Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.4 beta 4] iOS: Metal shader compilation warnings and unexpected compilation amount #103006

Open
georgwacker opened this issue Feb 18, 2025 · 26 comments · May be fixed by #103185
Open

[4.4 beta 4] iOS: Metal shader compilation warnings and unexpected compilation amount #103006

georgwacker opened this issue Feb 18, 2025 · 26 comments · May be fixed by #103185

Comments

@georgwacker
Copy link
Contributor

Tested versions

  • tested Metal iOS 4.4 beta 4
  • comparing MoltenVK iOS 4.3-stable

System information

Godot v4.4.beta4 - macOS Sonoma (14.7.2) - Multi-window, 2 monitors - Vulkan (Mobile)

Issue description

So far, I've been using MoltenVK on iOS with version 4.3-stable.

Doing manual particle preloading I'm getting 4 compilation warnings from MoltenVK for one of each particle system.

Switching to 4.4 beta4 using Metal with pipeline caching and no manual preloading is showing 78 compilation warnings on the first launch, which is very slow (logs down below).

I even got these warnings when running in Release mode via Xcode.

  • Should these warnings be suppressed?
  • Are the amount of compilations out of the ordinary? I only have 4 particle systems and the game is fully 2D with control nodes (no sprites), so I'm not sure where the 78 shader compilations are coming from?
Log (truncated)
Godot Engine v4.4.beta4.official.93d270693 - https://godotengine.org
Metal 3.2 - Forward Mobile - Using Device #0: Apple - Apple A15 GPU (Apple8)
fopen failed for data file: errno = 2 (No such file or directory)
Errors found! Invalidating cache...
fopen failed for data file: errno = 2 (No such file or directory)
Errors found! Invalidating cache...
Warning: Compilation succeeded with: 

program_source:68:11: warning: unused variable 'DTid' [-Wunused-variable]
    uint3 DTid = gl_GlobalInvocationID;
          ^
Warning: Compilation succeeded with: 

program_source:68:11: warning: unused variable 'DTid' [-Wunused-variable]
    uint3 DTid = gl_GlobalInvocationID;
          ^
Warning: Compilation succeeded with: 

program_source:189:12: warning: unused variable 'instance_custom' [-Wunused-variable]
    float4 instance_custom = float4(0.0);
           ^
program_source:237:11: warning: unused variable 'bones' [-Wunused-variable]
    uint4 bones = uint4(0u);
          ^
program_source:238:12: warning: unused variable 'bone_weights' [-Wunused-variable]
    float4 bone_weights = float4(0.0);
           ^
program_source:240:11: warning: unused variable 'point_size' [-Wunused-variable]
    float point_size = 1.0;
          ^
program_source:107:15: warning: unused variable 'pso_sc_packed_0' [-Wunused-const-variable]
constant uint pso_sc_packed_0 = is_function_constant_defined(pso_sc_packed_0_tmp) ? pso_sc_packed_0_tmp : 0u;
              ^
Warning: Compilation succeeded with: 

program_source:212:15: warning: unused variable 'cVdotH' [-Wunused-variable]
        float cVdotH = fast::max(dot(view, half_vec), 0.0);
              ^
program_source:213:15: warning: unused variable 'cLdotH' [-Wunused-variable]
        float cLdotH = fast::max(dot(light_vec, half_vec), 0.0);
              ^
program_source:447:12: warning: unused variable 'screen_uv' [-Wunused-variable]
    float2 screen_uv = float2(0.0);
           ^
program_source:450:11: warning: unused variable 'normal_map_depth' [-Wunused-variable]
    float normal_map_depth = 1.0;
          ^
Warning: Compilation succeeded with: 

program_source:190:12: warning: unused variable 'instance_custom' [-Wunused-variable]
    float4 instance_custom = float4(0.0);
           ^
program_source:238:11: warning: unused variable 'bones' [-Wunused-variable]
    uint4 bones = uint4(0u);
          ^
program_source:239:12: warning: unused variable 'bone_weights' [-Wunused-variable]
    float4 bone_weights = float4(0.0);
           ^
program_source:107:15: warning: unused variable 'pso_sc_packed_0' [-Wunused-const-variable]
constant uint pso_sc_packed_0 = is_function_constant_defined(pso_sc_packed_0_tmp) ? pso_sc_packed_0_tmp : 0u;
              ^
Warning: Compilation succeeded with: 

program_source:196:12: warning: unused variable 'instance_custom' [-Wunused-variable]
    float4 instance_custom = float4(0.0);
           ^
program_source:226:11: warning: unused variable 'bones' [-Wunused-variable]
    uint4 bones = uint4(0u);
          ^
program_source:228:11: warning: unused variable 'point_size' [-Wunused-variable]
    float point_size = 1.0;
          ^
program_source:111:15: warning: unused variable 'pso_sc_packed_0' [-Wunused-const-variable]
constant uint pso_sc_packed_0 = is_function_constant_defined(pso_sc_packed_0_tmp) ? pso_sc_packed_0_tmp : 0u;
              ^
Warning: Compilation succeeded with: 

program_source:255:15: warning: unused variable 'cVdotH' [-Wunused-variable]
        float cVdotH = fast::max(dot(view, half_vec), 0.0);
              ^
program_source:256:15: warning: unused variable 'cLdotH' [-Wunused-variable]
        float cLdotH = fast::max(dot(light_vec, half_vec), 0.0);
              ^
program_source:570:12: warning: unused variable 'screen_uv' [-Wunused-variable]
    float2 screen_uv = float2(0.0);
           ^
program_source:573:11: warning: unused variable 'normal_map_depth' [-Wunused-variable]
    float normal_map_depth = 1.0;
          ^
Warning: Compilation succeeded with: 

program_source:196:11: warning: unused variable 'bones' [-Wunused-variable]
    uint4 bones = in.bone_attrib;
          ^
program_source:197:12: warning: unused variable 'bone_weights' [-Wunused-variable]
    float4 bone_weights = in.weight_attrib;
           ^
program_source:251:11: warning: unused variable 'point_size' [-Wunused-variable]
    float point_size = 1.0;
          ^
program_source:77:15: warning: unused variable 'pso_sc_packed_0' [-Wunused-const-variable]
constant uint pso_sc_packed_0 = is_function_constant_defined(pso_sc_packed_0_tmp) ? pso_sc_packed_0_tmp : 0u;
              ^
Warning: Compilation succeeded with: 

program_source:197:11: warning: unused variable 'bones' [-Wunused-variable]
    uint4 bones = in.bone_attrib;
          ^
program_source:198:12: warning: unused variable 'bone_weights' [-Wunused-variable]
    float4 bone_weights = in.weight_attrib;
           ^
program_source:77:15: warning: unused variable 'pso_sc_packed_0' [-Wunused-const-variable]
constant uint pso_sc_packed_0 = is_function_constant_defined(pso_sc_packed_0_tmp) ? pso_sc_packed_0_tmp : 0u;
              ^
Warning: Compilation succeeded with: 

program_source:182:15: warning: unused variable 'cVdotH' [-Wunused-variable]
        float cVdotH = fast::max(dot(view, half_vec), 0.0);
              ^
program_source:183:15: warning: unused variable 'cLdotH' [-Wunused-variable]
        float cLdotH = fast::max(dot(light_vec, half_vec), 0.0);
              ^
program_source:457:12: warning: unused variable 'screen_uv' [-Wunused-variable]
    float2 screen_uv = float2(0.0);
           ^
program_source:460:11: warning: unused variable 'normal_map_depth' [-Wunused-variable]
    float normal_map_depth = 1.0;
          ^
Warning: Compilation succeeded with: 

program_source:195:12: warning: unused variable 'instance_custom' [-Wunused-variable]
    float4 instance_custom = float4(0.0);
           ^
program_source:225:11: warning: unused variable 'bones' [-Wunused-variable]
    uint4 bones = uint4(0u);
          ^
program_source:227:11: warning: unused variable 'point_size' [-Wunused-variable]
    float point_size = 1.0;
          ^
program_source:111:15: warning: unused variable 'pso_sc_packed_0' [-Wunused-const-variable]
constant uint pso_sc_packed_0 = is_function_constant_defined(pso_sc_packed_0_tmp) ? pso_sc_packed_0_tmp : 0u;
              ^
[... truncated]

metal_4.4b4_log.txt

Steps to reproduce

Minimal reproduction project (MRP)

@Calinou
Copy link
Member

Calinou commented Feb 18, 2025

Can you compare with 4.4beta4 on MoltenVK? You can switch back in the Project Settings using Rendering Driver overrides for macOS and iOS.

This is likely a consequence of the new ubershaders in 4.4, although I'm surprised they are compiled when 3D rendering is not used. I guess they are still needed when you use GPUParticles?

@georgwacker
Copy link
Contributor Author

In the project settings "rendering/rendering_device/driver.ios" is already set to vulkan and is the only option for me, probably because I'm running the editor under x86, not on ARM. The iOS export seems to automatically run on Metal, regardless.

Is that the correct settings path for overriding it in theory?

I'm using 4 x GPUParticles2D, which are all in one scene, but only one gets set to emitting upon an initializer. Maybe that is causing all these permutations for compilation?

@Calinou
Copy link
Member

Calinou commented Feb 18, 2025

Is that the correct settings path for overriding it in theory?

Yes. We should probably change the setting hint to allow setting Metal even in x86, so that you can export Metal even if you don't run it yourself.

cc @stuartcarnie

@stuartcarnie
Copy link
Contributor

Good to know, thanks.

The warnings are ok – I will look at whether we can suppress them in release builds. An SPIR-V optimiser would reduce them significantly, as we could enabled a few passes, like dead code elimination.

@bruvzg
Copy link
Member

bruvzg commented Feb 18, 2025

Multiple unused variable warnings were always present with MoltenVK (on macOS as well if you enable verbose logging), it should be fine.

We should probably change the setting hint to allow setting Metal even in x86, so that you can export Metal even if you don't run it yourself.

It's a bit more complex, it will use Metal on iOS even if you set it to Vulkan on x86_64 Mac. Since Vulkan is the default value on x86_64, it's not saved in the config (and default on iOS is Metal). We probably should show always show all available values for both macOS and iOS and always have the same default, and instead auto fallback to Vulkan on x86-64 (#102341 already do the fallback, but a warning print probably should be added to avoid confusion).

@Calinou
Copy link
Member

Calinou commented Feb 18, 2025

and instead auto fallback to Vulkan on x86-64 (#102341 already do the fallback, but a warning print probably should be added to avoid confusion).

This would always print a warning on every startup on x86_64 hardware by default, so I'm not sure.

What we can do though is amend the rendering driver startup line with a notice about the fallback being applied. Something like this:

Godot Engine v4.4.beta.custom_build.e0cf7853b (2025-02-18 21:17:43 UTC) - https://godotengine.org
OpenGL API 3.3.0 NVIDIA 565.77 - Compatibility - Using Device: NVIDIA - NVIDIA GeForce RTX 4090 (fallback from Vulkan)

@akien-mga
Copy link
Member

akien-mga commented Feb 18, 2025

We should probably change the setting hint to allow setting Metal even in x86, so that you can export Metal even if you don't run it yourself.

It's a bit more complex, it will use Metal on iOS even if you set it to Vulkan on x86_64 Mac. Since Vulkan is the default value on x86_64, it's not saved in the config (and default on iOS is Metal). We probably should show always show all available values for both macOS and iOS and always have the same default, and instead auto fallback to Vulkan on x86-64 (#102341 already do the fallback, but a warning print probably should be added to avoid confusion).

I confirm we shouldn't make the default value or available hints depend on th editor host, as we see here that's limiting proper configuration.

I would make the default "auto" for macOS, which would be Metal on arm and Vulkan on x86_64. For iOS, the default should be Metal and Vulkan should be available to select as option (so no need for "auto" there I believe).

For 4.5, I think we should really implement a hint so that rendering method and drivers always get written to project.godot even when using default values. I thought we had a proposal for that but I couldn't find it (GH search isn't super helpful).

@bruvzg
Copy link
Member

bruvzg commented Feb 19, 2025

I would make the default "auto" for macOS, which would be Metal on arm and Vulkan on x86_64. For iOS, the default should be Metal and Vulkan should be available to select as option (so no need for "auto" there I believe).

This is probably better, and with "auto" we do not need any warning.

@bruvzg
Copy link
Member

bruvzg commented Feb 19, 2025

#103026

@georgwacker
Copy link
Contributor Author

I've tested the override by manually editing the project file with closed editor, but the game still starts up with Metal, so it doesn't seem to respect the override currently.

[rendering]   
rendering_device/driver.ios="vulkan"

Regarding the ubershader compilations, I wonder how the pipeline and specializations could be manually tweaked in the future. It is nice that particle preload basically happens automatically in 4.4, but the 78 "compilation succeeded" messages suggest, that a lot of unnecessary features get pre-compiled, that will never get used by a mostly Control-Node based game.

For comparison in 4.3 I only ever get the shader compilations for the particle systems. I need to do some proper timed startup test next.

Out of interest, I've compiled a custom iOS export with disable_3d=yes but that didn't reduce the shader compilations.

@georgwacker
Copy link
Contributor Author

I've done some profiling with Instruments, testing the first run (app was removed before each test):

4.4 b4: 16s until menu (ubershader pipeline, 78 compilations, no manual preloading)
4.3: 4s until menu (no manual particle preloading)
4.3: 4s until menu, followed by 4.3s hang in menu for manual preloading (4 compilations)

So it seems the baseline is 4s to get to the menu, but the additional ubershader compilation in 4.4b4 takes 12s compared to the 4.3s of the manual preload.

@georgwacker
Copy link
Contributor Author

I've forced the DisplayServerIOS to vulkan with a custom build of 4.4b4 and I'm getting 4.21s unitl menu with 4 compilations only (one for each particle system, done automatically).

So it seems all those extra compilations only happen on Metal?

@clayjohn
Copy link
Member

@georgwacker How are you measuring shader compiles?

I am a bit confused since the way we compile particle shaders hasn't changed between 4.3 and 4.4. The ubershader system applies to the shaders we use for drawing 3D meshes.

So are you measuring all shader compiles somehow? And if you are, how are you doing it? Latter 4.4 releases can track shader compiles in the monitors, but that didn't exist in 4.3.

@georgwacker
Copy link
Contributor Author

georgwacker commented Feb 20, 2025

Ah, I wasn't aware that the ubershader system is not used for particle shaders. But it must be related to the new pipeline cache system?

I'm testing cold bootup time to menu with no shader cache in Instruments and looking at Xcode logs for "compilation succeeded" messages. Below is the 16s "severe hang" before reaching the menu.

Image

When running under Metal, it shows 78 compilations vs. the 4 compilations under Vulkan, so I presumed the additional time is due to those additional compilations. But the slow bootup can be something else related to Metal, perhaps?

Edit: Those Points of Interests in the trace are all create_pipeline calls.

@stuartcarnie
Copy link
Contributor

What version of MoltenVK are you using to build your application?

I have not verified this, but can you try running Metal and Vulkan with the Metal compilation cache completely disabled by setting the MTL_SHADER_CACHE_SIZE environment variable to 0. This undocumented feature was referenced in this comment.

There shouldn't be any reason that Metal is compiling more shaders than Vulkan, as it is driven by Godot's rendering driver. I would also expect MoltenVK should be compiling a lot more than 4 shaders on cold startup.

@stuartcarnie
Copy link
Contributor

stuartcarnie commented Feb 20, 2025

As noted in #96052, from a cold startup, Metal should be faster than Vulkan, which was also confirmed by another user. Indeed, this was only validated on macOS, as it is easy to clear the Metal shader cache as noted in the Testing section of the PR description. I don't know how easy that is to test on iOS, which after various runs, and without rebooting the entire device, your results may be affected by previous runs. I'm hopeful that MTL_SHADER_CACHE_SIZE environment variable will help.

I will run those tests again on macOS using master, to make sure there hasn't been any regressions.

@clayjohn
Copy link
Member

@georgwacker thank you for your response.

Indeed, I think Stuart is on the right track. It sounds like some sort of system caching is working successfully in MoltenVK that isn't successful in our Metal backend. The actual number of pipelines compile requests should be the same between them.

@stuartcarnie do you know if the reported compilation number in XCode is for pipelines that were compiled from scratch (as opposed to loaded from cache)?

@stuartcarnie
Copy link
Contributor

@georgwacker try setting this environment variable when you run your iOS app from a cold start:

GODOT_MTL_SHADER_LOAD_STRATEGY=lazy

I'll elaborate in a follow-up comment, but it should make a significant difference.

@stuartcarnie
Copy link
Contributor

I have determined the difference, which I identified in #96052:

Take note of the section summary, for the shader_compile statistics, that indicate 936 unique shaders were compiled.

It turns out that only 250 shaders (26%) of the requested shaders are used by the editor.

The same goes at runtime, where Godot will request that the driver compile a shader, but may never use them in a pipeline, at least not immediately. One aspect I expect would be all the shader variants.

More specifically, Godot will ask the RenderDeviceDriver, specifically Metal, Vulkan or D3D12, to compile the shader via the shader_create_from_bytecode:

RDD::ShaderID RenderingDeviceDriverMetal::shader_create_from_bytecode(const Vector<uint8_t> &p_shader_binary, ShaderDescription &r_shader_desc, String &r_name, const Vector<ImmutableSampler> &p_immutable_samplers) {

which for the Vulkan driver won't do much, but for Metal, it will ask for a new MTLLibrary. This request results in a compilation of that library, which will show up in the Metal Shader Compiler graph:

Image

and associated log as Create Metal Library (Godot (PID)).

I implemented an alternative library loading strategy that compiles the MTLLibrary and shaders when Godot creates the render or compute pipeline. That can be triggered by specifying the following environment variable:

GODOT_MTL_SHADER_LOAD_STRATEGY=lazy

This behaviour more closely matches MoltenVK's implementation, which will also delay MTLLibrary creation until the pipeline is created. The downside of this approach is that a lot of the render and compute pipeline creation in Godot is single-threaded, so we have to wait for the MTLLibrary and then associated vertex, fragment or compute shader to compile, when requesting the shader pipeline. This is the job of the Metal shader compiler services you see running in the activity monitor.

Note

On a desktop machine, there are significantly more Metal shader compiler services available for concurrent compilation, whereas there are only 2 on iOS devices, from what I learned from Apple. I found that lazy had a negative impact on Godot editor startup, as we don't concurrently create pipelines, so we lose a lot of parallelisation.

We tell Metal to maximise compilation services with the following API (macOS only):

#if TARGET_OS_OSX
if (@available(macOS 13.3, *)) {
[id<MTLDeviceEx>(metal_device) setShouldMaximizeConcurrentCompilation:YES];
}
#endif

I further validated this strategy with the Bistro demo, by analysing the Metal compilations from cold start for the metal driver, the vulkan driver and the metal driver with the lazy strategy enabled. This data was pulled from the Metal Shader Compiler graph in Instruments.

Pay attention to the Create MTLibrary counts

Metal Cold Start

The default MTLLibrary creation behaviour.

┌────────────────────────────────────────┬──────────────┐
│                 Source                 │ count_star() │
│                varchar                 │    int64     │
├────────────────────────────────────────┼──────────────┤
│ Compile Compute shader (Godot (5032))  │          165 │
│ Compile Fragment shader (Godot (5032)) │          170 │
│ Compile Vertex shader (Godot (5032))   │          143 │
│ Create MTLibrary (Godot (5032))        │         1487 │
└────────────────────────────────────────┴──────────────┘

Metal Cold Start (lazy)

With the environment variable set to lazy. Notice the significance drop in MTLLibrary compilations.

┌────────────────────────────────────────┬──────────────┐
│                 Source                 │ count_star() │
│                varchar                 │    int64     │
├────────────────────────────────────────┼──────────────┤
│ Compile Compute shader (Godot (7482))  │          165 │
│ Compile Fragment shader (Godot (7482)) │          170 │
│ Compile Vertex shader (Godot (7482))   │          145 │
│ Create MTLibrary (Godot (7482))        │         1010 │
└────────────────────────────────────────┴──────────────┘

Vulkan Cold Start

┌────────────────────────────────────────┬──────────────┐
│                 Source                 │ count_star() │
│                varchar                 │    int64     │
├────────────────────────────────────────┼──────────────┤
│ Compile Compute shader (Godot (5203))  │          166 │
│ Compile Fragment shader (Godot (5203)) │          167 │
│ Compile Vertex shader (Godot (5203))   │          141 │
│ Create MTLibrary (Godot (5203))        │          941 │
└────────────────────────────────────────┴──────────────┘

Solution

@clayjohn

I can expose the compilation behaviour as a driver-specific project setting so users can override it. For iOS it would default to lazy and for desktop, it can stay as the current behaviour. Users can change the setting if iOS increases concurrency or the find that macOS starts faster using the alternative approach for their specific project.

@stuartcarnie
Copy link
Contributor

As an aside, the Metal driver uses considerably fewer resources:

Image

than MoltenVK for the Bistro demo:

Image

That doesn't necessarily mean it's better, but it is significant.

@clayjohn
Copy link
Member

@stuartcarnie I think we need to re-evaluate some of our decisions in light of the Ubershader stuff.

I forgot about iOS' limit of 2 concurrent pipeline compiles. It really complicates the async compilation approach. We rely on being able to throw a bunch of stuff at the driver and then just use the results when ready.

But, ultimately we can now distinguish between Ubershader compiling and optimized pipelines compiling.

Ubershaders will be loaded at load time or the first frame, we need to do more compilations than in 4.3, but it should be fine (Metal and Vulkan should behave the same). When we compile the ubershaders we need to compile all the variants they need (I.e. the pipeline variants should be loaded ASAP). But then at run time the optimized variants should be scheduled to compile with as little overhead as possible.

I'm not sure I fully understand this lazy compile strategy. But if it allows us to defer the cost of creating pipelines, then that sounds like the right approach. Ideally, any cost from creating the optimized pipelines should be deferred and should be constrained to a background thread.

I don't think we need to expose a setting for this. I think we can design a solution specifically for iOS, since it is a unique platform. What we have now works great for MacOS, so let's just figure out the minimal set of changes needed for iOS and then try it out

@kisg
Copy link
Contributor

kisg commented Feb 21, 2025

Would the shader baking PR improve on this? #102552

@georgwacker
Copy link
Contributor Author

georgwacker commented Feb 21, 2025

@georgwacker try setting this environment variable when you run your iOS app from a cold start:

GODOT_MTL_SHADER_LOAD_STRATEGY=lazy

I'll elaborate in a follow-up comment, but it should make a significant difference.

With lazy loading on metal 4.4b4 official I'm getting 8.4s into menu on cold boot, which is much better than the 16s with default strategy.

With my custom build forcing vulkan 4.4b4 it's still only 3.5s on the cold boot, though. Custom build running metal takes 5.2s, so slightly better.

For these tests, I've been using the MoltenVK bundled with the iOS export template from 4.3, which shows as 1.2.283.

@stuartcarnie
Copy link
Contributor

@georgwacker those numbers are more in line and expected. MoltenVK has a little advantage here, as Metal has to convert all the SPIRV to MSL during the calls to shader_compile_binary_from_spirv, so that is still roughly 78 shaders according to your numbers, where as MoltenVK even delays this until it creates a pipeline. If you were to reboot your phone and try again (don't clear the Godot caches), you might find that Metal and MoltenVK are closer and Metal possibly even faster here.

As @kisg noted, the shader baker PR will resolve this problem.

@stuartcarnie
Copy link
Contributor

Further to @kisg's question about #102552, I am planning to leverage the Metal compiler tools, when available on Windows and macOS, so that baking shaders will not only generate the Metal source, but take it a step further and generate Metal libraries compiled to AIR, so Metal will have a significant advantage over MoltenVK here.

@stuartcarnie
Copy link
Contributor

I forgot about iOS' limit of 2 concurrent pipeline compiles. It really complicates the async compilation approach. We rely on being able to throw a bunch of stuff at the driver and then just use the results when ready.

We can still do that, as Metal supports continuations / callbacks for compilation, so we use the results when ready. I use that feature already in the driver for the non-lazy (immediate) shader compiler mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment