September 9, 2021

MultiDrawIndirect and Metal

The most significant omission of Apple Metal API is an absence of MultiDrawIndirect functionality. MDI is the most compatible way of rendering for GPU-driven technology. There are several ways how we can emulate MDI on macOS and iOS.

The simplest way is a loop of drawIndexedPrimitives() commands with different offsets to indirect buffer. MultiDrawIndirectCount will require a CPU-GPU synchronization to get the Count value from GPU memory. Maybe it is not a very optimal way, but it is working and even can outperform a single MDI call on some hardware because of insufficient driver optimizations.

for(uint32_t i = 0; i < num_draws; i++) {
  [encoder drawIndexedPrimitives:... indirectBufferOffset:offset];
  offset += stride;

The official Metal way is to use Indirect Command Buffer and encode rendering commands on CPU or GPU. All textures and samples must be passed as Argument buffer parameters even if they are not changing during rendering. Metal shading language has a built-in draw_indexed_primitives() function that 100% corresponds to their Metal API analog. That sounds good except minor limitation of 16384 drawing commands.

In theory a single call to executeCommandsInBuffer() must easily outperform the loop of drawIndexedPrimitives(). But let's perform some tests for that. We will use the same applications we used in our Mesh Shader performance comparison.

Apple M1 4/2 20 M 48 M
Apple M1 64/84 698 M 794 M
Apple M1 64/126 740 M 840 M
Apple M1 96/169 1.22 B 933 M
Apple M1 128/212 1.49 B 1.05 B
Apple M1 32-bit indices 1.36 B
Radeon Vega 56 4/2 24 M 16 M
Radeon Vega 56 64/84 1.13 B 647 M
Radeon Vega 56 64/126 1.20 B 687 M
Radeon Vega 56 96/169 1.84 B 1.05 B
Radeon Vega 56 128/212 2.54 B 1.44 B
Radeon Vega 56 32-bit indices 3.9 B
  • The ICB performs better than the loop when the number of primitives per draw call is small on Apple GPU. But ICB is 1.5 times slower when DIPs contain more than 200 primitives. And this is a kind of weird behavior of ICB.
  • The ICB is always almost two times slower than the loop of indirect commands on AMD GPU. But the exciting thing is that the power-hungry AMD Radeon Vega 56 is running with approximately the same performance as the integrated Apple M1.

But how about running tests with the same HW under Direct3D12 and Vulkan:

Metal Loop Metal ICB D3D12 MDI VK MDI
Radeon Vega 56 4/2 24 M 16 M 34 M 34 M
Radeon Vega 56 64/84 1.13 B 647 M 1.34 B 1.33 B
Radeon Vega 56 64/126 1.20 B 687 M 1.41 B 1.40 B
Radeon Vega 56 96/169 1.84 B 1.05 B 2.13 B 2.11 B
Radeon Vega 56 128/212 2.54 B 1.44 B 2.91 B 2.84 B
  • The loop of indirect commands is almost as fast as a single API MDI call. But the cost of performance is a huge CPU load. ICB and MDI solutions are not loading the CPU at all.
  • All AMD GPUs have native MDI functionality available even for D3D11, but not for Metal. It's understandable that AMD GPU is not the main target for Apple.

The current version of the GravityMark benchmark uses a loop approach for rendering for Metal. This dramatically reduces the performance of the same AMD GPU:

Responsive image
Responsive image

ICB is not a widely used functionality that is very difficult to debug. Metal shader validation is still not compatible with ICB. Application is crashing with:

-[MTLGPUDebugDevice newIndirectCommandBufferWithDescriptor:maxCommandCount:options:]:1036: failed assertion `Indirect Command Buffers are not currently supported with Shader Validation'

Fortunately, there is MTL_SHADER_VALIDATION_GPUOPT_ENABLE_INDIRECT_COMMAND_BUFFERS parameter that can enable ICB validation. But trivial ICB shaders can cause a non-trivial error:

Compiler encountered an internal error

It's possible to simplify a trivial ICB shader so it will pass the compilation stage. Then MTLArgumentBuffer will remind that he also has his own debugger:

Responsive image

Okay, there is no debugging available for ICB now. Let's try without it. The good thing is that macOS usually is not hanging for more than 20 seconds in the case of error. And more often, a nice Magenta screen is telling that it's something is going wrong:

Responsive image

As a result, we have an internal ICB version of GravityMark which demonstrates that:

  • GPU-CPU synchronization is required because the indirect version of executeCommandsInBuffer() causes Magenta screen on M1.
  • 16384 ICB length limit is too low for all asteroids, so we should repeat the indirect version of the executeCommandsInBuffer() command a couple of times.
  • AMD is 18% slower with ICB than with the Loop. The only bonus is that the CPU is free. Native Windows or Linux on Mac will give 3x boost for graphics.
  • M1 is 39% faster with ICB than with the Loop. It should be even better without synchronization. But even with this limitation, M1 outperforms the best integrated GPUs.
  • A14 is 44% faster with ICB than with the Loop. It is crashing at the last take with a 3600 score result instead of 2536.
  • It is a bad time to buy Mac with AMD GPU for macOS.

September 4, 2021

Mesh Shader versus MultiDrawIndirect

Mesh and Task (Amplification) shaders were a nice addition to the existing set of shaders. They allow the use of custom data formats for better compression and can dynamically generate geometry without using intermediate storage for it. Let’s see how they perform on different hardware and API.

We will use two simple test applications for that. One draws a million independent quad primitives without Task shader multiplication. Nvidia can dispatch a million Task shader invocations, but such numbers are crashing the AMD driver. We will use 16 Task shader batches for that. The geometry output is just 4 vertices and 6 indices per Mesh shader.

The second application draws many complex model instances. Each model is made from 283K vertices and 491K triangles. The number of instances is 81. That makes 20M vertices and 40M triangles to be processed. The number of Meshlets depends on the vertex/primitive counts and varying from 6130 to 2709. The size of the vertex buffer is also increasing because each Meshlet must be independent. There is no geometry culling. We are checking the geometry throughput only.

Both applications can draw all Meshlets individually by using single MultiDrawIndirect command. The instancing is not used.

GeForce 2080 Ti 4/2 685 M 707 M 655 M 360 M
GeForce 2080 Ti 64/84 12.92 B 11.20 B 11.34 B 10.80 B
GeForce 2080 Ti 64/126 12.51 B 11.21 B 11.71 B 10.92 B
GeForce 2080 Ti 96/169 12.71 B 11.23 B 11.27 B 11.44 B
GeForce 2080 Ti 128/212 12.12 B 12.20 B 10.84 B 11.72 B
GeForce 2080 Ti 32-bit indices 11.79 B 10.95 B
Quadro RTX 8000 4/2 683 M 693 M 661 M 345 M
Quadro RTX 8000 64/84 12.51 B 15.26 B 14.72 B 12.31 B
Quadro RTX 8000 64/126 12.23 B 15.37 B 14.30 B 12.83 B
Quadro RTX 8000 96/169 12.50 B 16.62 B 14.05 B 15.39 B
Quadro RTX 8000 128/212 12.15 B 17.04 B 13.43 B 15.64 B
Quadro RTX 8000 32-bit indices 13.63 B 12.88 B
Radeon 6700 XT 4/2 118 M 85.1 M 85.5 M
Radeon 6700 XT 64/84 4.31 B 3.42 B 3.40 B
Radeon 6700 XT 64/126 4.37 B 3.61 B 3.59 B
Radeon 6700 XT 96/169 4.43 B 5.48 B 5.40 B
Radeon 6700 XT 128/212 4.49 B 7.42 B 7.31 B
Radeon 6700 XT 32-bit indices 14.68 B 14.16 B

The table demonstrates the number of triangles rendered per second.

Responsive image
Responsive image

The results are very interesting:

  • MultiDrawIndirect is faster than Mesh Shader on Nvidia Quadro.
  • There is no difference for Nvidia between Mesh shader and MultiDrawIndirect, except Vulkan with very small primitives number.
  • 32-bit indices for raw geometry work faster on AMD. But any Mesh shader configuration is decreasing the hardware capabilities 3 times. More importantly, the MultiDrawIndirect approach is starting to work faster than the Mesh shaders when the number of primitives is bigger than 128.

But let’s try to draw 256K boxes by using Geometry shader for that. We will use another simple application to draw a 3D grid of boxes. Each box will be a point primitive for the Geometry shader. Mesh shader will draw 64 boxes per Task shader group.

GeForce 2080 Ti 1.5 B 3.7 B 1.5 B 2.9 B
Radeon 6700 XT 1.4 B 2.2 B 1.9 B

Geometry shader is a clear winner here, especially on Nvidia hardware.

Properly implemented MultiDrawIndirectCount is allowing to do the same job as Mesh shaders. Geometry shaders do simple primitive rendering better than Mesh shaders. We hope that vendors will provide better API flexibility instead of implementing an enormous number of different shader types.

Here are a couple more observations from Mesh shaders:

  • All Mesh shader vertices and indices must be written by threads lower than 32 on Nvidia and Vulkan, even if the shader group size is bigger than 32. Otherwise, the result will be ignored.
  • Nvidia can dispatch any number of Task shader groups under D3D12 and Vulkan. AMD has a limit of 65K.

Clay shader compiler can automatically translate Task and Mesh shader from GLSL/SPIR-V to HLSL.

Reproduction binaries for Windows:

June 30, 2021

Shader Pipeline

The first shaders were very simple programs allowing to transform and light vertices before a rasterization stage. They were written in an assembly language. Now shaders can be found everywhere, from the user interface to physics and logic, because they are not any different from other source code. The number of shaders is growing every year with the flexibility of modern GPUs. And software delegates more and more tasks to GPU instead of CPU.

There are not many ways of writing shaders nowadays. It can be HLSL/GLSL/MSL/WGSL/CU/CL dialect. But mostly, the code will be the same with some minor differences. It’s not possible to create a perfect binary shader that will run across all available hardware. Because GPUs have different architectures, it’s impossible to make optimal binary format for everybody. So, an intermediate binary representation is used to simplify the application runtime. And it’s the driver’s job to generate the perfect binary for the input intermediate shader representation.

With cross-API technology, you need different shaders for different APIs. Moreover, some platforms do not allow compiling shaders during the application’s execution and require precompiled shader input. In the case of Vulkan, it’s SPIR-V format, which is a binary representation of GLSL shader code. And that binary shader can be directly loaded by Vulkan runtime, and the driver will transform it for the hardware. OpenGL ARB_gl_spirv extension allows loading the same SPIR-V binary shader directly to the OpenGL runtime with minor modifications related to the samples and textures. Unfortunately, it is only working for Nvidia. AMD and Intel can’t handle geometry and tessellation shaders from the SPIR-V binary.

The problems will appear when the same shader needs to be run on Direct3D12, Metal, or WebGPU APIs. SPIR-V cross tools make it possible by translating binary shaders to HLSL or MSL formats with many tweak parameters for resource binding. That translation is not fully compatible between platforms, so the engine must know how to transform parameters best for each platform. After that, shader source code can be compiled by d3dcomiler/dxcompiler/metal toolset for the required runtime.

Another option is to use HLSL shaders as input and cross-compile them to SPIR-V representation for Vulkan. But that also requires resource binding magic. Every platform wants to have its own shader language, which is not compatible with other platforms. SPIR-V is a great attempt to make a standard for everybody. But only Vulkan API can use it. Other platforms require different shader languages or binary formats.

The number of different shader types is also growing. We have Vertex, Fragment, Geometry, Control, Evaluate, Compute, Task, Mesh, RayGen, RayMiss, Closest, AnyHit, Intersection, and Callable shader types. All of them have different input and output semantics. Luckily, Khronos group provides tools to validate and compile all of these shaders. We tried to use these tools, but, unfortunately, it was impossible to cross-compile our Compute, Tessellation, and Geometry shaders with their help. So, we’ve created our own shader pipeline based on GLSL and SPIR-V specifications. And now we are excited to tell you more about it.

We use the GLSL language in our Tellusim Engine as a primary language for all platforms. All shader types, including Mesh and Raytracing, are supported. Because of the high performance of our shader toolset, we can skip the offline shader compilation step and do everything during runtime. And it works fast with any amount of code. For platforms that are not allowing to compile shaders at runtime, we use precompiled shader cache.

GravityMark GPU benchmark requires more than 20K lines of GLSL shaders. And there is a huge difference in the time needed to start the application when Khronos glslang compiler is used for GLSL to SPIR-V compilation.

Here is a log from build with Khronos glslang compiler:

M:  63.32 ms: Creating 1600x900 Vulkan Window
M:   1.493 s: Creating SceneManager
M:   9.346 s: Creating RenderManager
M:  12.431 s: Creating Scene
M:  13.551 s: Creating 200,000 Asteroids
M:  13.701 s: Updating Scene
M:  13.851 s: GravityMark v1.2 is Ready in 13.9 s

And this is Clay shader compiler doing the same job 10 times faster:

M:  58.59 ms: Creating 1600x900 Vulkan Window
M: 288.47 ms: Creating SceneManager
M: 411.18 ms: Creating RenderManager
M: 541.40 ms: Creating Scene
M:   1.289 s: Creating 200,000 Asteroids
M:   1.364 s: Updating Scene
M:   1.500 s: GravityMark v1.2 is Ready in 1.5 s

There is no difference in FPS between shader compilers.

Conversion to other shader languages from SPIR-V representation is performed with the same incredible speed. Moreover, all resource bindings are automatically handled by the engine. Only one GLSL shader is needed for all supported platforms, including Cuda and WebGPU. This gives great flexibility and significantly reduces the time it takes to develop new features. We also use all available debugging tools from supported platforms.

Some of GLSL features, such as embedded arrays, are not supported because we don’t need them.

You can download Clay shader compiler command-line tool for Windows and Linux with all shader languages back ends (Vulkan SPIR-V, OpenGL SPIR-V, OpenGL GLSL, OpenGLES GLSL, Direct3D12 HLSL, Direct3D11 HLSL, WebGPU WGSL, Metal MSL, and Cuda) here.