January 16, 2022

Mesh Shader Emulation

The Mesh shader performance is not where AMD GPUs shine. There is not much difference in FPS between MDI and Mesh shader either. But there is a trick that emulates Mesh shader functionality on all GPUs with compute shader support. And this technique works faster than MDI and Mesh shader on almost all AMD GPUs. It also brings an unbelievable (7x) performance boost for Qualcomm GPUs.

Geometry prepared for Mesh shader rendering contains vertex and meshlet data buffers. Each meshlet uses uint8_t indices for vertices which allow loading more primitives in a few memory lockups (uint32_t[3] per 4 triangles).

This is a data layout for each meshlet:

uint32_t Number of primitives
uint32_t Number of vertices 
uint32_t Base index (index buffer offset)
uint32_t Base vertex (vertex buffer offset)

Mesh shader is straightforward: load meshlet info, fetch vertices and indices, submit vertices and indices.

But how will we render meshlets without Mesh shader support? We will use a single draw call with the help of a simple compute shader. This shader will transform meshlet indices into the packed uint32_t triangle indices that contain the meshlet index and the vertex index. The shader is trivial:

layout(std430, binding = 0) readonly buffer meshlets_buffer { uint meshlets_data[]; };
layout(std430, binding = 1) writeonly buffer indexing_buffer { uvec4 indexing_data[]; };

void main() {
  uint group_id = base_group + gl_WorkGroupID.x;
  uint local_id = gl_LocalInvocationIndex;
  [[branch]] if(local_id == 0u) {
    uint meshlet_index = (group_id % num_meshlets) * 4u;
    num_primitives = meshlets_data[meshlet_index + 0u];
    base_index = meshlets_data[meshlet_index + 2u];
  memoryBarrierShared(); barrier();
  uint indices_0 = 0u;
  uint indices_1 = 0u;
  uint indices_2 = 0u;
  [[branch]] if(local_id * 4u < num_primitives) {
    uint address = base_index + local_id * 3u;
    indices_0 = meshlets_data[address + 0u];
    indices_1 = meshlets_data[address + 1u];
    indices_2 = meshlets_data[address + 2u];
  uint group_index = group_id << 8u;
  uint index = (GROUP_SIZE * group_id + local_id) * 3u;
  indexing_data[index + 0u] = (uvec4(indices_0, indices_0 >> 8u, indices_0 >> 16u, indices_0 >> 24u) & 0xffu) | group_index;
  indexing_data[index + 1u] = (uvec4(indices_1, indices_1 >> 8u, indices_1 >> 16u, indices_1 >> 24u) & 0xffu) | group_index;
  indexing_data[index + 2u] = (uvec4(indices_2, indices_2 >> 8u, indices_2 >> 16u, indices_2 >> 24u) & 0xffu) | group_index;

After that compute pass, only a single drawElements() is required to render all meshlets. The amount of write and read memory per 1M triangles is 11 MB. It’s not possible to use 16-bit indices because we can draw only 1024 meshlets per draw call in that case. The TriangleIndex built-in shader variable will make that possible, but we don’t have it.

Vertex shader can get the meshlet index and meshlet vertex by using simple math over VertexIndex built-in variable:

uint meshlet = gl_VertexIndex >> 8u;
uint vertex = gl_VertexIndex & 0xffu;
// load vertex data from SSBO

There is no problem adding a per-triangle visibility culling test to that shader. And it will improve performance because back-face and invisible triangles will not use the bandwidth.

Let’s check the results in different configurations and platforms where the emulation is faster:

Single DIP Emulation Mesh Shader MDI/ICB/Loop
Radeon 6900 XT 64/126 17.2 B 9.6 B 9.2 B 4.1 B
Radeon 6900 XT 128/212 17.2 B 10.1 B 9.2 B 8.5 B
Radeon 6700 XT 64/126 14.1 B 7.7 B 4.6 B 4.1 B
Radeon 6700 XT 128/212 14.1 B 7.9 B 4.6 B 8.3 B
Radeon 5600 M 64/126 5.0 B 2.9 B 1.4 B
Radeon 5600 M 128/212 5.0 B 2.9 B 2.8 B
Radeon Vega 56 64/126 (macOS) 4.1 B 2.7 B 860 M
Radeon Vega 56 128/212 (macOS) 4.1 B 2.9 B 1.8 B
Adreno 660 64/126 583 M 265 M 36.6 M
Adreno 660 128/212 596 M 309 M 74.7 M
Mali-G78 MP20 64/126 176 M 153 M 83 M
Mali-G78 MP20 128/212 187 M 123 M 134 M

The results in Billion and Million processed triangles per second. MDI – Multi Draw Indirect, ICB – Indirect Command Buffer, Loop – the CPU loop of multiple drawIndirect() commands.

If you need to draw a lot of small DIPs, it’s better to pack everything to the single draw call on AMD and Qualcomm. And a single DIP is working better than Mesh shaders or MDI even with additional index buffer generation overhead.

Windows and Linux binaries: TellusimDrawMeshlet.zip

December 16, 2021

Mesh Shader Performance

We have some new questionable results of Mesh Shader and MultiDrawIndirect performance on AMD GPUs. Let’s look at the situation when we render geometry using independent triangles without index buffer. VS Draw Array column represents this rendering mode where the number of Vertex Shader invocations equals the number of primitives times 3. Technically, this mode does the same as Mesh Shader rendering mode but without generating indices. That’s why it’s surprising to see that it’s faster than MultiDrawIndirect and Mesh Shader on 6700 XT. So if you are optimizing geometry for vertex cache or generating the best possible meshlets, it makes no sense to do it for AMD GPUs. Other GPUs can share “Vertex Shader” output between adjacent triangles.

VS Draw Elements VS Draw Arrays MDI MS CS
Radeon 5600 M 5.0 B 1.5 B 1.1 B 8.3 B
Radeon 6700 XT 14.5 B 4.8 B 4.1 B 4.6 B 19.5 B
Radeon 6900 XT 17.6 B 7.2 B 4.1 B 9.1 B 34.5 B
GeForce RTX 2080 Ti 12.3 B 5.6 B 12.5 B 13.3 B 18.3 B
GeForce RTX 3090 14.3 B 5.9 B 14.6 B 20.7 B 28.8 B
Intel DG1 1.3 B 227 M 1.1 B 2.5 B
Apple M1 1.4 B 556 M 930 M 2.5 B

Compute versus Hardware
Mesh Shader versus MultiDrawIndirect

October 10, 2021

Blue Noise Generator

Blue noise has unique properties that make it best for dithering, spatial generation, and more. You can find more details about blue noise from these great articles: momentsingraphics.de, blog.demofox.org.

Performance and precision are two main drawbacks of all blue noise generators. A 1024×1024 image requires hours or even days to generate. And that is not acceptable. Moreover, the 8-bit precision cannot represent a large noise image (more than 16×16) with perfect spatial accuracy because there will be multiple pixels with the same value. We made a GPU blue noise generator that uses fast Fourier transformation and provides a blazing-fast generation. A 1024×1024 image generation is performing in less than 6 minutes on GeForce 2080 Ti class GPU. The output noise image can be in 16-bit integer or 32-bit floating-point precision.

The binaries of our blue noise generator for Windows and Linux (Vulkan) are available for everyone. But be careful because our Radeon Vega 7 died during 4K texture generation.

Download link: TellusimClayNoise.zip

The application has simple command line parameters:

Clay Blue Noise Image Generator (https://tellusim.com/)
Usage: ./noise -o noise.png
  -i  Input image
  -o  Output image
  -f  Forward image
  -h  Histogram output
  -bits  Image bits (8)
  -size  Image size (256)
  -width  Image width (256)
  -height  Image width (256)
  -seed  Random seed (random)
  -init  Initial pixels (10%)
  -sigma  Gaussian sigma (2.0)
  -epsilon  Quadratic epsilon (0.01)
  -device  Computation device index

This command makes a unique blue noise image with 16-bit precision in less than a minute. The Fourier transformation of the noise can be obtained with -f command line parameter. Histogram output is a text file. The initial random pixels can be specified as an input image as well.

clay_noise.exe -size 512 -bits 16 -o noise_512x512_16bit.png

The sigma parameter changes the blur radius during generation. You can see the result of different Sigma values in this WebGL application:

Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image

October 7, 2021

Ray Tracing versus Animation

There are no answers from vendors about acceleration structure performance and actual numbers of how the dynamic geometry affects ray tracing performance. There are only recommendations on how to use acceleration structure flags. We had static performance numbers in our previous post. Now it’s time to focus on acceleration structure builders.

We will draw animated characters. Because all characters are independent, they will all require a bottom-level acceleration structure. Each pixel will query two rays, one for the scene intersection and one for the shadow intersection. The resolution is 1600×900.

Acceleration structure builders have many configuration parameters, that’s why there is a lot of data in the tables. We will also split Vulkan and Direct3D12 results into different rows.

  • FT (RT) – All BLAS build time with Fast Trace flag enabled and the scene tracing time (in round brackets).
  • FB (RT) – The Fast Build time and (RQ scene tracing time).
  • CS (RT) – Compute shader BLAS build time (CS scene tracing time).
  • FTU (RT) – All BLAS Update time with Fast Trace flag enabled and (RQ scene tracing time).
  • FBU (RT) – The Fast Build Update time and (RQ scene tracing time).
  • CSU (RT) – Compute shader BLAS Update time (CS scene tracing time).
  • BLAS (Scratch) – The memory required for the BLAS buffers and (the scratch buffer size).

There are the results with 81 instances of the 52K triangle model. The overall number of animated vertices/triangles is 2.9M/4.2M. The number of joints is 4212:

FT (RT) FB (RT) CS (RT) FTU (RT) FBU (RT) CSU (RT) BLAS (Scratch)
GeForce 2080 Ti (D3D12) 16.9 ms (0.6 ms) 15.7 ms (0.6 ms) 9.1 ms (5.3 ms) 3.7 ms (0.8 ms) 1.2 ms (0.9 ms) 2.1 ms (7.7 ms) 255 MB (16 MB)
GeForce 2080 Ti (VK) 17.1 ms (0.6 ms) 15.1 ms (0.7 ms) 8.7 ms (3.4 ms) 3.8 ms (0.8 ms) 1.5 ms (1.0 ms) 1.8 ms (4.7 ms) 255 MB (16 MB)
Radeon 6700 XT (D3D12) 30.2 ms (1.2 ms) 30.2 ms (1.2 ms) 8.7 ms (4.8 ms) 4.6 ms (1.3 ms) 4.6 ms (1.3 ms) 3.0 ms (6.1 ms) 656 MB (872 MB)
Radeon 6700 XT (VK) 223.0 ms (1.4 ms) 73.0 ms (1.5 ms) 8.9 ms (4.9 ms) 4.6 ms (1.6 ms) 4.6 ms (1.6 ms) 3.2 ms (6.2 ms) 656 MB (872 MB)
Radeon Vega 56 (macOS) 185.5 ms (8.4 ms) 185.6 ms (8.4 ms) 16.8 ms (8.1 ms) 21.5 ms (11.9 ms) 21.7 ms (12.1 ms) 4.54 ms (11.2 ms) 355 MB (355 MB)
Apple M1 (macOS) 394.9 ms (29.2 ms) 395.2 ms (29.0 ms) 74.5 ms (42.6 ms) 29.8 ms (34.5 ms) 29.9 ms (34.8 ms) 18.1 ms (49.2 ms) 355 MB (355 MB)

Here is what we have:

  • It’s impossible to build BLAS for animated characters in each frame. The BLAS update mode must be used. Otherwise, the BLAS builder will significantly reduce the FPS.
  • AMD Vulkan BLAS builder is much slower than the Direct3D12 builder. The best AMD BLAS build time is twice bigger than on Nvidia.
  • The ray queries on Apple M1 are twice faster than the CS solution, which is interesting considering that there was no difference in our previous test.
  • There is no coherent buffer memory in Metal shading language. That is why we are performing the BVH update step by using atomics. And it affects performance negatively.
  • AMD has an enormous memory consumption for the BLAS and Scratch buffers.
  • Our GPU BVH builder outperforms any available implementation in full rebuild mode.

Now let’s find the ray tracing limits by increasing the number of characters. This time we will use a simplified model of 1500 triangles. The overall number of characters is 2401. The number of triangles in the scene is 3.6M. The number of joints is 125K. It’s getting difficult for the CPU to animate all characters independently, so we will reuse the joint transformations across multiple instances.

FT (RT) FB (RT) CS (RT) FTU (RT) FBU (RT) CSU (RT) BLAS (Scratch)
GeForce 2080 Ti (D3D12) 15.8 ms (0.6 ms) 14.5 ms (0.6 ms) 8.3 ms (6.2 ms) 7.0 ms (0.8 ms) 1.5 ms (0.9 ms) 1.6 ms (6.7 ms) 224 MB (15 MB)
GeForce 2080 Ti (VK) 16.3 ms (0.6 ms) 14.6 ms (0.7 ms) 8.1 ms (4.5 ms) 4.8 ms (0.8 ms) 1.1 ms (1.0 ms) 1.5 ms (4.6 ms) 224 MB (15 MB)
Radeon 6700 XT (D3D12) 55.9 ms (1.5 ms) 55.8 ms (1.5 ms) 7.5 ms (6.0 ms) 3.0 ms (1.5 ms) 3.0 ms (1.5 ms) 2.4 ms (6.7 ms) 559 MB (742 MB)
Radeon 6700 XT (VK) 2000 ms (5.2 ms) 1070 ms (5.3 ms) 7.8 ms (6.3 ms) 18.5 ms (1.7 ms) 18.7 ms (1.8 ms) 2.6 ms (7.0 ms) 559 MB (742 MB)
Radeon Vega 56 (macOS) 1840 ms (9.8 ms) 1800 ms (9.7 ms) 14.7 ms (10.3 ms) 297.1 ms (10.5 ms) 297.1 ms (10.1 ms) 3.6 ms (11.6 ms) 304 MB (325 MB)
Apple M1 (macOS) 1600 ms (33.1 ms) 1600 ms (32.8 ms) 63.4 ms (58.3 ms) 271.1 ms (36.3 ms) 273.1 ms (36.9 ms) 14.9 ms (62.7 ms) 304 MB (325 MB)
  • The number of instances doesn’t affect Nvidia and Tellusim BVH builder time. Only the number of triangles does.
  • There are no mistakes with AMD Vulkan and Metal timings. The BLAS build time is measured in seconds.
  • Vulkan API is performing all BLAS build/update in a single API call. But AMD is performing BLAS build in a non-parallel way on Vulkan. Metal API is doing the same.
  • Nvidia GPU is only 50-70% utilized on Windows in the second test. But it’s working at full speed on Linux.

Here are the videos of these two tests captured in BLAS update mode. Otherwise, the CS solution is faster than HW ray tracing.

How about 10000 BLAS instances? It will be 520K joints and 15M triangles. CPU cannot process half of million joints well. So we will stop the animation. Nvidia requires 1 GB BLAS buffer for all instances. The BLAS buffer for AMD is 2.3 GB, and the scratch buffer requires an additional 3 GB of memory. We are not counting pre-transformed vertices (600 MB in the case of 48 bytes per vertex). BLAS builder will work in the update mode. Nvidia 2080 Ti provides 43 FPS in 2K video mode due to underutilization, but the overall timing is already more than 10 ms, so it’s not better than 100 FPS. Radeon 6700 XT shows 49 FPS with 100% utilization.

Each animated, morphed, or deformed object requires memory for Vertex and BLAS data. Sometimes it can be a lot of data and a lot of memory bandwidth. And it affects performance. By contrast, rasterization doesn’t not require anything except original geometry and joint transformations. And all transformed data goes directly to the rasterizer. Tellusim engine is fully GPU-driven, including animation blend trees. The video with 10K animated characters is running with 135 FPS on Nvidia 2080 Ti. The performance on Radeon 6700 XT is more than 120 FPS. It works with 20 FPS and animation on mobile Apple M1. Each character has a unique animation and casts a global shadow as well.

How to make 10K animated characters scene with Tellusim Engine:

// load object
ObjectMesh object(scene);
if(!object.load("object.glb", &material, ObjectMesh::FlagTriSkiBasTex | ObjectMesh::FlagAnimation)) return 1;

// create nodes
uint32_t size = 100;
Array nodes(size * size);
for(uint32_t i = 0; i < nodes.size(); i++) {
  nodes[i] = NodeObject(graph, object);
  nodes[i].setGlobalTransform(Matrix4x3d::translate((i % size) * step, (i / size) * step, 0.0));

// somewhere in the update loop
Random random(0);
for(NodeObject node : nodes) {
  ObjectFrame frame;
  uint32_t index = random.geti32(0, object.getNumAnimations() - 1);
  frame.append(ObjectFrame(object.getAnimation(index), time + random.getf32(0.0f, 32.0f)));

That’s all that’s required. CPU only does some random number generation. Everything else is delegated to the GPU.

Tellusim Animation Demo: TellusimAnimation.zip

September 24, 2021

Ray Tracing Performance Comparison

All modern APIs support ray tracing now. It’s available on Windows (D3D12 and VK), Linux (VK), and macOS (Metal). There is no difference between D3D12 and VK ray tracing. The Metal API only supports ray query, which is also available with D3D12 and VK API. The main ray tracing concept is to make BLAS for geometry and combine all geometries inside the single TLAS. And the driver is responsible for everything under the hood.

With a simple modification our test app can work in ray tracing (RT), ray query (RQ), and compute shader (CS) rendering modes. We are going to render the same 81 instances of 490K triangles. But this time, it will be no rasterization at all. Each pixel will always trace primary, shadow, and reflection rays.

A naive compute shader ray query mode will use our generic GPU BVH builder with a single triangle per leaf partitioning. Each BVH node consumes 12 bytes. The whole model requires a 45 MB BVH buffer. For rasterization, this model requires 13 MB for 32-byte vertex and 32-bit index buffer.

Let’s take a look at the 1600×900 resolution results first:

GeForce 3080 0.58 ms 0.54 ms 0.60 ms 0.55 ms 4.92 ms 33 MB (15 MB) 7 ms (18 ms)
GeForce 2080 Ti 1.06 ms 1.03 ms 0.95 ms 0.97 ms 7.35 ms 33 MB (15 MB) 7 ms (18 ms)
GeForce 1060 M 34.59 ms 33.81 ms 43 MB (37 MB) 7 ms (18 ms)
Radeon 6700 XT 2.50 ms 2.55 ms 1.71 ms 2.06 ms 6.13 ms 76 MB (65 MB) 15 ms (> 500 ms)
Radeon Vega 56 (macOS) 15.84 ms 15.48 ms 82 MB 30 ms
Apple M1 (macOS) 48.8 ms 46.4 ms 82 MB 100 ms
Apple A14 (iOS) 98.0 ms 94.3 ms 82 MB
Adreno 660 (Android) 431 ms
  • Nvidia RTX series GPUs work with the best performance, and Ray Queries are slightly faster. The great thing is that compacted BLAS is only 15 MB in size, and this is a few MB larger than the model representation for rasterization. compute shader ray tracing is 7.7 times slower than HW accelerated.
  • AMD has a 50% difference between ray query (D3D12) and ray tracing (VK). It looks like the driver is only optimized for D3D12 Ray Queries. HW Raytracing is 3.5 times faster than compute shader implementation. But the main problem is the size of BLAS that is more than four times bigger than it can be. Moreover, the first BLAS generation is painfully slow even with a fast build flag enabled.
  • Interesting fact: compute shader ray tracing is faster on AMD than on Nvidia.
  • There is no HW ray tracing on Metal. The current implementation is slower than naive compute shader. The BLAS requires a twice larger amount of memory as well. Some of the triangles are not intersecting, so the model has holes during rendering on all hardware.
  • A passive cooled Apple M1 is only 33% slower than GeForce 1060M GPU.
  • Vertex cache optimization is also improving ray tracing performance.

Now, what about native 4K?

GeForce 3080 3.16 ms 3.13 ms 3.25 ms 3.13 ms 21.00 ms
GeForce 2080 Ti 4.78 ms 4.77 ms 4.92 ms 4.86 ms 32.21 ms
Radeon 6700 XT 12.89 ms 13.35 ms 8.48 ms 9.77 ms 27.94 ms
  • AMD is showing a 65% difference between ray query and ray tracing pipeline. The choice for AMD is obvious.
  • It’s better to use ray tracing than ray query for Nvidia, but the difference is only 3%.
  • The best Radeon 6700 XT ray query performance is 70% slower than GeForce 2080 Ti. The biggest gap is 280%.

It’s interesting how AMD with a bigger number of ray tracing cores will perform. The main issue for AMD is an acceleration structure size and its build time. Especially for dynamic TLAS with thousands of instances.

Responsive image
Responsive image

This is an image of compute shader ray tracing and a heat map image. The red channel represents the number of BLAS BVH intersection steps scaled by 128. The main bottleneck for compute shader ray tracing is a Ray-BVH intersection that cannot be well optimized due to divergence and scattered memory access. By contrast, Volume-BVH intersection performs great on compute shader. But, unfortunately, we cannot reuse the HW (API) ray tracing acceleration structures for different purposes.

September 13, 2021

Compute versus Hardware

It was tempting to compare the Compute Shader rasterization with Mesh Shader and MDI on all supported platforms. Unreal 5 Nanite is already demonstrating outstanding performance with small triangle rendering. But what about numbers from tile-based architectures? And how good is a dedicated rasterizer?

Let’s use our old test Mesh Shader versus MDI with 498990 64×128 Meshlets without any culling except back-face. The data prepared for Meshlet fits well Compute based rasterization. All vertices and indices are independent and tightly packed. It’s possible to spawn a group per Meshlet and rasterize a triangle per thread. We will limit software rasterization to depth-only mode because 64-bit atomics are not available on Mobile devices and Metal. It will not make a big difference in performance. Metal is also not supporting atomic operations with textures, but there is a way to create texture from buffer data and use buffer atomic operations.

We compared single DIP 32-bit indices, Mesh Shader, MultiDrawIndirect, and Compute Shader rasterization performance on Nvidia, AMD, Intel, Qualcomm, and Apple GPUs.

And here are the results:

Single DIP Mesh Shader MDI/ICB/Loop Compute Shader
GeForce 2080 Ti 12.05 B 12.57 B 12.63 B 17.26 B
GeForce 1060 M 3.86 B 3.90 B 4.55 B
Radeon 6700 XT 14.73 B 4.38 B 3.63 B 16.74 B
Radeon RX 5600M 4.87 B 1.11 B 7.57 B
Radeon Vega 56 (macOS) 2.40 B 796.0 M 3.17 B
Apple M1 (macOS) 1.37 B 739.0 M 2.30 B
Apple A14 (iOS) 666.1 M 475.0 M 1.02 B
Intel UHD Graphics 680.0 M 396.5 M 556.1 M
Adreno 660 (Android) 565.2 M 31.17 M 497.3 M

The table demonstrates the number of processed triangles per second (millions and billions).


  • The era of GPU fixed-function units and an enormous number of shader types is almost over.
  • MultiDrawIndirect doesn’t work well on mobile because of the tile-based rendering.
  • Best Mesh Shaders / MDI are slower than Compute-based rasterization.
  • Single shader type is better than 14 dedicated shader types.

What we need are compute shaders and the ability to spawn threads from shaders effectively. Everything else (including raytracing) can be easily implemented on Compute Shader level. Ten years ago, the performance of Intel Larrabee wasn’t enough for compute-only mode, but now it’s possible to discard all other shaders.

A compute shader extension allowing atomically to write payloads into the image will change everything. Without it, we are limited to 32-bit payload data and must perform a redundant triangle intersection.

uint imageAtomicPayloadMax(gimage2D atomic_image, gimage2D payload_image,
  ivec2 P,
  uint atomic_data, gvec4 payload_data);
Responsive image
Responsive image

Here is an unoptimized code of Compute rasterization shader that was used in all tests on all platforms and APIs without any modification thanks to our Clay shader compiler:

layout(local_size_x = GROUP_SIZE) in;

layout(row_major, binding = 0) uniform common_parameters {
  mat4 projection;
  mat4 modelview;
  vec4 camera;
layout(std140, binding = 1) uniform compute_parameters {
  uint num_meshlets;
  uint group_offset;
  vec2 surface_size;
  float surface_stride;
layout(std140, binding = 2) uniform transform_parameters {
  vec4 transforms[NUM_INSTANCES * 3u];

layout(std430, binding = 3) readonly buffer vertices_buffer { vec4 vertices_data[]; };
layout(std430, binding = 4) readonly buffer meshlets_buffer { uint meshlets_data[]; };
  layout(std430, binding = 5) buffer surface_buffer { uint out_surface[]; };
  layout(binding = 0, set = 1, r32ui) uniform uimage2D out_surface;

shared uint num_primitives;
shared uint num_vertices;
shared uint base_index;
shared uint base_vertex;

shared vec4 row_0;
shared vec4 row_1;
shared vec4 row_2;

shared vec3 positions[NUM_VERTICES];
shared uint indices[NUM_PRIMITIVES * 3u];

void rasterize(vec3 p0, vec3 p1, vec3 p2) {
  [[branch]] if(p0.z < 0.0f || p1.z < 0.0f || p2.z < 0.0f) return;
  vec3 p10 = p1 - p0;
  vec3 p20 = p2 - p0;
  float det = p20.x * p10.y - p20.y * p10.x;
  [[branch]] if(det >= 0.0f) return;
  vec2 min_p = floor(min(min(p0.xy, p1.xy), p2.xy));
  vec2 max_p = ceil(max(max(p0.xy, p1.xy), p2.xy));
  [[branch]] if(max_p.x < 0.0f || max_p.y < 0.0f || min_p.x >= surface_size.x || min_p.y >= surface_size.x) return;
  min_p = clamp(min_p, vec2(0.0f), surface_size - 1.0f);
  max_p = clamp(max_p, vec2(0.0f), surface_size - 1.0f);
  vec2 texcoord_dx = vec2(-p20.y, p10.y) / det;
  vec2 texcoord_dy = vec2(p20.x, -p10.x) / det;
  vec2 texcoord_x = texcoord_dx * (min_p.x - p0.x);
  vec2 texcoord_y = texcoord_dy * (min_p.y - p0.y);
  for(float y = min_p.y; y <= max_p.y; y += 1.0f) {
    vec2 texcoord = texcoord_x + texcoord_y;
    for(float x = min_p.x; x <= max_p.x; x += 1.0f) {
      if(texcoord.x >= 0.0f && texcoord.y >= 0.0f && texcoord.x + texcoord.y <= 1.0f) {
        float z = p10.z * texcoord.x + p20.z * texcoord.y + p0.z;
        #if CLAY_MTL
          uint index = uint(surface_stride * y + x);
          atomicMax(out_surface[index], floatBitsToUint(z));
          imageAtomicMax(out_surface, ivec2(vec2(x, y)), floatBitsToUint(z));
      texcoord += texcoord_dx;
    texcoord_y += texcoord_dy;

void main() {
  uint group_id = gl_WorkGroupID.x + group_offset;
  uint local_id = gl_LocalInvocationIndex;
  // meshlet parameters
  [[branch]] if(local_id == 0u) {
    uint transform_index = (group_id / num_meshlets) * 3u;
    row_0 = transforms[transform_index + 0u];
    row_1 = transforms[transform_index + 1u];
    row_2 = transforms[transform_index + 2u];
    uint meshlet_index = (group_id % num_meshlets) * 4u;
    num_primitives = meshlets_data[meshlet_index + 0u];
    num_vertices = meshlets_data[meshlet_index + 1u];
    base_index = meshlets_data[meshlet_index + 2u];
    base_vertex = meshlets_data[meshlet_index + 3u];
  memoryBarrierShared(); barrier();
  // load vertices
  [[unroll]] for(uint i = 0; i < NUM_VERTICES; i += GROUP_SIZE) {
    uint index = local_id + i;
    [[branch]] if(index < num_vertices) {
      uint address = (base_vertex + index) * 2u;
      vec4 position = vec4(vertices_data[address].xyz, 1.0f);
      position = vec4(dot(row_0, position), dot(row_1, position), dot(row_2, position), 1.0f);
      position = projection * (modelview * position);
      positions[index] = vec3(round((position.xy * (0.5f / position.w) + 0.5f) * surface_size * 256.0f) / 256.0f - 0.5f, position.z / position.w);
  // load indices
  [[loop]] for(uint i = local_id; (i << 2u) < num_primitives; i += GROUP_SIZE) {
    uint index = i * 12u;
    uint address = base_index + i * 3u;
    uint indices_0 = meshlets_data[address + 0u];
    uint indices_1 = meshlets_data[address + 1u];
    uint indices_2 = meshlets_data[address + 2u];
    indices[index +  0u] = (indices_0 >>  0u) & 0xffu;
    indices[index +  1u] = (indices_0 >>  8u) & 0xffu;
    indices[index +  2u] = (indices_0 >> 16u) & 0xffu;
    indices[index +  3u] = (indices_0 >> 24u) & 0xffu;
    indices[index +  4u] = (indices_1 >>  0u) & 0xffu;
    indices[index +  5u] = (indices_1 >>  8u) & 0xffu;
    indices[index +  6u] = (indices_1 >> 16u) & 0xffu;
    indices[index +  7u] = (indices_1 >> 24u) & 0xffu;
    indices[index +  8u] = (indices_2 >>  0u) & 0xffu;
    indices[index +  9u] = (indices_2 >>  8u) & 0xffu;
    indices[index + 10u] = (indices_2 >> 16u) & 0xffu;
    indices[index + 11u] = (indices_2 >> 24u) & 0xffu;
  memoryBarrierShared(); barrier();
  // rasterize triangles
  [[branch]] if(local_id < num_primitives) {
    uint index = local_id * 3u;
    uint index_0 = indices[index + 0u];
    uint index_1 = indices[index + 1u];
    uint index_2 = indices[index + 2u];
    rasterize(positions[index_0], positions[index_1], positions[index_2]);

Reproduction binaries for Windows: TellusimDrawMeshlet.zip

September 9, 2021

MultiDrawIndirect and Metal

The most significant omission of Apple Metal API is an absence of MultiDrawIndirect functionality. MDI is the most compatible way of rendering for GPU-driven technology. There are several ways how we can emulate MDI on macOS and iOS.

The simplest way is a loop of drawIndexedPrimitives() commands with different offsets to indirect buffer. MultiDrawIndirectCount will require a CPU-GPU synchronization to get the Count value from GPU memory. Maybe it is not a very optimal way, but it is working and even can outperform a single MDI call on some hardware because of insufficient driver optimizations.

for(uint32_t i = 0; i < num_draws; i++) {
  [encoder drawIndexedPrimitives:... indirectBufferOffset:offset];
  offset += stride;

The official Metal way is to use Indirect Command Buffer and encode rendering commands on CPU or GPU. All textures and samples must be passed as Argument buffer parameters even if they are not changing during rendering. Metal shading language has a built-in draw_indexed_primitives() function that 100% corresponds to their Metal API analog. That sounds good except minor limitation of 16384 drawing commands.

In theory a single call to executeCommandsInBuffer() must easily outperform the loop of drawIndexedPrimitives(). But let's perform some tests for that. We will use the same applications we used in our Mesh Shader performance comparison.

Apple M1 4/2 20 M 48 M
Apple M1 64/84 698 M 794 M
Apple M1 64/126 740 M 840 M
Apple M1 96/169 1.22 B 933 M
Apple M1 128/212 1.49 B 1.05 B
Apple M1 32-bit indices 1.36 B
Radeon Vega 56 4/2 24 M 16 M
Radeon Vega 56 64/84 1.13 B 647 M
Radeon Vega 56 64/126 1.20 B 687 M
Radeon Vega 56 96/169 1.84 B 1.05 B
Radeon Vega 56 128/212 2.54 B 1.44 B
Radeon Vega 56 32-bit indices 3.9 B
  • The ICB performs better than the loop when the number of primitives per draw call is small on Apple GPU. But ICB is 1.5 times slower when DIPs contain more than 200 primitives. And this is a kind of weird behavior of ICB.
  • The ICB is always almost two times slower than the loop of indirect commands on AMD GPU. But the exciting thing is that the power-hungry AMD Radeon Vega 56 is running with approximately the same performance as the integrated Apple M1.

But how about running tests with the same HW under Direct3D12 and Vulkan:

Metal Loop Metal ICB D3D12 MDI VK MDI
Radeon Vega 56 4/2 24 M 16 M 34 M 34 M
Radeon Vega 56 64/84 1.13 B 647 M 1.34 B 1.33 B
Radeon Vega 56 64/126 1.20 B 687 M 1.41 B 1.40 B
Radeon Vega 56 96/169 1.84 B 1.05 B 2.13 B 2.11 B
Radeon Vega 56 128/212 2.54 B 1.44 B 2.91 B 2.84 B
  • The loop of indirect commands is almost as fast as a single API MDI call. But the cost of performance is a huge CPU load. ICB and MDI solutions are not loading the CPU at all.
  • All AMD GPUs have native MDI functionality available even for D3D11, but not for Metal. It's understandable that AMD GPU is not the main target for Apple.

The current version of the GravityMark benchmark uses a loop approach for rendering for Metal. This dramatically reduces the performance of the same AMD GPU:

Responsive image
Responsive image

ICB is not a widely used functionality that is very difficult to debug. Metal shader validation is still not compatible with ICB. Application is crashing with:

-[MTLGPUDebugDevice newIndirectCommandBufferWithDescriptor:maxCommandCount:options:]:1036: failed assertion `Indirect Command Buffers are not currently supported with Shader Validation'

Fortunately, there is MTL_SHADER_VALIDATION_GPUOPT_ENABLE_INDIRECT_COMMAND_BUFFERS parameter that can enable ICB validation. But trivial ICB shaders can cause a non-trivial error:

Compiler encountered an internal error

It's possible to simplify a trivial ICB shader so it will pass the compilation stage. Then MTLArgumentBuffer will remind that he also has his own debugger:

Responsive image

Okay, there is no debugging available for ICB now. Let's try without it. The good thing is that macOS usually is not hanging for more than 20 seconds in the case of error. And more often, a nice Magenta screen is telling that it's something is going wrong:

Responsive image

As a result, we have an internal ICB version of GravityMark which demonstrates that:

  • GPU-CPU synchronization is required because the indirect version of executeCommandsInBuffer() causes Magenta screen on M1.
  • 16384 ICB length limit is too low for all asteroids, so we should repeat the indirect version of the executeCommandsInBuffer() command a couple of times.
  • AMD is 18% slower with ICB than with the Loop. The only bonus is that the CPU is free. Native Windows or Linux on Mac will give 3x boost for graphics.
  • M1 is 39% faster with ICB than with the Loop. It should be even better without synchronization. But even with this limitation, M1 outperforms the best integrated GPUs.
  • A14 is 44% faster with ICB than with the Loop. It is crashing at the last take with a 3600 score result instead of 2536.
  • It is a bad time to buy Mac with AMD GPU for macOS.

September 4, 2021

Mesh Shader versus MultiDrawIndirect

Mesh and Task (Amplification) shaders were a nice addition to the existing set of shaders. They allow the use of custom data formats for better compression and can dynamically generate geometry without using intermediate storage for it. Let’s see how they perform on different hardware and API.

We will use two simple test applications for that. One draws a million independent quad primitives without Task shader multiplication. Nvidia can dispatch a million Task shader invocations, but such numbers are crashing the AMD driver. We will use 16 Task shader batches for that. The geometry output is just 4 vertices and 6 indices per Mesh shader.

The second application draws many complex model instances. Each model is made from 283K vertices and 491K triangles. The number of instances is 81. That makes 20M vertices and 40M triangles to be processed. The number of Meshlets depends on the vertex/primitive counts and varying from 6130 to 2709. The size of the vertex buffer is also increasing because each Meshlet must be independent. There is no geometry culling. We are checking the geometry throughput only.

Both applications can draw all Meshlets individually by using single MultiDrawIndirect command. The instancing is not used.

GeForce 2080 Ti 4/2 685 M 707 M 655 M 360 M
GeForce 2080 Ti 64/84 12.92 B 11.20 B 11.34 B 10.80 B
GeForce 2080 Ti 64/126 12.51 B 11.21 B 11.71 B 10.92 B
GeForce 2080 Ti 96/169 12.71 B 11.23 B 11.27 B 11.44 B
GeForce 2080 Ti 128/212 12.12 B 12.20 B 10.84 B 11.72 B
GeForce 2080 Ti 32-bit indices 11.79 B 10.95 B
Quadro RTX 8000 4/2 683 M 693 M 661 M 345 M
Quadro RTX 8000 64/84 12.51 B 15.26 B 14.72 B 12.31 B
Quadro RTX 8000 64/126 12.23 B 15.37 B 14.30 B 12.83 B
Quadro RTX 8000 96/169 12.50 B 16.62 B 14.05 B 15.39 B
Quadro RTX 8000 128/212 12.15 B 17.04 B 13.43 B 15.64 B
Quadro RTX 8000 32-bit indices 13.63 B 12.88 B
Radeon 6700 XT 4/2 118 M 85.1 M 85.5 M
Radeon 6700 XT 64/84 4.31 B 3.42 B 3.40 B
Radeon 6700 XT 64/126 4.37 B 3.61 B 3.59 B
Radeon 6700 XT 96/169 4.43 B 5.48 B 5.40 B
Radeon 6700 XT 128/212 4.49 B 7.42 B 7.31 B
Radeon 6700 XT 32-bit indices 14.68 B 14.16 B

The table demonstrates the number of triangles rendered per second.

Responsive image
Responsive image

The results are very interesting:

  • MultiDrawIndirect is faster than Mesh Shader on Nvidia Quadro.
  • There is no difference for Nvidia between Mesh shader and MultiDrawIndirect, except Vulkan with very small primitives number.
  • 32-bit indices for raw geometry work faster on AMD. But any Mesh shader configuration is decreasing the hardware capabilities 3 times. More importantly, the MultiDrawIndirect approach is starting to work faster than the Mesh shaders when the number of primitives is bigger than 128.

But let’s try to draw 256K boxes by using Geometry shader for that. We will use another simple application to draw a 3D grid of boxes. Each box will be a point primitive for the Geometry shader. Mesh shader will draw 64 boxes per Task shader group.

GeForce 2080 Ti 1.5 B 3.7 B 1.5 B 2.9 B
Radeon 6700 XT 1.4 B 2.2 B 1.9 B

Geometry shader is a clear winner here, especially on Nvidia hardware.

Properly implemented MultiDrawIndirectCount is allowing to do the same job as Mesh shaders. Geometry shaders do simple primitive rendering better than Mesh shaders. We hope that vendors will provide better API flexibility instead of implementing an enormous number of different shader types.

Here are a couple more observations from Mesh shaders:

  • All Mesh shader vertices and indices must be written by threads lower than 32 on Nvidia and Vulkan, even if the shader group size is bigger than 32. Otherwise, the result will be ignored.
  • Nvidia can dispatch any number of Task shader groups under D3D12 and Vulkan. AMD has a limit of 65K.

Clay shader compiler can automatically translate Task and Mesh shader from GLSL/SPIR-V to HLSL.

Reproduction binaries for Windows: TellusimDrawMeshlet.zip

June 30, 2021

Shader Pipeline

The first shaders were very simple programs allowing to transform and light vertices before a rasterization stage. They were written in an assembly language. Now shaders can be found everywhere, from the user interface to physics and logic, because they are not any different from other source code. The number of shaders is growing every year with the flexibility of modern GPUs. And software delegates more and more tasks to GPU instead of CPU.

There are not many ways of writing shaders nowadays. It can be HLSL/GLSL/MSL/WGSL/CU/CL dialect. But mostly, the code will be the same with some minor differences. It’s not possible to create a perfect binary shader that will run across all available hardware. Because GPUs have different architectures, it’s impossible to make optimal binary format for everybody. So, an intermediate binary representation is used to simplify the application runtime. And it’s the driver’s job to generate the perfect binary for the input intermediate shader representation.

With cross-API technology, you need different shaders for different APIs. Moreover, some platforms do not allow compiling shaders during the application’s execution and require precompiled shader input. In the case of Vulkan, it’s SPIR-V format, which is a binary representation of GLSL shader code. And that binary shader can be directly loaded by Vulkan runtime, and the driver will transform it for the hardware. OpenGL ARB_gl_spirv extension allows loading the same SPIR-V binary shader directly to the OpenGL runtime with minor modifications related to the samples and textures. Unfortunately, it is only working for Nvidia. AMD and Intel can’t handle geometry and tessellation shaders from the SPIR-V binary.

The problems will appear when the same shader needs to be run on Direct3D12, Metal, or WebGPU APIs. SPIR-V cross tools make it possible by translating binary shaders to HLSL or MSL formats with many tweak parameters for resource binding. That translation is not fully compatible between platforms, so the engine must know how to transform parameters best for each platform. After that, shader source code can be compiled by d3dcomiler/dxcompiler/metal toolset for the required runtime.

Another option is to use HLSL shaders as input and cross-compile them to SPIR-V representation for Vulkan. But that also requires resource binding magic. Every platform wants to have its own shader language, which is not compatible with other platforms. SPIR-V is a great attempt to make a standard for everybody. But only Vulkan API can use it. Other platforms require different shader languages or binary formats.

The number of different shader types is also growing. We have Vertex, Fragment, Geometry, Control, Evaluate, Compute, Task, Mesh, RayGen, RayMiss, Closest, AnyHit, Intersection, and Callable shader types. All of them have different input and output semantics. Luckily, Khronos group provides tools to validate and compile all of these shaders. We tried to use these tools, but, unfortunately, it was impossible to cross-compile our Compute, Tessellation, and Geometry shaders with their help. So, we’ve created our own shader pipeline based on GLSL and SPIR-V specifications. And now we are excited to tell you more about it.

We use the GLSL language in our Tellusim Engine as a primary language for all platforms. All shader types, including Mesh and Raytracing, are supported. Because of the high performance of our shader toolset, we can skip the offline shader compilation step and do everything during runtime. And it works fast with any amount of code. For platforms that are not allowing to compile shaders at runtime, we use precompiled shader cache.

GravityMark GPU benchmark requires more than 20K lines of GLSL shaders. And there is a huge difference in the time needed to start the application when Khronos glslang compiler is used for GLSL to SPIR-V compilation.

Here is a log from build with Khronos glslang compiler:

M:  63.32 ms: Creating 1600x900 Vulkan Window
M:   1.493 s: Creating SceneManager
M:   9.346 s: Creating RenderManager
M:  12.431 s: Creating Scene
M:  13.551 s: Creating 200,000 Asteroids
M:  13.701 s: Updating Scene
M:  13.851 s: GravityMark v1.2 is Ready in 13.9 s

And this is Clay shader compiler doing the same job 10 times faster:

M:  58.59 ms: Creating 1600x900 Vulkan Window
M: 288.47 ms: Creating SceneManager
M: 411.18 ms: Creating RenderManager
M: 541.40 ms: Creating Scene
M:   1.289 s: Creating 200,000 Asteroids
M:   1.364 s: Updating Scene
M:   1.500 s: GravityMark v1.2 is Ready in 1.5 s

There is no difference in FPS between shader compilers.

Conversion to other shader languages from SPIR-V representation is performed with the same incredible speed. Moreover, all resource bindings are automatically handled by the engine. Only one GLSL shader is needed for all supported platforms, including Cuda and WebGPU. This gives great flexibility and significantly reduces the time it takes to develop new features. We also use all available debugging tools from supported platforms.

Some of GLSL features, such as embedded arrays, are not supported because we don’t need them.

You can download Clay shader compiler command-line tool for Windows and Linux with all shader languages back ends (Vulkan SPIR-V, OpenGL SPIR-V, OpenGL GLSL, OpenGLES GLSL, Direct3D12 HLSL, Direct3D11 HLSL, WebGPU WGSL, Metal MSL, and Cuda) here.