January 16, 2022

Mesh Shader Emulation

The Mesh shader performance is not where AMD GPUs shine. There is not much difference in FPS between MDI and Mesh shader either. But there is a trick that emulates Mesh shader functionality on all GPUs with compute shader support. And this technique works faster than MDI and Mesh shader on almost all AMD GPUs. It also brings an unbelievable (7x) performance boost for Qualcomm GPUs.

Geometry prepared for Mesh shader rendering contains vertex and meshlet data buffers. Each meshlet uses uint8_t indices for vertices which allow loading more primitives in a few memory lockups (uint32_t[3] per 4 triangles).

This is a data layout for each meshlet:

uint32_t Number of primitives
uint32_t Number of vertices 
uint32_t Base index (index buffer offset)
uint32_t Base vertex (vertex buffer offset)

Mesh shader is straightforward: load meshlet info, fetch vertices and indices, submit vertices and indices.

But how will we render meshlets without Mesh shader support? We will use a single draw call with the help of a simple compute shader. This shader will transform meshlet indices into the packed uint32_t triangle indices that contain the meshlet index and the vertex index. The shader is trivial:

layout(std430, binding = 0) readonly buffer meshlets_buffer { uint meshlets_data[]; };
layout(std430, binding = 1) writeonly buffer indexing_buffer { uvec4 indexing_data[]; };

void main() {
  uint group_id = base_group + gl_WorkGroupID.x;
  uint local_id = gl_LocalInvocationIndex;
  [[branch]] if(local_id == 0u) {
    uint meshlet_index = (group_id % num_meshlets) * 4u;
    num_primitives = meshlets_data[meshlet_index + 0u];
    base_index = meshlets_data[meshlet_index + 2u];
  memoryBarrierShared(); barrier();
  uint indices_0 = 0u;
  uint indices_1 = 0u;
  uint indices_2 = 0u;
  [[branch]] if(local_id * 4u < num_primitives) {
    uint address = base_index + local_id * 3u;
    indices_0 = meshlets_data[address + 0u];
    indices_1 = meshlets_data[address + 1u];
    indices_2 = meshlets_data[address + 2u];
  uint group_index = group_id << 8u;
  uint index = (GROUP_SIZE * group_id + local_id) * 3u;
  indexing_data[index + 0u] = (uvec4(indices_0, indices_0 >> 8u, indices_0 >> 16u, indices_0 >> 24u) & 0xffu) | group_index;
  indexing_data[index + 1u] = (uvec4(indices_1, indices_1 >> 8u, indices_1 >> 16u, indices_1 >> 24u) & 0xffu) | group_index;
  indexing_data[index + 2u] = (uvec4(indices_2, indices_2 >> 8u, indices_2 >> 16u, indices_2 >> 24u) & 0xffu) | group_index;

After that compute pass, only a single drawElements() is required to render all meshlets. The amount of write and read memory per 1M triangles is 11 MB. It’s not possible to use 16-bit indices because we can draw only 1024 meshlets per draw call in that case. The TriangleIndex built-in shader variable will make that possible, but we don’t have it.

Vertex shader can get the meshlet index and meshlet vertex by using simple math over VertexIndex built-in variable:

uint meshlet = gl_VertexIndex >> 8u;
uint vertex = gl_VertexIndex & 0xffu;
// load vertex data from SSBO

There is no problem adding a per-triangle visibility culling test to that shader. And it will improve performance because back-face and invisible triangles will not use the bandwidth.

Let’s check the results in different configurations and platforms where the emulation is faster:

Single DIP Emulation Mesh Shader MDI/ICB/Loop
Radeon 6900 XT 64/126 17.2 B 9.6 B 9.2 B 4.1 B
Radeon 6900 XT 128/212 17.2 B 10.1 B 9.2 B 8.5 B
Radeon 6700 XT 64/126 14.1 B 7.7 B 4.6 B 4.1 B
Radeon 6700 XT 128/212 14.1 B 7.9 B 4.6 B 8.3 B
Radeon 5600 M 64/126 5.0 B 2.9 B 1.4 B
Radeon 5600 M 128/212 5.0 B 2.9 B 2.8 B
Radeon Vega 56 64/126 (macOS) 4.1 B 2.7 B 860 M
Radeon Vega 56 128/212 (macOS) 4.1 B 2.9 B 1.8 B
Adreno 660 64/126 583 M 265 M 36.6 M
Adreno 660 128/212 596 M 309 M 74.7 M
Mali-G78 MP20 64/126 176 M 153 M 83 M
Mali-G78 MP20 128/212 187 M 123 M 134 M

The results in Billion and Million processed triangles per second. MDI – Multi Draw Indirect, ICB – Indirect Command Buffer, Loop – the CPU loop of multiple drawIndirect() commands.

If you need to draw a lot of small DIPs, it’s better to pack everything to the single draw call on AMD and Qualcomm. And a single DIP is working better than Mesh shaders or MDI even with additional index buffer generation overhead.

Windows and Linux binaries: TellusimDrawMeshlet.zip