September 4, 2021
Mesh and Task (Amplification) shaders were a nice addition to the existing set of shaders. They allow the use of custom data formats for better compression and can dynamically generate geometry without using intermediate storage for it. Let’s see how they perform on different hardware and API.
We will use two simple test applications for that. One draws a million independent quad primitives without Task shader multiplication. Nvidia can dispatch a million Task shader invocations, but such numbers are crashing the AMD driver. We will use 16 Task shader batches for that. The geometry output is just 4 vertices and 6 indices per Mesh shader.
The second application draws many complex model instances. Each model is made from 283K vertices and 491K triangles. The number of instances is 81. That makes 20M vertices and 40M triangles to be processed. The number of Meshlets depends on the vertex/primitive counts and varying from 6130 to 2709. The size of the vertex buffer is also increasing because each Meshlet must be independent. There is no geometry culling. We are checking the geometry throughput only.
Both applications can draw all Meshlets individually by using single MultiDrawIndirect command. The instancing is not used.
|D3D12 MS||D3D12 MDI||D3D12 VS||VK MS||VK MDI||VK VS|
|GeForce 2080 Ti 4/2||685 M||707 M||655 M||360 M|
|GeForce 2080 Ti 64/84||12.92 B||11.20 B||11.34 B||10.80 B|
|GeForce 2080 Ti 64/126||12.51 B||11.21 B||11.71 B||10.92 B|
|GeForce 2080 Ti 96/169||12.71 B||11.23 B||11.27 B||11.44 B|
|GeForce 2080 Ti 128/212||12.12 B||12.20 B||10.84 B||11.72 B|
|GeForce 2080 Ti 32-bit indices||11.79 B||10.95 B|
|Quadro RTX 8000 4/2||683 M||693 M||661 M||345 M|
|Quadro RTX 8000 64/84||12.51 B||15.26 B||14.72 B||12.31 B|
|Quadro RTX 8000 64/126||12.23 B||15.37 B||14.30 B||12.83 B|
|Quadro RTX 8000 96/169||12.50 B||16.62 B||14.05 B||15.39 B|
|Quadro RTX 8000 128/212||12.15 B||17.04 B||13.43 B||15.64 B|
|Quadro RTX 8000 32-bit indices||13.63 B||12.88 B|
|Radeon 6700 XT 4/2||118 M||85.1 M||85.5 M|
|Radeon 6700 XT 64/84||4.31 B||3.42 B||3.40 B|
|Radeon 6700 XT 64/126||4.37 B||3.61 B||3.59 B|
|Radeon 6700 XT 96/169||4.43 B||5.48 B||5.40 B|
|Radeon 6700 XT 128/212||4.49 B||7.42 B||7.31 B|
|Radeon 6700 XT 32-bit indices||14.68 B||14.16 B|
The table demonstrates the number of triangles rendered per second.
The results are very interesting:
- MultiDrawIndirect is faster than Mesh Shader on Nvidia Quadro.
- There is no difference for Nvidia between Mesh shader and MultiDrawIndirect, except Vulkan with very small primitives number.
- 32-bit indices for raw geometry work faster on AMD. But any Mesh shader configuration is decreasing the hardware capabilities 3 times. More importantly, the MultiDrawIndirect approach is starting to work faster than the Mesh shaders when the number of primitives is bigger than 128.
But let’s try to draw 256K boxes by using Geometry shader for that. We will use another simple application to draw a 3D grid of boxes. Each box will be a point primitive for the Geometry shader. Mesh shader will draw 64 boxes per Task shader group.
|D3D12 MS||D3D12 GS||VK MS||VK GS|
|GeForce 2080 Ti||1.5 B||3.7 B||1.5 B||2.9 B|
|Radeon 6700 XT||1.4 B||2.2 B||1.9 B|
Geometry shader is a clear winner here, especially on Nvidia hardware.
Properly implemented MultiDrawIndirectCount is allowing to do the same job as Mesh shaders. Geometry shaders do simple primitive rendering better than Mesh shaders. We hope that vendors will provide better API flexibility instead of implementing an enormous number of different shader types.
Here are a couple more observations from Mesh shaders:
- All Mesh shader vertices and indices must be written by threads lower than 32 on Nvidia and Vulkan, even if the shader group size is bigger than 32. Otherwise, the result will be ignored.
- Nvidia can dispatch any number of Task shader groups under D3D12 and Vulkan. AMD has a limit of 65K.
Clay shader compiler can automatically translate Task and Mesh shader from GLSL/SPIR-V to HLSL.
Reproduction binaries for Windows: TellusimDrawMeshlet.zip