January 31, 2023

Dispatch, Dispatch, Dispatch

The famous Nvidia’s “Batch, Batch, Batch” recommendation was a must-have recipe for scenes with a lot of small objects. The trick is in combining small draw calls together, so GPU always has triangles to process without waiting for the CPU. It was essential at the time of D3D9. Hopefully, modern Mesh Shaders or Instancing can cranch small batches very well on modern APIs and GPUs.

But how about compute? One hundred of empty compute shaders is fine. One thousand is a problem that can cost up to 10ms, especially if the shaders are indirect. APIs are not trying to solve this problem. We still don’t have callable shaders and multiDispachIndirect(). And what can we do with that? Yes, the old fashon “Batch, Batch, Batch” is here again.

Tellusim pipeline is completely GPU-driven, which means that we have tons of dispatches. A GPU-driven pipeline is slower than CPU-driven when we have just a few objects. And it really shines when the scene is not that simple. We have a fixed overhead from compute shaders dispatching. And this overhead is the same for single viewport or multiple viewport rendering because we are doing “Dispatch, Dispatch, Dispatch” for every compute shader call. That makes stereo or multiple-viewport rendering faster than the serial, so we can easily handle multiple Headsets from a single GPU or connect as many displays as we can.

21K dynamic objects with unique materials, light with four split PSSM shadow map, six 1024×1024 viewports, and zero CPU load:

VK Serial/Parallel D3D12 Serial/Parallel D3D11 Serial/Parallel GL Serial/Parallel Metal Serial/Parallel
GeForce 3090 163/255 FPS 159/209 FPS 64/112 FPS 139/224 FPS
GeForce 2080 Ti 94/169 FPS 129/167 FPS 76/114 FPS N/A
Radeon 6900 XT 180/211 FPS 162/188 FPS 156/180 FPS 165/188 FPS
Radeon 6700 XT 126/142 FPS 127/139 FPS 122/136 FPS 132/146 FPS
Radeon 6600 66/74 FPS
Intel Arc A380 29/31 FPS 2/3 FPS 5/5 FPS N/A
Apple M1 Max 58/71 FPS
Apple M1 25/27 FPS

We have 50% better performance on Nvidia. Our advice is simple: combine dispatches of same compute shaders together and minimize the number of compute shader switches. Because a: D; b: D; a: D; b: D is slower than a: D, D; b: D, D.