January 31, 2023
Dispatch, Dispatch, Dispatch
The famous Nvidia’s “Batch, Batch, Batch” recommendation was a must-have recipe for scenes with a lot of small objects. The trick is in combining small draw calls together, so GPU always has triangles to process without waiting for the CPU. It was essential at the time of D3D9. Hopefully, modern Mesh Shaders or Instancing can cranch small batches very well on modern APIs and GPUs.
But how about compute? One hundred of empty compute shaders is fine. One thousand is a problem that can cost up to 10ms, especially if the shaders are indirect. APIs are not trying to solve this problem. We still don’t have callable shaders and multiDispachIndirect(). And what can we do with that? Yes, the old fashon “Batch, Batch, Batch” is here again.
Tellusim pipeline is completely GPU-driven, which means that we have tons of dispatches. A GPU-driven pipeline is slower than CPU-driven when we have just a few objects. And it really shines when the scene is not that simple. We have a fixed overhead from compute shaders dispatching. And this overhead is the same for single viewport or multiple viewport rendering because we are doing “Dispatch, Dispatch, Dispatch” for every compute shader call. That makes stereo or multiple-viewport rendering faster than the serial, so we can easily handle multiple Headsets from a single GPU or connect as many displays as we can.
21K dynamic objects with unique materials, light with four split PSSM shadow map, six 1024×1024 viewports, and zero CPU load:
|VK Serial/Parallel||D3D12 Serial/Parallel||D3D11 Serial/Parallel||GL Serial/Parallel||Metal Serial/Parallel|
|GeForce 3090||163/255 FPS||159/209 FPS||64/112 FPS||139/224 FPS|
|GeForce 2080 Ti||94/169 FPS||129/167 FPS||76/114 FPS||N/A|
|Radeon 6900 XT||180/211 FPS||162/188 FPS||156/180 FPS||165/188 FPS|
|Radeon 6700 XT||126/142 FPS||127/139 FPS||122/136 FPS||132/146 FPS|
|Radeon 6600||66/74 FPS|
|Intel Arc A380||29/31 FPS||2/3 FPS||5/5 FPS||N/A|
|Apple M1 Max||58/71 FPS|
|Apple M1||25/27 FPS|
We have 50% better performance on Nvidia. Our advice is simple: combine dispatches of same compute shaders together and minimize the number of compute shader switches. Because a: D; b: D; a: D; b: D is slower than a: D, D; b: D, D.