Dispatch, Dispatch, Dispatch - Tellusim Technologies Inc.

January 31, 2023

Dispatch, Dispatch, Dispatch

The famous Nvidia’s “Batch, Batch, Batch” recommendation was a must-have recipe for scenes with a lot of small objects. The trick is in combining small draw calls together, so GPU always has triangles to process without waiting for the CPU. It was essential at the time of D3D9. Hopefully, modern Mesh Shaders or Instancing can cranch small batches very well on modern APIs and GPUs.

But how about compute? One hundred of empty compute shaders is fine. One thousand is a problem that can cost up to 10ms, especially if the shaders are indirect. APIs are not trying to solve this problem. We still don’t have callable shaders and multiDispachIndirect(). And what can we do with that? Yes, the old fashon “Batch, Batch, Batch” is here again.

Tellusim pipeline is completely GPU-driven, which means that we have tons of dispatches. A GPU-driven pipeline is slower than CPU-driven when we have just a few objects. And it really shines when the scene is not that simple. We have a fixed overhead from compute shaders dispatching. And this overhead is the same for single viewport or multiple viewport rendering because we are doing “Dispatch, Dispatch, Dispatch” for every compute shader call. That makes stereo or multiple-viewport rendering faster than the serial, so we can easily handle multiple Headsets from a single GPU or connect as many displays as we can.

21K dynamic objects with unique materials, light with four split PSSM shadow map, six 1024×1024 viewports, and zero CPU load:

	VK Serial/Parallel	D3D12 Serial/Parallel	D3D11 Serial/Parallel	GL Serial/Parallel	Metal Serial/Parallel
GeForce 3090	163/255 FPS	159/209 FPS	64/112 FPS	139/224 FPS
GeForce 2080 Ti	94/169 FPS	129/167 FPS	76/114 FPS	N/A
Radeon 6900 XT	180/211 FPS	162/188 FPS	156/180 FPS	165/188 FPS
Radeon 6700 XT	126/142 FPS	127/139 FPS	122/136 FPS	132/146 FPS
Radeon 6600					66/74 FPS
Intel Arc A380	29/31 FPS	2/3 FPS	5/5 FPS	N/A
Apple M1 Max					58/71 FPS
Apple M1					25/27 FPS

We have 50% better performance on Nvidia. Our advice is simple: combine dispatches of same compute shaders together and minimize the number of compute shader switches. Because a: D; b: D; a: D; b: D is slower than a: D, D; b: D, D.