September 24, 2021
Ray Tracing Performance Comparison
All modern APIs support ray tracing now. It’s available on Windows (D3D12 and VK), Linux (VK), and macOS (Metal). There is no difference between D3D12 and VK ray tracing. The Metal API only supports ray query, which is also available with D3D12 and VK API. The main ray tracing concept is to make BLAS for geometry and combine all geometries inside the single TLAS. And the driver is responsible for everything under the hood.
With a simple modification our test app can work in ray tracing (RT), ray query (RQ), and compute shader (CS) rendering modes. We are going to render the same 81 instances of 490K triangles. But this time, it will be no rasterization at all. Each pixel will always trace primary, shadow, and reflection rays.
A naive compute shader ray query mode will use our generic GPU BVH builder with a single triangle per leaf partitioning. Each BVH node consumes 12 bytes. The whole model requires a 45 MB BVH buffer. For rasterization, this model requires 13 MB for 32-byte vertex and 32-bit index buffer.
Let’s take a look at the 1600×900 resolution results first:
RT D3D12 | RT VK | RQ D3D12 | RQ VK | RQ MTL | RQ CS | BLAS Size | BLAS Build | |
---|---|---|---|---|---|---|---|---|
GeForce 3080 | 0.58 ms | 0.54 ms | 0.60 ms | 0.55 ms | 4.92 ms | 33 MB (15 MB) | 7 ms (18 ms) | |
GeForce 2080 Ti | 1.06 ms | 1.03 ms | 0.95 ms | 0.97 ms | 7.35 ms | 33 MB (15 MB) | 7 ms (18 ms) | |
GeForce 1060 M | 34.59 ms | 33.81 ms | 43 MB (37 MB) | 7 ms (18 ms) | ||||
Radeon 6700 XT | 2.50 ms | 2.55 ms | 1.71 ms | 2.06 ms | 6.13 ms | 76 MB (65 MB) | 15 ms (> 500 ms) | |
Radeon Vega 56 (macOS) | 15.84 ms | 15.48 ms | 82 MB | 30 ms | ||||
Apple M1 (macOS) | 48.8 ms | 46.4 ms | 82 MB | 100 ms | ||||
Apple A14 (iOS) | 98.0 ms | 94.3 ms | 82 MB | |||||
Adreno 660 (Android) | 431 ms |
- Nvidia RTX series GPUs work with the best performance, and Ray Queries are slightly faster. The great thing is that compacted BLAS is only 15 MB in size, and this is a few MB larger than the model representation for rasterization. compute shader ray tracing is 7.7 times slower than HW accelerated.
- AMD has a 50% difference between ray query (D3D12) and ray tracing (VK). It looks like the driver is only optimized for D3D12 Ray Queries. HW Raytracing is 3.5 times faster than compute shader implementation. But the main problem is the size of BLAS that is more than four times bigger than it can be. Moreover, the first BLAS generation is painfully slow even with a fast build flag enabled.
- Interesting fact: compute shader ray tracing is faster on AMD than on Nvidia.
- There is no HW ray tracing on Metal. The current implementation is slower than naive compute shader. The BLAS requires a twice larger amount of memory as well. Some of the triangles are not intersecting, so the model has holes during rendering on all hardware.
- A passive cooled Apple M1 is only 33% slower than GeForce 1060M GPU.
- Vertex cache optimization is also improving ray tracing performance.
Now, what about native 4K?
RT D3D12 | RT VK | RQ D3D12 | RQ VK | RQ CS | |
---|---|---|---|---|---|
GeForce 3080 | 3.16 ms | 3.13 ms | 3.25 ms | 3.13 ms | 21.00 ms |
GeForce 2080 Ti | 4.78 ms | 4.77 ms | 4.92 ms | 4.86 ms | 32.21 ms |
Radeon 6700 XT | 12.89 ms | 13.35 ms | 8.48 ms | 9.77 ms | 27.94 ms |
- AMD is showing a 65% difference between ray query and ray tracing pipeline. The choice for AMD is obvious.
- It’s better to use ray tracing than ray query for Nvidia, but the difference is only 3%.
- The best Radeon 6700 XT ray query performance is 70% slower than GeForce 2080 Ti. The biggest gap is 280%.
It’s interesting how AMD with a bigger number of ray tracing cores will perform. The main issue for AMD is an acceleration structure size and its build time. Especially for dynamic TLAS with thousands of instances.
This is an image of compute shader ray tracing and a heat map image. The red channel represents the number of BLAS BVH intersection steps scaled by 128. The main bottleneck for compute shader ray tracing is a Ray-BVH intersection that cannot be well optimized due to divergence and scattered memory access. By contrast, Volume-BVH intersection performs great on compute shader. But, unfortunately, we cannot reuse the HW (API) ray tracing acceleration structures for different purposes.