October 7, 2021

Ray Tracing versus Animation


There are no answers from vendors about acceleration structure performance and actual numbers of how the dynamic geometry affects ray tracing performance. There are only recommendations on how to use acceleration structure flags. We had static performance numbers in our previous post. Now it’s time to focus on acceleration structure builders.

We will draw animated characters. Because all characters are independent, they will all require a bottom-level acceleration structure. Each pixel will query two rays, one for the scene intersection and one for the shadow intersection. The resolution is 1600×900.

Acceleration structure builders have many configuration parameters, that’s why there is a lot of data in the tables. We will also split Vulkan and Direct3D12 results into different rows.

  • FT (RT) – All BLAS build time with Fast Trace flag enabled and the scene tracing time (in round brackets).
  • FB (RT) – The Fast Build time and (RQ scene tracing time).
  • CS (RT) – Compute shader BLAS build time (CS scene tracing time).
  • FTU (RT) – All BLAS Update time with Fast Trace flag enabled and (RQ scene tracing time).
  • FBU (RT) – The Fast Build Update time and (RQ scene tracing time).
  • CSU (RT) – Compute shader BLAS Update time (CS scene tracing time).
  • BLAS (Scratch) – The memory required for the BLAS buffers and (the scratch buffer size).

There are the results with 81 instances of the 52K triangle model. The overall number of animated vertices/triangles is 2.9M/4.2M. The number of joints is 4212:

FT (RT) FB (RT) CS (RT) FTU (RT) FBU (RT) CSU (RT) BLAS (Scratch)
GeForce 2080 Ti (D3D12) 16.9 ms (0.6 ms) 15.7 ms (0.6 ms) 9.1 ms (5.3 ms) 3.7 ms (0.8 ms) 1.2 ms (0.9 ms) 2.1 ms (7.7 ms) 255 MB (16 MB)
GeForce 2080 Ti (VK) 17.1 ms (0.6 ms) 15.1 ms (0.7 ms) 8.7 ms (3.4 ms) 3.8 ms (0.8 ms) 1.5 ms (1.0 ms) 1.8 ms (4.7 ms) 255 MB (16 MB)
Radeon 6700 XT (D3D12) 30.2 ms (1.2 ms) 30.2 ms (1.2 ms) 8.7 ms (4.8 ms) 4.6 ms (1.3 ms) 4.6 ms (1.3 ms) 3.0 ms (6.1 ms) 656 MB (872 MB)
Radeon 6700 XT (VK) 223.0 ms (1.4 ms) 73.0 ms (1.5 ms) 8.9 ms (4.9 ms) 4.6 ms (1.6 ms) 4.6 ms (1.6 ms) 3.2 ms (6.2 ms) 656 MB (872 MB)
Radeon Vega 56 (macOS) 185.5 ms (8.4 ms) 185.6 ms (8.4 ms) 16.8 ms (8.1 ms) 21.5 ms (11.9 ms) 21.7 ms (12.1 ms) 4.54 ms (11.2 ms) 355 MB (355 MB)
Apple M1 (macOS) 394.9 ms (29.2 ms) 395.2 ms (29.0 ms) 74.5 ms (42.6 ms) 29.8 ms (34.5 ms) 29.9 ms (34.8 ms) 18.1 ms (49.2 ms) 355 MB (355 MB)

Here is what we have:

  • It’s impossible to build BLAS for animated characters in each frame. The BLAS update mode must be used. Otherwise, the BLAS builder will significantly reduce the FPS.
  • AMD Vulkan BLAS builder is much slower than the Direct3D12 builder. The best AMD BLAS build time is twice bigger than on Nvidia.
  • The ray queries on Apple M1 are twice faster than the CS solution, which is interesting considering that there was no difference in our previous test.
  • There is no coherent buffer memory in Metal shading language. That is why we are performing the BVH update step by using atomics. And it affects performance negatively.
  • AMD has an enormous memory consumption for the BLAS and Scratch buffers.
  • Our GPU BVH builder outperforms any available implementation in full rebuild mode.

Now let’s find the ray tracing limits by increasing the number of characters. This time we will use a simplified model of 1500 triangles. The overall number of characters is 2401. The number of triangles in the scene is 3.6M. The number of joints is 125K. It’s getting difficult for the CPU to animate all characters independently, so we will reuse the joint transformations across multiple instances.

FT (RT) FB (RT) CS (RT) FTU (RT) FBU (RT) CSU (RT) BLAS (Scratch)
GeForce 2080 Ti (D3D12) 15.8 ms (0.6 ms) 14.5 ms (0.6 ms) 8.3 ms (6.2 ms) 7.0 ms (0.8 ms) 1.5 ms (0.9 ms) 1.6 ms (6.7 ms) 224 MB (15 MB)
GeForce 2080 Ti (VK) 16.3 ms (0.6 ms) 14.6 ms (0.7 ms) 8.1 ms (4.5 ms) 4.8 ms (0.8 ms) 1.1 ms (1.0 ms) 1.5 ms (4.6 ms) 224 MB (15 MB)
Radeon 6700 XT (D3D12) 55.9 ms (1.5 ms) 55.8 ms (1.5 ms) 7.5 ms (6.0 ms) 3.0 ms (1.5 ms) 3.0 ms (1.5 ms) 2.4 ms (6.7 ms) 559 MB (742 MB)
Radeon 6700 XT (VK) 2000 ms (5.2 ms) 1070 ms (5.3 ms) 7.8 ms (6.3 ms) 18.5 ms (1.7 ms) 18.7 ms (1.8 ms) 2.6 ms (7.0 ms) 559 MB (742 MB)
Radeon Vega 56 (macOS) 1840 ms (9.8 ms) 1800 ms (9.7 ms) 14.7 ms (10.3 ms) 297.1 ms (10.5 ms) 297.1 ms (10.1 ms) 3.6 ms (11.6 ms) 304 MB (325 MB)
Apple M1 (macOS) 1600 ms (33.1 ms) 1600 ms (32.8 ms) 63.4 ms (58.3 ms) 271.1 ms (36.3 ms) 273.1 ms (36.9 ms) 14.9 ms (62.7 ms) 304 MB (325 MB)
  • The number of instances doesn’t affect Nvidia and Tellusim BVH builder time. Only the number of triangles does.
  • There are no mistakes with AMD Vulkan and Metal timings. The BLAS build time is measured in seconds.
  • Vulkan API is performing all BLAS build/update in a single API call. But AMD is performing BLAS build in a non-parallel way on Vulkan. Metal API is doing the same.
  • Nvidia GPU is only 50-70% utilized on Windows in the second test. But it’s working at full speed on Linux.

Here are the videos of these two tests captured in BLAS update mode. Otherwise, the CS solution is faster than HW ray tracing.

How about 10000 BLAS instances? It will be 520K joints and 15M triangles. CPU cannot process half of million joints well. So we will stop the animation. Nvidia requires 1 GB BLAS buffer for all instances. The BLAS buffer for AMD is 2.3 GB, and the scratch buffer requires an additional 3 GB of memory. We are not counting pre-transformed vertices (600 MB in the case of 48 bytes per vertex). BLAS builder will work in the update mode. Nvidia 2080 Ti provides 43 FPS in 2K video mode due to underutilization, but the overall timing is already more than 10 ms, so it’s not better than 100 FPS. Radeon 6700 XT shows 49 FPS with 100% utilization.

Each animated, morphed, or deformed object requires memory for Vertex and BLAS data. Sometimes it can be a lot of data and a lot of memory bandwidth. And it affects performance. By contrast, rasterization doesn’t not require anything except original geometry and joint transformations. And all transformed data goes directly to the rasterizer. Tellusim engine is fully GPU-driven, including animation blend trees. The video with 10K animated characters is running with 135 FPS on Nvidia 2080 Ti. The performance on Radeon 6700 XT is more than 120 FPS. It works with 20 FPS and animation on mobile Apple M1. Each character has a unique animation and casts a global shadow as well.

How to make 10K animated characters scene with Tellusim Engine:

// load object
ObjectMesh object(scene);
if(!object.load("object.glb", &material, ObjectMesh::FlagTriSkiBasTex | ObjectMesh::FlagAnimation)) return 1;

// create nodes
uint32_t size = 100;
Array nodes(size * size);
for(uint32_t i = 0; i < nodes.size(); i++) {
  nodes[i] = NodeObject(graph, object);
  nodes[i].setGlobalTransform(Matrix4x3d::translate((i % size) * step, (i / size) * step, 0.0));
}

// somewhere in the update loop
Random random(0);
for(NodeObject node : nodes) {
  ObjectFrame frame;
  uint32_t index = random.geti32(0, object.getNumAnimations() - 1);
  frame.append(ObjectFrame(object.getAnimation(index), time + random.getf32(0.0f, 32.0f)));
  node.setFrame(frame);
}
scene.updateGraph(graph);
graph.updateObjectTree();
graph.updateNodes();

That’s all that’s required. CPU only does some random number generation. Everything else is delegated to the GPU.

Tellusim Animation Demo: Tellusim_Animation.zip