October 10, 2021

Blue Noise Generator


Blue noise has unique properties that make it best for dithering, spatial generation, and more. You can find more details about blue noise from these great articles: momentsingraphics.de, blog.demofox.org.

Performance and precision are two main drawbacks of all blue noise generators. A 1024×1024 image requires hours or even days to generate. And that is not acceptable. Moreover, the 8-bit precision cannot represent a large noise image (more than 16×16) with perfect spatial accuracy because there will be multiple pixels with the same value. We made a GPU blue noise generator that uses fast Fourier transformation and provides a blazing-fast generation. A 1024×1024 image generation is performing in less than 6 minutes on GeForce 2080 Ti class GPU. The output noise image can be in 16-bit integer or 32-bit floating-point precision.

The binaries of our blue noise generator for Windows and Linux (Vulkan) are available for everyone. But be careful because our Radeon Vega 7 died during 4K texture generation.

Download link: TellusimClayNoise.zip

The application has simple command line parameters:

Clay Blue Noise Image Generator (https://tellusim.com/)
Usage: ./noise -o noise.png
  -i  Input image
  -o  Output image
  -f  Forward image
  -h  Histogram output
  -bits  Image bits (8)
  -size  Image size (256)
  -width  Image width (256)
  -height  Image width (256)
  -seed  Random seed (random)
  -init  Initial pixels (10%)
  -sigma  Gaussian sigma (2.0)
  -epsilon  Quadratic epsilon (0.01)
  -device  Computation device index

This command makes a unique blue noise image with 16-bit precision in less than a minute. The Fourier transformation of the noise can be obtained with -f command line parameter. Histogram output is a text file. The initial random pixels can be specified as an input image as well.

clay_noise.exe -size 512 -bits 16 -o noise_512x512_16bit.png

The sigma parameter changes the blur radius during generation. You can see the result of different Sigma values in this WebGL application:

Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image

October 7, 2021

Ray Tracing versus Animation


There are no answers from vendors about acceleration structure performance and actual numbers of how the dynamic geometry affects ray tracing performance. There are only recommendations on how to use acceleration structure flags. We had static performance numbers in our previous post. Now it’s time to focus on acceleration structure builders.

We will draw animated characters. Because all characters are independent, they will all require a bottom-level acceleration structure. Each pixel will query two rays, one for the scene intersection and one for the shadow intersection. The resolution is 1600×900.

Acceleration structure builders have many configuration parameters, that’s why there is a lot of data in the tables. We will also split Vulkan and Direct3D12 results into different rows.

  • FT (RT) – All BLAS build time with Fast Trace flag enabled and the scene tracing time (in round brackets).
  • FB (RT) – The Fast Build time and (RQ scene tracing time).
  • CS (RT) – Compute shader BLAS build time (CS scene tracing time).
  • FTU (RT) – All BLAS Update time with Fast Trace flag enabled and (RQ scene tracing time).
  • FBU (RT) – The Fast Build Update time and (RQ scene tracing time).
  • CSU (RT) – Compute shader BLAS Update time (CS scene tracing time).
  • BLAS (Scratch) – The memory required for the BLAS buffers and (the scratch buffer size).

There are the results with 81 instances of the 52K triangle model. The overall number of animated vertices/triangles is 2.9M/4.2M. The number of joints is 4212:

FT (RT) FB (RT) CS (RT) FTU (RT) FBU (RT) CSU (RT) BLAS (Scratch)
GeForce 2080 Ti (D3D12) 16.9 ms (0.6 ms) 15.7 ms (0.6 ms) 9.1 ms (5.3 ms) 3.7 ms (0.8 ms) 1.2 ms (0.9 ms) 2.1 ms (7.7 ms) 255 MB (16 MB)
GeForce 2080 Ti (VK) 17.1 ms (0.6 ms) 15.1 ms (0.7 ms) 8.7 ms (3.4 ms) 3.8 ms (0.8 ms) 1.5 ms (1.0 ms) 1.8 ms (4.7 ms) 255 MB (16 MB)
Radeon 6700 XT (D3D12) 30.2 ms (1.2 ms) 30.2 ms (1.2 ms) 8.7 ms (4.8 ms) 4.6 ms (1.3 ms) 4.6 ms (1.3 ms) 3.0 ms (6.1 ms) 656 MB (872 MB)
Radeon 6700 XT (VK) 223.0 ms (1.4 ms) 73.0 ms (1.5 ms) 8.9 ms (4.9 ms) 4.6 ms (1.6 ms) 4.6 ms (1.6 ms) 3.2 ms (6.2 ms) 656 MB (872 MB)
Radeon Vega 56 (macOS) 185.5 ms (8.4 ms) 185.6 ms (8.4 ms) 16.8 ms (8.1 ms) 21.5 ms (11.9 ms) 21.7 ms (12.1 ms) 4.54 ms (11.2 ms) 355 MB (355 MB)
Apple M1 (macOS) 394.9 ms (29.2 ms) 395.2 ms (29.0 ms) 74.5 ms (42.6 ms) 29.8 ms (34.5 ms) 29.9 ms (34.8 ms) 18.1 ms (49.2 ms) 355 MB (355 MB)

Here is what we have:

  • It’s impossible to build BLAS for animated characters in each frame. The BLAS update mode must be used. Otherwise, the BLAS builder will significantly reduce the FPS.
  • AMD Vulkan BLAS builder is much slower than the Direct3D12 builder. The best AMD BLAS build time is twice bigger than on Nvidia.
  • The ray queries on Apple M1 are twice faster than the CS solution, which is interesting considering that there was no difference in our previous test.
  • There is no coherent buffer memory in Metal shading language. That is why we are performing the BVH update step by using atomics. And it affects performance negatively.
  • AMD has an enormous memory consumption for the BLAS and Scratch buffers.
  • Our GPU BVH builder outperforms any available implementation in full rebuild mode.

Now let’s find the ray tracing limits by increasing the number of characters. This time we will use a simplified model of 1500 triangles. The overall number of characters is 2401. The number of triangles in the scene is 3.6M. The number of joints is 125K. It’s getting difficult for the CPU to animate all characters independently, so we will reuse the joint transformations across multiple instances.

FT (RT) FB (RT) CS (RT) FTU (RT) FBU (RT) CSU (RT) BLAS (Scratch)
GeForce 2080 Ti (D3D12) 15.8 ms (0.6 ms) 14.5 ms (0.6 ms) 8.3 ms (6.2 ms) 7.0 ms (0.8 ms) 1.5 ms (0.9 ms) 1.6 ms (6.7 ms) 224 MB (15 MB)
GeForce 2080 Ti (VK) 16.3 ms (0.6 ms) 14.6 ms (0.7 ms) 8.1 ms (4.5 ms) 4.8 ms (0.8 ms) 1.1 ms (1.0 ms) 1.5 ms (4.6 ms) 224 MB (15 MB)
Radeon 6700 XT (D3D12) 55.9 ms (1.5 ms) 55.8 ms (1.5 ms) 7.5 ms (6.0 ms) 3.0 ms (1.5 ms) 3.0 ms (1.5 ms) 2.4 ms (6.7 ms) 559 MB (742 MB)
Radeon 6700 XT (VK) 2000 ms (5.2 ms) 1070 ms (5.3 ms) 7.8 ms (6.3 ms) 18.5 ms (1.7 ms) 18.7 ms (1.8 ms) 2.6 ms (7.0 ms) 559 MB (742 MB)
Radeon Vega 56 (macOS) 1840 ms (9.8 ms) 1800 ms (9.7 ms) 14.7 ms (10.3 ms) 297.1 ms (10.5 ms) 297.1 ms (10.1 ms) 3.6 ms (11.6 ms) 304 MB (325 MB)
Apple M1 (macOS) 1600 ms (33.1 ms) 1600 ms (32.8 ms) 63.4 ms (58.3 ms) 271.1 ms (36.3 ms) 273.1 ms (36.9 ms) 14.9 ms (62.7 ms) 304 MB (325 MB)
  • The number of instances doesn’t affect Nvidia and Tellusim BVH builder time. Only the number of triangles does.
  • There are no mistakes with AMD Vulkan and Metal timings. The BLAS build time is measured in seconds.
  • Vulkan API is performing all BLAS build/update in a single API call. But AMD is performing BLAS build in a non-parallel way on Vulkan. Metal API is doing the same.
  • Nvidia GPU is only 50-70% utilized on Windows in the second test. But it’s working at full speed on Linux.

Here are the videos of these two tests captured in BLAS update mode. Otherwise, the CS solution is faster than HW ray tracing.

How about 10000 BLAS instances? It will be 520K joints and 15M triangles. CPU cannot process half of million joints well. So we will stop the animation. Nvidia requires 1 GB BLAS buffer for all instances. The BLAS buffer for AMD is 2.3 GB, and the scratch buffer requires an additional 3 GB of memory. We are not counting pre-transformed vertices (600 MB in the case of 48 bytes per vertex). BLAS builder will work in the update mode. Nvidia 2080 Ti provides 43 FPS in 2K video mode due to underutilization, but the overall timing is already more than 10 ms, so it’s not better than 100 FPS. Radeon 6700 XT shows 49 FPS with 100% utilization.

Each animated, morphed, or deformed object requires memory for Vertex and BLAS data. Sometimes it can be a lot of data and a lot of memory bandwidth. And it affects performance. By contrast, rasterization doesn’t not require anything except original geometry and joint transformations. And all transformed data goes directly to the rasterizer. Tellusim engine is fully GPU-driven, including animation blend trees. The video with 10K animated characters is running with 135 FPS on Nvidia 2080 Ti. The performance on Radeon 6700 XT is more than 120 FPS. It works with 20 FPS and animation on mobile Apple M1. Each character has a unique animation and casts a global shadow as well.

How to make 10K animated characters scene with Tellusim Engine:

// load object
ObjectMesh object(scene);
if(!object.load("object.glb", &material, ObjectMesh::FlagTriSkiBasTex | ObjectMesh::FlagAnimation)) return 1;

// create nodes
uint32_t size = 100;
Array nodes(size * size);
for(uint32_t i = 0; i < nodes.size(); i++) {
  nodes[i] = NodeObject(graph, object);
  nodes[i].setGlobalTransform(Matrix4x3d::translate((i % size) * step, (i / size) * step, 0.0));
}

// somewhere in the update loop
Random random(0);
for(NodeObject node : nodes) {
  ObjectFrame frame;
  uint32_t index = random.geti32(0, object.getNumAnimations() - 1);
  frame.append(ObjectFrame(object.getAnimation(index), time + random.getf32(0.0f, 32.0f)));
  node.setFrame(frame);
}
scene.updateGraph(graph);
graph.updateObjectTree();
graph.updateNodes();

That’s all that’s required. CPU only does some random number generation. Everything else is delegated to the GPU.

Tellusim Animation Demo: TellusimAnimation.zip

September 24, 2021

Ray Tracing Performance Comparison


All modern APIs support ray tracing now. It’s available on Windows (D3D12 and VK), Linux (VK), and macOS (Metal). There is no difference between D3D12 and VK ray tracing. The Metal API only supports ray query, which is also available with D3D12 and VK API. The main ray tracing concept is to make BLAS for geometry and combine all geometries inside the single TLAS. And the driver is responsible for everything under the hood.

With a simple modification our test app can work in ray tracing (RT), ray query (RQ), and compute shader (CS) rendering modes. We are going to render the same 81 instances of 490K triangles. But this time, it will be no rasterization at all. Each pixel will always trace primary, shadow, and reflection rays.

A naive compute shader ray query mode will use our generic GPU BVH builder with a single triangle per leaf partitioning. Each BVH node consumes 12 bytes. The whole model requires a 45 MB BVH buffer. For rasterization, this model requires 13 MB for 32-byte vertex and 32-bit index buffer.

Let’s take a look at the 1600×900 resolution results first:

RT D3D12 RT VK RQ D3D12 RQ VK RQ MTL RQ CS BLAS Size BLAS Build
GeForce 3080 0.58 ms 0.54 ms 0.60 ms 0.55 ms 4.92 ms 33 MB (15 MB) 7 ms (18 ms)
GeForce 2080 Ti 1.06 ms 1.03 ms 0.95 ms 0.97 ms 7.35 ms 33 MB (15 MB) 7 ms (18 ms)
GeForce 1060 M 34.59 ms 33.81 ms 43 MB (37 MB) 7 ms (18 ms)
Radeon 6700 XT 2.50 ms 2.55 ms 1.71 ms 2.06 ms 6.13 ms 76 MB (65 MB) 15 ms (> 500 ms)
Radeon Vega 56 (macOS) 15.84 ms 15.48 ms 82 MB 30 ms
Apple M1 (macOS) 48.8 ms 46.4 ms 82 MB 100 ms
Apple A14 (iOS) 98.0 ms 94.3 ms 82 MB
Adreno 660 (Android) 431 ms
  • Nvidia RTX series GPUs work with the best performance, and Ray Queries are slightly faster. The great thing is that compacted BLAS is only 15 MB in size, and this is a few MB larger than the model representation for rasterization. compute shader ray tracing is 7.7 times slower than HW accelerated.
  • AMD has a 50% difference between ray query (D3D12) and ray tracing (VK). It looks like the driver is only optimized for D3D12 Ray Queries. HW Raytracing is 3.5 times faster than compute shader implementation. But the main problem is the size of BLAS that is more than four times bigger than it can be. Moreover, the first BLAS generation is painfully slow even with a fast build flag enabled.
  • Interesting fact: compute shader ray tracing is faster on AMD than on Nvidia.
  • There is no HW ray tracing on Metal. The current implementation is slower than naive compute shader. The BLAS requires a twice larger amount of memory as well. Some of the triangles are not intersecting, so the model has holes during rendering on all hardware.
  • A passive cooled Apple M1 is only 33% slower than GeForce 1060M GPU.
  • Vertex cache optimization is also improving ray tracing performance.

Now, what about native 4K?

RT D3D12 RT VK RQ D3D12 RQ VK RQ CS
GeForce 3080 3.16 ms 3.13 ms 3.25 ms 3.13 ms 21.00 ms
GeForce 2080 Ti 4.78 ms 4.77 ms 4.92 ms 4.86 ms 32.21 ms
Radeon 6700 XT 12.89 ms 13.35 ms 8.48 ms 9.77 ms 27.94 ms
  • AMD is showing a 65% difference between ray query and ray tracing pipeline. The choice for AMD is obvious.
  • It’s better to use ray tracing than ray query for Nvidia, but the difference is only 3%.
  • The best Radeon 6700 XT ray query performance is 70% slower than GeForce 2080 Ti. The biggest gap is 280%.

It’s interesting how AMD with a bigger number of ray tracing cores will perform. The main issue for AMD is an acceleration structure size and its build time. Especially for dynamic TLAS with thousands of instances.

Responsive image
Responsive image

This is an image of compute shader ray tracing and a heat map image. The red channel represents the number of BLAS BVH intersection steps scaled by 128. The main bottleneck for compute shader ray tracing is a Ray-BVH intersection that cannot be well optimized due to divergence and scattered memory access. By contrast, Volume-BVH intersection performs great on compute shader. But, unfortunately, we cannot reuse the HW (API) ray tracing acceleration structures for different purposes.

September 13, 2021

Compute versus Hardware


It was tempting to compare the Compute Shader rasterization with Mesh Shader and MDI on all supported platforms. Unreal 5 Nanite is already demonstrating outstanding performance with small triangle rendering. But what about numbers from tile-based architectures? And how good is a dedicated rasterizer?

Let’s use our old test Mesh Shader versus MDI with 498990 64×128 Meshlets without any culling except back-face. The data prepared for Meshlet fits well Compute based rasterization. All vertices and indices are independent and tightly packed. It’s possible to spawn a group per Meshlet and rasterize a triangle per thread. We will limit software rasterization to depth-only mode because 64-bit atomics are not available on Mobile devices and Metal. It will not make a big difference in performance. Metal is also not supporting atomic operations with textures, but there is a way to create texture from buffer data and use buffer atomic operations.

We compared single DIP 32-bit indices, Mesh Shader, MultiDrawIndirect, and Compute Shader rasterization performance on Nvidia, AMD, Intel, Qualcomm, and Apple GPUs.

And here are the results:

Single DIP Mesh Shader MDI/ICB/Loop Compute Shader
GeForce 2080 Ti 12.05 B 12.57 B 12.63 B 17.26 B
GeForce 1060 M 3.86 B 3.90 B 4.55 B
Radeon 6700 XT 14.73 B 4.38 B 3.63 B 16.74 B
Radeon RX 5600M 4.87 B 1.11 B 7.57 B
Radeon Vega 56 (macOS) 2.40 B 796.0 M 3.17 B
Apple M1 (macOS) 1.37 B 739.0 M 2.30 B
Apple A14 (iOS) 666.1 M 475.0 M 1.02 B
Intel UHD Graphics 680.0 M 396.5 M 556.1 M
Adreno 660 (Android) 565.2 M 31.17 M 497.3 M

The table demonstrates the number of processed triangles per second (millions and billions).

Well…

  • The era of GPU fixed-function units and an enormous number of shader types is almost over.
  • MultiDrawIndirect doesn’t work well on mobile because of the tile-based rendering.
  • Best Mesh Shaders / MDI are slower than Compute-based rasterization.
  • Single shader type is better than 14 dedicated shader types.

What we need are compute shaders and the ability to spawn threads from shaders effectively. Everything else (including raytracing) can be easily implemented on Compute Shader level. Ten years ago, the performance of Intel Larrabee wasn’t enough for compute-only mode, but now it’s possible to discard all other shaders.

A compute shader extension allowing atomically to write payloads into the image will change everything. Without it, we are limited to 32-bit payload data and must perform a redundant triangle intersection.

uint imageAtomicPayloadMax(gimage2D atomic_image, gimage2D payload_image,
  ivec2 P,
  uint atomic_data, gvec4 payload_data);
Responsive image
Responsive image

Here is an unoptimized code of Compute rasterization shader that was used in all tests on all platforms and APIs without any modification thanks to our Clay shader compiler:

layout(local_size_x = GROUP_SIZE) in;

layout(row_major, binding = 0) uniform common_parameters {
  mat4 projection;
  mat4 modelview;
  vec4 camera;
};
layout(std140, binding = 1) uniform compute_parameters {
  uint num_meshlets;
  uint group_offset;
  vec2 surface_size;
  float surface_stride;
};
layout(std140, binding = 2) uniform transform_parameters {
  vec4 transforms[NUM_INSTANCES * 3u];
};

layout(std430, binding = 3) readonly buffer vertices_buffer { vec4 vertices_data[]; };
layout(std430, binding = 4) readonly buffer meshlets_buffer { uint meshlets_data[]; };
#if CLAY_MTL
  layout(std430, binding = 5) buffer surface_buffer { uint out_surface[]; };
#else
  layout(binding = 0, set = 1, r32ui) uniform uimage2D out_surface;
#endif

shared uint num_primitives;
shared uint num_vertices;
shared uint base_index;
shared uint base_vertex;

shared vec4 row_0;
shared vec4 row_1;
shared vec4 row_2;

shared vec3 positions[NUM_VERTICES];
shared uint indices[NUM_PRIMITIVES * 3u];

/*
 */
void rasterize(vec3 p0, vec3 p1, vec3 p2) {
  
  [[branch]] if(p0.z < 0.0f || p1.z < 0.0f || p2.z < 0.0f) return;
  
  vec3 p10 = p1 - p0;
  vec3 p20 = p2 - p0;
  float det = p20.x * p10.y - p20.y * p10.x;
  [[branch]] if(det >= 0.0f) return;
  
  vec2 min_p = floor(min(min(p0.xy, p1.xy), p2.xy));
  vec2 max_p = ceil(max(max(p0.xy, p1.xy), p2.xy));
  [[branch]] if(max_p.x < 0.0f || max_p.y < 0.0f || min_p.x >= surface_size.x || min_p.y >= surface_size.x) return;
  
  min_p = clamp(min_p, vec2(0.0f), surface_size - 1.0f);
  max_p = clamp(max_p, vec2(0.0f), surface_size - 1.0f);
  
  vec2 texcoord_dx = vec2(-p20.y, p10.y) / det;
  vec2 texcoord_dy = vec2(p20.x, -p10.x) / det;
  
  vec2 texcoord_x = texcoord_dx * (min_p.x - p0.x);
  vec2 texcoord_y = texcoord_dy * (min_p.y - p0.y);
  
  for(float y = min_p.y; y <= max_p.y; y += 1.0f) {
    vec2 texcoord = texcoord_x + texcoord_y;
    for(float x = min_p.x; x <= max_p.x; x += 1.0f) {
      if(texcoord.x >= 0.0f && texcoord.y >= 0.0f && texcoord.x + texcoord.y <= 1.0f) {
        float z = p10.z * texcoord.x + p20.z * texcoord.y + p0.z;
        #if CLAY_MTL
          uint index = uint(surface_stride * y + x);
          atomicMax(out_surface[index], floatBitsToUint(z));
        #else
          imageAtomicMax(out_surface, ivec2(vec2(x, y)), floatBitsToUint(z));
        #endif
      }
      texcoord += texcoord_dx;
    }
    texcoord_y += texcoord_dy;
  }
}

/*
 */
void main() {
  
  uint group_id = gl_WorkGroupID.x + group_offset;
  uint local_id = gl_LocalInvocationIndex;
  
  // meshlet parameters
  [[branch]] if(local_id == 0u) {
    uint transform_index = (group_id / num_meshlets) * 3u;
    row_0 = transforms[transform_index + 0u];
    row_1 = transforms[transform_index + 1u];
    row_2 = transforms[transform_index + 2u];
    uint meshlet_index = (group_id % num_meshlets) * 4u;
    num_primitives = meshlets_data[meshlet_index + 0u];
    num_vertices = meshlets_data[meshlet_index + 1u];
    base_index = meshlets_data[meshlet_index + 2u];
    base_vertex = meshlets_data[meshlet_index + 3u];
  }
  memoryBarrierShared(); barrier();
  
  // load vertices
  [[unroll]] for(uint i = 0; i < NUM_VERTICES; i += GROUP_SIZE) {
    uint index = local_id + i;
    [[branch]] if(index < num_vertices) {
      uint address = (base_vertex + index) * 2u;
      vec4 position = vec4(vertices_data[address].xyz, 1.0f);
      position = vec4(dot(row_0, position), dot(row_1, position), dot(row_2, position), 1.0f);
      position = projection * (modelview * position);
      positions[index] = vec3(round((position.xy * (0.5f / position.w) + 0.5f) * surface_size * 256.0f) / 256.0f - 0.5f, position.z / position.w);
    }
  }
  
  // load indices
  [[loop]] for(uint i = local_id; (i << 2u) < num_primitives; i += GROUP_SIZE) {
    uint index = i * 12u;
    uint address = base_index + i * 3u;
    uint indices_0 = meshlets_data[address + 0u];
    uint indices_1 = meshlets_data[address + 1u];
    uint indices_2 = meshlets_data[address + 2u];
    indices[index +  0u] = (indices_0 >>  0u) & 0xffu;
    indices[index +  1u] = (indices_0 >>  8u) & 0xffu;
    indices[index +  2u] = (indices_0 >> 16u) & 0xffu;
    indices[index +  3u] = (indices_0 >> 24u) & 0xffu;
    indices[index +  4u] = (indices_1 >>  0u) & 0xffu;
    indices[index +  5u] = (indices_1 >>  8u) & 0xffu;
    indices[index +  6u] = (indices_1 >> 16u) & 0xffu;
    indices[index +  7u] = (indices_1 >> 24u) & 0xffu;
    indices[index +  8u] = (indices_2 >>  0u) & 0xffu;
    indices[index +  9u] = (indices_2 >>  8u) & 0xffu;
    indices[index + 10u] = (indices_2 >> 16u) & 0xffu;
    indices[index + 11u] = (indices_2 >> 24u) & 0xffu;
  }
  memoryBarrierShared(); barrier();
  
  // rasterize triangles
  [[branch]] if(local_id < num_primitives) {
    uint index = local_id * 3u;
    uint index_0 = indices[index + 0u];
    uint index_1 = indices[index + 1u];
    uint index_2 = indices[index + 2u];
    rasterize(positions[index_0], positions[index_1], positions[index_2]);
  }
}

Reproduction binaries for Windows: TellusimDrawMeshlet.zip