September 20, 2022

Improved Blue Noise

Our GPU implementation of Void And Cluster blue noise worked well for us. It provides many tweaking parameters that help to remove low noise frequencies even more, even if the blue noise texture is small. The extension for spatiotemporal effects was simple: fract(blue_noise + per_frame_random).

Previous attempts to use multiple blue noise texture sequences didn’t work well. The simple fract(bs+rnd) wrap provides the same white noise signal over time as multiple uncorrelated blue noise textures. Nvidia Spatiotemporal Blue Noise extension for blue noise removes white noise over time. It improves the image quality without additional cost because each frame renders in absolutely the same way except for different random seeds for effects.

We wanted to keep our blue noise benefits like excellent performance and variable sigma with maximum code reuse of the current algorithm. And the idea was simple: use the Void And Cluster algorithm for layers and add correlation between layers. So we added a few lines to the current implementation that uses the last blue noise result as an input seed for the new generation. That preserves the quality of blue noise per layer, but what about time correlation? The result was promising. The amplitude of low frequencies over time started to be lower but still not perfect. That is because the next iteration of blue noise generation makes the inverse sequence, and the result was almost the same as a layer before. After a few tries of different input seed jittering and randomization, we found that a simple image flip over the X and Y axes provides the best quality for time coherence.

Here is the sequence of 64 blue noise (64×64) images spectrum over XY/XZ/YZ slices with different sigma (2.0, 1.6, 1.2). As you can see, there are no low frequencies in time at all, and we can still control the number of middle frequencies in the XY slice as before (8×8 grid of layers):

Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image
Responsive image

Now let’s see the difference between a single wrapped blue noise image (white over time) and a sequence of blue noise images (blue over time). 64x64x32 noise with a 1.6 sigma value works best for us. This is a side-by-side comparison of the PCF shadow map with just 8 samples (super fast on modern GPUs). Almost all unwanted blue noise tiling artifacts are gone:

Responsive image
Responsive image

A comparison with Nvidia STBN (128x128x64) also shows image improvement on ray tracing ambient occlusion with 2spp and reflection with 1spp. And it’s interesting because we still use 64x64x32 noise that is 8 times smaller in size. “Single” is the first layer from the noise sequence with white noise wrap. “All” is all layers from the sequence.

The high-level source code of the improved blue noise generator is available on our GitHub page. The source code requires Tellusim Engine, but you can use it as a starting point in your noise explorations.

Ready-to-use binaries for Windows, Linux, and macOS:

Ready-to-use noise sequences with 1.6 sigma:

  • 64x64x32: png ktx
  • 64x64x64: png ktx
  • 128x128x64: png ktx (108 seconds on M1 Max).
  • 256x256x128: png ktx (21 minutes on M1 Max).

We are thankful for the following Sketchfab models that were used in this article:

June 19, 2022

Intel Arc 370M analysis

Intel Arc GPU was announced a while ago, but we did not have an opportunity to analyze its performance until now. Because VK performance is lower on Intel GPUs we will use D3D12 API for tests. The driver version for Arc GPU is, while the Xe GPU has a more recent driver version The main focus is raw Arc GPU performance in comparison with other GPUs. We can easily extrapolate the numbers to get the approximate performance of the upcoming Intel Arc 770M, which has 4x cores and faster memory. So just multiple (divide) results by 4.

We will see how the new GPU handles triangles and batches in the first test. This is the same test we used before. The Meshlet size is 69/169. The test is rendering 262K Meshlets. The total amount of geometry is 20M vertices and 40M triangles per frame.

Single DIP Mesh Indexing MDI/ICB Mesh Shader Compute Shader
Intel Arc A370M 3.2B 1.0B 2.9B 770M 3.0B
Intel Iris Xe (12th) 1.6B 450M 1.6B 2.9B
Intel Iris Xe (11th) 1.2B 370M 400B 2.3B
Apple M1 Max 8.3B 3.5B 2.2B 12.3B
Apple M1 1.4B 648M 1.0B 2.7B
GeForce 2080 Ti 15.5B 5.2B 17.5B 14.3B 17.8B
GeForce 1060 3.8B 1.2B 4.0B 4.5B
Radeon 6700 XT 14.2B 6.2B 6.3B 4.6B 17.0B
Radeon 5600 M 5.0B 2.4B 2.1B 7.4B
Radeon 4800H 1.2B 530M 1.2B 1.5B
Adreno 730 890M 287M 120M 423M
  • Unit is Billion or Million triangles per second.
  • Single DIP is drawing 81 instances with u32 indices without going to Meshlet level.
  • Mesh Indexing is a Mesh Shader emulation trick from this post.
  • MDI/ICB is Multi Draw Indirect or Indirect Command Buffer.
  • Mesh Shader is using Mesh Shaders rendering mode.
  • Compute Shader is using Compute Shader rasterization.

Intel Arc 370M provides 2x better performance in triangle rasterization throughput. But unfortunately, there is no increase in Compute Shader calculations, and Arc GPU shows almost the same result as Xe GPU. The 12th generation of Xe GPU also has some benefits from increased DDR5 memory bandwidth compared to the 11th generation that uses DDR4. With the new Xe generation, there is also a huge 4x performance boost in MDI performance.

Mesh Shader rendering performance is not good. GPU is losing 4x of its theoretical triangle throughput. The Mesh Shader emulation trick is better than Mesh Shader on Intel Arc as well.

Responsive image
Responsive image

The second test is Ray Tracing test with CS and API (HW) rendering modes.

CS Static API Static CS Dynamic Fast API Dynamic Fast CS Dynamic Full API Dynamic Full
Intel Arc A370M 15.0 FPS 142 FPS 23.8 FPS (8ms/15ms) 62.3 FPS (8ms/3ms) 12.0 FPS (53ms/12ms) 20.8 FPS (40ms/3ms)
Intel Iris Xe (12th) 11.5 FPS 12.5 FPS (14ms/56ms) 7.4 FPS (76ms/32ms)
Intel Iris Xe (11th) 10.1 FPS 8.8 FPS (17ms/60 ms) 5.4 FPS (101ms/50ms)
Apple M1 Max 68.4 FPS 63.9 FPS 34.8 FPS 28.7 FPS 28.5 FPS 2.7 FPS
Apple M1 16.5 FPS 16.5 FPS 11.3 FPS 14.3 FPS 7.3 FPS 1.7 FPS
GeForce 2080 Ti 74.2 FPS 803 FPS 74.5 FPS (2ms/8ms) 353 FPS (1.1ms/0.6ms) 58.1 FPS (8.8ms/5ms) 61.2 FPS (15ms/0.5ms)
GeForce 1060 18.0 FPS 16.8 FPS (10ms/35ms) 13.7 FPS (32ms/26ms)
Radeon 6700 XT 134 FPS 368 FPS 62.7 FPS (3ms/10ms) 155 FPS (5ms/1ms) 50.4 FPS (9ms/8ms) 22 FPS (44ms/1ms)
Radeon 5600 M 73.3 FPS 35.2 FPS (6ms/16ms) 25.7 FPS (19ms/13ms)
Radeon 4800H 11.7 FPS 7.5 FPS (35ms/66ms) 5.2 FPS (109ms/52ms)
  • CS Static is Compute Shader ray tracing from our post (40M triangles total).
  • CS Dynamic Fast is Compute Shader ray tracing from this post (4.2M triangles and 2.9M vertices total).
  • CS Dynamic Full the same as CS Dynamic Fast but with full BLAS rebuild instead of fast BVH update.
  • API Static, API Dynamic Fast, and API Dynamic Full uses API-provided ray tracing.
  • Timings show BLAS update / Scene tracing times.

In these tests, the new Arc architecture demonstrates much better compute performance on loads with big threads divergence. HW ray tracing rate is great in comparison with compute shader implementation. RT performance of Arc 770M should be better than RT performance of AMD Radeon GPUs based on results extrapolation.

Let’s see how much memory we need for BLAS and Scratch buffers:

Static BLAS Static Scratch Dynamic BLAS Dynamic Scratch
Intel Arc A370M 66 MB 23 MB 642 MB 280 MB
Apple M1 82 MB 88 MB 355 MB 382 MB
GeForce 2080 Ti 33 MB 10 MB 255 MB 16 MB
Radeon 6700 XT 77 MB 105 MB 656 MB 887 MB
Responsive image
Responsive image

The numbers from the new Intel GPU look promising. But GravityMark benchmark crashes on D3D12 (Raster and RT) API, where Arc should demonstrate its potential. But in fact, we have results that are worse than the Intel Xe generation and even worse than Apple A15. Hopefully, it will be improved with new driver updates because, at this moment, Intel Xe is faster than Intel Arc.

Intel Arc and Intel Xe don’t have a big difference in Compute performance. So probably driver can use both GPUs for rendering. Like emulating multiple compute queues that work on different GPUs. If this is true, Intel’s driver team has a lot of work to do.

Update: ExecuteIndirect() with GPU-specified count is crashing D3D12 driver. A CPU-GPU synchronization workaround (count parameter fetch) allows running GravityMark on D3D12. But unfortunately, this workaround is not helping Vulkan to be as fast as D3D12.

Update 2: Thanks to Intel developer relations, a simple engine optimization (flexible subgroup size) gives >200% better performance on Intel Arc GPU in Vulkan.

Responsive image
Responsive image
Responsive image

January 16, 2022

Mesh Shader Emulation

The Mesh shader performance is not where AMD GPUs shine. There is not much difference in FPS between MDI and Mesh shader either. But there is a trick that emulates Mesh shader functionality on all GPUs with compute shader support. And this technique works faster than MDI and Mesh shader on almost all AMD GPUs. It also brings an unbelievable (7x) performance boost for Qualcomm GPUs.

Geometry prepared for Mesh shader rendering contains vertex and meshlet data buffers. Each meshlet uses uint8_t indices for vertices which allow loading more primitives in a few memory lockups (uint32_t[3] per 4 triangles).

This is a data layout for each meshlet:

uint32_t Number of primitives
uint32_t Number of vertices 
uint32_t Base index (index buffer offset)
uint32_t Base vertex (vertex buffer offset)

Mesh shader is straightforward: load meshlet info, fetch vertices and indices, submit vertices and indices.

But how will we render meshlets without Mesh shader support? We will use a single draw call with the help of a simple compute shader. This shader will transform meshlet indices into the packed uint32_t triangle indices that contain the meshlet index and the vertex index. The shader is trivial:

layout(std430, binding = 0) readonly buffer meshlets_buffer { uint meshlets_data[]; };
layout(std430, binding = 1) writeonly buffer indexing_buffer { uvec4 indexing_data[]; };

void main() {
  uint group_id = base_group + gl_WorkGroupID.x;
  uint local_id = gl_LocalInvocationIndex;
  [[branch]] if(local_id == 0u) {
    uint meshlet_index = (group_id % num_meshlets) * 4u;
    num_primitives = meshlets_data[meshlet_index + 0u];
    base_index = meshlets_data[meshlet_index + 2u];
  memoryBarrierShared(); barrier();
  uint indices_0 = 0u;
  uint indices_1 = 0u;
  uint indices_2 = 0u;
  [[branch]] if(local_id * 4u < num_primitives) {
    uint address = base_index + local_id * 3u;
    indices_0 = meshlets_data[address + 0u];
    indices_1 = meshlets_data[address + 1u];
    indices_2 = meshlets_data[address + 2u];
  uint group_index = group_id << 8u;
  uint index = (GROUP_SIZE * group_id + local_id) * 3u;
  indexing_data[index + 0u] = (uvec4(indices_0, indices_0 >> 8u, indices_0 >> 16u, indices_0 >> 24u) & 0xffu) | group_index;
  indexing_data[index + 1u] = (uvec4(indices_1, indices_1 >> 8u, indices_1 >> 16u, indices_1 >> 24u) & 0xffu) | group_index;
  indexing_data[index + 2u] = (uvec4(indices_2, indices_2 >> 8u, indices_2 >> 16u, indices_2 >> 24u) & 0xffu) | group_index;

After that compute pass, only a single drawElements() is required to render all meshlets. The amount of write and read memory per 1M triangles is 11 MB. It’s not possible to use 16-bit indices because we can draw only 1024 meshlets per draw call in that case. The TriangleIndex built-in shader variable will make that possible, but we don’t have it.

Vertex shader can get the meshlet index and meshlet vertex by using simple math over VertexIndex built-in variable:

uint meshlet = gl_VertexIndex >> 8u;
uint vertex = gl_VertexIndex & 0xffu;
// load vertex data from SSBO

There is no problem adding a per-triangle visibility culling test to that shader. And it will improve performance because back-face and invisible triangles will not use the bandwidth.

Let’s check the results in different configurations and platforms where the emulation is faster:

Single DIP Emulation Mesh Shader MDI/ICB/Loop
Radeon 6900 XT 64/126 17.2 B 9.6 B 9.2 B 4.1 B
Radeon 6900 XT 128/212 17.2 B 10.1 B 9.2 B 8.5 B
Radeon 6700 XT 64/126 14.1 B 7.7 B 4.6 B 4.1 B
Radeon 6700 XT 128/212 14.1 B 7.9 B 4.6 B 8.3 B
Radeon 5600 M 64/126 5.0 B 2.9 B 1.4 B
Radeon 5600 M 128/212 5.0 B 2.9 B 2.8 B
Radeon Vega 56 64/126 (macOS) 4.1 B 2.7 B 860 M
Radeon Vega 56 128/212 (macOS) 4.1 B 2.9 B 1.8 B
Adreno 660 64/126 583 M 265 M 36.6 M
Adreno 660 128/212 596 M 309 M 74.7 M
Mali-G78 MP20 64/126 176 M 153 M 83 M
Mali-G78 MP20 128/212 187 M 123 M 134 M

The results in Billion and Million processed triangles per second. MDI – Multi Draw Indirect, ICB – Indirect Command Buffer, Loop – the CPU loop of multiple drawIndirect() commands.

If you need to draw a lot of small DIPs, it’s better to pack everything to the single draw call on AMD and Qualcomm. And a single DIP is working better than Mesh shaders or MDI even with additional index buffer generation overhead.

Windows and Linux binaries:

December 16, 2021

Mesh Shader Performance

We have some new questionable results of Mesh Shader and MultiDrawIndirect performance on AMD GPUs. Let’s look at the situation when we render geometry using independent triangles without index buffer. VS Draw Array column represents this rendering mode where the number of Vertex Shader invocations equals the number of primitives times 3. Technically, this mode does the same as Mesh Shader rendering mode but without generating indices. That’s why it’s surprising to see that it’s faster than MultiDrawIndirect and Mesh Shader on 6700 XT. So if you are optimizing geometry for vertex cache or generating the best possible meshlets, it makes no sense to do it for AMD GPUs. Other GPUs can share “Vertex Shader” output between adjacent triangles.

VS Draw Elements VS Draw Arrays MDI MS CS
Radeon 5600 M 5.0 B 1.5 B 1.1 B 8.3 B
Radeon 6700 XT 14.5 B 4.8 B 4.1 B 4.6 B 19.5 B
Radeon 6900 XT 17.6 B 7.2 B 4.1 B 9.1 B 34.5 B
GeForce RTX 2080 Ti 12.3 B 5.6 B 12.5 B 13.3 B 18.3 B
GeForce RTX 3090 14.3 B 5.9 B 14.6 B 20.7 B 28.8 B
Intel DG1 1.3 B 227 M 1.1 B 2.5 B
Apple M1 1.4 B 556 M 930 M 2.5 B

Compute versus Hardware
Mesh Shader versus MultiDrawIndirect