September 18, 2023

07 Hello Splatting

Gaussian Splatting is a fancy method of digitizing 2D or 3D scenes by representing them as a cloud of ellipsoids, each with unique sizes and orientations. An excellent 3D Gaussian Splatting for Real-Time Radiance Field Rendering paper enhances rendering quality through view-dependent color and fast generation.

Let’s implement Gaussian Splatting rendering using the Tellusim Core SDK for all platforms and APIs with minimal code and aim for the highest achievable FPS. We will utilize pre-trained models from the paper.

Pre-trained models are saved in the *PLY* file format with custom vertex properties that describe Gaussian parameters, including rotation, scale, color, and opacity. The Tellusim *Mesh* interface can load all this data, so our task is to convert and pack it. Each Gaussian is fully defined by the following parameters:

structGaussian { Vector4f position; // Gaussian positionfloat16x8_tharmonics[7]; // spherical harmonicsfloat16x8_trotation_scale; // rotation, scale, and opacity Vector4f covariance_depth; // covariance and depth Vector4f position_color; // position and color Vector2i min_tile; // minimum tile Vector2i max_tile; // maximum tile };

We are mixing constant and run-time values inside the same Gaussian structure for simplicity. Sixteen spherical harmonic coefficients and other parameters, except for position, are converted to half precision without loss in quality. The size of each Gaussian is 144 bytes using this simple packing. When including run-time parameters, the required size expands to 192 bytes. The conversion code is straightforward. It uses *SIMD* types optimized for the target platforms through *SSE/AVX/NEON* intrinsics:

for(uint32_ti = 0; i < gaussians.size(); i++) { Gaussian &gaussian = gaussians[i]; // copy position gaussian.position.set(position_data[i], 1.0f); // pack rotation, scale, and opacityfloat32_topacity = 1.0f / (1.0f + exp(-opacity_data[i])); Vector4f scale = Vector4f(exp(scale_0_data[i]), exp(scale_1_data[i]), exp(scale_2_data[i]), opacity); Quaternionf rotation = normalize(Quaternionf(rot_1_data[i], rot_2_data[i], rot_3_data[i], rot_0_data[i])); gaussian.rotation_scale =float16x8_t(float16x4_t(float32x4_t(rotation.q)),float16x4_t(float32x4_t(scale.v))); // copy harmonics harmonics[0].x = f_dc_0_data[i]; harmonics[0].y = f_dc_1_data[i]; harmonics[0].z = f_dc_2_data[i];for(uint32_tj = 1, k = 0; j < 16; j++, k++) { harmonics[j].x = harmonics_data[k + 0][i]; harmonics[j].y = harmonics_data[k + 15][i]; harmonics[j].z = harmonics_data[k + 30][i]; } // don't waste alpha channel harmonics[0].w = harmonics[14].x; harmonics[1].w = harmonics[14].y; harmonics[2].w = harmonics[14].z; harmonics[3].w = harmonics[15].x; harmonics[4].w = harmonics[15].y; harmonics[5].w = harmonics[15].z; // float32_t to float16_tfor(uint32_tj = 0, k = 0; j < 7; j++, k += 2) { gaussian.harmonics[j] =float16x8_t(float16x4_t(harmonics[k + 0]),float16x4_t(harmonics[k + 1])); } }

Prefix Scan and Radix Sort algorithms for all platforms are available in the Tellusim Core SDK. They are essential for nearly all compute-based applications, including Gaussian Splatting rendering. We will employ compute tile-based rasterization to render all Gaussians, requiring only six custom compute shaders. The rendering sequence is as follows:

- 1. Clear screen-tile counters.
- 2. Process all Gaussians and calculate the number of Gaussians per tile.
- 3. Align the tile counters to 4 elements (required for the Radix Sort algorithm).
- 4. Run the Prefix Scan algorithm over all tiles to find correct tile offsets.
- 5. Create Indirect dispatch arguments for the Scatter and Radix Sort steps.
- 6. Scatter visible Gaussian indices to screen tiles.
- 7. Sort screen tiles by depth (front to back).
- 8. Blend Gaussians in each tile.

The performance optimal tile size is 32×16. It’s suitable for high resolutions and avoids overburdening Radix Sort with additional independent sorting regions. The Radix Sort algorithm utilizes GPU-generated Indirect arguments, which helps minimize memory requirements through more compact data packing.

A huge number of Gaussians per tile and the blending termination condition can result in significant divergence between GPU threads, slowing down GPU performance. The Gaussian Splatting algorithm is inherently simple. However, the overhead of data loading adversely impacts performance, even when using shared memory for efficient per-tile data loading.

A superior approach is to avoid using shared memory entirely and instead blend multiple pixels per GPU thread. We have determined that using 8 pixels per GPU thread optimizes both performance and shader group memory usage. Multi-pixel blending significantly boosts performance. Instead of processing 8 million pixels (3840×2160 resolution) with substantial data loading overhead and high divergence, we now handle just 1 million pixels (960×1080 resolution) without data loading overhead for 7 out of every 8 pixels. This optimization enables real-time Gaussian Splatting on mobile devices. The only remaining question is how to efficiently pack (optimize) Gaussians to avoid consuming all available memory.

This is the final Gaussian Splatting rendering shader with 7 neighbor pixels removed:

/* */voidmain() {ivec2group_id =ivec2(gl_WorkGroupID.xy);ivec2global_id =ivec2(gl_GlobalInvocationID.xy) *ivec2(4, 2);uintlocal_id = gl_LocalInvocationIndex; // global parameters [[branch]]if(local_id == 0u) {inttile_index = tiles_width * group_id.y + group_id.x; tile_gaussians = count_buffer[tile_index + num_tiles]; data_offset = count_buffer[tile_index] + max_gaussians; } memoryBarrierShared(); barrier();vec2position =vec2(global_id);floattransparency_00 = 1.0f; ..vec3color_00 =vec3(0.0f); .. [[loop]]for(uinti = 0u; i < tile_gaussians; i++) {uintindex = order_buffer[data_offset + i];vec4position_color = gaussians_buffer[index].position_color;vec3covariance = gaussians_buffer[index].covariance_depth.xyz;vec4color_opacity = unpack_half4(position_color.zw);vec2direction_00 = position_color.xy - position; ..floatpower_00 = dot(covariance, direction_00.yxx * direction_00.yxy); ..floatalpha_00 = min(color_opacity.w * exp(power_00), 1.0f); .. color_00 += color_opacity.xyz * (alpha_00 * transparency_00); .. transparency_00 *= (1.0f - alpha_00); ..floattransparency_0 = max(max(transparency_00, transparency_10), max(transparency_20, transparency_30));floattransparency_1 = max(max(transparency_01, transparency_11), max(transparency_21, transparency_31)); [[branch]]if(max(transparency_0, transparency_1) < 1.0f / 255.0f)break; } // save result [[branch]]if(all(lessThan(global_id,ivec2(surface_width, surface_height)))) { imageStore(out_surface, global_id +ivec2(0, 0),vec4(color_00, transparency_00)); .. } }

With this optimization, a single GPU can easily handle 4K resolution. Comparing performance with the *CUDA* implementation is challenging because this tutorial example doesn’t read camera orientation parameters from scenes. However, it demonstrates a nearly twice improvement in speed (122 FPS compared to 72 FPS) when rendering from approximately the same locations in the Garden scene (with 5.8 million Gaussians). Another bonus of using the Tellusim Core SDK is the asset loading speed. The original example takes 45 seconds to start scene rendering, while the Tellusim-based implementation is ready in just 4 seconds.

Using the Kitchen (1.8 million Gaussians) scene, we were able to achieve the following FPS on different hardware (HW) and resolutions:

1920×1080 | 3840×2160 | |
---|---|---|

GeForce 3090 Ti | 238 FPS | 95 FPS |

GeForce 2080 Ti | 165 FPS | 66 FPS |

GeForce 3060 M | 104 FPS | 41 FPS |

Radeon 6900 XT | 223 FPS | 97 FPS |

Radeon 6700 XT | 165 FPS | 64 FPS |

Radeon 6600 | 113 FPS | 44 FPS |

Intel Arc A770 | 180 FPS | 80 FPS |

Intel Iris Xe 96 | 29 FPS | 12 FPS |

Apple M1 Max | 122 FPS | 40 FPS |

Apple M1 | 33 FPS | 13 FPS |

WebGPU demos (a WebGPU-compatible browser is required):