September 18, 2023

07 Hello Splatting

Gaussian Splatting is a fancy method of digitizing 2D or 3D scenes by representing them as a cloud of ellipsoids, each with unique sizes and orientations. An excellent 3D Gaussian Splatting for Real-Time Radiance Field Rendering paper enhances rendering quality through view-dependent color and fast generation.

Let’s implement Gaussian Splatting rendering using the Tellusim Core SDK for all platforms and APIs with minimal code and aim for the highest achievable FPS. We will utilize pre-trained models from the paper.

Pre-trained models are saved in the PLY file format with custom vertex properties that describe Gaussian parameters, including rotation, scale, color, and opacity. The Tellusim Mesh interface can load all this data, so our task is to convert and pack it. Each Gaussian is fully defined by the following parameters:

struct Gaussian {
    Vector4f position;              // Gaussian position
    float16x8_t harmonics[7];       // spherical harmonics
    float16x8_t rotation_scale;     // rotation, scale, and opacity
    Vector4f covariance_depth;      // covariance and depth
    Vector4f position_color;        // position and color
    Vector2i min_tile;              // minimum tile
    Vector2i max_tile;              // maximum tile

We are mixing constant and run-time values inside the same Gaussian structure for simplicity. Sixteen spherical harmonic coefficients and other parameters, except for position, are converted to half precision without loss in quality. The size of each Gaussian is 144 bytes using this simple packing. When including run-time parameters, the required size expands to 192 bytes. The conversion code is straightforward. It uses SIMD types optimized for the target platforms through SSE/AVX/NEON intrinsics:

for(uint32_t i = 0; i < gaussians.size(); i++) {
    Gaussian &gaussian = gaussians[i];

    // copy position
    gaussian.position.set(position_data[i], 1.0f);

    // pack rotation, scale, and opacity
    float32_t opacity = 1.0f / (1.0f + exp(-opacity_data[i]));
    Vector4f scale = Vector4f(exp(scale_0_data[i]), exp(scale_1_data[i]), exp(scale_2_data[i]), opacity);
    Quaternionf rotation = normalize(Quaternionf(rot_1_data[i], rot_2_data[i], rot_3_data[i], rot_0_data[i]));
    gaussian.rotation_scale = float16x8_t(float16x4_t(float32x4_t(rotation.q)), float16x4_t(float32x4_t(scale.v)));

    // copy harmonics
    harmonics[0].x = f_dc_0_data[i];
    harmonics[0].y = f_dc_1_data[i];
    harmonics[0].z = f_dc_2_data[i];
    for(uint32_t j = 1, k = 0; j < 16; j++, k++) {
        harmonics[j].x = harmonics_data[k +  0][i];
        harmonics[j].y = harmonics_data[k + 15][i];
        harmonics[j].z = harmonics_data[k + 30][i];

    // don't waste alpha channel
    harmonics[0].w = harmonics[14].x;
    harmonics[1].w = harmonics[14].y;
    harmonics[2].w = harmonics[14].z;
    harmonics[3].w = harmonics[15].x;
    harmonics[4].w = harmonics[15].y;
    harmonics[5].w = harmonics[15].z;

    // float32_t to float16_t
    for(uint32_t j = 0, k = 0; j < 7; j++, k += 2) {
        gaussian.harmonics[j] = float16x8_t(float16x4_t(harmonics[k + 0]), float16x4_t(harmonics[k + 1]));

Prefix Scan and Radix Sort algorithms for all platforms are available in the Tellusim Core SDK. They are essential for nearly all compute-based applications, including Gaussian Splatting rendering. We will employ compute tile-based rasterization to render all Gaussians, requiring only six custom compute shaders. The rendering sequence is as follows:

  • 1. Clear screen-tile counters.
  • 2. Process all Gaussians and calculate the number of Gaussians per tile.
  • 3. Align the tile counters to 4 elements (required for the Radix Sort algorithm).
  • 4. Run the Prefix Scan algorithm over all tiles to find correct tile offsets.
  • 5. Create Indirect dispatch arguments for the Scatter and Radix Sort steps.
  • 6. Scatter visible Gaussian indices to screen tiles.
  • 7. Sort screen tiles by depth (front to back).
  • 8. Blend Gaussians in each tile.

The performance optimal tile size is 32×16. It’s suitable for high resolutions and avoids overburdening Radix Sort with additional independent sorting regions. The Radix Sort algorithm utilizes GPU-generated Indirect arguments, which helps minimize memory requirements through more compact data packing.

A huge number of Gaussians per tile and the blending termination condition can result in significant divergence between GPU threads, slowing down GPU performance. The Gaussian Splatting algorithm is inherently simple. However, the overhead of data loading adversely impacts performance, even when using shared memory for efficient per-tile data loading.

A superior approach is to avoid using shared memory entirely and instead blend multiple pixels per GPU thread. We have determined that using 8 pixels per GPU thread optimizes both performance and shader group memory usage. Multi-pixel blending significantly boosts performance. Instead of processing 8 million pixels (3840×2160 resolution) with substantial data loading overhead and high divergence, we now handle just 1 million pixels (960×1080 resolution) without data loading overhead for 7 out of every 8 pixels. This optimization enables real-time Gaussian Splatting on mobile devices. The only remaining question is how to efficiently pack (optimize) Gaussians to avoid consuming all available memory.

This is the final Gaussian Splatting rendering shader with 7 neighbor pixels removed:

void main() {

    ivec2 group_id = ivec2(gl_WorkGroupID.xy);
    ivec2 global_id = ivec2(gl_GlobalInvocationID.xy) * ivec2(4, 2);
    uint local_id = gl_LocalInvocationIndex;

    // global parameters
    [[branch]] if(local_id == 0u) {
        int tile_index = tiles_width * group_id.y + group_id.x;
        tile_gaussians = count_buffer[tile_index + num_tiles];
        data_offset = count_buffer[tile_index] + max_gaussians;
    memoryBarrierShared(); barrier();

    vec2 position = vec2(global_id);

    float transparency_00 = 1.0f;

    vec3 color_00 = vec3(0.0f);

    [[loop]] for(uint i = 0u; i < tile_gaussians; i++) {

        uint index = order_buffer[data_offset + i];
        vec4 position_color = gaussians_buffer[index].position_color;
        vec3 covariance = gaussians_buffer[index];
        vec4 color_opacity = unpack_half4(;

        vec2 direction_00 = position_color.xy - position;

        float power_00 = dot(covariance, direction_00.yxx * direction_00.yxy);

        float alpha_00 = min(color_opacity.w * exp(power_00), 1.0f);

        color_00 += * (alpha_00 * transparency_00);

        transparency_00 *= (1.0f - alpha_00);

        float transparency_0 = max(max(transparency_00, transparency_10), max(transparency_20, transparency_30));
        float transparency_1 = max(max(transparency_01, transparency_11), max(transparency_21, transparency_31));
        [[branch]] if(max(transparency_0, transparency_1) < 1.0f / 255.0f) break;

    // save result
    [[branch]] if(all(lessThan(global_id, ivec2(surface_width, surface_height)))) {
        imageStore(out_surface, global_id + ivec2(0, 0), vec4(color_00, transparency_00));

With this optimization, a single GPU can easily handle 4K resolution. Comparing performance with the CUDA implementation is challenging because this tutorial example doesn’t read camera orientation parameters from scenes. However, it demonstrates a nearly twice improvement in speed (122 FPS compared to 72 FPS) when rendering from approximately the same locations in the Garden scene (with 5.8 million Gaussians). Another bonus of using the Tellusim Core SDK is the asset loading speed. The original example takes 45 seconds to start scene rendering, while the Tellusim-based implementation is ready in just 4 seconds.

Using the Kitchen (1.8 million Gaussians) scene, we were able to achieve the following FPS on different hardware (HW) and resolutions:

1920×1080 3840×2160
GeForce 3090 Ti 238 FPS 95 FPS
GeForce 2080 Ti 165 FPS 66 FPS
GeForce 3060 M 104 FPS 41 FPS
Radeon 6900 XT 223 FPS 97 FPS
Radeon 6700 XT 165 FPS 64 FPS
Radeon 6600 113 FPS 44 FPS
Intel Arc A770 180 FPS 80 FPS
Intel Iris Xe 96 29 FPS 12 FPS
Apple M1 Max 122 FPS 40 FPS
Apple M1 33 FPS 13 FPS

WebGPU demos (a WebGPU-compatible browser is required):

Responsive image
Responsive image
Responsive image