September 18, 2023

07 Hello Splatting

Gaussian Splatting is a fancy method of digitizing 2D or 3D scenes by representing them as a cloud of ellipsoids, each with unique sizes and orientations. An excellent 3D Gaussian Splatting for Real-Time Radiance Field Rendering paper enhances rendering quality through view-dependent color and fast generation.

Let’s implement Gaussian Splatting rendering using the Tellusim Core SDK for all platforms and APIs with minimal code and aim for the highest achievable FPS. We will utilize pre-trained models from the paper.

Pre-trained models are saved in the PLY file format with custom vertex properties that describe Gaussian parameters, including rotation, scale, color, and opacity. The Tellusim Mesh interface can load all this data, so our task is to convert and pack it. Each Gaussian is fully defined by the following parameters:

struct Gaussian {
    Vector4f position;              // Gaussian position
    float16x8_t harmonics[7];       // spherical harmonics
    float16x8_t rotation_scale;     // rotation, scale, and opacity
    Vector4f covariance_depth;      // covariance and depth
    Vector4f position_color;        // position and color
    Vector2i min_tile;              // minimum tile
    Vector2i max_tile;              // maximum tile

We are mixing constant and run-time values inside the same Gaussian structure for simplicity. Sixteen spherical harmonic coefficients and other parameters, except for position, are converted to half precision without loss in quality. The size of each Gaussian is 144 bytes using this simple packing. When including run-time parameters, the required size expands to 192 bytes. The conversion code is straightforward. It uses SIMD types optimized for the target platforms through SSE/AVX/NEON intrinsics:

for(uint32_t i = 0; i < gaussians.size(); i++) {
    Gaussian &gaussian = gaussians[i];

    // copy position
    gaussian.position.set(position_data[i], 1.0f);

    // pack rotation, scale, and opacity
    float32_t opacity = 1.0f / (1.0f + exp(-opacity_data[i]));
    Vector4f scale = Vector4f(exp(scale_0_data[i]), exp(scale_1_data[i]), exp(scale_2_data[i]), opacity);
    Quaternionf rotation = normalize(Quaternionf(rot_1_data[i], rot_2_data[i], rot_3_data[i], rot_0_data[i]));
    gaussian.rotation_scale = float16x8_t(float16x4_t(float32x4_t(rotation.q)), float16x4_t(float32x4_t(scale.v)));

    // copy harmonics
    harmonics[0].x = f_dc_0_data[i];
    harmonics[0].y = f_dc_1_data[i];
    harmonics[0].z = f_dc_2_data[i];
    for(uint32_t j = 1, k = 0; j < 16; j++, k++) {
        harmonics[j].x = harmonics_data[k +  0][i];
        harmonics[j].y = harmonics_data[k + 15][i];
        harmonics[j].z = harmonics_data[k + 30][i];

    // don't waste alpha channel
    harmonics[0].w = harmonics[14].x;
    harmonics[1].w = harmonics[14].y;
    harmonics[2].w = harmonics[14].z;
    harmonics[3].w = harmonics[15].x;
    harmonics[4].w = harmonics[15].y;
    harmonics[5].w = harmonics[15].z;

    // float32_t to float16_t
    for(uint32_t j = 0, k = 0; j < 7; j++, k += 2) {
        gaussian.harmonics[j] = float16x8_t(float16x4_t(harmonics[k + 0]), float16x4_t(harmonics[k + 1]));

Prefix Scan and Radix Sort algorithms for all platforms are available in the Tellusim Core SDK. They are essential for nearly all compute-based applications, including Gaussian Splatting rendering. We will employ compute tile-based rasterization to render all Gaussians, requiring only six custom compute shaders. The rendering sequence is as follows:

  • 1. Clear screen-tile counters.
  • 2. Process all Gaussians and calculate the number of Gaussians per tile.
  • 3. Align the tile counters to 4 elements (required for the Radix Sort algorithm).
  • 4. Run the Prefix Scan algorithm over all tiles to find correct tile offsets.
  • 5. Create Indirect dispatch arguments for the Scatter and Radix Sort steps.
  • 6. Scatter visible Gaussian indices to screen tiles.
  • 7. Sort screen tiles by depth (front to back).
  • 8. Blend Gaussians in each tile.

The performance optimal tile size is 32×16. It’s suitable for high resolutions and avoids overburdening Radix Sort with additional independent sorting regions. The Radix Sort algorithm utilizes GPU-generated Indirect arguments, which helps minimize memory requirements through more compact data packing.

A huge number of Gaussians per tile and the blending termination condition can result in significant divergence between GPU threads, slowing down GPU performance. The Gaussian Splatting algorithm is inherently simple. However, the overhead of data loading adversely impacts performance, even when using shared memory for efficient per-tile data loading.

A superior approach is to avoid using shared memory entirely and instead blend multiple pixels per GPU thread. We have determined that using 8 pixels per GPU thread optimizes both performance and shader group memory usage. Multi-pixel blending significantly boosts performance. Instead of processing 8 million pixels (3840×2160 resolution) with substantial data loading overhead and high divergence, we now handle just 1 million pixels (960×1080 resolution) without data loading overhead for 7 out of every 8 pixels. This optimization enables real-time Gaussian Splatting on mobile devices. The only remaining question is how to efficiently pack (optimize) Gaussians to avoid consuming all available memory.

This is the final Gaussian Splatting rendering shader with 7 neighbor pixels removed:

void main() {

    ivec2 group_id = ivec2(gl_WorkGroupID.xy);
    ivec2 global_id = ivec2(gl_GlobalInvocationID.xy) * ivec2(4, 2);
    uint local_id = gl_LocalInvocationIndex;

    // global parameters
    [[branch]] if(local_id == 0u) {
        int tile_index = tiles_width * group_id.y + group_id.x;
        tile_gaussians = count_buffer[tile_index + num_tiles];
        data_offset = count_buffer[tile_index] + max_gaussians;
    memoryBarrierShared(); barrier();

    vec2 position = vec2(global_id);

    float transparency_00 = 1.0f;

    vec3 color_00 = vec3(0.0f);

    [[loop]] for(uint i = 0u; i < tile_gaussians; i++) {

        uint index = order_buffer[data_offset + i];
        vec4 position_color = gaussians_buffer[index].position_color;
        vec3 covariance = gaussians_buffer[index];
        vec4 color_opacity = unpack_half4(;

        vec2 direction_00 = position_color.xy - position;

        float power_00 = dot(covariance, direction_00.yxx * direction_00.yxy);

        float alpha_00 = min(color_opacity.w * exp(power_00), 1.0f);

        color_00 += * (alpha_00 * transparency_00);

        transparency_00 *= (1.0f - alpha_00);

        float transparency_0 = max(max(transparency_00, transparency_10), max(transparency_20, transparency_30));
        float transparency_1 = max(max(transparency_01, transparency_11), max(transparency_21, transparency_31));
        [[branch]] if(max(transparency_0, transparency_1) < 1.0f / 255.0f) break;

    // save result
    [[branch]] if(all(lessThan(global_id, ivec2(surface_width, surface_height)))) {
        imageStore(out_surface, global_id + ivec2(0, 0), vec4(color_00, transparency_00));

With this optimization, a single GPU can easily handle 4K resolution. Comparing performance with the CUDA implementation is challenging because this tutorial example doesn’t read camera orientation parameters from scenes. However, it demonstrates a nearly twice improvement in speed (122 FPS compared to 72 FPS) when rendering from approximately the same locations in the Garden scene (with 5.8 million Gaussians). Another bonus of using the Tellusim Core SDK is the asset loading speed. The original example takes 45 seconds to start scene rendering, while the Tellusim-based implementation is ready in just 4 seconds.

Using the Kitchen (1.8 million Gaussians) scene, we were able to achieve the following FPS on different hardware (HW) and resolutions:

1920×1080 3840×2160
GeForce 3090 Ti 238 FPS 95 FPS
GeForce 2080 Ti 165 FPS 66 FPS
GeForce 3060 M 104 FPS 41 FPS
Radeon 6900 XT 223 FPS 97 FPS
Radeon 6700 XT 165 FPS 64 FPS
Radeon 6600 113 FPS 44 FPS
Intel Arc A770 180 FPS 80 FPS
Intel Iris Xe 96 29 FPS 12 FPS
Apple M1 Max 122 FPS 40 FPS
Apple M1 33 FPS 13 FPS

WebGPU demos (a WebGPU-compatible browser is required):

Responsive image
Responsive image
Responsive image

August 28, 2023

06 Hello Traversal

Ray Queries are very useful for simple scenes where material variety is low. They are good for shadows and ambient occlusion. However, it can be challenging to implement different BRDFs for reflections or primary rays using this approach. Ray Tracing Pipelines give much better flexibility by utilizing driver/GPU-side shader scheduling. Working with Ray Tracing Pipelines isn’t significantly more complex than working with Ray Queries, especially when utilizing the Tellusim SDK.

Within Tellusim Engine, the Ray Tracing Pipeline is referred to as the Traversal interface. It offers functionalities analogous to those of the Pipeline or Kernel interfaces. The Compute interface can dispatch it just as it would with a Kernel. It’s crucial to verify ray tracing support on our device before creating a Traversal. Furthermore, there is an AMD GPU Vulkan driver limitation that doesn’t allow the use of tracing recursion. So, the global RECURSION_DEPTH macro needs to be passed to all shaders:

// check ray tracing support
if(!device.getFeatures().rayTracing) {
    TS_LOG(Error, "ray tracing is not supported\n");
    return 0;
if(device.getFeatures().recursionDepth == 1) {
    TS_LOG(Error, "ray tracing recursion is not supported\n");

// shader macros
Shader::setMacro("RECURSION_DEPTH", device.getFeatures().recursionDepth);

In this example, we will trace primary, reflection, and shadow rays. For that, we need 3 shader groups. These groups are generated automatically. Your task during initialization is to simply combine the shaders together. Primary rays will launch secondary rays for reflections and shadows, thereby necessitating a recursion depth of 2:

// create traversal
Traversal traversal = device.createTraversal();
traversal.setUniformMask(0, Shader::MaskAll);
traversal.setStorageMasks(0, 3, Shader::MaskAll);
traversal.setSurfaceMask(0, Shader::MaskRayGen);
traversal.setTracingMask(0, Shader::MaskRayGen | Shader::MaskClosest);
traversal.setRecursionDepth(min(device.getFeatures().recursionDepth, 2u));

// entry shader
if(!traversal.loadShaderGLSL(Shader::TypeRayGen, "main.shader", "RAYGEN_SHADER=1")) return 1;

// primary shaders
if(!traversal.loadShaderGLSL(Shader::TypeRayMiss, "main.shader", "RAYMISS_SHADER=1; PRIMARY_SHADER=1")) return 1;
if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; PRIMARY_SHADER=1; PLANE_SHADER=1")) return 1;
if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; PRIMARY_SHADER=1; MODEL_SHADER=1")) return 1;
if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; PRIMARY_SHADER=1; DODECA_SHADER=1")) return 1;

// reflection shaders
if(!traversal.loadShaderGLSL(Shader::TypeRayMiss, "main.shader", "RAYMISS_SHADER=1; REFLECTION_SHADER=1")) return 1;
if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; REFLECTION_SHADER=1; PLANE_SHADER=1")) return 1;
if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; REFLECTION_SHADER=1; MODEL_SHADER=1")) return 1;
if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; REFLECTION_SHADER=1; DODECA_SHADER=1")) return 1;

// shadow shaders
if(!traversal.loadShaderGLSL(Shader::TypeRayMiss, "main.shader", "RAYMISS_SHADER=1; SHADOW_SHADER=1")) return 1;

// create traversal
if(!traversal.create()) return 1;

The example using Ray Queries involved a single geometry in the scene. In this case, we will utilize 3 different geometries. While it is possible to employ 6 separate buffers for vertex and index information, this approach may not be suitable for more complex scenes. Therefore, we will combine them into 2 buffers. The MeshModel interface offers two distinct methods to accomplish this: through Class inheritance or Data callbacks. For our application, Data callbacks provide more than sufficient functionality:

// vertex buffer callback
model.setVertexBufferCallback([&](const void *src, size_t size, bool owner) -> bool {

    // create geometry
    Geometry &geometry = geometries.append();
    geometry.base_vertex = vertices.size();
    geometry.base_index = indices.size();

    // copy vertices
    geometry.num_vertices = (uint32_t)(size / sizeof(Vertex));
    vertices.append((const Vertex*)src, geometry.num_vertices);

    // release memory
    if(owner) Allocator::free(src, size);

    return true;

// index buffer callback
model.setIndexBufferCallback([&](const void *src, size_t size, bool owner) -> bool {

    // copy indices
    Geometry &geometry = geometries.back();
    geometry.num_indices = (uint32_t)(size / sizeof(uint32_t));
    indices.append((const uint32_t*)src, geometry.num_indices);

    // release memory
    if(owner) Allocator::free(src, size);

    return true;

We simply copy the vertex and index data from the MeshModel callbacks into two buffers. Simultaneously, we save the number of vertices/indices along with their respective base offsets. The Tracing interface employs these offsets to properly initialize the appropriate geometry. If the build buffer is sufficiently large, all tracings can be constructed with a single API call:

// create tracings
size_t build_size = 0;
Array<Tracing> tracings;
for(Geometry &geometry : geometries) {
    Tracing tracing = device.createTracing();
    tracing.addVertexBuffer(geometry.num_vertices, FormatRGBf32, sizeof(Vertex), vertex_buffer, sizeof(Vertex) * geometry.base_vertex);
    tracing.addIndexBuffer(geometry.num_indices, FormatRu32, index_buffer, sizeof(uint32_t) * geometry.base_index);
    if(!tracing.create(Tracing::TypeTriangle, Tracing::FlagCompact | Tracing::FlagFastTrace)) return 1;
    build_size += tracing.getBuildSize();

// create build buffer
Buffer build_buffer = device.createBuffer(Buffer::FlagStorage | Buffer::FlagScratch, build_size);
if(!build_buffer) return 1;

// build tracings
if(!device.buildTracings(tracings, build_buffer, Tracing::FlagCompact)) return 1;

That is all for data preparation. To start actual ray tracing, you simply need to call Compute::dispatch() method:

// set traversal

// set uniform parameters
compute.setUniform(0, common_parameters);

// set storage buffers
compute.setStorageBuffers(0, {

// set instances tracing
compute.setTracing(0, instances_tracing);

// set surface texture
compute.setSurfaceTexture(0, surface);

// dispatch traversal

It’s possible to use GLSL shaders for Direct3D12 ray tracing. The Tellusim Shader compiler will convert them automatically. If you already have HLSL shaders, you can use them directly by passing them to the Traversal interface. The only necessary modification is to follow the Tellusim shader resource binding model. Here is an example of a Closest Hit shader that calculates intersection normals, applies simple Phong lighting, and launches secondary rays if recursion is supported:

void main() {

    // clear payloads
        reflection_color = vec3(0.0f);
        #if RECURSION_DEPTH > 1
            shadow_value = 0.2f;
            shadow_value = 1.0f;

    vec3 position = gl_WorldRayOriginEXT + gl_WorldRayDirectionEXT * gl_HitTEXT;
    vec3 direction = normalize( - position);
    vec3 light_direction = normalize( - position);

    // geometry parameters
    uint base_vertex = geometry_buffer[gl_InstanceCustomIndexEXT].base_vertex;
    uint base_index = geometry_buffer[gl_InstanceCustomIndexEXT].base_index;

    // geometry normal
    uint index = gl_PrimitiveID * 3u + base_index;
    vec3 normal_0 = vertex_buffer[index_buffer[index + 0u] + base_vertex];
    vec3 normal_1 = vertex_buffer[index_buffer[index + 1u] + base_vertex];
    vec3 normal_2 = vertex_buffer[index_buffer[index + 2u] + base_vertex];
    vec3 normal = normal_0 * (1.0f - hit_attribute.x - hit_attribute.y) + normal_1 * hit_attribute.x + normal_2 * hit_attribute.y;
    normal = normalize(gl_ObjectToWorldEXT[0].xyz * normal.x + gl_ObjectToWorldEXT[1].xyz * normal.y + gl_ObjectToWorldEXT[2].xyz * normal.z);

    // light color
    float diffuse = clamp(dot(light_direction, normal), 0.0f, 1.0f);
    float specular = pow(clamp(dot(reflect(-light_direction, normal), direction), 0.0f, 1.0f), 16.0f);

    // instance parameters
        vec3 color = cos(vec3(vec3(1.0f, 0.5f, 0.0f) * 3.14f + float(gl_InstanceID))) * 0.5f + 0.5f;
        vec3 color = vec3(16.0f, 219.0f, 217.0f) / 255.0f;
    #elif PLANE_SHADER
        ivec2 grid = ivec2(position.xy / 2.0f - 64.0f) & 0x01;
        vec3 color = vec3(((grid.x ^ grid.y) == 0) ? 0.8f : 0.4f);


        // trace secodary rays
        #if RECURSION_DEPTH > 1

            // reflection ray
            traceRayEXT(tracing, gl_RayFlagsOpaqueEXT, 0xffu, 3u, 3u, 1u, position, 1e-3f, reflect(-direction, normal), 1000.0f, 1);

            // shadow ray
            traceRayEXT(tracing, gl_RayFlagsOpaqueEXT | gl_RayFlagsTerminateOnFirstHitEXT | gl_RayFlagsSkipClosestHitShaderEXT, 0xffu, 0u, 3u, 2u, position, 1e-3f, light_direction, 1000.0f, 2);


        // color payload
        color_value = (color * diffuse + specular) * shadow_value + reflection_color * 0.5f;


        // reflection payload
        reflection_color = color * diffuse + specular;

Responsive image

Ray Tracing applications are very sensitive to the screen resolution. Even 3 rays per pixel in that simple scene can significantly drop FPS, especially on non-top-level GPUs:

1600×900 3840×2160
VK D3D12 VK D3D12
GeForce 3090 1410 FPS (0.5 ms) 1430 FPS (0.5 ms) 334 FPS (2.7 ms) 341 FPS (2.7 ms)
GeForce 2080 Ti 622 FPS (1.1 ms) 783 FPS (1.1 ms) 197 FPS (4.6 ms) 204 FPS (5.0 ms)
GeForce 3060 M 427 FPS (1.5 ms) 545 FPS (1.6 ms) 110 FPS (8.2 ms) 120 FPS (8.0 ms)
Radeon 6900 XT N/A 603 FPS (1.3 ms) N/A 139 FPS (6.6 ms)
Radeon 6700 XT N/A 362 FPS (2.5 ms) N/A 80 FPS (12.2 ms)
Radeon 6600 N/A 247 FPS (3.8 ms) N/A 52 FPS (19.0 ms)
Intel Arc A770 620 FPS (1.2 ms) 692 FPS (1.1 ms) 149 FPS (6.1 ms) 165 FPS (5.7 ms)

August 13, 2023

05 Hello Tracing

There are only a few ways to begin working with hardware-accelerated Ray Tracing. It’s always possible to use native Vulkan, Direct3D12, or Metal. However, this approach can result in a substantial amount of code where it’s very easy to make mistakes and difficult to maintain. And unfortunately, your applications will not work seamlessly on all platforms without significant refactoring. Tellusim Core SDK makes this possible by providing cross-platform and cross-API Ray Tracing support using C++, C, C#, Rust, or Python languages.

Ray Tracing rendering has a significant performance advantage over rasterization in scenes containing millions or billions of triangles. It’s faster when the number of tracing rays (pixels on the screen) is smaller than the number of visible triangles in the scene.

Let’s use the scene from the previous tutorial and render it using hardware-accelerated Ray Tracing. There are two ways to achieve this: the first is to use Ray Queries, which can be invoked from Compute or Fragment shaders, and the second way is to use Ray Tracing Pipelines. Tellusim supports both methods, and we will begin with the simpler Ray Queries approach.

We use a more compact naming convention for Ray Tracing objects. Instead of AccelerationStructure, we use the term Tracing. This term encompasses both BLAS (Bottom Level Acceleration Structure) and TLAS (Top Level Acceleration Structure). This approach enables us to create code that is more compact and easier to read/write.

In the first step, we need to ensure that Ray Queries are available from the Fragment shader. To achieve this, we must check the fragmentTracing member value of the Device::Features structure:

// check fragment tracing support
if(!device.getFeatures().fragmentTracing) {
    TS_LOG(Error, "fragment tracing is not supported\n");
    return 0;

Optionally, we can utilize Ray Queries from the Compute shader by testing the computeTracing member. There is no difference in performance.

Vertex and Index geometry buffers must be accessible for Ray Tracing shaders. This is crucial because hardware Ray Tracing only provides information about the intersected primitive and the distance where the ray hits. All other necessary parameters must be evaluated within the shader. To create Vertex and Index buffers, we can use the MeshModel interface, which generates buffers by the specified Pipeline Vertex layout. We can reuse the same code used for creating models for rasterization, with the addition of specific MeshModel::Flags required for Ray Tracing:

// create model pipeline
Pipeline model_pipeline = device.createPipeline();
model_pipeline.addAttribute(Pipeline::AttributePosition, FormatRGBf32, 0, offsetof(Vertex, position), sizeof(Vertex));
model_pipeline.addAttribute(Pipeline::AttributeNormal, FormatRGBf32, 0, offsetof(Vertex, normal), sizeof(Vertex));

// create model
MeshModel model;
if(!model.create(device, model_pipeline, mesh, MeshModel::FlagIndices32 | MeshModel::FlagBufferStorage | MeshModel::FlagBufferTracing | MeshModel::FlagBufferAddress)) return 1;

The geometry buffers are ready. It’s time to create geometry Tracing:

// create tracing
Tracing tracing = device.createTracing();
tracing.addVertexBuffer(model.getNumGeometryVertices(0), model_pipeline.getAttributeFormat(0), model.getVertexBufferStride(0), vertex_buffer);
tracing.addIndexBuffer(model.getNumIndices(), model.getIndexFormat(), index_buffer);
if(!tracing.create(Tracing::TypeTriangle, Tracing::FlagCompact | Tracing::FlagFastTrace)) return 1;

Geometry Tracing supports multiple geometries. Here, we are using a single geometry from the model, providing information about formats and strides. The optional offset argument is set to zero in our case. However, more than just creating a Tracing interface, it needs to be built by the GPU driver. For this purpose, a temporary build buffer is required, and its minimal size can be obtained using the Tracing::getBuildSize() method. The only requirement for this buffer is the presence of the Buffer::FlagScratch flag. It’s possible to reuse this buffer in different places. Building a geometry requires only a simple line of code. The Device interface allows the construction of multiple Tracing objects in one call, but the size of the build buffer must be proportionally larger; otherwise, the Device interface will split the process into multiple commands:

// build tracing
if(!device.buildTracing(tracing, build_buffer, Tracing::FlagCompact)) return 1;

To create the instance Tracing, we need to specify the Tracing::Instance structure and provide an Instance buffer that is large enough to store all of our instances:

// create instances
Tracing::Instance instance;
instance.mask = 0xff;
instance.tracing = &tracing;
Array<Tracing::Instance> instances(grid_size * grid_size * grid_height, instance);

// create instances buffer
Buffer instances_buffer = device.createBuffer(Buffer::FlagStorage | Buffer::FlagTracing, Tracing::InstanceSize * instances.size());
if(!instances_buffer) return 1;

// create instances tracing
Tracing instances_tracing = device.createTracing(instances.size(), instances_buffer);
if(!instances_tracing) return 1;

The final step involves creating a Rendering Pipeline responsible for rendering a full-screen triangle. We require a single Uniform argument, two Storage arguments for the Vertex and Index buffers, and a single argument for the instance Tracing, all of which are accessible from the Fragment shader:

// create pipeline
Pipeline pipeline = device.createPipeline();
pipeline.setUniformMask(0, Shader::MaskFragment);
pipeline.setStorageMasks(0, 2, Shader::MaskFragment);
pipeline.setTracingMask(0, Shader::MaskFragment);
if(!pipeline.loadShaderGLSL(Shader::TypeVertex, "main.shader", "VERTEX_SHADER=1")) return 1;
if(!pipeline.loadShaderGLSL(Shader::TypeFragment, "main.shader", "FRAGMENT_SHADER=1")) return 1;
if(!pipeline.create()) return 1;

The Ray Tracing shader calculates the ray direction from the camera to each pixel on the screen. The Fragment shader discards the output if the ray doesn’t hit any geometry. However, if the ray does intersect geometry, Ray Query objects provide us with detailed intersection information. This information includes instance and triangle indices, triangle barycentric coordinates, and the transformation matrix of the instance. By utilizing these parameters, we can calculate the normal at the intersection point using the Vertex and Index geometry buffers:

layout(location = 0) in vec2 s_texcoord;

layout(row_major, binding = 0) uniform CommonParameters {
    mat4 projection;
    mat4 imodelview;
    vec4 camera;
    float window_width;
    float window_height;
    float time;

layout(std430, binding = 1) readonly buffer VertexBuffer { vec4 vertex_buffer[]; };
layout(std430, binding = 2) readonly buffer IndexBuffer { uint index_buffer[]; };

layout(binding = 0, set = 1) uniform accelerationStructureEXT tracing;

layout(location = 0) out vec4 out_color;

void main() {

    // ray parameters
    float x = (s_texcoord.x * 2.0f - 1.0f + projection[2].x) / projection[0].x;
    float y = (s_texcoord.y * 2.0f - 1.0f + projection[2].y) / projection[1].y;
    vec3 ray_position = (imodelview * vec4(0.0f, 0.0f, 0.0f, 1.0f)).xyz;
    vec3 ray_direction = normalize((imodelview * vec4(x, y, -1.0f, 1.0f)).xyz - ray_position);

    // closest intersection
    rayQueryEXT ray_query;
    rayQueryInitializeEXT(ray_query, tracing, gl_RayFlagsOpaqueEXT, 0xff, ray_position, 0.0f, ray_direction, 1000.0f);
    while(rayQueryProceedEXT(ray_query)) {
        if(rayQueryGetIntersectionTypeEXT(ray_query, false) == gl_RayQueryCandidateIntersectionTriangleEXT) {

    // check intersection
    [[branch]] if(rayQueryGetIntersectionTypeEXT(ray_query, true) == gl_RayQueryCommittedIntersectionNoneEXT) discard;

    // camera direction
    vec3 direction = -ray_direction;

    // intersection parameters
    uint instance = rayQueryGetIntersectionInstanceIdEXT(ray_query, true);
    uint index = rayQueryGetIntersectionPrimitiveIndexEXT(ray_query, true) * 3u;
    vec2 texcoord = rayQueryGetIntersectionBarycentricsEXT(ray_query, true);
    mat4x3 transform = rayQueryGetIntersectionObjectToWorldEXT(ray_query, true);

    // interpolate normal
    vec3 normal_0 = vertex_buffer[index_buffer[index + 0u] * 2u + 1u].xyz;
    vec3 normal_1 = vertex_buffer[index_buffer[index + 1u] * 2u + 1u].xyz;
    vec3 normal_2 = vertex_buffer[index_buffer[index + 2u] * 2u + 1u].xyz;
    vec3 normal = normal_0 * (1.0f - texcoord.x - texcoord.y) + normal_1 * texcoord.x + normal_2 * texcoord.y;
    normal = normalize(transform[0].xyz * normal.x + transform[1].xyz * normal.y + transform[2].xyz * normal.z);

    // light color
    float diffuse = clamp(dot(direction, normal), 0.0f, 1.0f);
    float specular = pow(clamp(dot(reflect(-direction, normal), direction), 0.0f, 1.0f), 16.0f);

    // instance color
    vec3 color = cos(vec3(0.0f, 0.5f, 1.0f) * 3.14f + float(instance)) * 0.5f + 0.5f;
    float position = window_width * (cos(time) * 0.25f + 0.75f);
    if(gl_FragCoord.x < position) color = vec3(0.75f);

    // output color
    if(abs(gl_FragCoord.x - position) < 1.0f) out_color = vec4(0.0f);
    else out_color = vec4(color, 1.0f) * diffuse + specular;

Our application can run on Windows, Linux, and macOS using Vulkan, Direct3D12, or Metal API, and no code modification is required. Tellusim Shader compiler will create shaders for the target API, and the platform abstraction layer will isolate us from low-level code:

Responsive image

While we use the same scene as in the 04 Hello Raster tutorial, we can perform a performance comparison between Compute rasterization and Ray Tracing:

1600×900 2560×1440
GeForce 3090 Ti 2050 FPS (0.3 ms) 780 FPS (1.1 ms) 1260 FPS (0.6 ms) 400 FPS (2.3 ms)
GeForce 2080 Ti 1290 FPS (0.5 ms) 520 FPS (1.7 ms) 734 FPS (1.1 ms) 270 FPS (3.4 ms)
GeForce 3060 M 720 FPS (0.9 ms) 195 FPS (4.3 ms) 360 FPS (2.3 ms) 100 FPS (9.0 ms)
Radeon 6900 XT 640 FPS (0.8 ms) 667 FPS (1.1 ms) 355 FPS (2.0 ms) 379 FPS (2.0 ms)
Radeon 6700 XT 415 FPS (1.6 ms) 367 FPS (2.1 ms) 217 FPS (3.6 ms) 212 FPS (4.0 ms)
Radeon 6600 310 FPS (2.6 ms) 270 FPS (3.0 ms) 140 FPS (6.0 ms) 150 FPS (6.0 ms)
Intel Arc A770 850 FPS (0.6 ms) 450 FPS (1.7 ms) 500 FPS (1.4 ms) 250 FPS (3.4 ms)
Intel Arc A380 320 FPS (2.6 ms) 130 FPS (7.2 ms) 160 FPS (5.8 ms) 65 FPS (14 ms)
Apple M1 Max 104 FPS 188 FPS (5 ms) 55 FPS 102 FPS (9.5 ms)
Apple M1 30 FPS 50 FPS (19 ms) 14 FPS 26 FPS (36 ms)

In this scenario, where 30% of rays do not hit geometry, Ray Tracing is significantly faster than rasterization. With hardware-accelerated Ray Tracing, you can forget about LODs and use high-polygonal models, achieving excellent frame rates. But there is a price for that:

  • Additional memory is needed for the Tracing interface, and it’s larger than the combined size of all LODs.
  • Each animated character (geometry) requires its own Tracing interface, significantly increasing memory consumption.
  • Animated and dynamic geometry must be rebuilt (updated) after geometry modification, which can reduce performance in dynamic scenes.
  • The alpha test materials are very expensive for Ray Tracing.

August 12, 2023

04 Hello Raster

July 31, 2023

03 Hello Mesh

July 7, 2023

Scene Import

June 25, 2023

02 Hello Compute

May 15, 2023

01 Hello USDZ

May 14, 2023

00 Hello Triangle

April 24, 2023

WebGPU Update

April 4, 2023

Tellusim Upscaler Demo

February 10, 2023

DLSS 3.1.1 vs DLSS 2.4.0

January 31, 2023

Dispatch, Dispatch, Dispatch

October 28, 2022

Tellusim upscaler

October 14, 2022

Upscale SDK comparison

September 20, 2022

Improved Blue Noise

June 19, 2022

Intel Arc 370M analysis

January 16, 2022

Mesh Shader Emulation

December 16, 2021

Mesh Shader Performance

October 10, 2021

Blue Noise Generator

October 7, 2021

Ray Tracing versus Animation

September 24, 2021

Ray Tracing Performance Comparison

September 13, 2021

Compute versus Hardware

September 9, 2021

MultiDrawIndirect and Metal

September 4, 2021

Mesh Shader versus MultiDrawIndirect

June 30, 2021

Shader Pipeline