September 18, 2023
07 Hello Splatting
Gaussian Splatting is a fancy method of digitizing 2D or 3D scenes by representing them as a cloud of ellipsoids, each with unique sizes and orientations. An excellent 3D Gaussian Splatting for Real-Time Radiance Field Rendering paper enhances rendering quality through view-dependent color and fast generation.
Let’s implement Gaussian Splatting rendering using the Tellusim Core SDK for all platforms and APIs with minimal code and aim for the highest achievable FPS. We will utilize pre-trained models from the paper.
Pre-trained models are saved in the PLY file format with custom vertex properties that describe Gaussian parameters, including rotation, scale, color, and opacity. The Tellusim Mesh interface can load all this data, so our task is to convert and pack it. Each Gaussian is fully defined by the following parameters:
struct Gaussian {
Vector4f position; // Gaussian position
float16x8_t harmonics[7]; // spherical harmonics
float16x8_t rotation_scale; // rotation, scale, and opacity
Vector4f covariance_depth; // covariance and depth
Vector4f position_color; // position and color
Vector2i min_tile; // minimum tile
Vector2i max_tile; // maximum tile
};
We are mixing constant and run-time values inside the same Gaussian structure for simplicity. Sixteen spherical harmonic coefficients and other parameters, except for position, are converted to half precision without loss in quality. The size of each Gaussian is 144 bytes using this simple packing. When including run-time parameters, the required size expands to 192 bytes. The conversion code is straightforward. It uses SIMD types optimized for the target platforms through SSE/AVX/NEON intrinsics:
for(uint32_t i = 0; i < gaussians.size(); i++) { Gaussian &gaussian = gaussians[i]; // copy position gaussian.position.set(position_data[i], 1.0f); // pack rotation, scale, and opacity float32_t opacity = 1.0f / (1.0f + exp(-opacity_data[i])); Vector4f scale = Vector4f(exp(scale_0_data[i]), exp(scale_1_data[i]), exp(scale_2_data[i]), opacity); Quaternionf rotation = normalize(Quaternionf(rot_1_data[i], rot_2_data[i], rot_3_data[i], rot_0_data[i])); gaussian.rotation_scale = float16x8_t(float16x4_t(float32x4_t(rotation.q)), float16x4_t(float32x4_t(scale.v))); // copy harmonics harmonics[0].x = f_dc_0_data[i]; harmonics[0].y = f_dc_1_data[i]; harmonics[0].z = f_dc_2_data[i]; for(uint32_t j = 1, k = 0; j < 16; j++, k++) { harmonics[j].x = harmonics_data[k + 0][i]; harmonics[j].y = harmonics_data[k + 15][i]; harmonics[j].z = harmonics_data[k + 30][i]; } // don't waste alpha channel harmonics[0].w = harmonics[14].x; harmonics[1].w = harmonics[14].y; harmonics[2].w = harmonics[14].z; harmonics[3].w = harmonics[15].x; harmonics[4].w = harmonics[15].y; harmonics[5].w = harmonics[15].z; // float32_t to float16_t for(uint32_t j = 0, k = 0; j < 7; j++, k += 2) { gaussian.harmonics[j] = float16x8_t(float16x4_t(harmonics[k + 0]), float16x4_t(harmonics[k + 1])); } }
Prefix Scan and Radix Sort algorithms for all platforms are available in the Tellusim Core SDK. They are essential for nearly all compute-based applications, including Gaussian Splatting rendering. We will employ compute tile-based rasterization to render all Gaussians, requiring only six custom compute shaders. The rendering sequence is as follows:
- 1. Clear screen-tile counters.
- 2. Process all Gaussians and calculate the number of Gaussians per tile.
- 3. Align the tile counters to 4 elements (required for the Radix Sort algorithm).
- 4. Run the Prefix Scan algorithm over all tiles to find correct tile offsets.
- 5. Create Indirect dispatch arguments for the Scatter and Radix Sort steps.
- 6. Scatter visible Gaussian indices to screen tiles.
- 7. Sort screen tiles by depth (front to back).
- 8. Blend Gaussians in each tile.
The performance optimal tile size is 32×16. It’s suitable for high resolutions and avoids overburdening Radix Sort with additional independent sorting regions. The Radix Sort algorithm utilizes GPU-generated Indirect arguments, which helps minimize memory requirements through more compact data packing.
A huge number of Gaussians per tile and the blending termination condition can result in significant divergence between GPU threads, slowing down GPU performance. The Gaussian Splatting algorithm is inherently simple. However, the overhead of data loading adversely impacts performance, even when using shared memory for efficient per-tile data loading.
A superior approach is to avoid using shared memory entirely and instead blend multiple pixels per GPU thread. We have determined that using 8 pixels per GPU thread optimizes both performance and shader group memory usage. Multi-pixel blending significantly boosts performance. Instead of processing 8 million pixels (3840×2160 resolution) with substantial data loading overhead and high divergence, we now handle just 1 million pixels (960×1080 resolution) without data loading overhead for 7 out of every 8 pixels. This optimization enables real-time Gaussian Splatting on mobile devices. The only remaining question is how to efficiently pack (optimize) Gaussians to avoid consuming all available memory.
This is the final Gaussian Splatting rendering shader with 7 neighbor pixels removed:
/* */ void main() { ivec2 group_id = ivec2(gl_WorkGroupID.xy); ivec2 global_id = ivec2(gl_GlobalInvocationID.xy) * ivec2(4, 2); uint local_id = gl_LocalInvocationIndex; // global parameters [[branch]] if(local_id == 0u) { int tile_index = tiles_width * group_id.y + group_id.x; tile_gaussians = count_buffer[tile_index + num_tiles]; data_offset = count_buffer[tile_index] + max_gaussians; } memoryBarrierShared(); barrier(); vec2 position = vec2(global_id); float transparency_00 = 1.0f; .. vec3 color_00 = vec3(0.0f); .. [[loop]] for(uint i = 0u; i < tile_gaussians; i++) { uint index = order_buffer[data_offset + i]; vec4 position_color = gaussians_buffer[index].position_color; vec3 covariance = gaussians_buffer[index].covariance_depth.xyz; vec4 color_opacity = unpack_half4(position_color.zw); vec2 direction_00 = position_color.xy - position; .. float power_00 = dot(covariance, direction_00.yxx * direction_00.yxy); .. float alpha_00 = min(color_opacity.w * exp(power_00), 1.0f); .. color_00 += color_opacity.xyz * (alpha_00 * transparency_00); .. transparency_00 *= (1.0f - alpha_00); .. float transparency_0 = max(max(transparency_00, transparency_10), max(transparency_20, transparency_30)); float transparency_1 = max(max(transparency_01, transparency_11), max(transparency_21, transparency_31)); [[branch]] if(max(transparency_0, transparency_1) < 1.0f / 255.0f) break; } // save result [[branch]] if(all(lessThan(global_id, ivec2(surface_width, surface_height)))) { imageStore(out_surface, global_id + ivec2(0, 0), vec4(color_00, transparency_00)); .. } }
With this optimization, a single GPU can easily handle 4K resolution. Comparing performance with the CUDA implementation is challenging because this tutorial example doesn’t read camera orientation parameters from scenes. However, it demonstrates a nearly twice improvement in speed (122 FPS compared to 72 FPS) when rendering from approximately the same locations in the Garden scene (with 5.8 million Gaussians). Another bonus of using the Tellusim Core SDK is the asset loading speed. The original example takes 45 seconds to start scene rendering, while the Tellusim-based implementation is ready in just 4 seconds.
Using the Kitchen (1.8 million Gaussians) scene, we were able to achieve the following FPS on different hardware (HW) and resolutions:
1920×1080 | 3840×2160 | |
---|---|---|
GeForce 3090 Ti | 238 FPS | 95 FPS |
GeForce 2080 Ti | 165 FPS | 66 FPS |
GeForce 3060 M | 104 FPS | 41 FPS |
Radeon 6900 XT | 223 FPS | 97 FPS |
Radeon 6700 XT | 165 FPS | 64 FPS |
Radeon 6600 | 113 FPS | 44 FPS |
Intel Arc A770 | 180 FPS | 80 FPS |
Intel Iris Xe 96 | 29 FPS | 12 FPS |
Apple M1 Max | 122 FPS | 40 FPS |
Apple M1 | 33 FPS | 13 FPS |
WebGPU demos (a WebGPU-compatible browser is required):
August 28, 2023
06 Hello Traversal
Ray Queries are very useful for simple scenes where material variety is low. They are good for shadows and ambient occlusion. However, it can be challenging to implement different BRDFs for reflections or primary rays using this approach. Ray Tracing Pipelines give much better flexibility by utilizing driver/GPU-side shader scheduling. Working with Ray Tracing Pipelines isn’t significantly more complex than working with Ray Queries, especially when utilizing the Tellusim SDK.
Within Tellusim Engine, the Ray Tracing Pipeline is referred to as the Traversal interface. It offers functionalities analogous to those of the Pipeline or Kernel interfaces. The Compute interface can dispatch it just as it would with a Kernel. It’s crucial to verify ray tracing support on our device before creating a Traversal. Furthermore, there is an AMD GPU Vulkan driver limitation that doesn’t allow the use of tracing recursion. So, the global RECURSION_DEPTH macro needs to be passed to all shaders:
// check ray tracing support if(!device.getFeatures().rayTracing) { TS_LOG(Error, "ray tracing is not supported\n"); return 0; } if(device.getFeatures().recursionDepth == 1) { TS_LOG(Error, "ray tracing recursion is not supported\n"); } // shader macros Shader::setMacro("RECURSION_DEPTH", device.getFeatures().recursionDepth);
In this example, we will trace primary, reflection, and shadow rays. For that, we need 3 shader groups. These groups are generated automatically. Your task during initialization is to simply combine the shaders together. Primary rays will launch secondary rays for reflections and shadows, thereby necessitating a recursion depth of 2:
// create traversal Traversal traversal = device.createTraversal(); traversal.setUniformMask(0, Shader::MaskAll); traversal.setStorageMasks(0, 3, Shader::MaskAll); traversal.setSurfaceMask(0, Shader::MaskRayGen); traversal.setTracingMask(0, Shader::MaskRayGen | Shader::MaskClosest); traversal.setRecursionDepth(min(device.getFeatures().recursionDepth, 2u)); // entry shader if(!traversal.loadShaderGLSL(Shader::TypeRayGen, "main.shader", "RAYGEN_SHADER=1")) return 1; // primary shaders if(!traversal.loadShaderGLSL(Shader::TypeRayMiss, "main.shader", "RAYMISS_SHADER=1; PRIMARY_SHADER=1")) return 1; if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; PRIMARY_SHADER=1; PLANE_SHADER=1")) return 1; if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; PRIMARY_SHADER=1; MODEL_SHADER=1")) return 1; if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; PRIMARY_SHADER=1; DODECA_SHADER=1")) return 1; // reflection shaders if(!traversal.loadShaderGLSL(Shader::TypeRayMiss, "main.shader", "RAYMISS_SHADER=1; REFLECTION_SHADER=1")) return 1; if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; REFLECTION_SHADER=1; PLANE_SHADER=1")) return 1; if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; REFLECTION_SHADER=1; MODEL_SHADER=1")) return 1; if(!traversal.loadShaderGLSL(Shader::TypeClosest, "main.shader", "CLOSEST_SHADER=1; REFLECTION_SHADER=1; DODECA_SHADER=1")) return 1; // shadow shaders if(!traversal.loadShaderGLSL(Shader::TypeRayMiss, "main.shader", "RAYMISS_SHADER=1; SHADOW_SHADER=1")) return 1; // create traversal if(!traversal.create()) return 1;
The example using Ray Queries involved a single geometry in the scene. In this case, we will utilize 3 different geometries. While it is possible to employ 6 separate buffers for vertex and index information, this approach may not be suitable for more complex scenes. Therefore, we will combine them into 2 buffers. The MeshModel interface offers two distinct methods to accomplish this: through Class inheritance or Data callbacks. For our application, Data callbacks provide more than sufficient functionality:
// vertex buffer callback model.setVertexBufferCallback([&](const void *src, size_t size, bool owner) -> bool { // create geometry Geometry &geometry = geometries.append(); geometry.base_vertex = vertices.size(); geometry.base_index = indices.size(); // copy vertices geometry.num_vertices = (uint32_t)(size / sizeof(Vertex)); vertices.append((const Vertex*)src, geometry.num_vertices); // release memory if(owner) Allocator::free(src, size); return true; }); // index buffer callback model.setIndexBufferCallback([&](const void *src, size_t size, bool owner) -> bool { // copy indices Geometry &geometry = geometries.back(); geometry.num_indices = (uint32_t)(size / sizeof(uint32_t)); indices.append((const uint32_t*)src, geometry.num_indices); // release memory if(owner) Allocator::free(src, size); return true; });
We simply copy the vertex and index data from the MeshModel callbacks into two buffers. Simultaneously, we save the number of vertices/indices along with their respective base offsets. The Tracing interface employs these offsets to properly initialize the appropriate geometry. If the build buffer is sufficiently large, all tracings can be constructed with a single API call:
// create tracings size_t build_size = 0; Array<Tracing> tracings; for(Geometry &geometry : geometries) { Tracing tracing = device.createTracing(); tracing.addVertexBuffer(geometry.num_vertices, FormatRGBf32, sizeof(Vertex), vertex_buffer, sizeof(Vertex) * geometry.base_vertex); tracing.addIndexBuffer(geometry.num_indices, FormatRu32, index_buffer, sizeof(uint32_t) * geometry.base_index); if(!tracing.create(Tracing::TypeTriangle, Tracing::FlagCompact | Tracing::FlagFastTrace)) return 1; build_size += tracing.getBuildSize(); tracings.append(tracing); } // create build buffer Buffer build_buffer = device.createBuffer(Buffer::FlagStorage | Buffer::FlagScratch, build_size); if(!build_buffer) return 1; // build tracings if(!device.buildTracings(tracings, build_buffer, Tracing::FlagCompact)) return 1; device.flushTracings(tracings);
That is all for data preparation. To start actual ray tracing, you simply need to call Compute::dispatch() method:
// set traversal compute.setTraversal(traversal); // set uniform parameters compute.setUniform(0, common_parameters); // set storage buffers compute.setStorageBuffers(0, { geometry_buffer, vertex_buffer, index_buffer }); // set instances tracing compute.setTracing(0, instances_tracing); // set surface texture compute.setSurfaceTexture(0, surface); // dispatch traversal compute.dispatch(surface);
It’s possible to use GLSL shaders for Direct3D12 ray tracing. The Tellusim Shader compiler will convert them automatically. If you already have HLSL shaders, you can use them directly by passing them to the Traversal interface. The only necessary modification is to follow the Tellusim shader resource binding model. Here is an example of a Closest Hit shader that calculates intersection normals, applies simple Phong lighting, and launches secondary rays if recursion is supported:
/* */ void main() { // clear payloads #if PRIMARY_SHADER reflection_color = vec3(0.0f); #if RECURSION_DEPTH > 1 shadow_value = 0.2f; #else shadow_value = 1.0f; #endif #endif vec3 position = gl_WorldRayOriginEXT + gl_WorldRayDirectionEXT * gl_HitTEXT; vec3 direction = normalize(camera.xyz - position); vec3 light_direction = normalize(light.xyz - position); // geometry parameters uint base_vertex = geometry_buffer[gl_InstanceCustomIndexEXT].base_vertex; uint base_index = geometry_buffer[gl_InstanceCustomIndexEXT].base_index; // geometry normal uint index = gl_PrimitiveID * 3u + base_index; vec3 normal_0 = vertex_buffer[index_buffer[index + 0u] + base_vertex].normal.xyz; vec3 normal_1 = vertex_buffer[index_buffer[index + 1u] + base_vertex].normal.xyz; vec3 normal_2 = vertex_buffer[index_buffer[index + 2u] + base_vertex].normal.xyz; vec3 normal = normal_0 * (1.0f - hit_attribute.x - hit_attribute.y) + normal_1 * hit_attribute.x + normal_2 * hit_attribute.y; normal = normalize(gl_ObjectToWorldEXT[0].xyz * normal.x + gl_ObjectToWorldEXT[1].xyz * normal.y + gl_ObjectToWorldEXT[2].xyz * normal.z); // light color float diffuse = clamp(dot(light_direction, normal), 0.0f, 1.0f); float specular = pow(clamp(dot(reflect(-light_direction, normal), direction), 0.0f, 1.0f), 16.0f); // instance parameters #if MODEL_SHADER vec3 color = cos(vec3(vec3(1.0f, 0.5f, 0.0f) * 3.14f + float(gl_InstanceID))) * 0.5f + 0.5f; #elif DODECA_SHADER vec3 color = vec3(16.0f, 219.0f, 217.0f) / 255.0f; #elif PLANE_SHADER ivec2 grid = ivec2(position.xy / 2.0f - 64.0f) & 0x01; vec3 color = vec3(((grid.x ^ grid.y) == 0) ? 0.8f : 0.4f); #endif #if PRIMARY_SHADER // trace secodary rays #if RECURSION_DEPTH > 1 // reflection ray traceRayEXT(tracing, gl_RayFlagsOpaqueEXT, 0xffu, 3u, 3u, 1u, position, 1e-3f, reflect(-direction, normal), 1000.0f, 1); // shadow ray traceRayEXT(tracing, gl_RayFlagsOpaqueEXT | gl_RayFlagsTerminateOnFirstHitEXT | gl_RayFlagsSkipClosestHitShaderEXT, 0xffu, 0u, 3u, 2u, position, 1e-3f, light_direction, 1000.0f, 2); #endif // color payload color_value = (color * diffuse + specular) * shadow_value + reflection_color * 0.5f; #elif REFLECTION_SHADER // reflection payload reflection_color = color * diffuse + specular; #endif }
Ray Tracing applications are very sensitive to the screen resolution. Even 3 rays per pixel in that simple scene can significantly drop FPS, especially on non-top-level GPUs:
1600×900 | 3840×2160 | |||
---|---|---|---|---|
VK | D3D12 | VK | D3D12 | |
GeForce 3090 | 1410 FPS (0.5 ms) | 1430 FPS (0.5 ms) | 334 FPS (2.7 ms) | 341 FPS (2.7 ms) |
GeForce 2080 Ti | 622 FPS (1.1 ms) | 783 FPS (1.1 ms) | 197 FPS (4.6 ms) | 204 FPS (5.0 ms) |
GeForce 3060 M | 427 FPS (1.5 ms) | 545 FPS (1.6 ms) | 110 FPS (8.2 ms) | 120 FPS (8.0 ms) |
Radeon 6900 XT | N/A | 603 FPS (1.3 ms) | N/A | 139 FPS (6.6 ms) |
Radeon 6700 XT | N/A | 362 FPS (2.5 ms) | N/A | 80 FPS (12.2 ms) |
Radeon 6600 | N/A | 247 FPS (3.8 ms) | N/A | 52 FPS (19.0 ms) |
Intel Arc A770 | 620 FPS (1.2 ms) | 692 FPS (1.1 ms) | 149 FPS (6.1 ms) | 165 FPS (5.7 ms) |
August 13, 2023
05 Hello Tracing
There are only a few ways to begin working with hardware-accelerated Ray Tracing. It’s always possible to use native Vulkan, Direct3D12, or Metal. However, this approach can result in a substantial amount of code where it’s very easy to make mistakes and difficult to maintain. And unfortunately, your applications will not work seamlessly on all platforms without significant refactoring. Tellusim Core SDK makes this possible by providing cross-platform and cross-API Ray Tracing support using C++, C, C#, Rust, or Python languages.
Ray Tracing rendering has a significant performance advantage over rasterization in scenes containing millions or billions of triangles. It’s faster when the number of tracing rays (pixels on the screen) is smaller than the number of visible triangles in the scene.
Let’s use the scene from the previous tutorial and render it using hardware-accelerated Ray Tracing. There are two ways to achieve this: the first is to use Ray Queries, which can be invoked from Compute or Fragment shaders, and the second way is to use Ray Tracing Pipelines. Tellusim supports both methods, and we will begin with the simpler Ray Queries approach.
We use a more compact naming convention for Ray Tracing objects. Instead of AccelerationStructure, we use the term Tracing. This term encompasses both BLAS (Bottom Level Acceleration Structure) and TLAS (Top Level Acceleration Structure). This approach enables us to create code that is more compact and easier to read/write.
In the first step, we need to ensure that Ray Queries are available from the Fragment shader. To achieve this, we must check the fragmentTracing member value of the Device::Features structure:
// check fragment tracing support if(!device.getFeatures().fragmentTracing) { TS_LOG(Error, "fragment tracing is not supported\n"); return 0; }
Optionally, we can utilize Ray Queries from the Compute shader by testing the computeTracing member. There is no difference in performance.
Vertex and Index geometry buffers must be accessible for Ray Tracing shaders. This is crucial because hardware Ray Tracing only provides information about the intersected primitive and the distance where the ray hits. All other necessary parameters must be evaluated within the shader. To create Vertex and Index buffers, we can use the MeshModel interface, which generates buffers by the specified Pipeline Vertex layout. We can reuse the same code used for creating models for rasterization, with the addition of specific MeshModel::Flags required for Ray Tracing:
// create model pipeline Pipeline model_pipeline = device.createPipeline(); model_pipeline.addAttribute(Pipeline::AttributePosition, FormatRGBf32, 0, offsetof(Vertex, position), sizeof(Vertex)); model_pipeline.addAttribute(Pipeline::AttributeNormal, FormatRGBf32, 0, offsetof(Vertex, normal), sizeof(Vertex)); // create model MeshModel model; if(!model.create(device, model_pipeline, mesh, MeshModel::FlagIndices32 | MeshModel::FlagBufferStorage | MeshModel::FlagBufferTracing | MeshModel::FlagBufferAddress)) return 1;
The geometry buffers are ready. It’s time to create geometry Tracing:
// create tracing Tracing tracing = device.createTracing(); tracing.addVertexBuffer(model.getNumGeometryVertices(0), model_pipeline.getAttributeFormat(0), model.getVertexBufferStride(0), vertex_buffer); tracing.addIndexBuffer(model.getNumIndices(), model.getIndexFormat(), index_buffer); if(!tracing.create(Tracing::TypeTriangle, Tracing::FlagCompact | Tracing::FlagFastTrace)) return 1;
Geometry Tracing supports multiple geometries. Here, we are using a single geometry from the model, providing information about formats and strides. The optional offset argument is set to zero in our case. However, more than just creating a Tracing interface, it needs to be built by the GPU driver. For this purpose, a temporary build buffer is required, and its minimal size can be obtained using the Tracing::getBuildSize() method. The only requirement for this buffer is the presence of the Buffer::FlagScratch flag. It’s possible to reuse this buffer in different places. Building a geometry requires only a simple line of code. The Device interface allows the construction of multiple Tracing objects in one call, but the size of the build buffer must be proportionally larger; otherwise, the Device interface will split the process into multiple commands:
// build tracing
if(!device.buildTracing(tracing, build_buffer, Tracing::FlagCompact)) return 1;
To create the instance Tracing, we need to specify the Tracing::Instance structure and provide an Instance buffer that is large enough to store all of our instances:
// create instances Tracing::Instance instance; instance.mask = 0xff; instance.tracing = &tracing; Array<Tracing::Instance> instances(grid_size * grid_size * grid_height, instance); // create instances buffer Buffer instances_buffer = device.createBuffer(Buffer::FlagStorage | Buffer::FlagTracing, Tracing::InstanceSize * instances.size()); if(!instances_buffer) return 1; // create instances tracing Tracing instances_tracing = device.createTracing(instances.size(), instances_buffer); if(!instances_tracing) return 1;
The final step involves creating a Rendering Pipeline responsible for rendering a full-screen triangle. We require a single Uniform argument, two Storage arguments for the Vertex and Index buffers, and a single argument for the instance Tracing, all of which are accessible from the Fragment shader:
// create pipeline Pipeline pipeline = device.createPipeline(); pipeline.setUniformMask(0, Shader::MaskFragment); pipeline.setStorageMasks(0, 2, Shader::MaskFragment); pipeline.setTracingMask(0, Shader::MaskFragment); pipeline.setColorFormat(window.getColorFormat()); pipeline.setDepthFormat(window.getDepthFormat()); if(!pipeline.loadShaderGLSL(Shader::TypeVertex, "main.shader", "VERTEX_SHADER=1")) return 1; if(!pipeline.loadShaderGLSL(Shader::TypeFragment, "main.shader", "FRAGMENT_SHADER=1")) return 1; if(!pipeline.create()) return 1;
The Ray Tracing shader calculates the ray direction from the camera to each pixel on the screen. The Fragment shader discards the output if the ray doesn’t hit any geometry. However, if the ray does intersect geometry, Ray Query objects provide us with detailed intersection information. This information includes instance and triangle indices, triangle barycentric coordinates, and the transformation matrix of the instance. By utilizing these parameters, we can calculate the normal at the intersection point using the Vertex and Index geometry buffers:
layout(location = 0) in vec2 s_texcoord; layout(row_major, binding = 0) uniform CommonParameters { mat4 projection; mat4 imodelview; vec4 camera; float window_width; float window_height; float time; }; layout(std430, binding = 1) readonly buffer VertexBuffer { vec4 vertex_buffer[]; }; layout(std430, binding = 2) readonly buffer IndexBuffer { uint index_buffer[]; }; layout(binding = 0, set = 1) uniform accelerationStructureEXT tracing; layout(location = 0) out vec4 out_color; /* */ void main() { // ray parameters float x = (s_texcoord.x * 2.0f - 1.0f + projection[2].x) / projection[0].x; float y = (s_texcoord.y * 2.0f - 1.0f + projection[2].y) / projection[1].y; vec3 ray_position = (imodelview * vec4(0.0f, 0.0f, 0.0f, 1.0f)).xyz; vec3 ray_direction = normalize((imodelview * vec4(x, y, -1.0f, 1.0f)).xyz - ray_position); // closest intersection rayQueryEXT ray_query; rayQueryInitializeEXT(ray_query, tracing, gl_RayFlagsOpaqueEXT, 0xff, ray_position, 0.0f, ray_direction, 1000.0f); while(rayQueryProceedEXT(ray_query)) { if(rayQueryGetIntersectionTypeEXT(ray_query, false) == gl_RayQueryCandidateIntersectionTriangleEXT) { rayQueryConfirmIntersectionEXT(ray_query); } } // check intersection [[branch]] if(rayQueryGetIntersectionTypeEXT(ray_query, true) == gl_RayQueryCommittedIntersectionNoneEXT) discard; // camera direction vec3 direction = -ray_direction; // intersection parameters uint instance = rayQueryGetIntersectionInstanceIdEXT(ray_query, true); uint index = rayQueryGetIntersectionPrimitiveIndexEXT(ray_query, true) * 3u; vec2 texcoord = rayQueryGetIntersectionBarycentricsEXT(ray_query, true); mat4x3 transform = rayQueryGetIntersectionObjectToWorldEXT(ray_query, true); // interpolate normal vec3 normal_0 = vertex_buffer[index_buffer[index + 0u] * 2u + 1u].xyz; vec3 normal_1 = vertex_buffer[index_buffer[index + 1u] * 2u + 1u].xyz; vec3 normal_2 = vertex_buffer[index_buffer[index + 2u] * 2u + 1u].xyz; vec3 normal = normal_0 * (1.0f - texcoord.x - texcoord.y) + normal_1 * texcoord.x + normal_2 * texcoord.y; normal = normalize(transform[0].xyz * normal.x + transform[1].xyz * normal.y + transform[2].xyz * normal.z); // light color float diffuse = clamp(dot(direction, normal), 0.0f, 1.0f); float specular = pow(clamp(dot(reflect(-direction, normal), direction), 0.0f, 1.0f), 16.0f); // instance color vec3 color = cos(vec3(0.0f, 0.5f, 1.0f) * 3.14f + float(instance)) * 0.5f + 0.5f; float position = window_width * (cos(time) * 0.25f + 0.75f); if(gl_FragCoord.x < position) color = vec3(0.75f); // output color if(abs(gl_FragCoord.x - position) < 1.0f) out_color = vec4(0.0f); else out_color = vec4(color, 1.0f) * diffuse + specular; }
Our application can run on Windows, Linux, and macOS using Vulkan, Direct3D12, or Metal API, and no code modification is required. Tellusim Shader compiler will create shaders for the target API, and the platform abstraction layer will isolate us from low-level code:
While we use the same scene as in the 04 Hello Raster tutorial, we can perform a performance comparison between Compute rasterization and Ray Tracing:
1600×900 | 2560×1440 | |||
---|---|---|---|---|
RT | CS | RT | CS | |
GeForce 3090 Ti | 2050 FPS (0.3 ms) | 780 FPS (1.1 ms) | 1260 FPS (0.6 ms) | 400 FPS (2.3 ms) |
GeForce 2080 Ti | 1290 FPS (0.5 ms) | 520 FPS (1.7 ms) | 734 FPS (1.1 ms) | 270 FPS (3.4 ms) |
GeForce 3060 M | 720 FPS (0.9 ms) | 195 FPS (4.3 ms) | 360 FPS (2.3 ms) | 100 FPS (9.0 ms) |
Radeon 6900 XT | 640 FPS (0.8 ms) | 667 FPS (1.1 ms) | 355 FPS (2.0 ms) | 379 FPS (2.0 ms) |
Radeon 6700 XT | 415 FPS (1.6 ms) | 367 FPS (2.1 ms) | 217 FPS (3.6 ms) | 212 FPS (4.0 ms) |
Radeon 6600 | 310 FPS (2.6 ms) | 270 FPS (3.0 ms) | 140 FPS (6.0 ms) | 150 FPS (6.0 ms) |
Intel Arc A770 | 850 FPS (0.6 ms) | 450 FPS (1.7 ms) | 500 FPS (1.4 ms) | 250 FPS (3.4 ms) |
Intel Arc A380 | 320 FPS (2.6 ms) | 130 FPS (7.2 ms) | 160 FPS (5.8 ms) | 65 FPS (14 ms) |
Apple M1 Max | 104 FPS | 188 FPS (5 ms) | 55 FPS | 102 FPS (9.5 ms) |
Apple M1 | 30 FPS | 50 FPS (19 ms) | 14 FPS | 26 FPS (36 ms) |
In this scenario, where 30% of rays do not hit geometry, Ray Tracing is significantly faster than rasterization. With hardware-accelerated Ray Tracing, you can forget about LODs and use high-polygonal models, achieving excellent frame rates. But there is a price for that:
- Additional memory is needed for the Tracing interface, and it’s larger than the combined size of all LODs.
- Each animated character (geometry) requires its own Tracing interface, significantly increasing memory consumption.
- Animated and dynamic geometry must be rebuilt (updated) after geometry modification, which can reduce performance in dynamic scenes.
- The alpha test materials are very expensive for Ray Tracing.
August 12, 2023
04 Hello Raster
July 31, 2023
03 Hello Mesh
July 7, 2023
Scene Import
June 25, 2023
02 Hello Compute
May 15, 2023
01 Hello USDZ
May 14, 2023
00 Hello Triangle
April 24, 2023
WebGPU Update
April 4, 2023
Tellusim Upscaler Demo
February 10, 2023
DLSS 3.1.1 vs DLSS 2.4.0
January 31, 2023
Dispatch, Dispatch, Dispatch
October 28, 2022
Tellusim upscaler
October 14, 2022
Upscale SDK comparison
September 20, 2022
Improved Blue Noise
June 19, 2022
Intel Arc 370M analysis
January 16, 2022
Mesh Shader Emulation
December 16, 2021
Mesh Shader Performance
October 10, 2021
Blue Noise Generator
October 7, 2021
Ray Tracing versus Animation
September 24, 2021
Ray Tracing Performance Comparison
September 13, 2021
Compute versus Hardware
September 9, 2021
MultiDrawIndirect and Metal
September 4, 2021
Mesh Shader versus MultiDrawIndirect
June 30, 2021
Shader Pipeline