June 25, 2023
02 Hello Compute
Making cross-platform and cross-API GPU particle systems is extremely easy with Tellusim Core SDK. We wanted to make our API very efficient for compute shader dispatching because we have tons of them inside Tellusim Engine. In this example, we will use indirect dispatch, a built-in GPU RadixSort algorithm, and a particle rendering with instancing (for WebGPU compatibility).
A particle system is just an array of objects with the same behavior. Each particle has parameters like position, velocity, lifetime, and rotation. The particle system simulation has two essential parts: particle emission and particle simulation. Particle emission creates new particles with the specified position and parameters. Particle simulation updates particle position according to velocity, and additional external forces can be applied at this step.
We will start with Application, Window, and Device initialization. However, it is crucial to ensure that compute shaders are supported:
/* */ int32_t main(int32_t argc, char **argv) { // create app App app(argc, argv); if(!app.create()) return 1; // create window Window window(app.getPlatform(), app.getDevice()); if(!window || !window.setSize(app.getWidth(), app.getHeight())) return 1; if(!window.create("02 Hello Compute") || !window.setHidden(false)) return 1; window.setKeyboardPressedCallback([&](uint32_t key, uint32_t code) { if(key == Window::KeyEsc) window.stop(); }); // create device Device device(window); if(!device) return 1; // check compute shader support if(!device.hasShader(Shader::TypeCompute)) { TS_LOG(Error, "compute shader is not supported\n"); return 0; }
We must keep a bunch of structures synchronized between C++ and shader files. So we will declare them in a separate file and include it when required. The maximum number of simulated particles is limited by 1M:
// declarations #include "main.h" // parameters constexpr uint32_t group_size = 128; constexpr uint32_t max_emitters = 1024; constexpr uint32_t max_particles = 1024 * 1024;
The Kernel is a class that manages compute shaders. Its configuration is similar to a Pipeline configuration from the previous example, with the exception of the absence of shader stage masks. A simple particle simulation requires kernels for particle initialization, spawning, and update. Additionally, the geometry and indirect dispatch/draw buffers must be updated:
// create init kernel Kernel init_kernel = device.createKernel().setUniforms(1).setStorages(3, false); if(!init_kernel.loadShaderGLSL("main.shader", "INIT_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1; if(!init_kernel.create()) return 1; // create emitter kernel Kernel emitter_kernel = device.createKernel().setUniforms(1).setStorages(5, false).setStorageDynamic(0, true); if(!emitter_kernel.loadShaderGLSL("main.shader", "EMITTER_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1; if(!emitter_kernel.create()) return 1; // create dispatch kernel Kernel dispatch_kernel = device.createKernel().setUniforms(1).setStorages(4, false); if(!dispatch_kernel.loadShaderGLSL("main.shader", "DISPATCH_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1; if(!dispatch_kernel.create()) return 1; // create update kernel Kernel update_kernel = device.createKernel().setUniforms(1).setStorages(4, false); if(!update_kernel.loadShaderGLSL("main.shader", "UPDATE_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1; if(!update_kernel.create()) return 1; // create geometry kernel Kernel geometry_kernel = device.createKernel().setUniforms(1).setStorages(4, false); if(!geometry_kernel.loadShaderGLSL("main.shader", "GEOMETRY_SHADER=1; GROUP_SIZE=%uu", group_size)) return 1; if(!geometry_kernel.create()) return 1;
All particles must be sorted (back to front) for correct rendering. We will use a built-in RadixSort algorithm with an indirect dispatch and ordering option. Ordering sort mode creates integer indices that map input keys to the sorted state.
// create radix sort
RadixSort radix_sort;
PrefixScan prefix_scan;
if(!radix_sort.create(device, RadixSort::FlagSingle | RadixSort::FlagIndirect | RadixSort::FlagOrder, prefix_scan, max_particles)) return 1;
Now it’s time to create storage buffers that will contain particle parameters:
// create compute state buffer // contains global particle system state ComputeState state = {}; Buffer state_buffer = device.createBuffer(Buffer::FlagStorage, &state, sizeof(state)); if(!state_buffer) return 1; // create emitters state buffer // contains dynamic emitter parameters Buffer emitters_buffer = device.createBuffer(Buffer::FlagStorage, sizeof(EmitterState) * max_emitters * group_size); if(!emitters_buffer) return 1; // create particles state buffer // contains per-particle state Buffer particles_buffer = device.createBuffer(Buffer::FlagStorage, sizeof(ParticleState) * max_particles); if(!particles_buffer) return 1; // create particle allocator buffer // contains new particle indices Buffer allocator_buffer = device.createBuffer(Buffer::FlagStorage, sizeof(uint32_t) * max_particles); if(!allocator_buffer) return 1; // create particle distances buffer // contains camera to particle distances and order indices Buffer distances_buffer = device.createBuffer(Buffer::FlagStorage, sizeof(uint32_t) * max_particles * 2); if(!distances_buffer) return 1; // create particle vertex buffer // contains particle position, velocity, and color Buffer vertex_buffer = device.createBuffer(Buffer::FlagVertex | Buffer::FlagStorage, sizeof(Vertex) * max_particles); if(!vertex_buffer) return 1; // create particle indices buffer const uint16_t indices_data[] = { 0, 1, 2, 2, 3, 0 }; Buffer indices_buffer = device.createBuffer(Buffer::FlagIndex, indices_data, sizeof(indices_data)); if(!indices_buffer) return 1; // create indirect dispatch buffer Compute::DispatchIndirect dispatch_data = {}; Buffer dispatch_buffer = device.createBuffer(Buffer::FlagStorage | Buffer::FlagIndirect, &dispatch_data, sizeof(dispatch_data)); if(!dispatch_buffer) return 1; // create indirect draw buffer Command::DrawElementsIndirect draw_data = {}; Buffer draw_buffer = device.createBuffer(Buffer::FlagStorage | Buffer::FlagIndirect, &draw_data, sizeof(draw_data)); if(!draw_buffer) return 1; // create sort parameters buffer RadixSort::DispatchParameters sort_data = {}; Buffer sort_buffer = device.createBuffer(Buffer::FlagStorage, &sort_data, sizeof(sort_data)); if(!sort_buffer) return 1;
This Compute dispatch will initialize particles. A barrier is required for all buffers that have been updated and will be used in the following stages:
// compute parameters ComputeParameters compute_parameters = {}; compute_parameters.max_emitters = max_emitters; compute_parameters.max_particles = max_particles; compute_parameters.global_gravity = Vector4f(0.0f, 0.0f, -8.0f, 0.0f); compute_parameters.wind_velocity = Vector4f(0.0f, 0.0f, 4.0f, 0.0f); compute_parameters.wind_force = 0.2f; // initialize buffers { Compute compute = device.createCompute(); // init kernel compute.setKernel(init_kernel); compute.setUniform(0, compute_parameters); compute.setStorageBuffers(0, { emitters_buffer, particles_buffer, allocator_buffer }); compute.dispatch(max(max_emitters * group_size, max_particles)); compute.barrier({ emitters_buffer, particles_buffer, allocator_buffer }); }
The Compute dispatching is more simple than rendering with Command and Pipeline. The corresponding initialization kernel does nothing except for storage buffer writes. The allocator buffer contains indices for new particles. The last element contains the first particle index for better memory access. GLSL will be automatically converted to the target platform shader language, including Cuda and Hip compute:
layout(local_size_x = GROUP_SIZE) in; layout(std140, binding = 0) uniform ComputeParametersBuffer { ComputeParameters compute; }; layout(std430, binding = 1) writeonly buffer EmitterStateBuffer { EmitterState emitters_buffer[]; }; layout(std430, binding = 2) writeonly buffer ParticleStateBuffer { ParticleState particles_buffer[]; }; layout(std430, binding = 3) writeonly buffer IndicesBuffer { uint allocator_buffer[]; }; /* */ void main() { uint global_id = gl_GlobalInvocationID.x; // initialize emitters [[branch]] if(global_id < compute.max_emitters * GROUP_SIZE) { emitters_buffer[global_id].position = vec4(0.0f); emitters_buffer[global_id].seed = ivec2(global_id); emitters_buffer[global_id].spawn = 0.0f; } // initialize particles [[branch]] if(global_id < compute.max_particles) { particles_buffer[global_id].position = vec4(1e16f); particles_buffer[global_id].velocity = vec4(0.0f); particles_buffer[global_id].radius = 0.0f; particles_buffer[global_id].angle = 0.0f; particles_buffer[global_id].life = 0.0f; } // initialize particle indices [[branch]] if(global_id < compute.max_particles) { allocator_buffer[global_id] = compute.max_particles - global_id - 1u; } }
Now everything is ready for particle system simulation that must be executed per frame:
// simulate particles { Compute compute = device.createCompute(); // compute parameters compute_parameters.camera = Matrix4x3f::rotateZ(-time * 8.0f) * Vector4f(32.0f, 0.0f, 32.0f, 0.0f); compute_parameters.num_emitters = emitters.size(); compute_parameters.ifps = ifps; // emitter kernel compute.setKernel(emitter_kernel); compute.setUniform(0, compute_parameters); compute.setStorageData(0, emitters.get(), emitters.bytes()); compute.setStorageBuffers(1, { state_buffer, emitters_buffer, particles_buffer, allocator_buffer }); compute.dispatch(emitters.size() * group_size); compute.barrier({ state_buffer, emitters_buffer, particles_buffer, allocator_buffer }); // dispatch kernel compute.setKernel(dispatch_kernel); compute.setStorageBuffers(0, { state_buffer, dispatch_buffer, draw_buffer, sort_buffer }); compute.dispatch(1); compute.barrier({ dispatch_buffer, draw_buffer, sort_buffer }); // update kernel compute.setKernel(update_kernel); compute.setUniform(0, compute_parameters); compute.setStorageBuffers(0, { state_buffer, particles_buffer, allocator_buffer, distances_buffer }); compute.setIndirectBuffer(dispatch_buffer); compute.dispatchIndirect(); compute.barrier({ state_buffer, particles_buffer, allocator_buffer, distances_buffer }); // sort particles if(!radix_sort.dispatchIndirect(compute, distances_buffer, sort_buffer, 0, RadixSort::FlagOrder)) return false; // geometry kernel compute.setKernel(geometry_kernel); compute.setUniform(0, compute_parameters); compute.setStorageBuffers(0, { state_buffer, particles_buffer, distances_buffer, vertex_buffer }); compute.setIndirectBuffer(dispatch_buffer); compute.dispatchIndirect(); }
As a result, we have particle positions, velocities, and colors in the vertex buffer and rendering parameters in the indirect buffer. The only rule you should follow is to have required write-read barriers between stages. Our debug run-time will tell you all errors about invalid kernel arguments if there are any. Let’s render our particle system:
// window target target.begin(); { // create command list Command command = device.createCommand(target); // set common parameters CommonParameters common_parameters; common_parameters.projection = Matrix4x4f::perspective(60.0f, (float32_t)window.getWidth() / window.getHeight(), 0.1f, 1000.0f); common_parameters.modelview = Matrix4x4f::lookAt(compute_parameters.camera.xyz, Vector3f::zero, Vector3f(0.0f, 0.0f, 1.0f)); if(target.isFlipped()) common_parameters.projection = Matrix4x4f::scale(1.0f, -1.0f, 1.0f) * common_parameters.projection; // set pipeline command.setPipeline(pipeline); command.setSampler(0, sampler); command.setTexture(0, texture); command.setUniform(0, common_parameters); command.setVertexBuffer(0, vertex_buffer); command.setIndexBuffer(FormatRu16, indices_buffer); // draw particles command.setIndirectBuffer(draw_buffer); command.drawElementsIndirect(1); } target.end();
This sample runs on Android, iOS, WebGPU, Windows, Linux, and macOS platforms with excellent performance and does not require any source code modification. The left image is a link to the WebGPU application that can run in your browser: