WebGPU rendering pipeline
The WebGPU path runs frustum culling and importance sampling on the GPU, eliminating the CPU-side loop that the WebGL path requires.
Pipeline stages
1. CPU-to-GPU upload
Every throttle tick, the CPU ring buffer copies its contents into a GPU storage buffer flagged STORAGE | COPY_DST. This is a full-ring upload — O(N) CPU work per tick where N is the current buffer fill. It's the main cost of the WebGPU path and the planned optimization for a future incremental upload path.
2. Compute pass
A WGSL compute shader reads all uploaded points and produces a compacted output buffer:
@compute @workgroup_size(256)
fn cull_and_compact(@builtin(global_invocation_id) id: vec3<u32>) {
let index = id.x;
if (index >= uniforms.pointCount) { return; }
let pos = positions[index];
// 6-plane frustum test
if (!passes_frustum(pos, uniforms.frustumPlanes)) { return; }
// Temporal window filter (when timeWindowMs > 0)
if (uniforms.timeWindowMs > 0u) {
let age = uniforms.currentTime - timestamps[index];
if (age > uniforms.timeWindowMs) { return; }
}
// Importance sampling (when enabled)
if (uniforms.importanceSamplingEnabled == 1u) {
let hash = pcg_hash(index, uniforms.frameSeed);
if (f32(hash) / 4294967295.0 > importances[index]) { return; }
}
// Atomically append to output buffer
let out = atomicAdd(&drawIndirectArgs.vertexCount, 1u);
compacted[out] = index;
}
The compute shader:
- Reads positions, timestamps, and importance values from storage buffers.
- Applies frustum, temporal window, and importance tests.
- Uses
atomicAddto append surviving point indices to a compacted index buffer. - Writes the surviving point count directly into the
drawIndirectarguments buffer.
3. Draw pass
drawIndirect draws from the compacted buffer using the count that the compute shader wrote. The CPU never needs to know how many points survived. There's no BufferAttribute.needsUpdate, no CPU readback, and no draw call preparation on the frame path.
Double buffering
Two sets of position and attribute storage buffers alternate each tick: one is being written by the CPU (upload) while the other is being read by the GPU (compute + draw). This ensures ingest and rendering never contend on the same memory.
Support matrix
| Browser | Status |
|---|---|
| Chrome 113+ | Full WebGPU support |
| Edge 113+ | Full WebGPU support |
| Firefox | No WebGPU — uses WebGL fallback |
| Safari 18+ | Experimental WebGPU — uses WebGL fallback by default |
| Node.js / jsdom | No navigator.gpu — uses WebGL fallback |
Known limitations
Full-ring upload per tick: The current upload path copies all N positions on every tick even when few points changed. Worker-to-GPU incremental upload (only new chunks since the last tick) is the planned fix and would reduce upload cost proportionally to ingest rate.
maxStorageBufferBindingSize: GPU devices cap the size of storage buffers, typically at 128–512 MB. PointFlow checks this at initialization and silently caps maxPoints if the requested size would exceed the device limit. Use onRendererResolved and the active policy to see the effective budget.
GPU point picking
On pointerdown, a second render pass is added to the same command encoder (after the compute pass, before the main draw). The picking pass:
- Renders all visible points (via the same
drawIndirectbuffer) into a full-resolution R32Uint texture. - Each fragment writes
visibleSlot[instanceIndex] + 1— the ring-buffer slot of the point, offset by one so that 0 means "nothing drawn here". - A 1×1 region at the click coordinates is copied to a 256-byte staging buffer via
copyTextureToBuffer. - After
device.queue.submit, the staging buffer is mapped asynchronously. The encoded value is decoded back to a ring-buffer slot and the CPU ring buffer returns the point's XYZ and attributes.
Coordinates are scaled by the actual device pixel ratio (canvas device pixels / CSS pixels) before the texture read, so the result is correct on HiDPI displays.
The picking pass adds no per-frame cost — it only runs in the frame where a click lands.
WebGL fallback
The WebGL path achieves the same visible output through CPU-side operations:
copyToTypedArraysiterates the ring buffer, applies the frustum predicate, applies LOD stride, and writes into pre-allocatedFloat32Arrayposition and color buffers.- These buffers upload to
BufferGeometryattributes withneedsUpdate = true. drawArraysdraws from the uploaded attributes.
The cost is higher because the frustum loop runs on the main thread. But the result is correct and it works on every browser including Firefox and older Safari.