Lumina Global Illumination

Development Period: September 2025 - March 2026

My thesis implements a real-time dynamic global illumination system built from scratch
in DirectX 12, rendering ~1.5 million triangles at 1728×864 with multi-bounce indirect
lighting, no precomputed lightmaps, no hardware ray tracing.

Core Techniques:

SDF-Based Software Ray Tracing
Surface Cache
Screen Probe Final Gather
Temporal Stability: History reprojection, bilateral spatial filtering, and adaptive accumulation eliminating real-time GI flickering

Demo

Implementations

DirectX 12 Rendering Framework

All GI subsystems are built on a custom DX12 backend designed around compute-first execution, no engine abstractions sit between the GI logic and the GPU.

Command list management with separate graphics and compute queues for potential async execution
Descriptor heap management: CBV/SRV/UAV heap (437 slots), RTV heap, DSV heap with dynamic allocation
Bindless resource model using large descriptor tables and root signature design
Resource state tracking with automatic barrier insertion between passes
Pipeline State Object (PSO) caching for compute shaders with hot-reload support
Structured buffer management for GPU-side data (probes, cards, BVH nodes)
Texture array and 3D texture support for atlas and volumetric data
Frame synchronization with triple buffering and fence-based GPU/CPU sync

Deferred Rendering Pipeline

RT0: Albedo (RGBA8_UNORM): Base diffuse color from material textures
RT1: World Normal (RGBA16_FLOAT): High-precision normals for accurate lighting
RT2: Material (RGBA32_FLOAT): Stored basic material
RT3: World Position (RGBA32_FLOAT): Stored world coordinates for GI sampling
Depth (D32_FLOAT): Hardware depth buffer for occlusion queries

Shadow mapping pass with 2048×2048 shadow map and 3×3 PCF filtering. Direct lighting pass computing sun contribution with Lambert diffuse BRDF. Final composite pass combining direct lighting with screen-space indirect GI.

SurfaceCache System

Rather than computing lighting at every ray hit point, LuminaGI caches material and lighting data for all scene surfaces in a GPU-resident texture atlas. Results are stored on surface "cards" and reused across frames. Screen probes sample this cache during ray tracing, making multi-bounce indirect lighting affordable without hardware RT.

Architecture

A Texture2DArray with 6 layers, each 4096×4096 pixels, persistent in GPU memory across frames:
- Layer 0: Albedo (RGBA8_UNORM): Base material diffuse color
- Layer 1: World Normal (RGBA16_FLOAT): Surface orientation
- Layer 2: Material (RGBA16_FLOAT): Material properties
- Layer 3: Direct Light (RGBA16_FLOAT): Sun and local light contribution, HDR
- Layer 4: Indirect Light (RGBA16_FLOAT): GI from Surface Radiosity
- Layer 5: Final Combined (RGBA16_FLOAT): Direct + Indirect, sampled by screen probes

Surface Card System

Each card represents a rectangular patch of mesh surface projected onto the atlas.
Cards store the world-to-card transform, atlas coordinates, world-space bounds for BVH queries, a 128-bit light mask encoding which of up to 128 dynamic point lights affect this surface, and state flags (VISIBLE, DIRTY, STATIC, DYNAMIC, TWO_SIDED).

Tile Allocation

A free-list allocator manages available atlas regions at 64×64 pixel tile granularity, supporting up to 4,096 tiles (4096² / 64² = 4096).
Tiles are returned to the free list on deallocation, keeping allocation O(1) for the common case while supporting fragmentation recovery.

Card Capture Pass

Each dirty card is re-rendered from an orthographic viewpoint aligned to its surface normal: orthographic projection ensures undistorted texel-to-world-space mapping regardless of card orientation.
A single MRT pass writes albedo, world normal, and direct lighting simultaneously into the card's atlas region, then updates the card's LastUpdateFrame.

Incremental Update System

Lighting doesn't change everywhere every frame. The system maintains a priority queue where dirty cards (geometry or light changed) are serviced first, followed by age-based refresh for temporal coherence.
With a budget of 50 cards per frame, all ~500 active cards are refreshed within approximately 10 frames
The cache keeps temporally stable without per-frame full recomputation.

Direct Light Update Optimization

The DirectLightUpdate compute pass re-shades the direct lighting layer (Layer 3) of the surface cache whenever a light moves or the sun direction changes.
A 64×64 R32_UINT tile-granularity lookup texture (16 KB) maps each atlas tile to its owning card index (or 0xFFFFFFFF if empty). Rebuilt CPU-side on card allocation changes. Each thread now does a single texture load at `atlasCoord / 64` instead of scanning all cards.

Shadow System

Directional shadows use a 2048×2048 depth map rendered from the sun's orthographic projection, sampled with 3×3 PCF for soft edges. The pass only re-renders when the sun direction changes.
Point light shadows use a cube texture array: up to 4 shadow-casting lights × 6 faces = 24 depth slices at 512×512 each. Each face is rendered with a separate draw call targeting one array slice. Depth is stored as linear distance normalized by the light's far plane. Both shadow types share a single constant buffer (register b5) so all shaders access directional and point shadows through the same bind point.

Surface Cache Texture2DArray

Axis Surface Card

Card Capture

Shadow Map

Point light omnidirectional shadow

Card BVH Acceleration Structure

Without spatial acceleration, every SDF ray hit would require testing against all active surface cards to find which card owns the intersection point — an O(N) cost per hit. The BVH reduces this to O(log N) by organizing cards into an axis-aligned bounding volume hierarchy.

BVH Node Structure

GPUCardBVHNode struct: 48 bytes, cache-line aligned
Bounds: min(x,y,z) and max(x,y,z) - 24 bytes
leftFirst: child index (internal) or first card index (leaf)
cardCount: 0 for internal nodes, >0 for leaf nodes
Flat array layout for GPU-friendly access: Leaf nodes contain up to 4 card indices

BVH Construction

Built top-down on the CPU after scene load. At each node, the split axis is chosen as the longest dimension of the current bounds, and the split position is the median of card centroids.
Recursion terminates when a node contains 4 or fewer cards or a depth limit is reached. The resulting flat array of nodes is uploaded to a GPU structured buffer.
The tree is rebuilt when cards are added or removed.

GPU Traversal

Stack-based iterative traversal in compute shader (stack depth 32, sufficient for balanced trees). At each node, the query point is tested against the node AABB.
Misses skip the subtree; hits on leaf nodes test against individual card bounds and return the closest match; internal node hits push both children.
Returns the closest card ID or INVALID_CARD.

Signed Distance Field Pass

Hierarchical distance field representation enabling efficient software ray tracing. Each mesh has a 3D texture storing signed distances, allowing sphere tracing without triangle intersection tests. A global composite SDF handles far-field queries beyond individual mesh range.

Global SDF Texture3D with diffrent Z

Voxel Lighting

Per-Mesh SDF

Each mesh is baked into a 64³ R32_FLOAT texture storing the signed distance to the nearest surface.
Generation runs in a dedicated DX12 compute queue to avoid blocking the render timeline, and results are stored on disk for subsequent loads.
At runtime, all mesh SDFs are accessible as a bindless texture array.

Global Scene SDF

A 64³ RG32_FLOAT volume covering the entire scene, storing the minimum signed distance across all meshes along with the instance ID of the closest mesh at each voxel.
Composed by transforming each voxel into every mesh's local space and taking the union. Enables material lookup at far-field ray hit points.
Updated when scene geometry changes. GPU memory footprint: ~2 MB.

Voxel Lighting

A 64³ RGBA16_FLOAT volume storing average incident radiance sampled from the surface cache at each voxel center.
Trilinear interpolation weighted by the hit surface's normal reconstructs a directional irradiance estimate.
Serves as a stable, low-frequency fallback for screen probes when mesh SDF misses.
GPU memory footprint: ~2 MB.

Sphere Tracing Algorithm

Each ray advances by the SDF value at its current position: the distance to the nearest surface is a guaranteed safe step size, so no triangle intersection is needed.
A minimum step size prevents stalling at grazing angles. Tracing terminates on a hit (SDF distance below 0.001), a miss (exceeded 200 world units), or after 64 iterations.
Per-mesh SDFs handle near-field geometry; the global SDF takes over beyond individual mesh bounds.

Mesh SDF Normal

Sphere Tracing Algorithm

Surface Radiosity Pass

Screen probes capture view-dependent indirect lighting for pixels on screen. Surface Radiosity handles the complementary problem: view-independent light transport between cached surfaces. It computes how light arriving at one surface patch re-radiates to neighboring patches, accumulating over multiple frames to simulate multi-bounce GI on the surface cache atlas.

The pass executes every 10 frames to amortize its cost (256×256 thread groups over the full 4096×4096 atlas). After 30 dispatches without lighting changes, it converges and stops entirely.

Pass 1: Radiosity Trace

A 1024×1024 probe grid placed on the surface cache atlas (4× denser than the screen probe grid).
Each probe fires 16 rays, reads the surface normal from Layer 1, and sphere-traces the Global SDF.
At hit points, it samples the voxel lighting volume, which contains the Combined layer data injected each frame.
A quadratic distance falloff (`(hitDist / refDist)²`) suppresses corner over-brightening from near-field hits in tight geometry junctions.
Output is normalized by 1/π and firefly-clamped to 1/π.

Pass 2: Spatial Filter

A 4-neighbor cross filter (center weight 2.0, neighbor weight 1.0 each) reduces noise from the limited 16-ray sampling.
Averaging each texel with its four immediate neighbors while giving double weight to the center sample.

Pass 3: Convert to Spherical Harmonics

Projects filtered radiance onto a 1st-order SH basis (L0+L1, 4 coefficients per channel). This gives a compact directional representation sufficient for low-frequency diffuse indirect lighting, and decouples the radiosity solve from the final integration direction, the SH can be evaluated for any surface normal without re-tracing.

Pass 4: Integrate SH to Atlas

Evaluates the SH coefficients in the direction of each surface normal from Layer 1, writing the resulting irradiance to Layer 4 (Indirect Light). Negative values are clamped to suppress SH ringing. This triggers the Combine pass to propagate the update to Layer 5, which screen probes and the next radiosity bounce will read.

Radiosity Trace Pipeline

Radiosity Trace

Surface Cache Indirect Light Layer

Combine SurfaceCache Pass

Composites Layer 3 (Direct) and Layer 4 (Indirect) into Layer 5 (Final Combined):

Combined = (Direct + Indirect) × Albedo × (1 / π)

Direct and Indirect layers store raw cosine-weighted irradiance. The Lambertian BRDF (`albedo / π`) is applied once at Combine time, producing outgoing radiance in Layer 5. Storing raw irradiance in the two source layers keeps them in consistent units, keeps albedo in one place, and decouples the atlas storage format from the diffuse model.
Layer 5 is what screen probes read as outgoing radiance at ray hit points. It is also injected into the voxel lighting volume each frame, which the next radiosity trace samples at ray hit positions. This closes the multi-bounce loop: Layer 5 → InjectVoxelLighting → RadiosityTrace → Layer 4 → CombineLight → Layer 5. Each radiosity iteration (every 10 frames) adds one bounce of indirect light.

Screen Probe Final Gather Pass

The core GI system implementing an 11-pass screen-space probe pipeline. The key insight is that indirect lighting varies slowly across smooth surfaces, per-pixel ray tracing at 1080p would require 2 billion rays per frame. Instead, probes are placed on a coarse 8×8 pixel grid, each tracing 64 rays to capture hemispherical radiance. This reduces ray count to ~2 million while maintaining quality through importance sampling, octahedral radiance storage, and multi-stage filtering.

Probe Grid Configuration

Each probe samples the GBuffer at its cell center: reads GBuffer WorldPos layer, samples world-space normal.
Probes over sky pixels (depth ≥ 0.9999) are flagged INVALID and skipped in subsequent passes.

Pass 1: Probe Placement

Dispatch: ceil(ProbeGridW/8) × ceil(ProbeGridH/8) thread groups, 64 threads each
Each thread initializes one probe by sampling GBuffer at the cell center
Sample location: pixel coordinate (probeX × 8 + 4, probeY × 8 + 4)
Reads hardware depth buffer, converts to linear depth for filtering weights
Reconstructs world position: clipPos → inverse(ViewProjection) → worldPos
Samples world-space normal from GBuffer, decodes from octahedral or RGB encoding
Sky detection: flags probe as INVALID if depth ≥ 0.9999 (far plane)
Output: ProbeData structured buffer containing {WorldPosition, LinearDepth, WorldNormal, ValidFlags}

Pass 2: BRDF Importance Sampling

Computes a cosine-weighted hemisphere PDF for each probe's surface normal.
Directions aligned with the normal receive the highest probability; grazing directions approach zero.
Physical basis: Lambert's cosine law. A local tangent frame is built per probe to map ray indices to world-space directions.

Pass 3: Lighting Importance Sampling

Builds a per-probe PDF from the previous frame's radiance history. For each of 64 directions, samples the octahedral history texture and computes perceptual luminance.
The resulting PDF concentrates samples where light was actually found last frame, reducing variance in regions of high-frequency indirect illumination. Minimum probability clamped to 0.001 to keep all directions reachable.

Pass 4: MIS Sample Direction Generation

Combines BRDF and lighting PDFs using the MIS balance heuristic, reducing variance when either PDF alone would be a poor importance function.
Sub-pixel jitter from a blue noise texture is applied per frame for temporal anti-aliasing. Each ray direction is encoded with its MIS weight for use in Pass 7.

Pass 5: Mesh SDF Ray Trace (Near-Field)

Sphere traces each probe's 64 rays against per-mesh SDFs up to 100 world units. On hit, queries the Card BVH to find the owning surface card and samples Layer 5 of the surface cache at the intersection UV.
Rays that exit mesh SDF range without a hit are flagged for far-field processing.

Pass 6: Voxel SDF Ray Trace (Far-Field)

Cone traces flagged rays against the global scene SDF from 100 to 500 world units. Cone angle widens with distance (0.1 radians), providing softer, more stable results for distant geometry.
Hits sample the voxel lighting volume (64³ RGBA16F); misses sample the skybox.

Pass 7: Octahedral Radiance Composite

Merges near-field and far-field results into each probe's 8×8 octahedral texture. Selects the closer hit between the two passes, applies the MIS weight from Pass 4, and writes to the current frame's octahedral atlas. Octahedral mapping was chosen over cubemaps for its freedom from face seams, native bilinear filtering, and ~33% better texel efficiency.

Pass 8: Temporal Accumulation

Exponential moving average blending: 90% history, 10% current frame.
Over approximately 10 accumulated frames, variance reduces by roughly √10. A firefly suppression clamp (incoming radiance capped at 2× history value) prevents single-frame bright outliers from persisting. Disocclusion is handled by reducing the history weight when probe world positions shift significantly between frames.

Pass 9: Spatial Filtering

4-neighbor cross-bilateral filter over the probe grid. Two-factor weight: depth similarity (exponential falloff on plane distance) and normal alignment (power function).
Center sample has weight 1.0; each of the 4 axis-aligned neighbors contributes with its bilateral weight. Preserves lighting edges at geometry boundaries while smoothing noise within continuous surfaces.

Pass 9B: Octahedral Irradiance (SH Low-Pass)

Projects each probe's filtered radiance into L2 spherical harmonics (9 coefficients per channel), then reconstructs irradiance from the SH representation. This acts as a low-pass filter, suppressing temporal noise while preserving the dominant lighting direction.
A 1/π firefly clamp is applied after reconstruction.

Pass 10: Screen-Space Final Gather

Interpolates filtered probe radiance to full 1920×1080 resolution. Each pixel finds its 4 nearest probes, computes bilinear weights modulated by normal similarity, samples the filtered octahedral atlas in the pixel's normal direction, and accumulates a weighted sum. Sky pixels output zero indirect lighting.

Pass 11: Screen-Space Temporal Filter

Motion-aware temporal reprojection on the final full-resolution indirect lighting texture. Blends current-frame results with reprojected history using a 1/π luminance clamp for firefly suppression. Disoccluded pixels fall back to the current frame only.

Probe Buffer

BRDF Output

Final Composite Pass

Deferred lighting pass combining all computed data into the final image:

Output = DirectLighting(GBuffer, ShadowMap) + IndirectLighting × Albedo

Reads the GBuffer, shadow map, depth buffer, and the full-resolution indirect lighting texture from Pass 10. Computes sun direct lighting via Lambert BRDF with PCF shadow lookup. Adds the screen-space indirect result. A subsequent forward pass renders debug overlays and diagnostic entities on top of the composited frame without interfering with the deferred pipeline.

Direct Light Only

Color Bleeding

Final Light

Point Light Shadow

GI Visualization System

A dedicated visualization subsystem exposes the internal state of every GI pass for debugging, iteration, and demonstration. Selected at runtime via an ImGui panel, each mode redirects the final output to display an intermediate buffer instead of the composited image — invaluable when debugging a pipeline where a single miswired descriptor can produce plausible-looking but wrong results 10 passes downstream.

The visualization system is a separate compute/rasterization pipeline (`GIVisualization` class) that runs after the main frame is rendered and overwrites the back buffer if a debug mode is active. It uses two execution paths:

Fullscreen compute for modes that sample from a resource (GBuffer, voxel volumes, screen probe textures) and write to a full-screen UAV
Instanced VS/PS rasterization for surface-cache-space modes, which render the atlas content onto scene geometry using world-to-atlas UV mapping

All modes share a common `VisualizationConstants` buffer containing camera matrices, light parameters, atlas dimensions, probe grid config, and voxel/SDF volume parameters — so adding a new mode only requires authoring one HLSL entry point.

Why this matters?

Software GI pipelines fail silently: a wrong root signature slot, a missed resource barrier, or an SH sign flip produces output that's dim or noisy rather than visibly broken. Having a mode switch that shows each intermediate buffer directly turned multi-day debugging sessions into minutes — you see the bad pass, fix the bug, verify in the same panel, move on. The system is also the basis for every figure in the thesis: every "internal state" screenshot comes from this pipeline.

Other Solo (Individual) Projects

RedCraft

ChessSoul