Real-Time Photo Finishing in a Browser Tab: A Multi-Pass WebGPU Pipeline

A photo finisher is a slider rack. Exposure for global brightness. Contrast for the difference between highlights and shadows. Saturation for color intensity. Warmth for color temperature. Gamma for mid-tone shaping. Blur for atmosphere. Sharpen for crispness. Vignette for focus. Grain for texture. Nine knobs, each one cheap to evaluate but combined into a finishing chain that has to render now, not after a server round-trip.

This post walks through how a multi-pass WebGPU pipeline implements that whole rack — what each pass does, why some operations have to be separated, and what the performance budget actually looks like in a browser tab.

Why Multi-Pass

Some image operations can be done in a single pass: read each input pixel, transform it, write the output. Exposure, contrast, saturation, warmth, gamma, invert — all of these are pointwise operations. Each output pixel depends only on the corresponding input pixel. They can all be combined into a single fragment shader.

Other operations are neighborhood operations: each output pixel depends on multiple input pixels. Blur, sharpen, edge detect, convolution of any kind. These can't be fused with pointwise operations in a single shader (without doing the neighborhood read for every pointwise sample, which would explode the cost).

And some neighborhood operations have an even nastier property: they're not separable in the naive sense. A 2D Gaussian blur with radius 10 would naively cost 21×21 = 441 texture samples per output pixel. But Gaussian blur happens to be separable — it can be decomposed into a horizontal 1D blur followed by a vertical 1D blur, each costing 21 samples. Two passes of 21 samples is 42, an order of magnitude faster than one pass of 441.

So the pipeline shape is:

Tone pass. Pointwise: exposure, contrast, saturation, warmth, gamma. One fragment shader, one output texture.
Horizontal blur pass. 1D Gaussian, sampling along x. Input is the tone pass output.
Vertical blur pass. Same 1D Gaussian, sampling along y. Input is the horizontal blur output.
Finishing pass. Pointwise + small neighborhood: 5-tap sharpen, radial vignette, hashed grain. Reads the blurred image, writes the final result.

Four passes total. Each pass is a fragment shader writing to a new texture; the next pass reads from that texture. The intermediate textures live in GPU memory for the duration of the render and are recycled when the user moves a slider.

The Tone Pass

Exposure, contrast, saturation, warmth, and gamma all fit in a single fragment shader because they're pointwise. Each operation is a one-line transformation:

// Read the input pixel
var color = textureSample(input, samp, uv).rgb;
 
// Exposure (power-of-two multiplier; -3 stops to +3 stops)
color = color * pow(2.0, exposure);
 
// Contrast (around mid-gray 0.5)
color = (color - 0.5) * contrast + 0.5;
 
// Warmth (shift red and blue channels)
color = vec3<f32>(color.r + warmth, color.g, color.b - warmth);
 
// Saturation (mix toward Rec. 709 luminance)
let luma = dot(color, vec3<f32>(0.2126, 0.7152, 0.0722));
color = mix(vec3<f32>(luma), color, saturation);
 
// Gamma (per-channel power curve)
color = pow(max(color, vec3<f32>(0.0)), vec3<f32>(1.0 / gamma));
 
return vec4<f32>(clamp(color, vec3<f32>(0.0), vec3<f32>(1.0)), 1.0);

The order matters. Exposure first (because it's a linear-domain operation), then contrast and color shifts, then gamma last (because gamma is a perceptual mapping that the eye expects to be applied after linear adjustments). Different orderings produce visually different results; the order above is the conventional photographic one.

The Separable Gaussian Blur

A 2D Gaussian kernel has weights G(x, y) = exp(-(x² + y²) / 2σ²). The 2D kernel can be written as a product: G(x, y) = G(x) · G(y). This factoring lets you blur in x, then blur the result in y, and get the same answer as a single 2D blur — but in O(r) samples per pixel instead of O(r²).

For a radius-10 blur, the kernel size is 21 (taps centered on the pixel plus 10 on each side). The horizontal pass samples 21 pixels along the row; the vertical pass samples 21 pixels along the column. Total: 42 samples per output pixel. A naive 2D blur would be 441 samples.

The weight calculation in the shader:

fn gaussianWeight(offset: f32, sigma: f32) -> f32 {
  return exp(-(offset * offset) / (2.0 * sigma * sigma));
}

The weights must be normalized so they sum to 1 (otherwise the blur changes the image's average brightness). In practice the shader either computes the sum on the fly and normalizes, or relies on the radius being small enough that the truncation error is negligible.

The radius slider in the UI exposes σ directly. σ = 0 produces no blur. σ = 5 produces a soft 11-tap blur. σ = 20 produces a heavy 41-tap blur. Beyond σ ≈ 30 the performance starts to drop on integrated GPUs, but for finishing-pass blur work, σ ≤ 10 covers most aesthetic uses.

The Finishing Pass

The last pass combines three small operations:

Sharpen. A five-tap unsharp-mask: subtract the center pixel from the average of its four immediate neighbors, scale by an amount parameter, and add back to the center. This produces a high-frequency boost that visually enhances edges.

let center = textureSample(input, samp, uv).rgb;
let left   = textureSample(input, samp, uv + vec2(-texel.x, 0.0)).rgb;
let right  = textureSample(input, samp, uv + vec2( texel.x, 0.0)).rgb;
let up     = textureSample(input, samp, uv + vec2(0.0, -texel.y)).rgb;
let down   = textureSample(input, samp, uv + vec2(0.0,  texel.y)).rgb;
let blur4  = (left + right + up + down) * 0.25;
color = center + (center - blur4) * sharpen;

Vignette. A radial darkening centered on the image. The amount of darkening at each pixel depends on its distance from the center:

let dist = length(uv - vec2(0.5)) * 2.0;  // 0 at center, ~1.4 at corners
let fade = 1.0 - smoothstep(0.5, 1.0, dist) * vignette;
color *= fade;

The smoothstep is the key — it gives a smooth falloff from full brightness in the center to darkened corners. A linear falloff produces a harder vignette ring; smoothstep produces the photographic feel.

Grain. A pseudo-random per-pixel noise added to the output. Generated with a simple hash function:

fn hash(p: vec2<f32>) -> f32 {
  return fract(sin(dot(p, vec2(12.9898, 78.233))) * 43758.5453);
}
let n = (hash(uv * resolution) - 0.5) * grain;
color += vec3<f32>(n);

This is a classic GLSL one-liner that produces a noise pattern that's deterministic per pixel position but visually random. For animated film grain you'd add a time uniform to perturb the input; for static finishing the static hash is enough.

What's Happening in Memory

Each pass allocates a GPUTexture with format rgba8unorm — 4 bytes per pixel. For a 12-megapixel image, each intermediate texture is 48 MB. The pipeline has three intermediate textures (tone output, horizontal blur output, vertical blur output) plus the final output, so peak GPU memory is ~200 MB.

This is well within the budget of any modern GPU. Integrated graphics on laptops typically have 1–2 GB of shared memory available to WebGPU; discrete GPUs have many gigabytes. Even a phone GPU can comfortably handle a 12-megapixel finishing pipeline.

The textures are allocated once when the image loads, not per-slider-drag. Slider changes only update the uniform buffer (a small GPUBuffer containing the parameter values) and re-issue the render commands. The textures themselves stay allocated for the lifetime of the editing session.

Latency Budget

A finishing pipeline that updates on slider-drag needs to maintain interactivity. The target is one frame per slider position — about 16ms at 60Hz.

Where does the 16ms budget go?

Slider event → JavaScript handler: under 1ms.
Uniform buffer update: under 1ms. A 64-byte buffer write is essentially free.
Command encoding (build the four passes): ~1ms. WebGPU command encoding is fast.
Submit to GPU and wait: ~5–10ms for a 12MP image on integrated graphics, 1–2ms on a discrete GPU.
Present to canvas: ~1ms.

Total: well under 16ms on most hardware. On low-end integrated GPUs with high-resolution input, the pipeline can drop to 30fps interaction, which is still acceptable. The user perceives "the slider moves and the image changes" as a single event.

Quality Considerations

A few things to watch for in a multi-pass pipeline:

Quantization. Each pass writes to rgba8unorm, which quantizes each channel to 8 bits. Four passes of quantization produce more rounding error than one pass. For finishing work where the final output is also 8-bit, this is invisible. For HDR editing or scientific imagery where every bit matters, use rgba16float for intermediates and pay the 2× memory cost.

Gamma space vs. linear space. Most consumer images are stored in gamma-corrected sRGB. Doing tone operations directly on gamma-corrected values produces subtle hue shifts in highlights. The "correct" pipeline linearizes on read, processes in linear, and gamma-encodes on write. For finishing work, the gamma-space pipeline is usually fine; the shifts are within taste rather than error.

Clamping. Aggressive exposure or contrast adjustments can push channel values outside [0, 1]. Without clamping, downstream passes get garbage. With clamping, you lose information that could be recovered by a later adjustment. The right choice depends on whether the pipeline is one-shot finishing (clamp at the end) or part of a longer chain (keep float range).

Comparison to CSS Filters

Modern browsers expose a filter CSS property that includes blur, contrast, brightness, saturation, drop-shadow, and a few others. Why not use it?

CSS filters are convenient but limited:

No fine control. The filter values are scalars; you can't compose custom operations.
No multi-pass logic. The chaining is fixed; you can't insert a custom pass.
Pixel access. The filtered image isn't trivially exportable. You can't canvas.toBlob() a CSS-filtered <img> without rendering it through a canvas first, which discards the filter.
Performance is implementation-defined. Some browsers implement CSS filters on the GPU; some don't. Behavior varies.

For preview-only effects on existing DOM elements, CSS filters are fine. For an export-quality photo finishing pipeline, a custom WebGPU pipeline gives you control over every pass, predictable performance, and direct access to the output pixels for PNG encoding.

What This Enables

Real-time GPU image processing in a browser tab moves a whole category of tools from "needs a server" to "runs locally":

Photo finishers no longer need cloud GPUs.
Scientific image inspection (histograms, levels, channel views) becomes interactive on any size of image.
AI inference (model in WebGPU, post-processing in WebGPU) becomes a single-pipeline operation.
Live video filters run at full framerate without a worker thread or WASM intermediate.

The Utilora WebGPU Filter Studio is the worked-example version of this pipeline. Nine sliders, four-pass shader chain, real-time response, zero upload. Edit phone photos, finish screenshots, normalize figure exposure for a paper — without leaving the browser tab.

Conclusion

A multi-pass WebGPU pipeline turns photo finishing from a server operation into a slider drag. The math is conventional — exposure, contrast, separable Gaussian blur, sharpen, vignette, grain are all techniques from the 1980s and 1990s. The novelty is implementing them as a chain of fragment shaders that share GPU memory and re-render in milliseconds.

For practical work, try Filter Studio for general finishing, Edge Detect for line-art extraction, Image Histogram & Levels when you want to see the tonal distribution and adjust black/white points explicitly, and Color Transform for quick CSS-level adjustments that don't require WebGPU support.

Real-Time Photo Finishing in a Browser Tab: A Multi-Pass WebGPU Pipeline

Real-Time Photo Finishing in a Browser Tab: A Multi-Pass WebGPU Pipeline

Why Multi-Pass

The Tone Pass

The Separable Gaussian Blur

The Finishing Pass

What's Happening in Memory

Latency Budget

Quality Considerations

Comparison to CSS Filters

What This Enables

Conclusion

Try these tools