Unified Diffusion Algorithm

Introduction

The structure follows the same pattern. The training step is arguably the simplest of all four files — it’s literally “add noise, predict something, MSE.” The complexity lives entirely in sampling.

The key insight this file tries to make clear:

Training is identical across DDPM, DDIM, and v-prediction. DDPM and DDIM have the exact same training — both predict ε with MSE loss. They only differ at sampling time. This is surprisingly under-appreciated. You can train once and sample with either method.

The three prediction targets (ε, x_0, v) are mathematically interchangeable. The “RELATIONSHIP BETWEEN PREDICTION TARGETS” section at the bottom shows the algebra — you can convert freely between them. The choice matters only because it changes the loss landscape: ε-prediction gives a loss dominated by high-noise timesteps, v-prediction balances the difficulty evenly, and x_0-prediction emphasises low-noise timesteps.

Classifier-free guidance is the critical practical addition. It’s what turned diffusion from “nice research” into “DALL-E / Stable Diffusion.” The trick — train with random conditioning dropout, then extrapolate away from unconditional at inference — is elegant and doesn’t fit neatly into the core algorithm’s pluggable methods, so it’s a separate composable component (same way the EMA is).

The progression tells a similar story to the other files: DDPM solves the core problem (stable training, full distribution coverage), DDIM solves speed, v-prediction solves numerical stability, and classifier-free guidance solves controllability. Each one changes exactly one piece.

Summary: What changes vs. what stays the same

Always the same (training)

Sample random timestep t
x_t = sqrt(alpha_bar_t) · x_0 + sqrt(1 − alpha_bar_t) · ε (add noise)
loss = || target − model(x_t, t) ||² (MSE loss)
Gradient step

Always the same (sampling structure)

Start from x_T ~ N(0, I)
Loop: x_t → x_{t−1} via denoise_step() (PLUGGABLE)
Return x_0

What varies by variant

Variant	Prediction target	Sampling	Speed
DDPM	noise ε	Stochastic (+ σ·z)	T steps (slow)
DDIM	noise ε	Deterministic (η=0)	~50 steps (fast)
v-prediction	velocity v	Deterministic	~50 steps (fast)

Motives for each variant

Variant	Problem Solved	Intuition for Solution
DDPM	GANs are unstable to train (mode collapse, training oscillation) and don’t cover the full data distribution	Instead of adversarial training, learn to reverse a simple noise process. Training is just MSE on noise prediction — as stable as any regression. Covers all modes because the forward process does
DDIM	DDPM needs T=1000 sequential steps to generate one sample — extremely slow	Rewrite the reverse process as a non-Markovian chain that gives the same marginals but allows skipping steps. 50 steps ≈ 1000-step quality. Determinism enables interpolation and inversion
v-prediction	ε-prediction is numerically unstable at extremes: at t≈0 the noise is tiny (hard to predict), at t≈T the signal is gone (prediction is meaningless)	Predict v = √ᾱ·ε − √(1−ᾱ)·x_0, which stays well-scaled at all timesteps. The network’s job is equally difficult at every t, preventing the loss from being dominated by certain timesteps
Cosine schedule	Linear schedule destroys coarse structure too early — the model wastes capacity denoising already-destroyed images	Shape the noise curve so ᾱ_t follows a cosine — gentle at first, steep at the end. Coarse structure survives longer, giving the model useful signal at more timesteps
Classifier-free guidance	Need conditional generation but classifier guidance needs a separate trained classifier and back-propagation through it at each sampling step	Train one model for both conditional and unconditional (random cond drop). At inference, extrapolate AWAY from unconditional: ε̃ = ε_u + w(ε_c − ε_u). w>1 amplifies the condition. Simple, no extra models needed

Relationship between prediction targets

Given: x_t = √ᾱ · x_0 + √(1−ᾱ) · ε

You can recover any target from any other:

x_0 from ε: x_0 = (x_t − √(1−ᾱ)·ε) / √ᾱ
ε from x_0: ε = (x_t − √ᾱ·x_0) / √(1−ᾱ)
v from both: v = √ᾱ·ε − √(1−ᾱ)·x_0
x_0 from v: x_0 = √ᾱ·x_t − √(1−ᾱ)·v
ε from v: ε = √(1−ᾱ)·x_t + √ᾱ·v

The choice of target changes what the network learns to be good at — the mathematical content is equivalent, but the loss landscape and gradient magnitudes differ.