Skip to content

Unified Diffusion Algorithm

The structure follows the same pattern. The training step is arguably the simplest of all four files — it’s literally “add noise, predict something, MSE.” The complexity lives entirely in sampling.

The key insight this file tries to make clear:

Training is identical across DDPM, DDIM, and v-prediction. DDPM and DDIM have the exact same training — both predict ε with MSE loss. They only differ at sampling time. This is surprisingly under-appreciated. You can train once and sample with either method.

The three prediction targets (ε, x_0, v) are mathematically interchangeable. The “RELATIONSHIP BETWEEN PREDICTION TARGETS” section at the bottom shows the algebra — you can convert freely between them. The choice matters only because it changes the loss landscape: ε-prediction gives a loss dominated by high-noise timesteps, v-prediction balances the difficulty evenly, and x_0-prediction emphasises low-noise timesteps.

Classifier-free guidance is the critical practical addition. It’s what turned diffusion from “nice research” into “DALL-E / Stable Diffusion.” The trick — train with random conditioning dropout, then extrapolate away from unconditional at inference — is elegant and doesn’t fit neatly into the core algorithm’s pluggable methods, so it’s a separate composable component (same way the EMA is).

The progression tells a similar story to the other files: DDPM solves the core problem (stable training, full distribution coverage), DDIM solves speed, v-prediction solves numerical stability, and classifier-free guidance solves controllability. Each one changes exactly one piece.

Summary: What changes vs. what stays the same

Section titled “Summary: What changes vs. what stays the same”
  • Sample random timestep t
  • x_t = sqrt(alpha_bar_t) · x_0 + sqrt(1 − alpha_bar_t) · ε (add noise)
  • loss = || target − model(x_t, t) ||² (MSE loss)
  • Gradient step
  • Start from x_T ~ N(0, I)
  • Loop: x_t → x_{t−1} via denoise_step() (PLUGGABLE)
  • Return x_0
VariantPrediction targetSamplingSpeed
DDPMnoise εStochastic (+ σ·z)T steps (slow)
DDIMnoise εDeterministic (η=0)~50 steps (fast)
v-predictionvelocity vDeterministic~50 steps (fast)
VariantProblem SolvedIntuition for Solution
DDPMGANs are unstable to train (mode collapse, training oscillation) and don’t cover the full data distributionInstead of adversarial training, learn to reverse a simple noise process. Training is just MSE on noise prediction — as stable as any regression. Covers all modes because the forward process does
DDIMDDPM needs T=1000 sequential steps to generate one sample — extremely slowRewrite the reverse process as a non-Markovian chain that gives the same marginals but allows skipping steps. 50 steps ≈ 1000-step quality. Determinism enables interpolation and inversion
v-predictionε-prediction is numerically unstable at extremes: at t≈0 the noise is tiny (hard to predict), at t≈T the signal is gone (prediction is meaningless)Predict v = √ᾱ·ε − √(1−ᾱ)·x_0, which stays well-scaled at all timesteps. The network’s job is equally difficult at every t, preventing the loss from being dominated by certain timesteps
Cosine scheduleLinear schedule destroys coarse structure too early — the model wastes capacity denoising already-destroyed imagesShape the noise curve so ᾱ_t follows a cosine — gentle at first, steep at the end. Coarse structure survives longer, giving the model useful signal at more timesteps
Classifier-free guidanceNeed conditional generation but classifier guidance needs a separate trained classifier and back-propagation through it at each sampling stepTrain one model for both conditional and unconditional (random cond drop). At inference, extrapolate AWAY from unconditional: ε̃ = ε_u + w(ε_c − ε_u). w>1 amplifies the condition. Simple, no extra models needed

Given: x_t = √ᾱ · x_0 + √(1−ᾱ) · ε

You can recover any target from any other:

  • x_0 from ε: x_0 = (x_t − √(1−ᾱ)·ε) / √ᾱ
  • ε from x_0: ε = (x_t − √ᾱ·x_0) / √(1−ᾱ)
  • v from both: v = √ᾱ·ε − √(1−ᾱ)·x_0
  • x_0 from v: x_0 = √ᾱ·x_t − √(1−ᾱ)·v
  • ε from v: ε = √(1−ᾱ)·x_t + √ᾱ·v

The choice of target changes what the network learns to be good at — the mathematical content is equivalent, but the loss landscape and gradient magnitudes differ.