ELBO (Evidence Lower Bound)
ELBO (Evidence Lower Bound)
Section titled “ELBO (Evidence Lower Bound)”The training objective for variational autoencoders (VAEs) and variational inference in general. A lower bound on the log-likelihood of the data: maximising the ELBO approximately maximises the probability of the data under the model. Decomposes into a reconstruction term and a KL regularisation term — the two forces that shape VAE latent spaces.
Intuition
Section titled “Intuition”You want to maximise — the probability your model assigns to the data. But computing this requires integrating over all possible latent codes , which is intractable. The ELBO is a computable lower bound: if you push the ELBO up, you are guaranteed to be pushing up (or at least not down).
The bound comes from introducing an approximate posterior — the encoder — and using Jensen’s inequality. The gap between the ELBO and the true is exactly : how far the encoder is from the true posterior. As the encoder improves, the gap shrinks and the bound tightens.
The ELBO splits into two terms with opposing goals. The reconstruction term says “pick latent codes that let you reconstruct the input well.” The KL term says “keep the latent distribution close to the prior.” Reconstruction wants each input to have a unique, informative code; the KL term wants all codes to look like the same prior distribution. This tension is the core design challenge of VAEs. Push KL too hard (high ) and the model ignores the latent code (“posterior collapse”). Push reconstruction too hard and the latent space becomes unstructured and ungeneralisable.
The fundamental identity (always holds, for any ):
Since KL is non-negative, ELBO . Maximising the ELBO simultaneously:
- Maximises reconstruction quality:
- Minimises KL to the prior:
- Tightens the bound by making closer to the true posterior
VAE loss (negative ELBO, what we actually minimise):
With Gaussian encoder and standard normal prior (the common case):
where , , (reparameterisation trick).
Beta-VAE (Higgins et al.):
encourages more disentangled latent representations at the cost of reconstruction quality.
import torchimport torch.nn.functional as F
# ── Standard VAE loss (negative ELBO) ────────────────────────────def vae_loss(x, x_recon, mu, log_var, beta=1.0): """ x: (B, C, H, W) or (B, D) — original input x_recon: same shape — decoder output mu: (B, d_latent) — encoder mean log_var: (B, d_latent) — encoder log-variance beta: float — KL weight (1.0 = standard VAE, >1 = beta-VAE) """ # Reconstruction: MSE or BCE depending on data type recon_loss = F.mse_loss(x_recon, x, reduction='sum') / x.size(0)
# KL divergence: closed-form for Gaussian q vs N(0,1) prior kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp(), dim=-1) # (B,) kl_loss = kl.mean()
return recon_loss + beta * kl_loss
# ── With reparameterisation trick (encoder forward) ──────────────mu, log_var = encoder(x) # (B, d_latent) eachstd = torch.exp(0.5 * log_var) # (B, d_latent)eps = torch.randn_like(std) # (B, d_latent) ~ N(0, I)z = mu + std * eps # (B, d_latent) — differentiable sample
# WARNING: some implementations use 'sigma' instead of 'log_var'.# Using sigma directly can cause numerical issues — log_var is more stable# because it can represent very small variances without underflow.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def elbo_loss(x, x_recon, mu, log_var, beta=1.0): """ Negative ELBO for a Gaussian VAE with N(0,1) prior. x: (B, D) original input (flattened) x_recon: (B, D) decoder reconstruction mu: (B, d_latent) encoder means log_var: (B, d_latent) encoder log-variances """ B = x.shape[0]
# Reconstruction term: MSE (= Gaussian log-likelihood up to a constant) recon = np.sum((x - x_recon) ** 2) / B # scalar
# KL term: closed-form KL(N(mu, sigma^2) || N(0, 1)) # = 0.5 * sum(mu^2 + sigma^2 - log(sigma^2) - 1) kl_per_sample = -0.5 * np.sum( 1 + log_var - mu ** 2 - np.exp(log_var), axis=1 # (B,) ) kl = kl_per_sample.mean() # scalar
return recon + beta * kl
def reparameterise(mu, log_var): """ Sample z from q(z|x) = N(mu, sigma^2) using the reparameterisation trick. mu: (B, D) means log_var: (B, D) log-variances Returns: (B, D) sampled latent codes """ std = np.exp(0.5 * log_var) # (B, D) eps = np.random.randn(*mu.shape) # (B, D) ~ N(0, I) return mu + std * eps # (B, D)Popular Uses
Section titled “Popular Uses”- VAEs and all variants (see
variational-inference-vae/): the ELBO is the training objective. Beta-VAE, CVAE, and VQ-VAE all modify or approximate it - Stable Diffusion’s autoencoder: the “VAE” in latent diffusion is trained with ELBO (technically a KL-AE with very low KL weight) to compress images to a latent space
- Variational inference in Bayesian neural networks: approximate weight posteriors by maximising an ELBO over the weights
- Topic models (neural variational inference): learn document-topic distributions by treating topics as latent variables
- Amortised inference (any model with latent variables): the encoder network “amortises” the cost of inference by learning to map data to approximate posteriors in one forward pass
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Exact log-likelihood (autoregressive models) | Tractable models like PixelCNN, GPT | No approximation gap, but generation is sequential and slow. No latent space |
| Adversarial loss (GAN) | Want sharp samples, don’t need likelihood | No mode-covering behaviour — may miss modes. No encoder or latent inference |
| Flow-based (exact ELBO) | Need exact log-likelihood AND latent space | Normalising flows give exact posteriors, closing the ELBO gap. Architecturally constrained |
| IWAE bound | Want a tighter bound than ELBO | Uses multiple importance-weighted samples. Tighter bound but higher variance gradients |
| Diffusion loss | Want high-quality generation with likelihood | Can be viewed as a hierarchical ELBO with T steps. Better sample quality than single-step VAE |
Historical Context
Section titled “Historical Context”The ELBO has roots in variational calculus and Bayesian statistics, but it entered deep learning through Kingma and Welling’s “Auto-Encoding Variational Bayes” (2014) and the concurrent work by Rezende, Mohamed, and Wierstra. The key innovation was combining the ELBO with the reparameterisation trick, which made the expectation in the reconstruction term differentiable through the sampling step.
The tension between reconstruction and KL regularisation has driven most subsequent VAE research. Higgins et al. (2017) introduced beta-VAE, showing that increasing the KL weight encourages disentangled representations. Bowman et al. (2016) identified posterior collapse — where the model learns to ignore the latent code — and proposed KL annealing (gradually increasing from 0 to 1 during training) as a mitigation. The VQ-VAE (van den Oord et al., 2017) sidestepped the continuous KL entirely by using discrete latent codes, replacing the KL term with a vector quantisation objective. The ELBO framework remains the foundation for understanding all these variants and their tradeoffs.