Skip to content

ELBO (Evidence Lower Bound)

The training objective for variational autoencoders (VAEs) and variational inference in general. A lower bound on the log-likelihood of the data: maximising the ELBO approximately maximises the probability of the data under the model. Decomposes into a reconstruction term and a KL regularisation term — the two forces that shape VAE latent spaces.

You want to maximise logp(x)\log p(x) — the probability your model assigns to the data. But computing this requires integrating over all possible latent codes zz, which is intractable. The ELBO is a computable lower bound: if you push the ELBO up, you are guaranteed to be pushing logp(x)\log p(x) up (or at least not down).

The bound comes from introducing an approximate posterior q(zx)q(z|x) — the encoder — and using Jensen’s inequality. The gap between the ELBO and the true logp(x)\log p(x) is exactly DKL(q(zx)p(zx))D_{\text{KL}}(q(z|x) \| p(z|x)): how far the encoder is from the true posterior. As the encoder improves, the gap shrinks and the bound tightens.

The ELBO splits into two terms with opposing goals. The reconstruction term says “pick latent codes that let you reconstruct the input well.” The KL term says “keep the latent distribution close to the prior.” Reconstruction wants each input to have a unique, informative code; the KL term wants all codes to look like the same prior distribution. This tension is the core design challenge of VAEs. Push KL too hard (high β\beta) and the model ignores the latent code (“posterior collapse”). Push reconstruction too hard and the latent space becomes unstructured and ungeneralisable.

The fundamental identity (always holds, for any qq):

logp(x)=Eq(zx)[logp(xz)]DKL(q(zx)p(z))ELBO+DKL(q(zx)p(zx))\log p(x) = \underbrace{\mathbb{E}_{q(z|x)}\bigl[\log p(x|z)\bigr] - D_{\text{KL}}\bigl(q(z|x) \| p(z)\bigr)}_{\text{ELBO}} + D_{\text{KL}}\bigl(q(z|x) \| p(z|x)\bigr)

Since KL is non-negative, ELBO logp(x)\leq \log p(x). Maximising the ELBO simultaneously:

  1. Maximises reconstruction quality: Eq(zx)[logp(xz)]\mathbb{E}_{q(z|x)}[\log p(x|z)]
  2. Minimises KL to the prior: DKL(q(zx)p(z))D_{\text{KL}}(q(z|x) \| p(z))
  3. Tightens the bound by making q(zx)q(z|x) closer to the true posterior p(zx)p(z|x)

VAE loss (negative ELBO, what we actually minimise):

L=Eq(zx)[logp(xz)]+DKL(q(zx)p(z))\mathcal{L} = -\mathbb{E}_{q(z|x)}[\log p(x|z)] + D_{\text{KL}}(q(z|x) \| p(z))

With Gaussian encoder and standard normal prior (the common case):

L=xx^2+12j=1d(μj2+σj2logσj21)\mathcal{L} = \|x - \hat{x}\|^2 + \frac{1}{2}\sum_{j=1}^{d}(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1)

where x^=decoder(z)\hat{x} = \text{decoder}(z), z=μ+σϵz = \mu + \sigma \odot \epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) (reparameterisation trick).

Beta-VAE (Higgins et al.):

Lβ=recon+βDKL\mathcal{L}_\beta = \text{recon} + \beta \cdot D_{\text{KL}}

β>1\beta > 1 encourages more disentangled latent representations at the cost of reconstruction quality.

import torch
import torch.nn.functional as F
# ── Standard VAE loss (negative ELBO) ────────────────────────────
def vae_loss(x, x_recon, mu, log_var, beta=1.0):
"""
x: (B, C, H, W) or (B, D) — original input
x_recon: same shape — decoder output
mu: (B, d_latent) — encoder mean
log_var: (B, d_latent) — encoder log-variance
beta: float — KL weight (1.0 = standard VAE, >1 = beta-VAE)
"""
# Reconstruction: MSE or BCE depending on data type
recon_loss = F.mse_loss(x_recon, x, reduction='sum') / x.size(0)
# KL divergence: closed-form for Gaussian q vs N(0,1) prior
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp(), dim=-1) # (B,)
kl_loss = kl.mean()
return recon_loss + beta * kl_loss
# ── With reparameterisation trick (encoder forward) ──────────────
mu, log_var = encoder(x) # (B, d_latent) each
std = torch.exp(0.5 * log_var) # (B, d_latent)
eps = torch.randn_like(std) # (B, d_latent) ~ N(0, I)
z = mu + std * eps # (B, d_latent) — differentiable sample
# WARNING: some implementations use 'sigma' instead of 'log_var'.
# Using sigma directly can cause numerical issues — log_var is more stable
# because it can represent very small variances without underflow.
import numpy as np
def elbo_loss(x, x_recon, mu, log_var, beta=1.0):
"""
Negative ELBO for a Gaussian VAE with N(0,1) prior.
x: (B, D) original input (flattened)
x_recon: (B, D) decoder reconstruction
mu: (B, d_latent) encoder means
log_var: (B, d_latent) encoder log-variances
"""
B = x.shape[0]
# Reconstruction term: MSE (= Gaussian log-likelihood up to a constant)
recon = np.sum((x - x_recon) ** 2) / B # scalar
# KL term: closed-form KL(N(mu, sigma^2) || N(0, 1))
# = 0.5 * sum(mu^2 + sigma^2 - log(sigma^2) - 1)
kl_per_sample = -0.5 * np.sum(
1 + log_var - mu ** 2 - np.exp(log_var), axis=1 # (B,)
)
kl = kl_per_sample.mean() # scalar
return recon + beta * kl
def reparameterise(mu, log_var):
"""
Sample z from q(z|x) = N(mu, sigma^2) using the reparameterisation trick.
mu: (B, D) means
log_var: (B, D) log-variances
Returns: (B, D) sampled latent codes
"""
std = np.exp(0.5 * log_var) # (B, D)
eps = np.random.randn(*mu.shape) # (B, D) ~ N(0, I)
return mu + std * eps # (B, D)
  • VAEs and all variants (see variational-inference-vae/): the ELBO is the training objective. Beta-VAE, CVAE, and VQ-VAE all modify or approximate it
  • Stable Diffusion’s autoencoder: the “VAE” in latent diffusion is trained with ELBO (technically a KL-AE with very low KL weight) to compress images to a latent space
  • Variational inference in Bayesian neural networks: approximate weight posteriors by maximising an ELBO over the weights
  • Topic models (neural variational inference): learn document-topic distributions by treating topics as latent variables
  • Amortised inference (any model with latent variables): the encoder network “amortises” the cost of inference by learning to map data to approximate posteriors in one forward pass
AlternativeWhen to useTradeoff
Exact log-likelihood (autoregressive models)Tractable models like PixelCNN, GPTNo approximation gap, but generation is sequential and slow. No latent space
Adversarial loss (GAN)Want sharp samples, don’t need likelihoodNo mode-covering behaviour — may miss modes. No encoder or latent inference
Flow-based (exact ELBO)Need exact log-likelihood AND latent spaceNormalising flows give exact posteriors, closing the ELBO gap. Architecturally constrained
IWAE boundWant a tighter bound than ELBOUses multiple importance-weighted samples. Tighter bound but higher variance gradients
Diffusion lossWant high-quality generation with likelihoodCan be viewed as a hierarchical ELBO with T steps. Better sample quality than single-step VAE

The ELBO has roots in variational calculus and Bayesian statistics, but it entered deep learning through Kingma and Welling’s “Auto-Encoding Variational Bayes” (2014) and the concurrent work by Rezende, Mohamed, and Wierstra. The key innovation was combining the ELBO with the reparameterisation trick, which made the expectation in the reconstruction term differentiable through the sampling step.

The tension between reconstruction and KL regularisation has driven most subsequent VAE research. Higgins et al. (2017) introduced beta-VAE, showing that increasing the KL weight encourages disentangled representations. Bowman et al. (2016) identified posterior collapse — where the model learns to ignore the latent code — and proposed KL annealing (gradually increasing β\beta from 0 to 1 during training) as a mitigation. The VQ-VAE (van den Oord et al., 2017) sidestepped the continuous KL entirely by using discrete latent codes, replacing the KL term with a vector quantisation objective. The ELBO framework remains the foundation for understanding all these variants and their tradeoffs.