ELBO (Evidence Lower Bound)

The training objective for variational autoencoders (VAEs) and variational inference in general. A lower bound on the log-likelihood of the data: maximising the ELBO approximately maximises the probability of the data under the model. Decomposes into a reconstruction term and a KL regularisation term — the two forces that shape VAE latent spaces.

Intuition

You want to maximise $\log p(x)$ — the probability your model assigns to the data. But computing this requires integrating over all possible latent codes $z$ , which is intractable. The ELBO is a computable lower bound: if you push the ELBO up, you are guaranteed to be pushing $\log p(x)$ up (or at least not down).

The bound comes from introducing an approximate posterior $q(z|x)$ — the encoder — and using Jensen’s inequality. The gap between the ELBO and the true $\log p(x)$ is exactly $D_{\text{KL}}(q(z|x) \| p(z|x))$ : how far the encoder is from the true posterior. As the encoder improves, the gap shrinks and the bound tightens.

The ELBO splits into two terms with opposing goals. The reconstruction term says “pick latent codes that let you reconstruct the input well.” The KL term says “keep the latent distribution close to the prior.” Reconstruction wants each input to have a unique, informative code; the KL term wants all codes to look like the same prior distribution. This tension is the core design challenge of VAEs. Push KL too hard (high $\beta$ ) and the model ignores the latent code (“posterior collapse”). Push reconstruction too hard and the latent space becomes unstructured and ungeneralisable.

Math

The fundamental identity (always holds, for any $q$ ):

$\log p(x) = \underbrace{\mathbb{E}_{q(z|x)}\bigl[\log p(x|z)\bigr] - D_{\text{KL}}\bigl(q(z|x) \| p(z)\bigr)}_{\text{ELBO}} + D_{\text{KL}}\bigl(q(z|x) \| p(z|x)\bigr)$

Since KL is non-negative, ELBO $\leq \log p(x)$ . Maximising the ELBO simultaneously:

Maximises reconstruction quality: $\mathbb{E}_{q(z|x)}[\log p(x|z)]$
Minimises KL to the prior: $D_{\text{KL}}(q(z|x) \| p(z))$
Tightens the bound by making $q(z|x)$ closer to the true posterior $p(z|x)$

VAE loss (negative ELBO, what we actually minimise):

$\mathcal{L} = -\mathbb{E}_{q(z|x)}[\log p(x|z)] + D_{\text{KL}}(q(z|x) \| p(z))$

With Gaussian encoder and standard normal prior (the common case):

$\mathcal{L} = \|x - \hat{x}\|^2 + \frac{1}{2}\sum_{j=1}^{d}(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1)$

where $\hat{x} = \text{decoder}(z)$ , $z = \mu + \sigma \odot \epsilon$ , $\epsilon \sim \mathcal{N}(0, I)$ (reparameterisation trick).

Beta-VAE (Higgins et al.):

$\mathcal{L}_\beta = \text{recon} + \beta \cdot D_{\text{KL}}$

$\beta > 1$ encourages more disentangled latent representations at the cost of reconstruction quality.

Code

import torch
import torch.nn.functional as F

# ── Standard VAE loss (negative ELBO) ────────────────────────────
def vae_loss(x, x_recon, mu, log_var, beta=1.0):
    """
    x:       (B, C, H, W) or (B, D) — original input
    x_recon: same shape — decoder output
    mu:      (B, d_latent) — encoder mean
    log_var: (B, d_latent) — encoder log-variance
    beta:    float — KL weight (1.0 = standard VAE, >1 = beta-VAE)
    """
    # Reconstruction: MSE or BCE depending on data type
    recon_loss = F.mse_loss(x_recon, x, reduction='sum') / x.size(0)

    # KL divergence: closed-form for Gaussian q vs N(0,1) prior
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp(), dim=-1)  # (B,)
    kl_loss = kl.mean()

    return recon_loss + beta * kl_loss

# ── With reparameterisation trick (encoder forward) ──────────────
mu, log_var = encoder(x)                       # (B, d_latent) each
std = torch.exp(0.5 * log_var)                 # (B, d_latent)
eps = torch.randn_like(std)                    # (B, d_latent) ~ N(0, I)
z = mu + std * eps                             # (B, d_latent) — differentiable sample

# WARNING: some implementations use 'sigma' instead of 'log_var'.
# Using sigma directly can cause numerical issues — log_var is more stable
# because it can represent very small variances without underflow.

Manual Implementation

import numpy as np

def elbo_loss(x, x_recon, mu, log_var, beta=1.0):
    """
    Negative ELBO for a Gaussian VAE with N(0,1) prior.
    x:       (B, D) original input (flattened)
    x_recon: (B, D) decoder reconstruction
    mu:      (B, d_latent) encoder means
    log_var: (B, d_latent) encoder log-variances
    """
    B = x.shape[0]

    # Reconstruction term: MSE (= Gaussian log-likelihood up to a constant)
    recon = np.sum((x - x_recon) ** 2) / B                       # scalar

    # KL term: closed-form KL(N(mu, sigma^2) || N(0, 1))
    # = 0.5 * sum(mu^2 + sigma^2 - log(sigma^2) - 1)
    kl_per_sample = -0.5 * np.sum(
        1 + log_var - mu ** 2 - np.exp(log_var), axis=1          # (B,)
    )
    kl = kl_per_sample.mean()                                     # scalar

    return recon + beta * kl


def reparameterise(mu, log_var):
    """
    Sample z from q(z|x) = N(mu, sigma^2) using the reparameterisation trick.
    mu:      (B, D) means
    log_var: (B, D) log-variances
    Returns: (B, D) sampled latent codes
    """
    std = np.exp(0.5 * log_var)                                  # (B, D)
    eps = np.random.randn(*mu.shape)                             # (B, D) ~ N(0, I)
    return mu + std * eps                                        # (B, D)

Popular Uses

VAEs and all variants (see variational-inference-vae/): the ELBO is the training objective. Beta-VAE, CVAE, and VQ-VAE all modify or approximate it
Stable Diffusion’s autoencoder: the “VAE” in latent diffusion is trained with ELBO (technically a KL-AE with very low KL weight) to compress images to a latent space
Variational inference in Bayesian neural networks: approximate weight posteriors by maximising an ELBO over the weights
Topic models (neural variational inference): learn document-topic distributions by treating topics as latent variables
Amortised inference (any model with latent variables): the encoder network “amortises” the cost of inference by learning to map data to approximate posteriors in one forward pass

Alternatives

Alternative	When to use	Tradeoff
Exact log-likelihood (autoregressive models)	Tractable models like PixelCNN, GPT	No approximation gap, but generation is sequential and slow. No latent space
Adversarial loss (GAN)	Want sharp samples, don’t need likelihood	No mode-covering behaviour — may miss modes. No encoder or latent inference
Flow-based (exact ELBO)	Need exact log-likelihood AND latent space	Normalising flows give exact posteriors, closing the ELBO gap. Architecturally constrained
IWAE bound	Want a tighter bound than ELBO	Uses multiple importance-weighted samples. Tighter bound but higher variance gradients
Diffusion loss	Want high-quality generation with likelihood	Can be viewed as a hierarchical ELBO with T steps. Better sample quality than single-step VAE

Historical Context

The ELBO has roots in variational calculus and Bayesian statistics, but it entered deep learning through Kingma and Welling’s “Auto-Encoding Variational Bayes” (2014) and the concurrent work by Rezende, Mohamed, and Wierstra. The key innovation was combining the ELBO with the reparameterisation trick, which made the expectation in the reconstruction term differentiable through the sampling step.

The tension between reconstruction and KL regularisation has driven most subsequent VAE research. Higgins et al. (2017) introduced beta-VAE, showing that increasing the KL weight encourages disentangled representations. Bowman et al. (2016) identified posterior collapse — where the model learns to ignore the latent code — and proposed KL annealing (gradually increasing $\beta$ from 0 to 1 during training) as a mitigation. The VQ-VAE (van den Oord et al., 2017) sidestepped the continuous KL entirely by using discrete latent codes, replacing the KL term with a vector quantisation objective. The ELBO framework remains the foundation for understanding all these variants and their tradeoffs.