Mutual Information
Mutual Information
Section titled “Mutual Information”Measures how much knowing one variable reduces uncertainty about another. I(X;Y) = 0 means X and Y are independent; higher values mean more shared information. The core quantity behind representation learning (InfoMax principle) and the theoretical motivation for contrastive losses like InfoNCE.
Intuition
Section titled “Intuition”Mutual information asks: “how surprised would I be to learn that X and Y co-occur, compared to if they were independent?” If X is an image and Y is a caption, high MI means the caption tells you a lot about the image. If X is a noisy copy of Y, MI measures how much signal survives the noise.
Think of a Venn diagram of information. H(X) is one circle, H(Y) is the other. Their overlap is I(X;Y) — the information they share. The non-overlapping parts are what’s unique to each. Conditioning on Y removes its circle, leaving only H(X|Y) — the residual uncertainty. So I(X;Y) = H(X) - H(X|Y): knowing Y “explains away” exactly I(X;Y) nats of uncertainty about X.
The catch: MI is intractable for high-dimensional continuous variables. You’d need the full joint density p(x,y), which is exactly what you don’t have. This is why contrastive learning exists — InfoNCE provides a lower bound on MI that you can estimate from samples alone, by training a critic to distinguish real (x,y) pairs from shuffled ones. The tighter the critic, the tighter the bound.
Definition (three equivalent forms):
As KL divergence (most fundamental form):
This measures how far the joint distribution is from the product of marginals. Zero iff X and Y are independent.
Conditional mutual information:
InfoNCE lower bound (Oord et al., 2018):
where is the number of negative samples and the InfoNCE loss is:
The bound is tight when . In practice, the bound saturates at — more negatives = tighter bound.
import torchimport torch.nn.functional as F
# ── InfoNCE loss (contrastive MI estimation) ────────────────────# This is the standard way to maximise MI in practice.# z_x and z_y are paired embeddings from two views of the same data.
z_x = encoder_x(x) # (B, D) — normalised embeddingsz_y = encoder_y(y) # (B, D)z_x = F.normalize(z_x, dim=-1) # unit normz_y = F.normalize(z_y, dim=-1) # unit norm
# Cosine similarity matrix scaled by temperaturelogits = z_x @ z_y.T / temperature # (B, B) — similarity scoreslabels = torch.arange(B, device=logits.device) # (B,) — diagonal is positive
# InfoNCE = cross-entropy where the "correct class" is the matching pairloss = F.cross_entropy(logits, labels) # scalar
# ── MINE estimator (Belghazi et al., 2018) ──────────────────────# Direct MI estimation via a learned statistics network T(x,y)joint_scores = T(x, y) # (B,) — scores for real pairsmarginal_scores = T(x, y[torch.randperm(B)]) # (B,) — scores for shuffled pairs# Donsker-Varadhan bound: I(X;Y) >= E[T(x,y)] - log(E[exp(T(x,y_shuffled))])mi_lower_bound = joint_scores.mean() - torch.logsumexp(marginal_scores, 0) + torch.log(torch.tensor(B, dtype=torch.float))Manual Implementation
Section titled “Manual Implementation”import numpy as np
def mutual_information_discrete(joint_probs): """ MI from a joint probability table. joint_probs: (K_x, K_y) — p(x,y), must sum to 1 Returns: scalar MI in nats """ p_xy = np.clip(joint_probs, 1e-12, 1.0) # (K_x, K_y) p_x = p_xy.sum(axis=1, keepdims=True) # (K_x, 1) — marginal p_y = p_xy.sum(axis=0, keepdims=True) # (1, K_y) — marginal
# I(X;Y) = sum p(x,y) * log(p(x,y) / (p(x)*p(y))) return (p_xy * np.log(p_xy / (p_x * p_y))).sum() # scalar
def infonce_loss(z_x, z_y, temperature=0.07): """ InfoNCE contrastive loss (lower bound on MI). z_x: (B, D) — L2-normalised embeddings z_y: (B, D) — L2-normalised embeddings Returns: scalar loss """ B = z_x.shape[0] # Cosine similarity matrix logits = (z_x @ z_y.T) / temperature # (B, B)
# Numerically stable cross-entropy with labels = diagonal shifted = logits - logits.max(axis=1, keepdims=True) # (B, B) log_sum_exp = np.log(np.exp(shifted).sum(axis=1)) # (B,) log_probs_diag = shifted[np.arange(B), np.arange(B)] # (B,) — positive pairs return -(log_probs_diag - log_sum_exp).mean() # scalarPopular Uses
Section titled “Popular Uses”- Contrastive learning (SimCLR, MoCo, CLIP): InfoNCE maximises a lower bound on MI between two augmented views — this is why contrastive learning works as a representation learning method
- InfoMax principle (Deep InfoMax, CPC): learn representations that maximise MI between input and encoding — “preserve as much information as possible”
- Information bottleneck (VIB): compress representations by minimising I(X;Z) while maximising I(Z;Y) — keep only the label-relevant information
- MINE / neural MI estimation (Belghazi et al., 2018): train a neural network to estimate MI directly using variational bounds — useful for analysing learned representations
- Disentangled representations (beta-VAE): total correlation TC(Z) = KL(q(z) || prod q(z_i)) is a multi-variable generalisation of MI — minimising it encourages independent latent dimensions
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Correlation / cosine similarity | Quick check for linear relationships | Only captures linear dependence; MI captures arbitrary nonlinear relationships |
| KL divergence | When you have two distributions over the same variable, not two variables | Asymmetric; MI is symmetric and measures shared information between different variables |
| CKA (Centered Kernel Alignment) | Comparing learned representations across models | Practical and stable, but no information-theoretic interpretation |
| Hilbert-Schmidt Independence Criterion | Kernel-based independence testing | Consistent estimator without density estimation, but harder to optimise as a training objective |
| Wasserstein distance | When you care about geometry of distributions, not just dependence | Accounts for the metric structure of the space; MI treats all mismatches equally |
Historical Context
Section titled “Historical Context”Mutual information was defined by Shannon (1948) alongside entropy as part of the foundational framework of information theory. It remained primarily a theoretical tool in statistics and communications until Linsker (1988) proposed the InfoMax principle: the optimal representation of input data is one that maximises the mutual information between the input and the representation, subject to constraints.
The deep learning revival of MI came from two directions. Oord et al. (2018, “Representation Learning with Contrastive Predictive Coding”) introduced InfoNCE, showing that a contrastive classification loss provides a tractable lower bound on MI — this directly motivated SimCLR, MoCo, and CLIP. Simultaneously, Belghazi et al. (2018, MINE) showed MI could be estimated with neural networks using variational bounds. The practical difficulty of MI estimation (bounds can be loose, high-variance) has led the field to treat contrastive losses as effective objectives in their own right, somewhat independent of the MI interpretation.