Entropy
Entropy
Section titled “Entropy”Measures the uncertainty or information content of a probability distribution. Maximum for uniform distributions (every outcome equally likely), zero for deterministic ones (outcome is certain). The foundation of information theory — cross-entropy loss, KL divergence, and mutual information are all built on top of it.
Intuition
Section titled “Intuition”Entropy answers: “how many yes/no questions do I need to ask, on average, to determine the outcome?” If a coin is fair, you need exactly 1 bit (one question). If it has four equally likely outcomes, you need 2 bits. If the outcome is already certain, you need 0 bits — there’s nothing to learn.
The key insight is the logarithm. A rare event (probability 0.01) carries a lot of information when it occurs — “it’s snowing in July” is much more informative than “it’s sunny in July.” The log converts multiplicative probabilities into additive information content: two independent events that each need 3 bits to describe need 6 bits together. Entropy is just the expected (average) information content across all possible outcomes.
Why does this matter for deep learning? Cross-entropy loss decomposes as H(P,Q) = H(P) + D_KL(P||Q). Since H(P) is constant for fixed labels, minimising cross-entropy is equivalent to minimising KL divergence. Entropy also appears directly in reinforcement learning as an exploration bonus — adding H(pi) to the reward encourages the policy to stay stochastic and explore, which is the core idea behind SAC.
General form (discrete distribution over outcomes):
with the convention (the limit is well-defined).
Binary entropy (Bernoulli with parameter ):
Maximum at where nats (or 1 bit).
Differential entropy (continuous distribution with density ):
Warning: differential entropy can be negative (unlike discrete entropy). A Gaussian with small variance has negative differential entropy.
Maximum entropy: among all distributions on outcomes, the uniform distribution maximises entropy at . This is the principle behind maximum entropy models — assume maximum uncertainty subject to constraints.
Relationship to cross-entropy and KL divergence:
import torchimport torch.nn.functional as F
# ── Entropy of a categorical distribution from logits ───────────logits = model(x) # (B, K) raw scoresprobs = F.softmax(logits, dim=-1) # (B, K)log_probs = F.log_softmax(logits, dim=-1) # (B, K) — use log_softmax, not log(softmax)entropy = -(probs * log_probs).sum(dim=-1) # (B,) — per-sample entropy
# ── Entropy regularisation in RL (SAC-style) ────────────────────# Add entropy bonus to encourage explorationalpha = 0.2 # temperature coefficientpolicy_loss = (alpha * log_probs - q_values).mean() # maximise entropy = minimise negative entropy
# ── Binary entropy ──────────────────────────────────────────────p = torch.sigmoid(logits) # (B,) probabilitiesbinary_entropy = F.binary_cross_entropy(p, p) # H(p) = CE(p, p)# Equivalent: -(p * p.log() + (1-p) * (1-p).log())# WARNING: use binary_cross_entropy (not with_logits) since both args are probs hereManual Implementation
Section titled “Manual Implementation”import numpy as np
def entropy(probs): """ Entropy of a discrete distribution. probs: (B, K) — each row sums to 1 Returns: (B,) — entropy in nats """ # Clip to avoid log(0) = -inf safe_probs = np.clip(probs, 1e-12, 1.0) # (B, K) return -(probs * np.log(safe_probs)).sum(axis=-1) # (B,)
def entropy_from_logits(logits): """ Entropy from raw logits (numerically stable). logits: (B, K) Returns: (B,) """ # Stable log-softmax: subtract max to prevent overflow shifted = logits - logits.max(axis=-1, keepdims=True) # (B, K) log_sum_exp = np.log(np.exp(shifted).sum(axis=-1, keepdims=True)) # (B, 1) log_probs = shifted - log_sum_exp # (B, K) probs = np.exp(log_probs) # (B, K) return -(probs * log_probs).sum(axis=-1) # (B,)
def binary_entropy(p): """ Entropy of Bernoulli(p). p: (B,) probabilities in [0, 1] """ p = np.clip(p, 1e-12, 1 - 1e-12) return -p * np.log(p) - (1 - p) * np.log(1 - p) # (B,)Popular Uses
Section titled “Popular Uses”- Entropy regularisation in RL (SAC, A2C): add H(pi) as a reward bonus to encourage exploration and prevent premature convergence to a deterministic policy
- Cross-entropy loss decomposition: understanding that minimising CE means minimising KL (since H(P) is constant) — the fundamental insight behind why cross-entropy works for classification
- Maximum entropy models: constrain a distribution to match observed statistics while being maximally uncertain otherwise (MaxEnt RL, exponential family distributions)
- Bits-per-dimension evaluation: convert NLL from nats to bits using log2, then normalise by dimensionality — entropy provides the theoretical lower bound
- Information bottleneck: compress representations by minimising mutual information with input while maximising it with labels — entropy quantifies the compression
- Entropy coding (arithmetic coding, ANS): lossless compression that achieves rates approaching the entropy — used in neural compression models
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Renyi entropy | When you need to weight rare vs. common events differently | Parameterised by order alpha; alpha=1 recovers Shannon entropy. Higher alpha focuses on high-probability events |
| Gini impurity | Decision trees (CART) | . Computationally cheaper than entropy (no log), nearly identical splits in practice |
| Variance | Continuous distributions where you want a simple uncertainty measure | Only captures second-order information; doesn’t fully characterise the distribution shape |
| Calibration metrics (ECE) | When you care about whether predicted probabilities are reliable | Measures calibration directly rather than information content; orthogonal to entropy |
Historical Context
Section titled “Historical Context”Shannon (1948) introduced entropy in “A Mathematical Theory of Communication,” borrowing the name from thermodynamics on von Neumann’s suggestion. Shannon proved that entropy is the fundamental limit on lossless compression — you cannot encode messages from a source with fewer than H(X) bits per symbol on average. This established the field of information theory.
Jaynes (1957) extended entropy to inference with the Maximum Entropy Principle: when building a probability model, choose the distribution with the highest entropy subject to your known constraints. This principle underlies exponential family distributions and connects information theory to statistical mechanics. In modern deep learning, entropy regularisation (Ziebart et al., 2008, “Maximum Entropy Inverse Reinforcement Learning”; Haarnoja et al., 2018, SAC) uses this same idea — maximise entropy subject to getting high reward — producing robust, exploration-friendly policies.