Entropy

Measures the uncertainty or information content of a probability distribution. Maximum for uniform distributions (every outcome equally likely), zero for deterministic ones (outcome is certain). The foundation of information theory — cross-entropy loss, KL divergence, and mutual information are all built on top of it.

Intuition

Entropy answers: “how many yes/no questions do I need to ask, on average, to determine the outcome?” If a coin is fair, you need exactly 1 bit (one question). If it has four equally likely outcomes, you need 2 bits. If the outcome is already certain, you need 0 bits — there’s nothing to learn.

The key insight is the logarithm. A rare event (probability 0.01) carries a lot of information when it occurs — “it’s snowing in July” is much more informative than “it’s sunny in July.” The log converts multiplicative probabilities into additive information content: two independent events that each need 3 bits to describe need 6 bits together. Entropy is just the expected (average) information content across all possible outcomes.

Why does this matter for deep learning? Cross-entropy loss decomposes as H(P,Q) = H(P) + D_KL(P||Q). Since H(P) is constant for fixed labels, minimising cross-entropy is equivalent to minimising KL divergence. Entropy also appears directly in reinforcement learning as an exploration bonus — adding H(pi) to the reward encourages the policy to stay stochastic and explore, which is the core idea behind SAC.

Math

General form (discrete distribution over $K$ outcomes):

$H(X) = -\sum_{i=1}^{K} p(x_i) \log p(x_i)$

with the convention $0 \log 0 = 0$ (the limit is well-defined).

Binary entropy (Bernoulli with parameter $p$ ):

$H(p) = -p \log p - (1-p) \log(1-p)$

Maximum at $p = 0.5$ where $H = \log 2 \approx 0.693$ nats (or 1 bit).

Differential entropy (continuous distribution with density $p(x)$ ):

$h(X) = -\int p(x) \log p(x) \, dx$

Warning: differential entropy can be negative (unlike discrete entropy). A Gaussian with small variance has negative differential entropy.

Maximum entropy: among all distributions on $K$ outcomes, the uniform distribution $p(x_i) = 1/K$ maximises entropy at $H = \log K$ . This is the principle behind maximum entropy models — assume maximum uncertainty subject to constraints.

Relationship to cross-entropy and KL divergence:

$H(P, Q) = H(P) + D_{KL}(P \| Q)$

Code

import torch
import torch.nn.functional as F

# ── Entropy of a categorical distribution from logits ───────────
logits = model(x)                                     # (B, K) raw scores
probs = F.softmax(logits, dim=-1)                     # (B, K)
log_probs = F.log_softmax(logits, dim=-1)             # (B, K) — use log_softmax, not log(softmax)
entropy = -(probs * log_probs).sum(dim=-1)            # (B,) — per-sample entropy

# ── Entropy regularisation in RL (SAC-style) ────────────────────
# Add entropy bonus to encourage exploration
alpha = 0.2                                           # temperature coefficient
policy_loss = (alpha * log_probs - q_values).mean()   # maximise entropy = minimise negative entropy

# ── Binary entropy ──────────────────────────────────────────────
p = torch.sigmoid(logits)                             # (B,) probabilities
binary_entropy = F.binary_cross_entropy(p, p)         # H(p) = CE(p, p)
# Equivalent: -(p * p.log() + (1-p) * (1-p).log())
# WARNING: use binary_cross_entropy (not with_logits) since both args are probs here

Manual Implementation

import numpy as np

def entropy(probs):
    """
    Entropy of a discrete distribution.
    probs: (B, K) — each row sums to 1
    Returns: (B,) — entropy in nats
    """
    # Clip to avoid log(0) = -inf
    safe_probs = np.clip(probs, 1e-12, 1.0)                  # (B, K)
    return -(probs * np.log(safe_probs)).sum(axis=-1)         # (B,)


def entropy_from_logits(logits):
    """
    Entropy from raw logits (numerically stable).
    logits: (B, K)
    Returns: (B,)
    """
    # Stable log-softmax: subtract max to prevent overflow
    shifted = logits - logits.max(axis=-1, keepdims=True)     # (B, K)
    log_sum_exp = np.log(np.exp(shifted).sum(axis=-1, keepdims=True))  # (B, 1)
    log_probs = shifted - log_sum_exp                         # (B, K)
    probs = np.exp(log_probs)                                 # (B, K)
    return -(probs * log_probs).sum(axis=-1)                  # (B,)


def binary_entropy(p):
    """
    Entropy of Bernoulli(p).
    p: (B,) probabilities in [0, 1]
    """
    p = np.clip(p, 1e-12, 1 - 1e-12)
    return -p * np.log(p) - (1 - p) * np.log(1 - p)         # (B,)

Popular Uses

Entropy regularisation in RL (SAC, A2C): add H(pi) as a reward bonus to encourage exploration and prevent premature convergence to a deterministic policy
Cross-entropy loss decomposition: understanding that minimising CE means minimising KL (since H(P) is constant) — the fundamental insight behind why cross-entropy works for classification
Maximum entropy models: constrain a distribution to match observed statistics while being maximally uncertain otherwise (MaxEnt RL, exponential family distributions)
Bits-per-dimension evaluation: convert NLL from nats to bits using log2, then normalise by dimensionality — entropy provides the theoretical lower bound
Information bottleneck: compress representations by minimising mutual information with input while maximising it with labels — entropy quantifies the compression
Entropy coding (arithmetic coding, ANS): lossless compression that achieves rates approaching the entropy — used in neural compression models

Alternatives

Alternative	When to use	Tradeoff
Renyi entropy	When you need to weight rare vs. common events differently	Parameterised by order alpha; alpha=1 recovers Shannon entropy. Higher alpha focuses on high-probability events
Gini impurity	Decision trees (CART)	$1 - \sum p_i^2$ . Computationally cheaper than entropy (no log), nearly identical splits in practice
Variance	Continuous distributions where you want a simple uncertainty measure	Only captures second-order information; doesn’t fully characterise the distribution shape
Calibration metrics (ECE)	When you care about whether predicted probabilities are reliable	Measures calibration directly rather than information content; orthogonal to entropy

Historical Context

Shannon (1948) introduced entropy in “A Mathematical Theory of Communication,” borrowing the name from thermodynamics on von Neumann’s suggestion. Shannon proved that entropy is the fundamental limit on lossless compression — you cannot encode messages from a source with fewer than H(X) bits per symbol on average. This established the field of information theory.

Jaynes (1957) extended entropy to inference with the Maximum Entropy Principle: when building a probability model, choose the distribution with the highest entropy subject to your known constraints. This principle underlies exponential family distributions and connects information theory to statistical mechanics. In modern deep learning, entropy regularisation (Ziebart et al., 2008, “Maximum Entropy Inverse Reinforcement Learning”; Haarnoja et al., 2018, SAC) uses this same idea — maximise entropy subject to getting high reward — producing robust, exploration-friendly policies.