Entropy Regularisation

Adds the entropy of the policy $H(\pi)$ to the reinforcement learning objective: $J = \mathbb{E}[R] + \alpha H(\pi)$ . Encourages exploration by preventing the policy from collapsing to a deterministic action too early. The defining component of SAC and a key ingredient in A2C/PPO.

Intuition

Without entropy regularisation, a policy gradient agent that discovers one good action will immediately exploit it — putting all probability on that action and never trying alternatives. This is premature convergence: the agent gets stuck in a local optimum because it stopped exploring before finding better strategies.

Entropy regularisation adds a bonus for randomness. A policy that spreads probability across many actions has high entropy and gets rewarded for it. The temperature parameter $\alpha$ controls the tradeoff: high $\alpha$ means “explore a lot, even if it costs some reward,” low $\alpha$ means “mostly exploit, with a small exploration nudge.” The optimal policy under entropy regularisation is the Boltzmann (softmax) distribution over Q-values — actions with higher value get more probability, but no action gets zero probability.

This has a secondary benefit: entropy regularisation makes the optimisation landscape smoother. A deterministic policy has zero-volume support (a single action), making gradients noisy and the landscape spiky. A stochastic policy spreads probability mass, creating smoother gradients and more stable training. SAC exploits this to achieve the sample efficiency of off-policy methods with the stability of stochastic policies.

Math

Entropy of a discrete policy:

$H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \log \pi(a|s)$

Maximum entropy is $\log |A|$ (uniform policy). Minimum is 0 (deterministic policy).

Entropy of a continuous Gaussian policy $\pi = \mathcal{N}(\mu, \sigma^2 I)$ :

$H(\pi) = \frac{d}{2} \log(2\pi e) + \sum_{i=1}^{d} \log \sigma_i$

Only depends on $\sigma$ , not $\mu$ . Larger variance = higher entropy.

Maximum-entropy RL objective (SAC):

$J(\pi) = \mathbb{E}\left[\sum_t \gamma^t \left( r_t + \alpha H(\pi(\cdot | s_t)) \right)\right]$

Soft Bellman equation (Q includes entropy):

$Q(s, a) = r + \gamma \, \mathbb{E}_{s'} \left[ V(s') \right], \quad V(s) = \mathbb{E}_{a \sim \pi} \left[ Q(s, a) - \alpha \log \pi(a|s) \right]$

Optimal policy (soft policy improvement):

$\pi^*(a|s) = \frac{\exp(Q(s,a) / \alpha)}{Z(s)} \propto \exp(Q(s,a) / \alpha)$

This is a Boltzmann distribution — actions are chosen proportionally to exponentiated Q-values.

Code

import torch
import torch.nn.functional as F
from torch.distributions import Categorical, Normal

# ── Discrete policy (A2C-style) ────────────────────────────────
logits = policy_net(state)                          # (B, n_actions)
dist = Categorical(logits=logits)
action = dist.sample()                              # (B,)
log_prob = dist.log_prob(action)                    # (B,)
entropy = dist.entropy()                            # (B,)

# Add entropy bonus to the policy loss
# Note: we SUBTRACT because we minimise loss but want to MAXIMISE entropy
policy_loss = -(log_prob * advantage).mean()
loss = policy_loss - alpha * entropy.mean()

# ── Continuous policy (SAC-style) ──────────────────────────────
mu, log_std = policy_net(state).chunk(2, dim=-1)    # (B, d), (B, d)
std = log_std.clamp(-20, 2).exp()                   # (B, d)
dist = Normal(mu, std)
action = dist.rsample()                             # (B, d) — reparameterised

# SAC uses log_prob directly instead of separate entropy term
log_prob = dist.log_prob(action).sum(dim=-1)        # (B,)

# For tanh-squashed actions (standard in SAC):
squashed = torch.tanh(action)                       # (B, d) — bounded to [-1, 1]
# Jacobian correction for the tanh transform
log_prob -= torch.log(1 - squashed.pow(2) + 1e-6).sum(dim=-1)  # (B,)

# WARNING: The tanh correction is easy to forget and causes silent
# training instability. Always include it when using squashed actions.

# ── Automatic temperature tuning (SAC) ─────────────────────────
# Learn alpha to hit a target entropy (typically -dim(A))
log_alpha = torch.zeros(1, requires_grad=True)
alpha = log_alpha.exp()
alpha_loss = -(alpha * (log_prob + target_entropy).detach()).mean()

Manual Implementation

import numpy as np

def discrete_entropy(logits):
    """
    Entropy of a categorical distribution from logits.
    logits: (B, n_actions) raw scores
    """
    # Stable log-softmax
    shifted = logits - logits.max(axis=1, keepdims=True)             # (B, A)
    log_probs = shifted - np.log(np.exp(shifted).sum(axis=1, keepdims=True))  # (B, A)
    probs = np.exp(log_probs)                                        # (B, A)
    return -(probs * log_probs).sum(axis=1)                          # (B,)


def gaussian_entropy(log_std):
    """
    Entropy of a diagonal Gaussian.
    log_std: (B, d) log standard deviations
    """
    d = log_std.shape[1]
    return 0.5 * d * np.log(2 * np.pi * np.e) + log_std.sum(axis=1)  # (B,)


def entropy_regularised_loss(log_probs, advantages, entropy, alpha=0.01):
    """
    A2C-style policy loss with entropy bonus.
    log_probs:   (B,) log-prob of taken action
    advantages:  (B,) advantage estimates
    entropy:     (B,) policy entropy
    """
    policy_loss = -(log_probs * advantages).mean()
    return policy_loss - alpha * entropy.mean()

Popular Uses

SAC (Soft Actor-Critic): entropy regularisation is the core idea — the entire algorithm is built around maximum-entropy RL, with automatic temperature tuning
A2C / A3C: adds $\alpha H(\pi)$ to the policy loss to prevent premature convergence; typically $\alpha = 0.01$
PPO (in practice): many PPO implementations include an entropy bonus even though the original paper doesn’t emphasise it
SQL (Soft Q-Learning): predecessor to SAC, uses entropy-augmented Bellman equation with energy-based policies
Exploration in sparse-reward environments: entropy bonus keeps the agent exploring when rewards are rare
AlphaGo / AlphaZero: MCTS exploration bonus serves a similar role to entropy regularisation in encouraging diverse action selection

Alternatives

Alternative	When to use	Tradeoff
Epsilon-greedy	Simple discrete action spaces (DQN)	No gradient signal for exploration; abrupt transition from random to greedy
Boltzmann exploration	Discrete actions, temperature-based	Achieves similar effect to entropy regularisation but applied post-hoc to Q-values
Curiosity / ICM	Sparse reward, large state spaces	Intrinsic reward from prediction error; complements entropy regularisation
Noise injection (NoisyNet)	When you want learned exploration	Adds parametric noise to weights; exploration adapts per-state without explicit entropy
Count-based exploration	Tabular or small state spaces	Bonuses for visiting novel states; doesn’t scale well to continuous spaces

Historical Context

Entropy regularisation in RL traces back to Ziebart et al. (2008, “Maximum Entropy Inverse Reinforcement Learning”), who formalised maximum-entropy decision-making. The idea entered deep RL through A3C (Mnih et al. 2016), which added an entropy bonus to stabilise policy gradient training — a small but critical detail that many practitioners discovered empirically was necessary.

SAC (Haarnoja et al. 2018, “Soft Actor-Critic”) elevated entropy regularisation from a training trick to a first-class design principle. By building the entire algorithm around maximum-entropy RL — including entropy-augmented Bellman equations and automatic temperature tuning — SAC achieved state-of-the-art sample efficiency and robustness in continuous control. The automatic temperature mechanism ( $\alpha$ learned to target a desired entropy) was particularly important: it eliminated a sensitive hyperparameter and let the agent naturally transition from exploration to exploitation as training progresses.