Skip to content

Entropy Regularisation

Adds the entropy of the policy H(π)H(\pi) to the reinforcement learning objective: J=E[R]+αH(π)J = \mathbb{E}[R] + \alpha H(\pi). Encourages exploration by preventing the policy from collapsing to a deterministic action too early. The defining component of SAC and a key ingredient in A2C/PPO.

Without entropy regularisation, a policy gradient agent that discovers one good action will immediately exploit it — putting all probability on that action and never trying alternatives. This is premature convergence: the agent gets stuck in a local optimum because it stopped exploring before finding better strategies.

Entropy regularisation adds a bonus for randomness. A policy that spreads probability across many actions has high entropy and gets rewarded for it. The temperature parameter α\alpha controls the tradeoff: high α\alpha means “explore a lot, even if it costs some reward,” low α\alpha means “mostly exploit, with a small exploration nudge.” The optimal policy under entropy regularisation is the Boltzmann (softmax) distribution over Q-values — actions with higher value get more probability, but no action gets zero probability.

This has a secondary benefit: entropy regularisation makes the optimisation landscape smoother. A deterministic policy has zero-volume support (a single action), making gradients noisy and the landscape spiky. A stochastic policy spreads probability mass, creating smoother gradients and more stable training. SAC exploits this to achieve the sample efficiency of off-policy methods with the stability of stochastic policies.

Entropy of a discrete policy:

H(π(s))=aπ(as)logπ(as)H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \log \pi(a|s)

Maximum entropy is logA\log |A| (uniform policy). Minimum is 0 (deterministic policy).

Entropy of a continuous Gaussian policy π=N(μ,σ2I)\pi = \mathcal{N}(\mu, \sigma^2 I):

H(π)=d2log(2πe)+i=1dlogσiH(\pi) = \frac{d}{2} \log(2\pi e) + \sum_{i=1}^{d} \log \sigma_i

Only depends on σ\sigma, not μ\mu. Larger variance = higher entropy.

Maximum-entropy RL objective (SAC):

J(π)=E[tγt(rt+αH(π(st)))]J(\pi) = \mathbb{E}\left[\sum_t \gamma^t \left( r_t + \alpha H(\pi(\cdot | s_t)) \right)\right]

Soft Bellman equation (Q includes entropy):

Q(s,a)=r+γEs[V(s)],V(s)=Eaπ[Q(s,a)αlogπ(as)]Q(s, a) = r + \gamma \, \mathbb{E}_{s'} \left[ V(s') \right], \quad V(s) = \mathbb{E}_{a \sim \pi} \left[ Q(s, a) - \alpha \log \pi(a|s) \right]

Optimal policy (soft policy improvement):

π(as)=exp(Q(s,a)/α)Z(s)exp(Q(s,a)/α)\pi^*(a|s) = \frac{\exp(Q(s,a) / \alpha)}{Z(s)} \propto \exp(Q(s,a) / \alpha)

This is a Boltzmann distribution — actions are chosen proportionally to exponentiated Q-values.

import torch
import torch.nn.functional as F
from torch.distributions import Categorical, Normal
# ── Discrete policy (A2C-style) ────────────────────────────────
logits = policy_net(state) # (B, n_actions)
dist = Categorical(logits=logits)
action = dist.sample() # (B,)
log_prob = dist.log_prob(action) # (B,)
entropy = dist.entropy() # (B,)
# Add entropy bonus to the policy loss
# Note: we SUBTRACT because we minimise loss but want to MAXIMISE entropy
policy_loss = -(log_prob * advantage).mean()
loss = policy_loss - alpha * entropy.mean()
# ── Continuous policy (SAC-style) ──────────────────────────────
mu, log_std = policy_net(state).chunk(2, dim=-1) # (B, d), (B, d)
std = log_std.clamp(-20, 2).exp() # (B, d)
dist = Normal(mu, std)
action = dist.rsample() # (B, d) — reparameterised
# SAC uses log_prob directly instead of separate entropy term
log_prob = dist.log_prob(action).sum(dim=-1) # (B,)
# For tanh-squashed actions (standard in SAC):
squashed = torch.tanh(action) # (B, d) — bounded to [-1, 1]
# Jacobian correction for the tanh transform
log_prob -= torch.log(1 - squashed.pow(2) + 1e-6).sum(dim=-1) # (B,)
# WARNING: The tanh correction is easy to forget and causes silent
# training instability. Always include it when using squashed actions.
# ── Automatic temperature tuning (SAC) ─────────────────────────
# Learn alpha to hit a target entropy (typically -dim(A))
log_alpha = torch.zeros(1, requires_grad=True)
alpha = log_alpha.exp()
alpha_loss = -(alpha * (log_prob + target_entropy).detach()).mean()
import numpy as np
def discrete_entropy(logits):
"""
Entropy of a categorical distribution from logits.
logits: (B, n_actions) raw scores
"""
# Stable log-softmax
shifted = logits - logits.max(axis=1, keepdims=True) # (B, A)
log_probs = shifted - np.log(np.exp(shifted).sum(axis=1, keepdims=True)) # (B, A)
probs = np.exp(log_probs) # (B, A)
return -(probs * log_probs).sum(axis=1) # (B,)
def gaussian_entropy(log_std):
"""
Entropy of a diagonal Gaussian.
log_std: (B, d) log standard deviations
"""
d = log_std.shape[1]
return 0.5 * d * np.log(2 * np.pi * np.e) + log_std.sum(axis=1) # (B,)
def entropy_regularised_loss(log_probs, advantages, entropy, alpha=0.01):
"""
A2C-style policy loss with entropy bonus.
log_probs: (B,) log-prob of taken action
advantages: (B,) advantage estimates
entropy: (B,) policy entropy
"""
policy_loss = -(log_probs * advantages).mean()
return policy_loss - alpha * entropy.mean()
  • SAC (Soft Actor-Critic): entropy regularisation is the core idea — the entire algorithm is built around maximum-entropy RL, with automatic temperature tuning
  • A2C / A3C: adds αH(π)\alpha H(\pi) to the policy loss to prevent premature convergence; typically α=0.01\alpha = 0.01
  • PPO (in practice): many PPO implementations include an entropy bonus even though the original paper doesn’t emphasise it
  • SQL (Soft Q-Learning): predecessor to SAC, uses entropy-augmented Bellman equation with energy-based policies
  • Exploration in sparse-reward environments: entropy bonus keeps the agent exploring when rewards are rare
  • AlphaGo / AlphaZero: MCTS exploration bonus serves a similar role to entropy regularisation in encouraging diverse action selection
AlternativeWhen to useTradeoff
Epsilon-greedySimple discrete action spaces (DQN)No gradient signal for exploration; abrupt transition from random to greedy
Boltzmann explorationDiscrete actions, temperature-basedAchieves similar effect to entropy regularisation but applied post-hoc to Q-values
Curiosity / ICMSparse reward, large state spacesIntrinsic reward from prediction error; complements entropy regularisation
Noise injection (NoisyNet)When you want learned explorationAdds parametric noise to weights; exploration adapts per-state without explicit entropy
Count-based explorationTabular or small state spacesBonuses for visiting novel states; doesn’t scale well to continuous spaces

Entropy regularisation in RL traces back to Ziebart et al. (2008, “Maximum Entropy Inverse Reinforcement Learning”), who formalised maximum-entropy decision-making. The idea entered deep RL through A3C (Mnih et al. 2016), which added an entropy bonus to stabilise policy gradient training — a small but critical detail that many practitioners discovered empirically was necessary.

SAC (Haarnoja et al. 2018, “Soft Actor-Critic”) elevated entropy regularisation from a training trick to a first-class design principle. By building the entire algorithm around maximum-entropy RL — including entropy-augmented Bellman equations and automatic temperature tuning — SAC achieved state-of-the-art sample efficiency and robustness in continuous control. The automatic temperature mechanism (α\alpha learned to target a desired entropy) was particularly important: it eliminated a sensitive hyperparameter and let the agent naturally transition from exploration to exploitation as training progresses.