Skip to content

Log-Derivative Trick

The identity θpθ(x)=pθ(x)θlogpθ(x)\nabla_\theta \, p_\theta(x) = p_\theta(x) \, \nabla_\theta \log p_\theta(x) that lets you estimate gradients of expectations by sampling, without differentiating through the sampling process itself. Also called the REINFORCE trick or score function estimator. The foundation of all policy gradient methods (see policy-gradient/).

Suppose you want to optimise Expθ[f(x)]\mathbb{E}_{x \sim p_\theta}[f(x)] — the expected value of some reward ff under a distribution you control. The problem: you can’t backprop through the act of sampling. Sampling is a discrete, non-differentiable operation — you rolled a die and got a 4, and there’s no gradient of “rolling a 4” with respect to the die’s probabilities.

The log-derivative trick sidesteps this entirely. Instead of differentiating through the sample, it says: “keep the sample fixed, and ask how much more likely that sample would become if you nudged the parameters.” If a high-reward sample would become more likely with a small parameter change, that’s a good direction. The gradient θlogpθ(x)\nabla_\theta \log p_\theta(x) is exactly this “how to make xx more likely” direction, and f(x)f(x) weights it by how good that sample was.

The cost: high variance. You’re estimating a gradient from individual samples, and f(x)logpf(x) \nabla \log p can be noisy. This is why every practical algorithm (A2C, PPO) adds a baseline bb to form [f(x)b]logpθ(x)[f(x) - b] \nabla \log p_\theta(x) — the baseline doesn’t change the expected gradient but dramatically reduces variance.

The core identity — from the chain rule applied to logp\nabla \log p:

θlogpθ(x)=θpθ(x)pθ(x)θpθ(x)=pθ(x)θlogpθ(x)\nabla_\theta \log p_\theta(x) = \frac{\nabla_\theta \, p_\theta(x)}{p_\theta(x)} \quad \Longrightarrow \quad \nabla_\theta \, p_\theta(x) = p_\theta(x) \, \nabla_\theta \log p_\theta(x)

Gradient of an expectation — substitute the identity into θE[f(x)]\nabla_\theta \mathbb{E}[f(x)]:

θExpθ[f(x)]=θpθ(x)f(x)dx=pθ(x)f(x)θlogpθ(x)dx\nabla_\theta \mathbb{E}_{x \sim p_\theta}[f(x)] = \nabla_\theta \int p_\theta(x) f(x) \, dx = \int p_\theta(x) \, f(x) \, \nabla_\theta \log p_\theta(x) \, dx

=Expθ[f(x)θlogpθ(x)]= \mathbb{E}_{x \sim p_\theta}\bigl[f(x) \, \nabla_\theta \log p_\theta(x)\bigr]

This expectation can be estimated by Monte Carlo sampling — draw x1,,xNpθx_1, \dots, x_N \sim p_\theta and average.

With baseline (variance reduction, does not change expectation because E[logp]=0\mathbb{E}[\nabla \log p] = 0):

θ1Ni=1N[f(xi)b]θlogpθ(xi)\nabla_\theta \approx \frac{1}{N} \sum_{i=1}^{N} \bigl[f(x_i) - b\bigr] \, \nabla_\theta \log p_\theta(x_i)

Policy gradient specialisationpθp_\theta is a policy πθ(as)\pi_\theta(a|s), ff is the return RR:

θJ=Eπθ[t=0T(Rtbt)θlogπθ(atst)]\nabla_\theta J = \mathbb{E}_{\pi_\theta}\bigl[\sum_{t=0}^{T} (R_t - b_t) \, \nabla_\theta \log \pi_\theta(a_t | s_t)\bigr]

import torch
import torch.nn.functional as F
# ── Policy gradient using the log-derivative trick ──────────────
# The key line: log_prob * advantage gives a surrogate loss whose
# gradient equals the policy gradient. We NEVER differentiate
# through the sampling — actions are treated as fixed constants.
logits = policy_net(states) # (B, n_actions)
dist = torch.distributions.Categorical(logits=logits)
actions = dist.sample() # (B,) — no gradient here
log_probs = dist.log_prob(actions) # (B,) — gradient flows through logits
# advantages: (B,) — precomputed, detached from the graph
surrogate_loss = -(log_probs * advantages).mean() # scalar
surrogate_loss.backward() # ∇θ matches the policy gradient
# WARNING: the negative sign is essential — we MAXIMISE expected reward
# by MINIMISING negative log_prob * advantage.
# ── With entropy bonus (encourages exploration) ─────────────────
entropy = dist.entropy().mean() # scalar
loss = -(log_probs * advantages).mean() - 0.01 * entropy
import numpy as np
def reinforce_gradient(logits, actions, rewards, baseline=0.0):
"""
Compute the REINFORCE policy gradient estimate.
logits: (B, A) raw scores from policy network
actions: (B,) integer actions that were taken
rewards: (B,) scalar rewards received
baseline: scalar or (B,) baseline for variance reduction
Returns: (B, A) gradient of logits — the surrogate loss gradient
"""
B, A = logits.shape
# Softmax: convert logits to action probabilities
shifted = logits - logits.max(axis=1, keepdims=True) # (B, A) numerical stability
exp_logits = np.exp(shifted) # (B, A)
probs = exp_logits / exp_logits.sum(axis=1, keepdims=True) # (B, A)
# Log-prob of the taken action
log_probs = np.log(probs[np.arange(B), actions] + 1e-8) # (B,)
# ∇logits log π(a|s) for a categorical distribution:
# = one_hot(a) - probs (the "softmax gradient" identity)
one_hot = np.zeros_like(logits) # (B, A)
one_hot[np.arange(B), actions] = 1.0
grad_log_prob = one_hot - probs # (B, A)
# Weight by advantage: (reward - baseline)
advantage = (rewards - baseline)[:, None] # (B, 1)
policy_gradient = advantage * grad_log_prob # (B, A)
return policy_gradient # average over B for the batch gradient
  • REINFORCE and all policy gradient methods (A2C, PPO): the log-derivative trick IS the policy gradient theorem (see policy-gradient/)
  • Variational inference (original VAE gradient estimator): before the reparameterisation trick, REINFORCE was used to estimate ϕEqϕ(zx)[logp(xz)]\nabla_\phi \mathbb{E}_{q_\phi(z|x)}[\log p(x|z)] — it works but has high variance
  • Discrete latent variable models: any model with discrete sampling (hard attention, discrete VAE) must use score-function estimators since you can’t reparameterise discrete distributions
  • Black-box optimisation / evolution strategies (OpenAI ES): estimate gradients of non-differentiable objectives by perturbing parameters and weighting by fitness
  • Neural architecture search (ENAS): policy gradient over discrete architecture decisions
AlternativeWhen to useTradeoff
Reparameterisation trickContinuous latent variables (VAEs, normalising flows)Much lower variance but requires a differentiable sampling path — doesn’t work for discrete distributions
Gumbel-softmaxDiscrete variables where you want low-variance gradientsContinuous relaxation introduces bias; requires a temperature schedule
Straight-through estimatorDiscrete forward pass with simple gradient approximation (VQ-VAE)Biased gradient, but zero variance and dead simple to implement
Pathwise derivativeDeterministic functions of random inputsSame idea as reparameterisation — only works when you can express sampling as a deterministic transform of fixed noise
Evolution strategiesNon-differentiable objectives, parallel hardwareScales to massive parallelism but needs many more samples than REINFORCE

The identity logpp=p\nabla \log p \cdot p = \nabla p is elementary calculus, but its use for gradient estimation was formalised by Williams (1992) in the REINFORCE paper, which showed how to train stochastic neural networks by sampling. The key insight — that you could optimise expectations without differentiating through the sampling — opened the door to reinforcement learning with function approximation.

The trick was independently known in statistics as the “score function method” and in operations research as the “likelihood ratio method.” Its resurgence in deep learning came through policy gradient methods (Sutton et al., 1999) and later through variational inference (Wingate & Weber, 2013), though in the VAE setting it was quickly superseded by the lower-variance reparameterisation trick (Kingma & Welling, 2014).