Log-Derivative Trick

The identity $\nabla_\theta \, p_\theta(x) = p_\theta(x) \, \nabla_\theta \log p_\theta(x)$ that lets you estimate gradients of expectations by sampling, without differentiating through the sampling process itself. Also called the REINFORCE trick or score function estimator. The foundation of all policy gradient methods (see policy-gradient/).

Intuition

Suppose you want to optimise $\mathbb{E}_{x \sim p_\theta}[f(x)]$ — the expected value of some reward $f$ under a distribution you control. The problem: you can’t backprop through the act of sampling. Sampling is a discrete, non-differentiable operation — you rolled a die and got a 4, and there’s no gradient of “rolling a 4” with respect to the die’s probabilities.

The log-derivative trick sidesteps this entirely. Instead of differentiating through the sample, it says: “keep the sample fixed, and ask how much more likely that sample would become if you nudged the parameters.” If a high-reward sample would become more likely with a small parameter change, that’s a good direction. The gradient $\nabla_\theta \log p_\theta(x)$ is exactly this “how to make $x$ more likely” direction, and $f(x)$ weights it by how good that sample was.

The cost: high variance. You’re estimating a gradient from individual samples, and $f(x) \nabla \log p$ can be noisy. This is why every practical algorithm (A2C, PPO) adds a baseline $b$ to form $[f(x) - b] \nabla \log p_\theta(x)$ — the baseline doesn’t change the expected gradient but dramatically reduces variance.

Math

The core identity — from the chain rule applied to $\nabla \log p$ :

$\nabla_\theta \log p_\theta(x) = \frac{\nabla_\theta \, p_\theta(x)}{p_\theta(x)} \quad \Longrightarrow \quad \nabla_\theta \, p_\theta(x) = p_\theta(x) \, \nabla_\theta \log p_\theta(x)$

Gradient of an expectation — substitute the identity into $\nabla_\theta \mathbb{E}[f(x)]$ :

$\nabla_\theta \mathbb{E}_{x \sim p_\theta}[f(x)] = \nabla_\theta \int p_\theta(x) f(x) \, dx = \int p_\theta(x) \, f(x) \, \nabla_\theta \log p_\theta(x) \, dx$

$= \mathbb{E}_{x \sim p_\theta}\bigl[f(x) \, \nabla_\theta \log p_\theta(x)\bigr]$

This expectation can be estimated by Monte Carlo sampling — draw $x_1, \dots, x_N \sim p_\theta$ and average.

With baseline (variance reduction, does not change expectation because $\mathbb{E}[\nabla \log p] = 0$ ):

$\nabla_\theta \approx \frac{1}{N} \sum_{i=1}^{N} \bigl[f(x_i) - b\bigr] \, \nabla_\theta \log p_\theta(x_i)$

Policy gradient specialisation — $p_\theta$ is a policy $\pi_\theta(a|s)$ , $f$ is the return $R$ :

$\nabla_\theta J = \mathbb{E}_{\pi_\theta}\bigl[\sum_{t=0}^{T} (R_t - b_t) \, \nabla_\theta \log \pi_\theta(a_t | s_t)\bigr]$

Code

import torch
import torch.nn.functional as F

# ── Policy gradient using the log-derivative trick ──────────────
# The key line: log_prob * advantage gives a surrogate loss whose
# gradient equals the policy gradient. We NEVER differentiate
# through the sampling — actions are treated as fixed constants.

logits = policy_net(states)               # (B, n_actions)
dist = torch.distributions.Categorical(logits=logits)
actions = dist.sample()                   # (B,) — no gradient here
log_probs = dist.log_prob(actions)        # (B,) — gradient flows through logits

# advantages: (B,) — precomputed, detached from the graph
surrogate_loss = -(log_probs * advantages).mean()  # scalar
surrogate_loss.backward()                 # ∇θ matches the policy gradient

# WARNING: the negative sign is essential — we MAXIMISE expected reward
# by MINIMISING negative log_prob * advantage.

# ── With entropy bonus (encourages exploration) ─────────────────
entropy = dist.entropy().mean()           # scalar
loss = -(log_probs * advantages).mean() - 0.01 * entropy

Manual Implementation

import numpy as np

def reinforce_gradient(logits, actions, rewards, baseline=0.0):
    """
    Compute the REINFORCE policy gradient estimate.
    logits:   (B, A) raw scores from policy network
    actions:  (B,) integer actions that were taken
    rewards:  (B,) scalar rewards received
    baseline: scalar or (B,) baseline for variance reduction
    Returns:  (B, A) gradient of logits — the surrogate loss gradient
    """
    B, A = logits.shape

    # Softmax: convert logits to action probabilities
    shifted = logits - logits.max(axis=1, keepdims=True)     # (B, A) numerical stability
    exp_logits = np.exp(shifted)                              # (B, A)
    probs = exp_logits / exp_logits.sum(axis=1, keepdims=True)  # (B, A)

    # Log-prob of the taken action
    log_probs = np.log(probs[np.arange(B), actions] + 1e-8)  # (B,)

    # ∇logits log π(a|s) for a categorical distribution:
    # = one_hot(a) - probs  (the "softmax gradient" identity)
    one_hot = np.zeros_like(logits)                           # (B, A)
    one_hot[np.arange(B), actions] = 1.0
    grad_log_prob = one_hot - probs                           # (B, A)

    # Weight by advantage: (reward - baseline)
    advantage = (rewards - baseline)[:, None]                 # (B, 1)
    policy_gradient = advantage * grad_log_prob               # (B, A)

    return policy_gradient  # average over B for the batch gradient

Popular Uses

REINFORCE and all policy gradient methods (A2C, PPO): the log-derivative trick IS the policy gradient theorem (see policy-gradient/)
Variational inference (original VAE gradient estimator): before the reparameterisation trick, REINFORCE was used to estimate $\nabla_\phi \mathbb{E}_{q_\phi(z|x)}[\log p(x|z)]$ — it works but has high variance
Discrete latent variable models: any model with discrete sampling (hard attention, discrete VAE) must use score-function estimators since you can’t reparameterise discrete distributions
Black-box optimisation / evolution strategies (OpenAI ES): estimate gradients of non-differentiable objectives by perturbing parameters and weighting by fitness
Neural architecture search (ENAS): policy gradient over discrete architecture decisions

Alternatives

Alternative	When to use	Tradeoff
Reparameterisation trick	Continuous latent variables (VAEs, normalising flows)	Much lower variance but requires a differentiable sampling path — doesn’t work for discrete distributions
Gumbel-softmax	Discrete variables where you want low-variance gradients	Continuous relaxation introduces bias; requires a temperature schedule
Straight-through estimator	Discrete forward pass with simple gradient approximation (VQ-VAE)	Biased gradient, but zero variance and dead simple to implement
Pathwise derivative	Deterministic functions of random inputs	Same idea as reparameterisation — only works when you can express sampling as a deterministic transform of fixed noise
Evolution strategies	Non-differentiable objectives, parallel hardware	Scales to massive parallelism but needs many more samples than REINFORCE

Historical Context

The identity $\nabla \log p \cdot p = \nabla p$ is elementary calculus, but its use for gradient estimation was formalised by Williams (1992) in the REINFORCE paper, which showed how to train stochastic neural networks by sampling. The key insight — that you could optimise expectations without differentiating through the sampling — opened the door to reinforcement learning with function approximation.

The trick was independently known in statistics as the “score function method” and in operations research as the “likelihood ratio method.” Its resurgence in deep learning came through policy gradient methods (Sutton et al., 1999) and later through variational inference (Wingate & Weber, 2013), though in the VAE setting it was quickly superseded by the lower-variance reparameterisation trick (Kingma & Welling, 2014).