Skip to content

Importance Sampling

Estimates an expectation under one distribution using samples from a different distribution. If you collected data under policy qq but need to evaluate something under policy pp, multiply each sample by the ratio p(x)/q(x)p(x)/q(x) to correct the mismatch. This is the mathematical foundation for off-policy learning in RL.

Suppose you want to estimate average rainfall in Seattle but you only have weather data from Phoenix. Phoenix data under-represents rainy days, so you reweight: each rainy day gets a large multiplier (rain is common in Seattle, rare in Phoenix) and each sunny day gets a small one. The result is an unbiased estimate of Seattle’s average — as if you had sampled there directly.

The ratio p(x)/q(x)p(x)/q(x) is the importance weight. When qq assigns low probability to something that pp considers likely, the weight is huge. This is both the power and the danger: a few samples can dominate the estimate, causing high variance. In the extreme case where qq never samples something that pp cares about, the estimator breaks entirely.

In PPO, the ratio πnew(as)/πold(as)\pi_\text{new}(a|s) / \pi_\text{old}(a|s) is exactly importance sampling. You collected trajectories under πold\pi_\text{old} but want the gradient for πnew\pi_\text{new}. The ratio corrects for this mismatch — but if πnew\pi_\text{new} drifts too far from πold\pi_\text{old}, the ratios explode, which is why PPO clips them to [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon].

The identity:

Exp[f(x)]=Exq ⁣[f(x)p(x)q(x)]\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\!\left[f(x) \cdot \frac{p(x)}{q(x)}\right]

where w(x)=p(x)/q(x)w(x) = p(x)/q(x) is the importance weight. Requires q(x)>0q(x) > 0 wherever p(x)>0p(x) > 0.

Monte Carlo estimator (with NN samples from qq):

μ^=1Ni=1Nf(xi)p(xi)q(xi),xiq\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} f(x_i) \cdot \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q

Self-normalised variant (more stable, slightly biased):

μ^SN=if(xi)w(xi)iw(xi)\hat{\mu}_\text{SN} = \frac{\sum_i f(x_i) \cdot w(x_i)}{\sum_i w(x_i)}

PPO surrogate objective (importance-sampled policy gradient):

LCLIP(θ)=Et ⁣[min ⁣(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1{-}\epsilon, 1{+}\epsilon)\hat{A}_t\right)\right]

where rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_\text{old}}(a_t|s_t) is the importance ratio.

Prioritised experience replay (correcting sampling bias):

wi=(1NP(i))βw_i = \left(\frac{1}{N \cdot P(i)}\right)^\beta

where P(i)P(i) is the priority-based sampling probability and β1\beta \to 1 gives full correction.

import torch
import torch.nn.functional as F
# ── PPO importance ratio ─────────────────────────────────────────
with torch.no_grad():
old_log_probs = old_policy.log_prob(actions) # (B,) — from collection policy
new_log_probs = policy.log_prob(actions) # (B,) — current policy
ratio = (new_log_probs - old_log_probs).exp() # (B,) — importance weight p/q
# Clipped surrogate objective
surr1 = ratio * advantages # (B,)
surr2 = ratio.clamp(1 - eps, 1 + eps) * advantages # (B,)
loss = -torch.min(surr1, surr2).mean()
# ── Prioritised replay IS correction ─────────────────────────────
priorities = torch.tensor([0.5, 0.3, 0.2]) # P(i) — sampling probs
N = len(priorities)
beta = 0.4 # annealed toward 1.0
is_weights = (1.0 / (N * priorities)) ** beta # (B,)
is_weights /= is_weights.max() # normalise for stability
loss = (is_weights * F.mse_loss(q_pred, target, reduction='none')).mean()

Warning: If old_log_probs is not detached, gradients flow into the old policy. Always compute old log-probs inside torch.no_grad() or .detach() them.

import numpy as np
def importance_sampling_estimate(f_values, log_p, log_q):
"""
Estimate E_p[f(x)] using samples from q.
f_values: (N,) function values f(x_i) for each sample
log_p: (N,) log p(x_i) — target distribution
log_q: (N,) log q(x_i) — proposal distribution
"""
log_w = log_p - log_q # (N,) — log importance weights
w = np.exp(log_w - log_w.max()) # (N,) — stabilised weights
# Self-normalised estimate (lower variance, slight bias)
return (w * f_values).sum() / w.sum()
def ppo_clipped_objective(new_log_probs, old_log_probs, advantages, eps=0.2):
"""
PPO's clipped surrogate with importance sampling correction.
All inputs: (B,)
"""
ratio = np.exp(new_log_probs - old_log_probs) # (B,) — IS ratio
surr1 = ratio * advantages # (B,)
surr2 = np.clip(ratio, 1 - eps, 1 + eps) * advantages # (B,)
return np.minimum(surr1, surr2).mean()
def prioritised_replay_weights(priorities, beta=0.4):
"""
IS weights for prioritised experience replay.
priorities: (B,) — sampling probabilities P(i)
"""
N = len(priorities)
w = (1.0 / (N * priorities)) ** beta # (B,)
return w / w.max() # (B,) — normalised
  • PPO (policy gradient): the probability ratio πnew/πold\pi_\text{new}/\pi_\text{old} corrects for using trajectories from the old policy
  • Prioritised experience replay (DQN variants): samples with high TD error are drawn more often; IS weights debias the non-uniform sampling
  • Off-policy evaluation (RL): estimate the value of a new policy using data collected by a different (behaviour) policy
  • Variational inference (VAE): the ELBO can be derived as importance-weighted estimation of logp(x)\log p(x); IWAE uses multiple importance samples for a tighter bound
  • Particle filters / sequential Monte Carlo: reweight particles to track a target distribution over time
AlternativeWhen to useTradeoff
On-policy samplingCan afford to recollect data each update (A2C, REINFORCE)No IS needed; simpler but sample-inefficient
PPO clippingIS ratios might explodeBiased but bounded variance; the standard RL solution
V-trace (IMPALA)Highly off-policy distributed RLTruncates IS ratios at cˉ\bar{c} and ρˉ\bar{\rho}; trades bias for stability
Retrace(λ\lambda)Multi-step off-policy returnsTruncates product of IS ratios; safe with any λ\lambda
Rejection samplingNeed exact samples from ppUnbiased samples but can be very wasteful if p/qp/q is large
Direct density ratio estimationDon’t know pp or qq analyticallyLearns p/qp/q as a classifier; avoids explicit density computation

Importance sampling is a classical Monte Carlo technique from statistics (Kahn & Marshall, 1953). It entered RL through off-policy policy evaluation and was formalised for policy gradients by Precup et al. (2000).

The practical challenge — variance explosion from large importance ratios — drove key RL innovations. TRPO (Schulman et al., 2015) constrained the KL divergence between old and new policies to keep ratios bounded. PPO (Schulman et al., 2017) simplified this to a clipped ratio, making importance sampling practical at scale. Prioritised experience replay (Schaul et al., 2016) applied IS correction in a different direction: fixing bias from non-uniform sampling in replay buffers.