Importance Sampling
Importance Sampling
Section titled “Importance Sampling”Estimates an expectation under one distribution using samples from a different distribution. If you collected data under policy but need to evaluate something under policy , multiply each sample by the ratio to correct the mismatch. This is the mathematical foundation for off-policy learning in RL.
Intuition
Section titled “Intuition”Suppose you want to estimate average rainfall in Seattle but you only have weather data from Phoenix. Phoenix data under-represents rainy days, so you reweight: each rainy day gets a large multiplier (rain is common in Seattle, rare in Phoenix) and each sunny day gets a small one. The result is an unbiased estimate of Seattle’s average — as if you had sampled there directly.
The ratio is the importance weight. When assigns low probability to something that considers likely, the weight is huge. This is both the power and the danger: a few samples can dominate the estimate, causing high variance. In the extreme case where never samples something that cares about, the estimator breaks entirely.
In PPO, the ratio is exactly importance sampling. You collected trajectories under but want the gradient for . The ratio corrects for this mismatch — but if drifts too far from , the ratios explode, which is why PPO clips them to .
The identity:
where is the importance weight. Requires wherever .
Monte Carlo estimator (with samples from ):
Self-normalised variant (more stable, slightly biased):
PPO surrogate objective (importance-sampled policy gradient):
where is the importance ratio.
Prioritised experience replay (correcting sampling bias):
where is the priority-based sampling probability and gives full correction.
import torchimport torch.nn.functional as F
# ── PPO importance ratio ─────────────────────────────────────────with torch.no_grad(): old_log_probs = old_policy.log_prob(actions) # (B,) — from collection policynew_log_probs = policy.log_prob(actions) # (B,) — current policyratio = (new_log_probs - old_log_probs).exp() # (B,) — importance weight p/q
# Clipped surrogate objectivesurr1 = ratio * advantages # (B,)surr2 = ratio.clamp(1 - eps, 1 + eps) * advantages # (B,)loss = -torch.min(surr1, surr2).mean()
# ── Prioritised replay IS correction ─────────────────────────────priorities = torch.tensor([0.5, 0.3, 0.2]) # P(i) — sampling probsN = len(priorities)beta = 0.4 # annealed toward 1.0is_weights = (1.0 / (N * priorities)) ** beta # (B,)is_weights /= is_weights.max() # normalise for stabilityloss = (is_weights * F.mse_loss(q_pred, target, reduction='none')).mean()Warning: If old_log_probs is not detached, gradients flow into the old policy. Always compute old log-probs inside torch.no_grad() or .detach() them.
Manual Implementation
Section titled “Manual Implementation”import numpy as np
def importance_sampling_estimate(f_values, log_p, log_q): """ Estimate E_p[f(x)] using samples from q. f_values: (N,) function values f(x_i) for each sample log_p: (N,) log p(x_i) — target distribution log_q: (N,) log q(x_i) — proposal distribution """ log_w = log_p - log_q # (N,) — log importance weights w = np.exp(log_w - log_w.max()) # (N,) — stabilised weights # Self-normalised estimate (lower variance, slight bias) return (w * f_values).sum() / w.sum()
def ppo_clipped_objective(new_log_probs, old_log_probs, advantages, eps=0.2): """ PPO's clipped surrogate with importance sampling correction. All inputs: (B,) """ ratio = np.exp(new_log_probs - old_log_probs) # (B,) — IS ratio surr1 = ratio * advantages # (B,) surr2 = np.clip(ratio, 1 - eps, 1 + eps) * advantages # (B,) return np.minimum(surr1, surr2).mean()
def prioritised_replay_weights(priorities, beta=0.4): """ IS weights for prioritised experience replay. priorities: (B,) — sampling probabilities P(i) """ N = len(priorities) w = (1.0 / (N * priorities)) ** beta # (B,) return w / w.max() # (B,) — normalisedPopular Uses
Section titled “Popular Uses”- PPO (policy gradient): the probability ratio corrects for using trajectories from the old policy
- Prioritised experience replay (DQN variants): samples with high TD error are drawn more often; IS weights debias the non-uniform sampling
- Off-policy evaluation (RL): estimate the value of a new policy using data collected by a different (behaviour) policy
- Variational inference (VAE): the ELBO can be derived as importance-weighted estimation of ; IWAE uses multiple importance samples for a tighter bound
- Particle filters / sequential Monte Carlo: reweight particles to track a target distribution over time
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| On-policy sampling | Can afford to recollect data each update (A2C, REINFORCE) | No IS needed; simpler but sample-inefficient |
| PPO clipping | IS ratios might explode | Biased but bounded variance; the standard RL solution |
| V-trace (IMPALA) | Highly off-policy distributed RL | Truncates IS ratios at and ; trades bias for stability |
| Retrace() | Multi-step off-policy returns | Truncates product of IS ratios; safe with any |
| Rejection sampling | Need exact samples from | Unbiased samples but can be very wasteful if is large |
| Direct density ratio estimation | Don’t know or analytically | Learns as a classifier; avoids explicit density computation |
Historical Context
Section titled “Historical Context”Importance sampling is a classical Monte Carlo technique from statistics (Kahn & Marshall, 1953). It entered RL through off-policy policy evaluation and was formalised for policy gradients by Precup et al. (2000).
The practical challenge — variance explosion from large importance ratios — drove key RL innovations. TRPO (Schulman et al., 2015) constrained the KL divergence between old and new policies to keep ratios bounded. PPO (Schulman et al., 2017) simplified this to a clipped ratio, making importance sampling practical at scale. Prioritised experience replay (Schaul et al., 2016) applied IS correction in a different direction: fixing bias from non-uniform sampling in replay buffers.