Importance Sampling

Estimates an expectation under one distribution using samples from a different distribution. If you collected data under policy $q$ but need to evaluate something under policy $p$ , multiply each sample by the ratio $p(x)/q(x)$ to correct the mismatch. This is the mathematical foundation for off-policy learning in RL.

Intuition

Suppose you want to estimate average rainfall in Seattle but you only have weather data from Phoenix. Phoenix data under-represents rainy days, so you reweight: each rainy day gets a large multiplier (rain is common in Seattle, rare in Phoenix) and each sunny day gets a small one. The result is an unbiased estimate of Seattle’s average — as if you had sampled there directly.

The ratio $p(x)/q(x)$ is the importance weight. When $q$ assigns low probability to something that $p$ considers likely, the weight is huge. This is both the power and the danger: a few samples can dominate the estimate, causing high variance. In the extreme case where $q$ never samples something that $p$ cares about, the estimator breaks entirely.

In PPO, the ratio $\pi_\text{new}(a|s) / \pi_\text{old}(a|s)$ is exactly importance sampling. You collected trajectories under $\pi_\text{old}$ but want the gradient for $\pi_\text{new}$ . The ratio corrects for this mismatch — but if $\pi_\text{new}$ drifts too far from $\pi_\text{old}$ , the ratios explode, which is why PPO clips them to $[1-\epsilon, 1+\epsilon]$ .

Math

The identity:

$\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\!\left[f(x) \cdot \frac{p(x)}{q(x)}\right]$

where $w(x) = p(x)/q(x)$ is the importance weight. Requires $q(x) > 0$ wherever $p(x) > 0$ .

Monte Carlo estimator (with $N$ samples from $q$ ):

$\hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} f(x_i) \cdot \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q$

Self-normalised variant (more stable, slightly biased):

$\hat{\mu}_\text{SN} = \frac{\sum_i f(x_i) \cdot w(x_i)}{\sum_i w(x_i)}$

PPO surrogate objective (importance-sampled policy gradient):

$L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1{-}\epsilon, 1{+}\epsilon)\hat{A}_t\right)\right]$

where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_\text{old}}(a_t|s_t)$ is the importance ratio.

Prioritised experience replay (correcting sampling bias):

$w_i = \left(\frac{1}{N \cdot P(i)}\right)^\beta$

where $P(i)$ is the priority-based sampling probability and $\beta \to 1$ gives full correction.

Code

import torch
import torch.nn.functional as F

# ── PPO importance ratio ─────────────────────────────────────────
with torch.no_grad():
    old_log_probs = old_policy.log_prob(actions)     # (B,) — from collection policy
new_log_probs = policy.log_prob(actions)             # (B,) — current policy
ratio = (new_log_probs - old_log_probs).exp()        # (B,) — importance weight p/q

# Clipped surrogate objective
surr1 = ratio * advantages                           # (B,)
surr2 = ratio.clamp(1 - eps, 1 + eps) * advantages   # (B,)
loss = -torch.min(surr1, surr2).mean()

# ── Prioritised replay IS correction ─────────────────────────────
priorities = torch.tensor([0.5, 0.3, 0.2])           # P(i) — sampling probs
N = len(priorities)
beta = 0.4                                           # annealed toward 1.0
is_weights = (1.0 / (N * priorities)) ** beta         # (B,)
is_weights /= is_weights.max()                        # normalise for stability
loss = (is_weights * F.mse_loss(q_pred, target, reduction='none')).mean()

Warning: If old_log_probs is not detached, gradients flow into the old policy. Always compute old log-probs inside torch.no_grad() or .detach() them.

Manual Implementation

import numpy as np

def importance_sampling_estimate(f_values, log_p, log_q):
    """
    Estimate E_p[f(x)] using samples from q.
    f_values: (N,) function values f(x_i) for each sample
    log_p:    (N,) log p(x_i) — target distribution
    log_q:    (N,) log q(x_i) — proposal distribution
    """
    log_w = log_p - log_q                             # (N,) — log importance weights
    w = np.exp(log_w - log_w.max())                   # (N,) — stabilised weights
    # Self-normalised estimate (lower variance, slight bias)
    return (w * f_values).sum() / w.sum()

def ppo_clipped_objective(new_log_probs, old_log_probs, advantages, eps=0.2):
    """
    PPO's clipped surrogate with importance sampling correction.
    All inputs: (B,)
    """
    ratio = np.exp(new_log_probs - old_log_probs)     # (B,) — IS ratio
    surr1 = ratio * advantages                         # (B,)
    surr2 = np.clip(ratio, 1 - eps, 1 + eps) * advantages  # (B,)
    return np.minimum(surr1, surr2).mean()

def prioritised_replay_weights(priorities, beta=0.4):
    """
    IS weights for prioritised experience replay.
    priorities: (B,) — sampling probabilities P(i)
    """
    N = len(priorities)
    w = (1.0 / (N * priorities)) ** beta               # (B,)
    return w / w.max()                                  # (B,) — normalised

Popular Uses

PPO (policy gradient): the probability ratio $\pi_\text{new}/\pi_\text{old}$ corrects for using trajectories from the old policy
Prioritised experience replay (DQN variants): samples with high TD error are drawn more often; IS weights debias the non-uniform sampling
Off-policy evaluation (RL): estimate the value of a new policy using data collected by a different (behaviour) policy
Variational inference (VAE): the ELBO can be derived as importance-weighted estimation of $\log p(x)$ ; IWAE uses multiple importance samples for a tighter bound
Particle filters / sequential Monte Carlo: reweight particles to track a target distribution over time

Alternatives

Alternative	When to use	Tradeoff
On-policy sampling	Can afford to recollect data each update (A2C, REINFORCE)	No IS needed; simpler but sample-inefficient
PPO clipping	IS ratios might explode	Biased but bounded variance; the standard RL solution
V-trace (IMPALA)	Highly off-policy distributed RL	Truncates IS ratios at $\bar{c}$ and $\bar{\rho}$ ; trades bias for stability
Retrace( $\lambda$ )	Multi-step off-policy returns	Truncates product of IS ratios; safe with any $\lambda$
Rejection sampling	Need exact samples from $p$	Unbiased samples but can be very wasteful if $p/q$ is large
Direct density ratio estimation	Don’t know $p$ or $q$ analytically	Learns $p/q$ as a classifier; avoids explicit density computation

Historical Context

Importance sampling is a classical Monte Carlo technique from statistics (Kahn & Marshall, 1953). It entered RL through off-policy policy evaluation and was formalised for policy gradients by Precup et al. (2000).

The practical challenge — variance explosion from large importance ratios — drove key RL innovations. TRPO (Schulman et al., 2015) constrained the KL divergence between old and new policies to keep ratios bounded. PPO (Schulman et al., 2017) simplified this to a clipped ratio, making importance sampling practical at scale. Prioritised experience replay (Schaul et al., 2016) applied IS correction in a different direction: fixing bias from non-uniform sampling in replay buffers.