Polyak Averaging (Soft Target Updates)
Polyak Averaging (Soft Target Updates)
Section titled “Polyak Averaging (Soft Target Updates)”Slowly updating target network weights toward the online network: theta_target <- tau * theta + (1 - tau) * theta_target, with tau << 1 (typically 0.005). Stabilises Q-learning by providing a slowly-moving bootstrap target. The same mechanism as exponential moving average (EMA), applied specifically to target networks in RL.
Intuition
Section titled “Intuition”Imagine trying to hit a moving target with a bow — if the target jumps erratically, you’ll never land a shot. In Q-learning, the TD target uses the network’s own predictions. If the network updates its weights, the target changes too, creating a “moving target” problem. The network chases its own predictions, which shift every gradient step, potentially oscillating or diverging.
Polyak averaging fixes this by maintaining a separate copy of the network — the “target network” — that moves slowly. Instead of copying the online weights every N steps (the DQN approach, “hard updates”), Polyak averaging blends a tiny fraction tau of the online weights into the target at every step. With tau = 0.005, the target network is always ~200 gradient steps behind the online network, providing a stable bootstrap target.
The key insight: this is just an exponential moving average (EMA) of the network weights. The same idea appears in self-supervised learning (BYOL, EMA teacher), model averaging for better generalisation (SWA, EMA checkpoints), and pseudo-labelling. The RL community calls it “Polyak averaging” or “soft updates”; the rest of deep learning calls it “EMA.” They are the same operation.
Soft update (applied after every gradient step):
where is the interpolation coefficient, typically .
This is equivalent to an exponential moving average with decay :
Effective window: the target network “remembers” approximately past versions of the online network. With , that’s ~200 steps.
Hard update (the DQN alternative):
This is the limit of Polyak averaging as applied every steps. Discontinuous updates make the target jump, which is less stable but simpler.
Relationship to EMA decay: some papers parameterise with momentum . BYOL uses , which is equivalent to .
import torch
# ── Soft update (Polyak averaging) ──────────────────────────────# Call this AFTER every optimizer.step()def soft_update(online_net, target_net, tau=0.005): """Polyak averaging: blend online weights into target.""" for p_online, p_target in zip(online_net.parameters(), target_net.parameters()): p_target.data.mul_(1 - tau).add_(tau * p_online.data) # Equivalent to: p_target = (1-tau)*p_target + tau*p_online # Using in-place ops to avoid allocating new tensors
# ── Initialisation ──────────────────────────────────────────────# Target net starts as an exact copy. NEVER forget this step.target_net = copy.deepcopy(online_net)target_net.requires_grad_(False) # target never needs gradients
# ── In the training loop ────────────────────────────────────────loss = compute_td_loss(online_net, target_net, batch)optimizer.zero_grad()loss.backward()optimizer.step()soft_update(online_net, target_net, tau=0.005)
# ── Hard update alternative (DQN-style) ────────────────────────if step % target_update_freq == 0: target_net.load_state_dict(online_net.state_dict())Manual Implementation
Section titled “Manual Implementation”import numpy as np
def soft_update_numpy(online_params, target_params, tau=0.005): """ Polyak averaging for parameter dictionaries (numpy arrays). online_params: dict of str -> ndarray (online network weights) target_params: dict of str -> ndarray (target network weights) Modifies target_params in-place. """ for key in target_params: # theta_target = (1-tau)*theta_target + tau*theta_online target_params[key] *= (1 - tau) target_params[key] += tau * online_params[key]
def ema_decay_to_tau(decay): """Convert EMA momentum/decay to Polyak tau. decay=0.995 -> tau=0.005""" return 1.0 - decay
def effective_window(tau): """How many past steps the EMA effectively averages over.""" return 1.0 / tau # tau=0.005 -> ~200 steps
# ── Demonstration: EMA of a noisy signal ────────────────────────def ema_demo(signal, tau=0.005): """Show that Polyak averaging smooths a noisy signal.""" ema = signal[0] result = np.zeros_like(signal) for t in range(len(signal)): ema = tau * signal[t] + (1 - tau) * ema # same formula result[t] = ema return result # smoothed versionPopular Uses
Section titled “Popular Uses”- SAC, DDPG, TD3 (see
q-learning/): soft target updates with tau = 0.005 after every gradient step. This is the modern default for continuous-control RL - DQN (see
q-learning/): uses hard updates (copy every 10K steps) — the predecessor to Polyak averaging in RL - BYOL, MoCo (see
contrastive-self-supervising/): the EMA teacher/momentum encoder uses the same formula. MoCo v1 uses m = 0.999 (tau = 0.001), BYOL uses m = 0.996 (tau = 0.004) - Diffusion models (see
diffusion/): EMA of model weights is standard for generation quality. Typical decay = 0.9999 (tau = 0.0001) - Stochastic Weight Averaging (SWA): averages model checkpoints for better generalisation; related but uses uniform rather than exponential weighting
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Hard target updates (DQN) | Simple implementation, discrete update schedule | Target jumps discontinuously every N steps; less stable but fewer hyperparameters |
| No target network | Very simple environments, tabular RL | Works for tabular; diverges with function approximation in most cases |
| Double Q-learning | Want to reduce overestimation bias | Orthogonal to target update method — use two Q-networks to decouple selection and evaluation |
| Clipped Double Q (TD3, SAC) | Continuous control, want conservative Q estimates | Takes min of two Q-networks; complements Polyak averaging, doesn’t replace it |
| Periodic EMA reset | EMA weights become stale | Reset target to online weights periodically, then resume soft updates; used in some RLHF setups |
Historical Context
Section titled “Historical Context”Polyak averaging originates from Boris Polyak’s 1990 work on accelerating stochastic optimisation by averaging iterates. The idea of a separate target network in RL was introduced by Mnih et al. (2015) in DQN, but they used hard updates (full weight copy every 10K steps).
Lillicrap et al. (2016) introduced soft target updates (Polyak averaging) in DDPG, arguing that the smooth blending was more stable than periodic hard copies for continuous control. This was adopted by TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018), making it the standard for modern off-policy RL. Meanwhile, the self-supervised learning community independently adopted the same mechanism for momentum encoders in MoCo (He et al., 2020) and BYOL (Grill et al., 2020), calling it “EMA” rather than “Polyak averaging.”