Skip to content

Polyak Averaging (Soft Target Updates)

Slowly updating target network weights toward the online network: theta_target <- tau * theta + (1 - tau) * theta_target, with tau << 1 (typically 0.005). Stabilises Q-learning by providing a slowly-moving bootstrap target. The same mechanism as exponential moving average (EMA), applied specifically to target networks in RL.

Imagine trying to hit a moving target with a bow — if the target jumps erratically, you’ll never land a shot. In Q-learning, the TD target uses the network’s own predictions. If the network updates its weights, the target changes too, creating a “moving target” problem. The network chases its own predictions, which shift every gradient step, potentially oscillating or diverging.

Polyak averaging fixes this by maintaining a separate copy of the network — the “target network” — that moves slowly. Instead of copying the online weights every N steps (the DQN approach, “hard updates”), Polyak averaging blends a tiny fraction tau of the online weights into the target at every step. With tau = 0.005, the target network is always ~200 gradient steps behind the online network, providing a stable bootstrap target.

The key insight: this is just an exponential moving average (EMA) of the network weights. The same idea appears in self-supervised learning (BYOL, EMA teacher), model averaging for better generalisation (SWA, EMA checkpoints), and pseudo-labelling. The RL community calls it “Polyak averaging” or “soft updates”; the rest of deep learning calls it “EMA.” They are the same operation.

Soft update (applied after every gradient step):

θtargetτθ+(1τ)θtarget\theta_{\text{target}} \leftarrow \tau \cdot \theta + (1 - \tau) \cdot \theta_{\text{target}}

where τ(0,1)\tau \in (0, 1) is the interpolation coefficient, typically τ=0.005\tau = 0.005.

This is equivalent to an exponential moving average with decay (1τ)(1 - \tau):

θtarget(t)=(1τ)tθtarget(0)+τk=0t1(1τ)kθ(tk)\theta_{\text{target}}^{(t)} = (1 - \tau)^t \, \theta_{\text{target}}^{(0)} + \tau \sum_{k=0}^{t-1} (1 - \tau)^k \, \theta^{(t-k)}

Effective window: the target network “remembers” approximately 1τ\frac{1}{\tau} past versions of the online network. With τ=0.005\tau = 0.005, that’s ~200 steps.

Hard update (the DQN alternative):

θtargetθevery N steps\theta_{\text{target}} \leftarrow \theta \quad \text{every } N \text{ steps}

This is the limit of Polyak averaging as τ1\tau \to 1 applied every NN steps. Discontinuous updates make the target jump, which is less stable but simpler.

Relationship to EMA decay: some papers parameterise with momentum m=1τm = 1 - \tau. BYOL uses m=0.996m = 0.996, which is equivalent to τ=0.004\tau = 0.004.

import torch
# ── Soft update (Polyak averaging) ──────────────────────────────
# Call this AFTER every optimizer.step()
def soft_update(online_net, target_net, tau=0.005):
"""Polyak averaging: blend online weights into target."""
for p_online, p_target in zip(online_net.parameters(), target_net.parameters()):
p_target.data.mul_(1 - tau).add_(tau * p_online.data)
# Equivalent to: p_target = (1-tau)*p_target + tau*p_online
# Using in-place ops to avoid allocating new tensors
# ── Initialisation ──────────────────────────────────────────────
# Target net starts as an exact copy. NEVER forget this step.
target_net = copy.deepcopy(online_net)
target_net.requires_grad_(False) # target never needs gradients
# ── In the training loop ────────────────────────────────────────
loss = compute_td_loss(online_net, target_net, batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
soft_update(online_net, target_net, tau=0.005)
# ── Hard update alternative (DQN-style) ────────────────────────
if step % target_update_freq == 0:
target_net.load_state_dict(online_net.state_dict())
import numpy as np
def soft_update_numpy(online_params, target_params, tau=0.005):
"""
Polyak averaging for parameter dictionaries (numpy arrays).
online_params: dict of str -> ndarray (online network weights)
target_params: dict of str -> ndarray (target network weights)
Modifies target_params in-place.
"""
for key in target_params:
# theta_target = (1-tau)*theta_target + tau*theta_online
target_params[key] *= (1 - tau)
target_params[key] += tau * online_params[key]
def ema_decay_to_tau(decay):
"""Convert EMA momentum/decay to Polyak tau. decay=0.995 -> tau=0.005"""
return 1.0 - decay
def effective_window(tau):
"""How many past steps the EMA effectively averages over."""
return 1.0 / tau # tau=0.005 -> ~200 steps
# ── Demonstration: EMA of a noisy signal ────────────────────────
def ema_demo(signal, tau=0.005):
"""Show that Polyak averaging smooths a noisy signal."""
ema = signal[0]
result = np.zeros_like(signal)
for t in range(len(signal)):
ema = tau * signal[t] + (1 - tau) * ema # same formula
result[t] = ema
return result # smoothed version
  • SAC, DDPG, TD3 (see q-learning/): soft target updates with tau = 0.005 after every gradient step. This is the modern default for continuous-control RL
  • DQN (see q-learning/): uses hard updates (copy every 10K steps) — the predecessor to Polyak averaging in RL
  • BYOL, MoCo (see contrastive-self-supervising/): the EMA teacher/momentum encoder uses the same formula. MoCo v1 uses m = 0.999 (tau = 0.001), BYOL uses m = 0.996 (tau = 0.004)
  • Diffusion models (see diffusion/): EMA of model weights is standard for generation quality. Typical decay = 0.9999 (tau = 0.0001)
  • Stochastic Weight Averaging (SWA): averages model checkpoints for better generalisation; related but uses uniform rather than exponential weighting
AlternativeWhen to useTradeoff
Hard target updates (DQN)Simple implementation, discrete update scheduleTarget jumps discontinuously every N steps; less stable but fewer hyperparameters
No target networkVery simple environments, tabular RLWorks for tabular; diverges with function approximation in most cases
Double Q-learningWant to reduce overestimation biasOrthogonal to target update method — use two Q-networks to decouple selection and evaluation
Clipped Double Q (TD3, SAC)Continuous control, want conservative Q estimatesTakes min of two Q-networks; complements Polyak averaging, doesn’t replace it
Periodic EMA resetEMA weights become staleReset target to online weights periodically, then resume soft updates; used in some RLHF setups

Polyak averaging originates from Boris Polyak’s 1990 work on accelerating stochastic optimisation by averaging iterates. The idea of a separate target network in RL was introduced by Mnih et al. (2015) in DQN, but they used hard updates (full weight copy every 10K steps).

Lillicrap et al. (2016) introduced soft target updates (Polyak averaging) in DDPG, arguing that the smooth blending was more stable than periodic hard copies for continuous control. This was adopted by TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018), making it the standard for modern off-policy RL. Meanwhile, the self-supervised learning community independently adopted the same mechanism for momentum encoders in MoCo (He et al., 2020) and BYOL (Grill et al., 2020), calling it “EMA” rather than “Polyak averaging.”