Polyak Averaging (Soft Target Updates)

Slowly updating target network weights toward the online network: theta_target <- tau * theta + (1 - tau) * theta_target, with tau << 1 (typically 0.005). Stabilises Q-learning by providing a slowly-moving bootstrap target. The same mechanism as exponential moving average (EMA), applied specifically to target networks in RL.

Intuition

Imagine trying to hit a moving target with a bow — if the target jumps erratically, you’ll never land a shot. In Q-learning, the TD target uses the network’s own predictions. If the network updates its weights, the target changes too, creating a “moving target” problem. The network chases its own predictions, which shift every gradient step, potentially oscillating or diverging.

Polyak averaging fixes this by maintaining a separate copy of the network — the “target network” — that moves slowly. Instead of copying the online weights every N steps (the DQN approach, “hard updates”), Polyak averaging blends a tiny fraction tau of the online weights into the target at every step. With tau = 0.005, the target network is always ~200 gradient steps behind the online network, providing a stable bootstrap target.

The key insight: this is just an exponential moving average (EMA) of the network weights. The same idea appears in self-supervised learning (BYOL, EMA teacher), model averaging for better generalisation (SWA, EMA checkpoints), and pseudo-labelling. The RL community calls it “Polyak averaging” or “soft updates”; the rest of deep learning calls it “EMA.” They are the same operation.

Math

Soft update (applied after every gradient step):

$\theta_{\text{target}} \leftarrow \tau \cdot \theta + (1 - \tau) \cdot \theta_{\text{target}}$

where $\tau \in (0, 1)$ is the interpolation coefficient, typically $\tau = 0.005$ .

This is equivalent to an exponential moving average with decay $(1 - \tau)$ :

$\theta_{\text{target}}^{(t)} = (1 - \tau)^t \, \theta_{\text{target}}^{(0)} + \tau \sum_{k=0}^{t-1} (1 - \tau)^k \, \theta^{(t-k)}$

Effective window: the target network “remembers” approximately $\frac{1}{\tau}$ past versions of the online network. With $\tau = 0.005$ , that’s ~200 steps.

Hard update (the DQN alternative):

$\theta_{\text{target}} \leftarrow \theta \quad \text{every } N \text{ steps}$

This is the limit of Polyak averaging as $\tau \to 1$ applied every $N$ steps. Discontinuous updates make the target jump, which is less stable but simpler.

Relationship to EMA decay: some papers parameterise with momentum $m = 1 - \tau$ . BYOL uses $m = 0.996$ , which is equivalent to $\tau = 0.004$ .

Code

import torch

# ── Soft update (Polyak averaging) ──────────────────────────────
# Call this AFTER every optimizer.step()
def soft_update(online_net, target_net, tau=0.005):
    """Polyak averaging: blend online weights into target."""
    for p_online, p_target in zip(online_net.parameters(), target_net.parameters()):
        p_target.data.mul_(1 - tau).add_(tau * p_online.data)
        # Equivalent to: p_target = (1-tau)*p_target + tau*p_online
        # Using in-place ops to avoid allocating new tensors

# ── Initialisation ──────────────────────────────────────────────
# Target net starts as an exact copy. NEVER forget this step.
target_net = copy.deepcopy(online_net)
target_net.requires_grad_(False)       # target never needs gradients

# ── In the training loop ────────────────────────────────────────
loss = compute_td_loss(online_net, target_net, batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
soft_update(online_net, target_net, tau=0.005)

# ── Hard update alternative (DQN-style) ────────────────────────
if step % target_update_freq == 0:
    target_net.load_state_dict(online_net.state_dict())

Manual Implementation

import numpy as np

def soft_update_numpy(online_params, target_params, tau=0.005):
    """
    Polyak averaging for parameter dictionaries (numpy arrays).
    online_params: dict of str -> ndarray (online network weights)
    target_params: dict of str -> ndarray (target network weights)
    Modifies target_params in-place.
    """
    for key in target_params:
        # theta_target = (1-tau)*theta_target + tau*theta_online
        target_params[key] *= (1 - tau)
        target_params[key] += tau * online_params[key]


def ema_decay_to_tau(decay):
    """Convert EMA momentum/decay to Polyak tau. decay=0.995 -> tau=0.005"""
    return 1.0 - decay


def effective_window(tau):
    """How many past steps the EMA effectively averages over."""
    return 1.0 / tau    # tau=0.005 -> ~200 steps


# ── Demonstration: EMA of a noisy signal ────────────────────────
def ema_demo(signal, tau=0.005):
    """Show that Polyak averaging smooths a noisy signal."""
    ema = signal[0]
    result = np.zeros_like(signal)
    for t in range(len(signal)):
        ema = tau * signal[t] + (1 - tau) * ema         # same formula
        result[t] = ema
    return result                                        # smoothed version

Popular Uses

SAC, DDPG, TD3 (see q-learning/): soft target updates with tau = 0.005 after every gradient step. This is the modern default for continuous-control RL
DQN (see q-learning/): uses hard updates (copy every 10K steps) — the predecessor to Polyak averaging in RL
BYOL, MoCo (see contrastive-self-supervising/): the EMA teacher/momentum encoder uses the same formula. MoCo v1 uses m = 0.999 (tau = 0.001), BYOL uses m = 0.996 (tau = 0.004)
Diffusion models (see diffusion/): EMA of model weights is standard for generation quality. Typical decay = 0.9999 (tau = 0.0001)
Stochastic Weight Averaging (SWA): averages model checkpoints for better generalisation; related but uses uniform rather than exponential weighting

Alternatives

Alternative	When to use	Tradeoff
Hard target updates (DQN)	Simple implementation, discrete update schedule	Target jumps discontinuously every N steps; less stable but fewer hyperparameters
No target network	Very simple environments, tabular RL	Works for tabular; diverges with function approximation in most cases
Double Q-learning	Want to reduce overestimation bias	Orthogonal to target update method — use two Q-networks to decouple selection and evaluation
Clipped Double Q (TD3, SAC)	Continuous control, want conservative Q estimates	Takes min of two Q-networks; complements Polyak averaging, doesn’t replace it
Periodic EMA reset	EMA weights become stale	Reset target to online weights periodically, then resume soft updates; used in some RLHF setups

Historical Context

Polyak averaging originates from Boris Polyak’s 1990 work on accelerating stochastic optimisation by averaging iterates. The idea of a separate target network in RL was introduced by Mnih et al. (2015) in DQN, but they used hard updates (full weight copy every 10K steps).

Lillicrap et al. (2016) introduced soft target updates (Polyak averaging) in DDPG, arguing that the smooth blending was more stable than periodic hard copies for continuous control. This was adopted by TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018), making it the standard for modern off-policy RL. Meanwhile, the self-supervised learning community independently adopted the same mechanism for momentum encoders in MoCo (He et al., 2020) and BYOL (Grill et al., 2020), calling it “EMA” rather than “Polyak averaging.”