Temporal Difference Learning

Updating value estimates using the difference between consecutive predictions rather than waiting for the full episode return. The foundation of all Q-learning variants. The core update: V(s) <- V(s) + alpha * (r + gamma * V(s’) - V(s)), which bootstraps from the next state’s estimated value.

Intuition

Imagine you’re driving to a new restaurant. A Monte Carlo approach would be: drive the whole route, note the total time, then update your estimate of how long the drive takes. You learn nothing until you arrive. Temporal difference (TD) learning is different: at every intersection, you compare your current estimate of the remaining time against what you actually experience at the next step plus your estimate from there. If you expected 20 minutes left but after one block you now estimate 22 minutes left, you immediately adjust your beliefs.

The key property is bootstrapping: TD updates use the model’s own predictions as part of the target. This means you don’t need complete episodes to learn, which is essential for continuing tasks (like keeping a server running) that never truly terminate. The price is bias — your target includes your own potentially-wrong estimate — but in practice the bias shrinks as learning progresses, and the massive reduction in variance makes TD methods far more sample-efficient than Monte Carlo.

The “temporal difference” itself is the quantity delta = r + gamma * V(s’) - V(s), called the TD error. When delta is positive, the transition was better than expected — you increase V(s). When negative, it was worse — you decrease V(s). This same TD error shows up as the foundation of advantage estimation, actor-critic methods, and every Q-learning variant.

Math

TD(0) update (one-step bootstrapping):

$V(s_t) \leftarrow V(s_t) + \alpha \cdot \delta_t$

where the TD error is:

$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

$\alpha$ is the learning rate, $\gamma \in [0, 1]$ is the discount factor, $r_t$ is the immediate reward.

TD target: $y_t = r_t + \gamma V(s_{t+1})$ . This replaces the full Monte Carlo return $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$ .

N-step TD (interpolate between TD(0) and Monte Carlo):

$G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V(s_{t+n})$

More real rewards means less bias but more variance. TD(0) uses n=1; Monte Carlo uses n=infinity.

TD for action-values (Q-learning):

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \bigl(r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\bigr)$

This is the update that DQN, Double DQN, and all variants in q-learning/ implement.

Code

import torch
import torch.nn.functional as F

# ── TD target for Q-learning (DQN-style) ────────────────────────
# This is the core of every variant in q-learning/
q_values = q_net(states)                              # (B, n_actions)
q_sa = q_values.gather(1, actions.unsqueeze(1))       # (B, 1) — Q(s, a)

with torch.no_grad():
    next_q = target_net(next_states)                  # (B, n_actions)
    max_next_q = next_q.max(dim=1, keepdim=True)[0]   # (B, 1)
    # TD target: r + gamma * max_a' Q_target(s', a')
    # Mask terminal states: if done, there is no next state
    td_target = rewards + gamma * max_next_q * (1 - dones)  # (B, 1)

loss = F.mse_loss(q_sa, td_target)  # or F.smooth_l1_loss (Huber)

# ── TD error (useful for prioritised replay) ────────────────────
td_error = (q_sa - td_target).abs()                   # (B, 1)

Manual Implementation

import numpy as np

def td_update_tabular(V, s, r, s_next, done, alpha=0.1, gamma=0.99):
    """
    One-step TD(0) update for tabular state-value function.
    V:      (n_states,) value table
    s:      int, current state index
    r:      float, reward received
    s_next: int, next state index
    done:   bool, whether episode terminated
    """
    td_target = r + gamma * V[s_next] * (1 - done)     # bootstrap (0 if terminal)
    td_error = td_target - V[s]                         # the "temporal difference"
    V[s] = V[s] + alpha * td_error                      # nudge toward target
    return td_error


def q_learning_tabular(Q, s, a, r, s_next, done, alpha=0.1, gamma=0.99):
    """
    One-step Q-learning update (off-policy TD control).
    Q: (n_states, n_actions) action-value table
    """
    best_next = Q[s_next].max() * (1 - done)            # max_a' Q(s', a')
    td_target = r + gamma * best_next
    td_error = td_target - Q[s, a]
    Q[s, a] = Q[s, a] + alpha * td_error
    return td_error

Popular Uses

DQN and all Q-learning variants (see q-learning/): TD learning is the backbone — every variant only changes how the TD target or loss is computed
Actor-critic methods (see policy-gradient/): the critic learns V(s) or Q(s,a) via TD updates; the TD error becomes the advantage signal for the actor
GAE (see generalised-advantage-estimation.md): constructs advantage estimates from a sequence of TD errors weighted by lambda
Model-based RL (Dyna, MuZero): TD updates using simulated transitions from a learned world model
Eligibility traces (TD(lambda)): blend TD and Monte Carlo by maintaining a decaying trace of recently visited states

Alternatives

Alternative	When to use	Tradeoff
Monte Carlo returns	Short episodes, need unbiased estimates	Zero bias but high variance; cannot learn from incomplete episodes
N-step returns	Moderate episode length, want bias-variance middle ground	More real rewards in target reduces bias; increases variance and memory
TD(lambda)	Want smooth interpolation without choosing fixed n	Elegant theory (forward/backward views equivalent); more complex to implement
Model-based planning	Have a good dynamics model	Can do many TD-like updates per real step; model errors compound
Residual gradient	Want true gradient descent on TD error	Converges to correct solution even with function approximation; slower in practice

Historical Context

TD learning was introduced by Richard Sutton in 1988, building on earlier work by Samuel (1959) on checkers and the theoretical framework of dynamic programming (Bellman, 1957). Sutton’s key insight was that you could learn value functions incrementally from experience without needing a model of the environment, by bootstrapping from your own estimates.

The convergence of tabular TD(0) was proved by Dayan (1992) and later extended. The combination of TD learning with neural network function approximation was notoriously unstable until Mnih et al. (2013, 2015) introduced DQN with two stabilising tricks: a replay buffer (breaking temporal correlation) and a target network (slowing down the bootstrap target). These two innovations made TD learning with deep networks practical and launched the modern deep RL era.