Skip to content

Temporal Difference Learning

Updating value estimates using the difference between consecutive predictions rather than waiting for the full episode return. The foundation of all Q-learning variants. The core update: V(s) <- V(s) + alpha * (r + gamma * V(s’) - V(s)), which bootstraps from the next state’s estimated value.

Imagine you’re driving to a new restaurant. A Monte Carlo approach would be: drive the whole route, note the total time, then update your estimate of how long the drive takes. You learn nothing until you arrive. Temporal difference (TD) learning is different: at every intersection, you compare your current estimate of the remaining time against what you actually experience at the next step plus your estimate from there. If you expected 20 minutes left but after one block you now estimate 22 minutes left, you immediately adjust your beliefs.

The key property is bootstrapping: TD updates use the model’s own predictions as part of the target. This means you don’t need complete episodes to learn, which is essential for continuing tasks (like keeping a server running) that never truly terminate. The price is bias — your target includes your own potentially-wrong estimate — but in practice the bias shrinks as learning progresses, and the massive reduction in variance makes TD methods far more sample-efficient than Monte Carlo.

The “temporal difference” itself is the quantity delta = r + gamma * V(s’) - V(s), called the TD error. When delta is positive, the transition was better than expected — you increase V(s). When negative, it was worse — you decrease V(s). This same TD error shows up as the foundation of advantage estimation, actor-critic methods, and every Q-learning variant.

TD(0) update (one-step bootstrapping):

V(st)V(st)+αδtV(s_t) \leftarrow V(s_t) + \alpha \cdot \delta_t

where the TD error is:

δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

α\alpha is the learning rate, γ[0,1]\gamma \in [0, 1] is the discount factor, rtr_t is the immediate reward.

TD target: yt=rt+γV(st+1)y_t = r_t + \gamma V(s_{t+1}). This replaces the full Monte Carlo return Gt=k=0γkrt+kG_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}.

N-step TD (interpolate between TD(0) and Monte Carlo):

Gt(n)=k=0n1γkrt+k+γnV(st+n)G_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V(s_{t+n})

More real rewards means less bias but more variance. TD(0) uses n=1; Monte Carlo uses n=infinity.

TD for action-values (Q-learning):

Q(st,at)Q(st,at)+α(rt+γmaxaQ(st+1,a)Q(st,at))Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \bigl(r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\bigr)

This is the update that DQN, Double DQN, and all variants in q-learning/ implement.

import torch
import torch.nn.functional as F
# ── TD target for Q-learning (DQN-style) ────────────────────────
# This is the core of every variant in q-learning/
q_values = q_net(states) # (B, n_actions)
q_sa = q_values.gather(1, actions.unsqueeze(1)) # (B, 1) — Q(s, a)
with torch.no_grad():
next_q = target_net(next_states) # (B, n_actions)
max_next_q = next_q.max(dim=1, keepdim=True)[0] # (B, 1)
# TD target: r + gamma * max_a' Q_target(s', a')
# Mask terminal states: if done, there is no next state
td_target = rewards + gamma * max_next_q * (1 - dones) # (B, 1)
loss = F.mse_loss(q_sa, td_target) # or F.smooth_l1_loss (Huber)
# ── TD error (useful for prioritised replay) ────────────────────
td_error = (q_sa - td_target).abs() # (B, 1)
import numpy as np
def td_update_tabular(V, s, r, s_next, done, alpha=0.1, gamma=0.99):
"""
One-step TD(0) update for tabular state-value function.
V: (n_states,) value table
s: int, current state index
r: float, reward received
s_next: int, next state index
done: bool, whether episode terminated
"""
td_target = r + gamma * V[s_next] * (1 - done) # bootstrap (0 if terminal)
td_error = td_target - V[s] # the "temporal difference"
V[s] = V[s] + alpha * td_error # nudge toward target
return td_error
def q_learning_tabular(Q, s, a, r, s_next, done, alpha=0.1, gamma=0.99):
"""
One-step Q-learning update (off-policy TD control).
Q: (n_states, n_actions) action-value table
"""
best_next = Q[s_next].max() * (1 - done) # max_a' Q(s', a')
td_target = r + gamma * best_next
td_error = td_target - Q[s, a]
Q[s, a] = Q[s, a] + alpha * td_error
return td_error
  • DQN and all Q-learning variants (see q-learning/): TD learning is the backbone — every variant only changes how the TD target or loss is computed
  • Actor-critic methods (see policy-gradient/): the critic learns V(s) or Q(s,a) via TD updates; the TD error becomes the advantage signal for the actor
  • GAE (see generalised-advantage-estimation.md): constructs advantage estimates from a sequence of TD errors weighted by lambda
  • Model-based RL (Dyna, MuZero): TD updates using simulated transitions from a learned world model
  • Eligibility traces (TD(lambda)): blend TD and Monte Carlo by maintaining a decaying trace of recently visited states
AlternativeWhen to useTradeoff
Monte Carlo returnsShort episodes, need unbiased estimatesZero bias but high variance; cannot learn from incomplete episodes
N-step returnsModerate episode length, want bias-variance middle groundMore real rewards in target reduces bias; increases variance and memory
TD(lambda)Want smooth interpolation without choosing fixed nElegant theory (forward/backward views equivalent); more complex to implement
Model-based planningHave a good dynamics modelCan do many TD-like updates per real step; model errors compound
Residual gradientWant true gradient descent on TD errorConverges to correct solution even with function approximation; slower in practice

TD learning was introduced by Richard Sutton in 1988, building on earlier work by Samuel (1959) on checkers and the theoretical framework of dynamic programming (Bellman, 1957). Sutton’s key insight was that you could learn value functions incrementally from experience without needing a model of the environment, by bootstrapping from your own estimates.

The convergence of tabular TD(0) was proved by Dayan (1992) and later extended. The combination of TD learning with neural network function approximation was notoriously unstable until Mnih et al. (2013, 2015) introduced DQN with two stabilising tricks: a replay buffer (breaking temporal correlation) and a target network (slowing down the bootstrap target). These two innovations made TD learning with deep networks practical and launched the modern deep RL era.