Temporal Difference Learning
Temporal Difference Learning
Section titled “Temporal Difference Learning”Updating value estimates using the difference between consecutive predictions rather than waiting for the full episode return. The foundation of all Q-learning variants. The core update: V(s) <- V(s) + alpha * (r + gamma * V(s’) - V(s)), which bootstraps from the next state’s estimated value.
Intuition
Section titled “Intuition”Imagine you’re driving to a new restaurant. A Monte Carlo approach would be: drive the whole route, note the total time, then update your estimate of how long the drive takes. You learn nothing until you arrive. Temporal difference (TD) learning is different: at every intersection, you compare your current estimate of the remaining time against what you actually experience at the next step plus your estimate from there. If you expected 20 minutes left but after one block you now estimate 22 minutes left, you immediately adjust your beliefs.
The key property is bootstrapping: TD updates use the model’s own predictions as part of the target. This means you don’t need complete episodes to learn, which is essential for continuing tasks (like keeping a server running) that never truly terminate. The price is bias — your target includes your own potentially-wrong estimate — but in practice the bias shrinks as learning progresses, and the massive reduction in variance makes TD methods far more sample-efficient than Monte Carlo.
The “temporal difference” itself is the quantity delta = r + gamma * V(s’) - V(s), called the TD error. When delta is positive, the transition was better than expected — you increase V(s). When negative, it was worse — you decrease V(s). This same TD error shows up as the foundation of advantage estimation, actor-critic methods, and every Q-learning variant.
TD(0) update (one-step bootstrapping):
where the TD error is:
is the learning rate, is the discount factor, is the immediate reward.
TD target: . This replaces the full Monte Carlo return .
N-step TD (interpolate between TD(0) and Monte Carlo):
More real rewards means less bias but more variance. TD(0) uses n=1; Monte Carlo uses n=infinity.
TD for action-values (Q-learning):
This is the update that DQN, Double DQN, and all variants in q-learning/ implement.
import torchimport torch.nn.functional as F
# ── TD target for Q-learning (DQN-style) ────────────────────────# This is the core of every variant in q-learning/q_values = q_net(states) # (B, n_actions)q_sa = q_values.gather(1, actions.unsqueeze(1)) # (B, 1) — Q(s, a)
with torch.no_grad(): next_q = target_net(next_states) # (B, n_actions) max_next_q = next_q.max(dim=1, keepdim=True)[0] # (B, 1) # TD target: r + gamma * max_a' Q_target(s', a') # Mask terminal states: if done, there is no next state td_target = rewards + gamma * max_next_q * (1 - dones) # (B, 1)
loss = F.mse_loss(q_sa, td_target) # or F.smooth_l1_loss (Huber)
# ── TD error (useful for prioritised replay) ────────────────────td_error = (q_sa - td_target).abs() # (B, 1)Manual Implementation
Section titled “Manual Implementation”import numpy as np
def td_update_tabular(V, s, r, s_next, done, alpha=0.1, gamma=0.99): """ One-step TD(0) update for tabular state-value function. V: (n_states,) value table s: int, current state index r: float, reward received s_next: int, next state index done: bool, whether episode terminated """ td_target = r + gamma * V[s_next] * (1 - done) # bootstrap (0 if terminal) td_error = td_target - V[s] # the "temporal difference" V[s] = V[s] + alpha * td_error # nudge toward target return td_error
def q_learning_tabular(Q, s, a, r, s_next, done, alpha=0.1, gamma=0.99): """ One-step Q-learning update (off-policy TD control). Q: (n_states, n_actions) action-value table """ best_next = Q[s_next].max() * (1 - done) # max_a' Q(s', a') td_target = r + gamma * best_next td_error = td_target - Q[s, a] Q[s, a] = Q[s, a] + alpha * td_error return td_errorPopular Uses
Section titled “Popular Uses”- DQN and all Q-learning variants (see
q-learning/): TD learning is the backbone — every variant only changes how the TD target or loss is computed - Actor-critic methods (see
policy-gradient/): the critic learns V(s) or Q(s,a) via TD updates; the TD error becomes the advantage signal for the actor - GAE (see
generalised-advantage-estimation.md): constructs advantage estimates from a sequence of TD errors weighted by lambda - Model-based RL (Dyna, MuZero): TD updates using simulated transitions from a learned world model
- Eligibility traces (TD(lambda)): blend TD and Monte Carlo by maintaining a decaying trace of recently visited states
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Monte Carlo returns | Short episodes, need unbiased estimates | Zero bias but high variance; cannot learn from incomplete episodes |
| N-step returns | Moderate episode length, want bias-variance middle ground | More real rewards in target reduces bias; increases variance and memory |
| TD(lambda) | Want smooth interpolation without choosing fixed n | Elegant theory (forward/backward views equivalent); more complex to implement |
| Model-based planning | Have a good dynamics model | Can do many TD-like updates per real step; model errors compound |
| Residual gradient | Want true gradient descent on TD error | Converges to correct solution even with function approximation; slower in practice |
Historical Context
Section titled “Historical Context”TD learning was introduced by Richard Sutton in 1988, building on earlier work by Samuel (1959) on checkers and the theoretical framework of dynamic programming (Bellman, 1957). Sutton’s key insight was that you could learn value functions incrementally from experience without needing a model of the environment, by bootstrapping from your own estimates.
The convergence of tabular TD(0) was proved by Dayan (1992) and later extended. The combination of TD learning with neural network function approximation was notoriously unstable until Mnih et al. (2013, 2015) introduced DQN with two stabilising tricks: a replay buffer (breaking temporal correlation) and a target network (slowing down the bootstrap target). These two innovations made TD learning with deep networks practical and launched the modern deep RL era.