Discount Factor (Gamma, γ)
Discount Factor (Gamma, γ)
Section titled “Discount Factor (Gamma, γ)”Weights future rewards exponentially, making near-term rewards worth more than distant ones. The single scalar defines what “long-term” means for an RL agent — it appears in every value function, every target computation, and every return estimate. Arguably the most important hyperparameter in RL.
Intuition
Section titled “Intuition”A dollar today is worth more than a dollar next year. Discounting in RL works the same way: a reward steps in the future is worth times its face value. With , a reward 100 steps away is worth — significant but diminished. With , that same reward is worth — effectively invisible.
The effective horizon — how far ahead the agent meaningfully plans — is approximately . So gives a horizon of ~100 steps, gives ~1000 steps, and makes the agent completely myopic (only cares about immediate reward).
Why not set and care about all future rewards equally? For episodic tasks (games, episodes that end), you can — the sum is finite. But for continuing tasks (robot staying balanced forever), undiscounted returns are infinite and value functions diverge. Discounting also reduces variance in practice: distant rewards add noise because they depend on many uncertain future actions.
Discounted return from time :
Recursive form (Bellman equation):
Value function:
Q-learning target:
GAE (Generalised Advantage Estimation) — uses both and :
Effective horizon:
| Effective horizon | Character | |
|---|---|---|
| 0.0 | 1 step | Purely greedy |
| 0.9 | 10 steps | Short-sighted |
| 0.99 | 100 steps | Standard RL |
| 0.999 | 1000 steps | Very far-sighted |
| 1.0 | Undiscounted (episodic only) |
import torch
# ── Discounted returns (used in REINFORCE, A2C) ──────────────────def compute_returns(rewards, gamma=0.99): """Compute discounted returns from a list of rewards.""" returns = [] G = 0.0 for r in reversed(rewards): # work backwards G = r + gamma * G returns.insert(0, G) return torch.tensor(returns) # (T,)
# ── Q-learning target ────────────────────────────────────────────gamma = 0.99with torch.no_grad(): next_q = target_net(next_states).max(dim=1).values # (B,) target = rewards + gamma * (1 - dones) * next_q # (B,) — zero out terminal
# ── GAE ──────────────────────────────────────────────────────────def compute_gae(rewards, values, gamma=0.99, lam=0.95): """Generalised Advantage Estimation.""" T = len(rewards) advantages = torch.zeros(T) gae = 0.0 for t in reversed(range(T)): next_val = values[t + 1] if t + 1 < len(values) else 0.0 delta = rewards[t] + gamma * next_val - values[t] # TD error gae = delta + gamma * lam * gae # accumulate advantages[t] = gae return advantages # (T,)Warning: Always multiply by (1 - done) when computing targets. Without this, the agent bootstraps from the next episode’s state at terminal transitions, creating nonsensical value estimates.
Manual Implementation
Section titled “Manual Implementation”import numpy as np
def discounted_returns(rewards, gamma=0.99): """ Compute G_t = r_t + γr_{t+1} + γ²r_{t+2} + ... for each timestep. rewards: (T,) array of rewards Returns: (T,) array of discounted returns """ T = len(rewards) returns = np.zeros(T) G = 0.0 for t in range(T - 1, -1, -1): # backward pass G = rewards[t] + gamma * G returns[t] = G return returns # (T,)
def td_target(reward, next_value, done, gamma=0.99): """Single-step TD target: y = r + γ·V(s') (zeroed at terminal).""" return reward + gamma * (1.0 - done) * next_value
def gae(rewards, values, gamma=0.99, lam=0.95): """Generalised Advantage Estimation — combines γ and λ.""" T = len(rewards) advantages = np.zeros(T) gae_val = 0.0 for t in range(T - 1, -1, -1): next_v = values[t + 1] if t + 1 < len(values) else 0.0 delta = rewards[t] + gamma * next_v - values[t] # TD error δ_t gae_val = delta + gamma * lam * gae_val # (γλ)-weighted sum advantages[t] = gae_val return advantages # (T,)
# Example: verify effective horizongamma = 0.99horizon = 1.0 / (1.0 - gamma) # ≈ 100weights = np.array([gamma**k for k in range(200)])actual_95 = np.searchsorted(np.cumsum(weights) / weights.sum(), 0.95)print(f"γ={gamma}: effective horizon={horizon:.0f}, 95% weight within {actual_95} steps")Popular Uses
Section titled “Popular Uses”- Q-learning / DQN: — the bootstrap target discounts the next state’s value
- Policy gradient (REINFORCE, A2C, PPO): discounted returns or GAE advantages weight the policy gradient
- GAE: the product controls bias-variance tradeoff in advantage estimation; is the standard combo
- Model-based RL (MuZero, Dreamer): discount factor in the imagined rollouts controls planning horizon
- Inverse RL: inferring the discount factor from expert behaviour reveals the expert’s planning horizon
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| (undiscounted) | Short episodic tasks, bandits | Only works if episodes terminate; infinite returns otherwise |
| Average reward formulation | Continuing tasks (robotics) | Replaces discounting with ; harder to implement |
| Hyperbolic discounting | Modelling human behaviour | decay; matches human psychology but complicates Bellman equations |
| Learned / adaptive | Multi-timescale problems | Agent learns its own horizon; harder to train |
| -step returns | Compromise between TD(0) and Monte Carlo | Fixed horizon instead of exponential decay; requires tuning |
Historical Context
Section titled “Historical Context”Discounting entered RL from economics and dynamic programming (Bellman, 1957), where it models time-preference for money. Sutton & Barto formalised its role in the RL framework, showing that is necessary and sufficient for convergence of value functions in continuing tasks.
The practical insight that defines the effective planning horizon — and that tuning it can matter more than algorithm choice — emerged from empirical work in the 2000s-2010s. GAE (Schulman et al., 2016) introduced as a second discount-like parameter that controls bias-variance tradeoff in advantage estimation, giving practitioners two knobs: for “how far to plan” and for “how much to trust the value function.”