Sparse Rewards
Sparse Rewards
Section titled “Sparse Rewards”The agent receives a reward signal only rarely — often only at task completion (success/failure). For most timesteps, the reward is exactly zero, providing no gradient signal about whether the agent’s behaviour is improving. Most of the state-action space is a reward desert.
Intuition
Section titled “Intuition”Imagine learning to solve a Rubik’s cube by trial and error, where the only feedback is “solved” or “not solved.” You randomly twist the cube millions of times and never get any signal because you never accidentally solve it. Even if you occasionally make progress (getting one face right), you receive no feedback because the reward only fires at the end. You can’t learn because you can’t tell which of your millions of actions, if any, were steps in the right direction.
Dense rewards (reward at every timestep) are like a GPS giving turn-by-turn directions. Sparse rewards are like only being told “you arrived” or “you didn’t” — with no information about which wrong turns you took. The agent must stumble onto the reward through random exploration, and then propagate credit backward through potentially thousands of steps to figure out which actions led there.
The problem compounds with task length and action space size. For a 1000-step task with 10 actions per step, the space of possible trajectories is . The fraction of trajectories that reach the reward is vanishingly small. Random exploration will essentially never find the reward, so learning never begins.
Manifestation
Section titled “Manifestation”- The agent shows no improvement for long periods — reward is zero for every episode, so there’s no gradient signal
- Random exploration never reaches the goal — the agent needs to discover a specific sequence of actions, and random search in a combinatorial space is hopeless
- Policy gradient variance is extreme — in the rare episode that gets a reward, the entire trajectory is upweighted, including irrelevant actions
- Performance is brittle — the agent may eventually find one path to the reward and repeat it exactly, with no generalisation to nearby strategies
- Training works immediately when you add dense reward shaping — confirming the problem was sparsity, not capacity
Where It Appears
Section titled “Where It Appears”- Q-learning (
q-learning/): sparse rewards make Q-value propagation very slow — the reward must bootstrap backward one step at a time through the Bellman equation, requiring the agent to visit the entire path from reward to initial state multiple times - Policy gradient (
policy-gradient/): REINFORCE with sparse rewards has zero expected gradient for almost all trajectories — only the rare successful trajectory contributes, and the signal is extremely noisy - Contrastive learning (
contrastive-self-supervising/): the InfoNCE loss provides dense self-supervised signal — one motivation for self-supervised pretraining is to avoid the sparse-reward problem entirely by learning representations from dense (self-generated) objectives
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| Reward shaping | Add intermediate rewards that guide the agent toward the goal | (design principle) |
| Hindsight Experience Replay (HER) | Relabel failed episodes with achieved goals — turns every trajectory into a “success” for some goal | (Andrychowicz et al., 2017) |
| Curiosity / intrinsic motivation | Reward the agent for visiting novel states, providing signal even without external reward | (Pathak et al., 2017) |
| Hierarchical RL | Decompose the task into subtasks with their own (denser) rewards | (options framework) |
| n-step returns | Propagate reward backward n steps in one update instead of one step — speeds up credit propagation | atomic-concepts/rl-specific/temporal-difference-learning.md |
| Discount factor tuning | Lower γ makes the agent focus on nearby rewards, but a high γ is needed to see distant sparse rewards | atomic-concepts/rl-specific/discount-factor.md |
Historical Context
Section titled “Historical Context”Sparse rewards have been a central challenge since the earliest RL research. Sutton & Barto’s textbook uses the “cliff walking” and “maze” examples to illustrate how sparse terminal rewards make learning difficult. Hindsight Experience Replay (Andrychowicz et al., 2017) was a breakthrough: by relabelling goals after the fact, it converts sparse-reward problems into dense-reward problems without any domain knowledge. Curiosity-driven exploration (Pathak et al., 2017) attacked the problem from the other direction — generate your own reward signal from prediction error. Both approaches remain active research areas, and sparse rewards continue to be one of the hardest practical problems in RL.