Unified Policy Gradient Algorithm
Unified Policy Gradient Algorithm
Section titled “Unified Policy Gradient Algorithm”Introduction
Section titled “Introduction”The structure mirrors the Q-learning file. The core update() is always the same three steps — compute advantages, compute policy loss, gradient step — and the variants only swap out what goes into those.
The progression tells a clean story:
REINFORCE → works, but high variance because Ψ is the raw return for the entire trajectory. A good action in a bad episode gets punished.
+ Baseline → subtract V(s) so the signal becomes “better or worse than expected.” Same expected gradient, massively less variance. This is the single biggest practical improvement.
A2C → swap Monte Carlo returns for GAE, which blends multi-step TD errors. You get to tune the bias-variance tradeoff with λ, and you no longer need to wait for episodes to end.
PPO → same advantages as A2C, but changes how the gradient is used. Instead of one update then throw away the data, do K epochs of minibatch updates with the probability ratio clipped so no single step can wreck the policy. This is what makes PPO practical — it’s dramatically more sample-efficient while staying stable.
The key structural difference from Q-learning that PPO highlights: it overrides update() itself to add the multi-epoch minibatch loop. That’s the one place where the “core never changes” rule bends, because PPO’s whole point is reusing the same rollout multiple times — which is an outer-loop concern, not just a different loss.
Summary: What changes vs. what stays the same
Section titled “Summary: What changes vs. what stays the same”Always the same (core loop)
Section titled “Always the same (core loop)”- Collect rollout with current policy
- Compute advantages Ψ (PLUGGABLE)
- Compute policy loss using Ψ (PLUGGABLE)
- Gradient step on policy (+ optional value net)
- Discard data, repeat (on-policy)
What varies by variant
Section titled “What varies by variant”| Variant | Advantage Ψ | Policy loss |
|---|---|---|
| REINFORCE | G_t (full return) | −log π · G_t |
| REINFORCE+baseline | G_t − V(s) | −log π · (G_t − V) |
| A2C | GAE(δ_t) | −log π · A_GAE |
| PPO (clip) | GAE(δ_t) | −min(r·A, clip(r)·A) |
Motives for each variant
Section titled “Motives for each variant”| Variant | Problem Solved | Intuition for Solution |
|---|---|---|
| REINFORCE | Need a model-free way to optimise a policy directly without learning Q-values | Use the score function trick: ∇J ≈ E[G·∇log π]. Sample trajectories, weight actions by how much total reward followed |
| REINFORCE+baseline | Raw returns have high variance: even good actions get noisy credit because the whole trajectory’s reward is lumped together | Subtract V(s) — a learned estimate of the average return from state s. Now the signal is “better or worse than expected” instead of “good or bad in absolute terms” |
| A2C | Monte Carlo returns still have high variance, you need complete episodes, and can’t do incremental updates | Use GAE: a weighted blend of multi-step TD advantages. λ controls the bias-variance tradeoff. Can update every N steps without waiting for episode ends |
| PPO (clip) | Large policy updates are catastrophic — performance collapses and doesn’t recover. Vanilla PG wastes data (one update then discard) | Reuse rollout data for K epochs of minibatch updates (more sample-efficient), but clip the probability ratio π/π_old to [1−ε, 1+ε] so no single update can change the policy too much |
Key differences from Q-learning
Section titled “Key differences from Q-learning”- Q-learning is OFF-POLICY: can reuse old data from a replay buffer
- Policy gradient is ON-POLICY: must collect fresh data each iteration (PPO stretches this with multiple epochs, but still discards after)
- Q-learning learns a value, derives the policy (argmax Q)
- Policy gradient directly optimises the policy parameters
- Q-learning is natural for discrete actions
- Policy gradient handles continuous actions naturally