Unified Policy Gradient Algorithm

Introduction

The structure mirrors the Q-learning file. The core update() is always the same three steps — compute advantages, compute policy loss, gradient step — and the variants only swap out what goes into those.

The progression tells a clean story:

REINFORCE → works, but high variance because Ψ is the raw return for the entire trajectory. A good action in a bad episode gets punished.

+ Baseline → subtract V(s) so the signal becomes “better or worse than expected.” Same expected gradient, massively less variance. This is the single biggest practical improvement.

A2C → swap Monte Carlo returns for GAE, which blends multi-step TD errors. You get to tune the bias-variance tradeoff with λ, and you no longer need to wait for episodes to end.

PPO → same advantages as A2C, but changes how the gradient is used. Instead of one update then throw away the data, do K epochs of minibatch updates with the probability ratio clipped so no single step can wreck the policy. This is what makes PPO practical — it’s dramatically more sample-efficient while staying stable.

The key structural difference from Q-learning that PPO highlights: it overrides update() itself to add the multi-epoch minibatch loop. That’s the one place where the “core never changes” rule bends, because PPO’s whole point is reusing the same rollout multiple times — which is an outer-loop concern, not just a different loss.

Summary: What changes vs. what stays the same

Always the same (core loop)

Collect rollout with current policy
Compute advantages Ψ (PLUGGABLE)
Compute policy loss using Ψ (PLUGGABLE)
Gradient step on policy (+ optional value net)
Discard data, repeat (on-policy)

What varies by variant

Variant	Advantage Ψ	Policy loss
REINFORCE	G_t (full return)	−log π · G_t
REINFORCE+baseline	G_t − V(s)	−log π · (G_t − V)
A2C	GAE(δ_t)	−log π · A_GAE
PPO (clip)	GAE(δ_t)	−min(r·A, clip(r)·A)

Motives for each variant

Variant	Problem Solved	Intuition for Solution
REINFORCE	Need a model-free way to optimise a policy directly without learning Q-values	Use the score function trick: ∇J ≈ E[G·∇log π]. Sample trajectories, weight actions by how much total reward followed
REINFORCE+baseline	Raw returns have high variance: even good actions get noisy credit because the whole trajectory’s reward is lumped together	Subtract V(s) — a learned estimate of the average return from state s. Now the signal is “better or worse than expected” instead of “good or bad in absolute terms”
A2C	Monte Carlo returns still have high variance, you need complete episodes, and can’t do incremental updates	Use GAE: a weighted blend of multi-step TD advantages. λ controls the bias-variance tradeoff. Can update every N steps without waiting for episode ends
PPO (clip)	Large policy updates are catastrophic — performance collapses and doesn’t recover. Vanilla PG wastes data (one update then discard)	Reuse rollout data for K epochs of minibatch updates (more sample-efficient), but clip the probability ratio π/π_old to [1−ε, 1+ε] so no single update can change the policy too much

Key differences from Q-learning

Q-learning is OFF-POLICY: can reuse old data from a replay buffer
Policy gradient is ON-POLICY: must collect fresh data each iteration (PPO stretches this with multiple epochs, but still discards after)
Q-learning learns a value, derives the policy (argmax Q)
Policy gradient directly optimises the policy parameters
Q-learning is natural for discrete actions
Policy gradient handles continuous actions naturally