Saddle Points

In high-dimensional loss landscapes, most critical points (where the gradient is zero) are saddle points, not local minima. The gradient vanishes but the point is a minimum in some directions and a maximum in others. Gradient descent can slow to a crawl near saddle points, wasting compute on negligible updates.

Intuition

Imagine walking on a mountain pass: you’re at the lowest point between two peaks (a minimum along the east-west axis) but also at the highest point of the ridge between two valleys (a maximum along the north-south axis). The gradient is zero because you’re at a critical point, but it’s not a minimum — if you could take one step north or south, you’d start descending again.

In low dimensions (2D, 3D), local minima are common and saddle points are rare. In high dimensions, the opposite is true. A critical point is a local minimum only if the Hessian is positive definite — all eigenvalues are positive. For a random matrix, this probability decreases exponentially with dimension. In 1000 dimensions, a critical point that happens to be a minimum in all 1000 directions is astronomically unlikely. Almost all critical points are saddle points.

The good news is that saddle points aren’t permanent traps for SGD. The stochasticity in mini-batch gradients and the curvature from momentum help the optimiser escape. But they can cause long plateaus where training appears stuck, which is wasteful and can mislead practitioners into thinking the model has converged.

Manifestation

Loss plateaus for many steps then suddenly drops — the optimiser was near a saddle point and finally found a descent direction
Gradient norms become very small without the loss reaching a good value — distinguishes saddle points from true convergence
Training with full-batch gradient descent is much more susceptible to getting stuck than SGD — mini-batch noise helps escape
Second-order methods (which use curvature information) navigate saddle points efficiently but are too expensive for large models

Where It Appears

NN training (nn-training/): the primary context — deep networks have loss landscapes dominated by saddle points; momentum-based optimisers (Adam, SGD+momentum) are preferred partly because they escape saddle points faster
Policy gradient (policy-gradient/): policy optimisation landscapes also have saddle points — entropy regularisation helps by maintaining exploration of the parameter space
GANs (gans/): the GAN minimax objective has saddle points by construction (the Nash equilibrium is a saddle point of the minimax loss) — this is one reason GAN training is hard

Solutions at a Glance

Solution	Mechanism	Where documented
SGD with momentum	Momentum accumulates velocity, carrying the optimiser past saddle points	(standard optimiser)
Adam / AdamW	Adaptive learning rates per parameter — can navigate around saddle points via the second moment	`nn-training/`
Mini-batch noise	Stochastic gradients perturb the optimiser away from saddle points	(inherent in SGD)
Learning rate warmup	Gradual ramp-up prevents the optimiser from settling near early saddle points	`atomic-concepts/optimisation-primitives/learning-rate-warmup.md`
Entropy regularisation	In RL, maintains policy stochasticity, preventing the policy from collapsing to a saddle point	`atomic-concepts/regularisation/entropy-regularisation.md`

Historical Context

Dauphin et al. (2014) provided the first clear argument that saddle points, not local minima, are the primary obstacle in high-dimensional optimisation for deep learning. They showed that the loss surface of neural networks has exponentially more saddle points than local minima, and that the local minima that do exist tend to have loss values close to the global minimum. This shifted the research focus from “escaping local minima” (a 1990s concern) to “escaping saddle points” (the actual problem). The practical impact was a theoretical justification for stochastic methods: SGD’s noise is a feature, not a bug, because it prevents convergence to saddle points.