Skip to content

Saddle Points

In high-dimensional loss landscapes, most critical points (where the gradient is zero) are saddle points, not local minima. The gradient vanishes but the point is a minimum in some directions and a maximum in others. Gradient descent can slow to a crawl near saddle points, wasting compute on negligible updates.

Imagine walking on a mountain pass: you’re at the lowest point between two peaks (a minimum along the east-west axis) but also at the highest point of the ridge between two valleys (a maximum along the north-south axis). The gradient is zero because you’re at a critical point, but it’s not a minimum — if you could take one step north or south, you’d start descending again.

In low dimensions (2D, 3D), local minima are common and saddle points are rare. In high dimensions, the opposite is true. A critical point is a local minimum only if the Hessian is positive definite — all eigenvalues are positive. For a random matrix, this probability decreases exponentially with dimension. In 1000 dimensions, a critical point that happens to be a minimum in all 1000 directions is astronomically unlikely. Almost all critical points are saddle points.

The good news is that saddle points aren’t permanent traps for SGD. The stochasticity in mini-batch gradients and the curvature from momentum help the optimiser escape. But they can cause long plateaus where training appears stuck, which is wasteful and can mislead practitioners into thinking the model has converged.

  • Loss plateaus for many steps then suddenly drops — the optimiser was near a saddle point and finally found a descent direction
  • Gradient norms become very small without the loss reaching a good value — distinguishes saddle points from true convergence
  • Training with full-batch gradient descent is much more susceptible to getting stuck than SGD — mini-batch noise helps escape
  • Second-order methods (which use curvature information) navigate saddle points efficiently but are too expensive for large models
  • NN training (nn-training/): the primary context — deep networks have loss landscapes dominated by saddle points; momentum-based optimisers (Adam, SGD+momentum) are preferred partly because they escape saddle points faster
  • Policy gradient (policy-gradient/): policy optimisation landscapes also have saddle points — entropy regularisation helps by maintaining exploration of the parameter space
  • GANs (gans/): the GAN minimax objective has saddle points by construction (the Nash equilibrium is a saddle point of the minimax loss) — this is one reason GAN training is hard
SolutionMechanismWhere documented
SGD with momentumMomentum accumulates velocity, carrying the optimiser past saddle points(standard optimiser)
Adam / AdamWAdaptive learning rates per parameter — can navigate around saddle points via the second momentnn-training/
Mini-batch noiseStochastic gradients perturb the optimiser away from saddle points(inherent in SGD)
Learning rate warmupGradual ramp-up prevents the optimiser from settling near early saddle pointsatomic-concepts/optimisation-primitives/learning-rate-warmup.md
Entropy regularisationIn RL, maintains policy stochasticity, preventing the policy from collapsing to a saddle pointatomic-concepts/regularisation/entropy-regularisation.md

Dauphin et al. (2014) provided the first clear argument that saddle points, not local minima, are the primary obstacle in high-dimensional optimisation for deep learning. They showed that the loss surface of neural networks has exponentially more saddle points than local minima, and that the local minima that do exist tend to have loss values close to the global minimum. This shifted the research focus from “escaping local minima” (a 1990s concern) to “escaping saddle points” (the actual problem). The practical impact was a theoretical justification for stochastic methods: SGD’s noise is a feature, not a bug, because it prevents convergence to saddle points.