What a saddle is
A saddle point has a zero gradient yet is not a minimum. The surface curves up in some directions and down in others, like a horse saddle.
- The gradient vanishes, so naive descent slows.
- It is neither a peak nor a valley.
Why they matter
In high dimensional loss surfaces, saddle points are far more common than bad local minima. Many directions can curve down, so being stuck is rarely permanent if there is enough signal to move.
- Plateaus around saddles slow training.
- Pure gradient descent can crawl for a long time near them.
Escaping them
The Hessian, the matrix of second derivatives, has both positive and negative eigenvalues at a saddle. Noise from SGD and momentum help push parameters off the flat region into a descending direction.
Recognizing saddles explains why training can stall on a plateau yet later resume progress.
Key idea
A saddle point has a zero gradient but curves up and down in different directions, stalling naive descent until noise or momentum pushes parameters into a descending direction.