Momentum and Nesterov
Plain gradient descent can crawl when the loss surface forms a narrow ravine. Momentum adds inertia so updates build up speed in consistent directions and damp out oscillations.
How momentum works
- Keep a running velocity that is the decaying sum of past gradients.
- Update parameters using this velocity rather than the raw gradient.
- A coefficient around nine tenths controls how much past direction persists.
Why it helps
In a ravine the gradient points mostly across the valley walls and only weakly down the floor. Averaging gradients cancels the back and forth across the walls while reinforcing the steady push along the floor, so progress smooths out and accelerates.
The Nesterov twist
Nesterov momentum looks ahead before computing the gradient. It first takes a partial step in the velocity direction, then measures the gradient at that lookahead point. This anticipatory correction reacts sooner when the slope changes, often giving slightly faster and more stable convergence than plain momentum.
Key idea
Momentum accumulates a velocity to glide through ravines, and Nesterov measures the gradient at a lookahead point for sharper corrections.