The plain version
Stochastic gradient descent updates weights by stepping in the direction of the negative gradient computed on a small batch. Because each batch is noisy, the path can zigzag and progress slowly in long narrow valleys of the loss surface.
Adding momentum
Momentum keeps a running velocity that blends the current gradient with past gradients. Instead of stepping by the raw gradient, the optimizer steps by this accumulated velocity.
- A momentum coefficient, often around nine tenths, controls how much history is kept
- Consistent directions build up speed
- Conflicting directions cancel out and reduce wobble
The physical picture is a heavy ball rolling downhill. It carries inertia, so it smooths over small bumps and accelerates along consistent slopes.
Why it helps
- It dampens the zigzag in steep narrow ravines
- It speeds up movement across flat or gently sloped regions
- It can carry the model past small local stalls
A refinement called Nesterov momentum looks ahead by computing the gradient at the predicted next position, which often gives more responsive updates. Momentum SGD remains a strong baseline and still wins on many vision tasks.
Key idea
Momentum accumulates a velocity from past gradients so descent rolls smoothly and quickly through noise and ravines.