SGD with Momentum

The plain version

Stochastic gradient descent updates weights by stepping in the direction of the negative gradient computed on a small batch. Because each batch is noisy, the path can zigzag and progress slowly in long narrow valleys of the loss surface.

Adding momentum

Momentum keeps a running velocity that blends the current gradient with past gradients. Instead of stepping by the raw gradient, the optimizer steps by this accumulated velocity.

A momentum coefficient, often around nine tenths, controls how much history is kept
Consistent directions build up speed
Conflicting directions cancel out and reduce wobble

The physical picture is a heavy ball rolling downhill. It carries inertia, so it smooths over small bumps and accelerates along consistent slopes.

Why it helps

It dampens the zigzag in steep narrow ravines
It speeds up movement across flat or gently sloped regions
It can carry the model past small local stalls

A refinement called Nesterov momentum looks ahead by computing the gradient at the predicted next position, which often gives more responsive updates. Momentum SGD remains a strong baseline and still wins on many vision tasks.

Key idea

Momentum accumulates a velocity from past gradients so descent rolls smoothly and quickly through noise and ravines.

The plain version

Adding momentum

Why it helps

Key idea

Check yourself