← Lessons

quiz vs the machine

Silver1100

Machine Learning

SGD with Momentum

Adding velocity to gradient descent so it rolls through noise and ravines.

4 min read · intro · beat Silver to climb

The plain version

Stochastic gradient descent updates weights by stepping in the direction of the negative gradient computed on a small batch. Because each batch is noisy, the path can zigzag and progress slowly in long narrow valleys of the loss surface.

Adding momentum

Momentum keeps a running velocity that blends the current gradient with past gradients. Instead of stepping by the raw gradient, the optimizer steps by this accumulated velocity.

  • A momentum coefficient, often around nine tenths, controls how much history is kept
  • Consistent directions build up speed
  • Conflicting directions cancel out and reduce wobble

The physical picture is a heavy ball rolling downhill. It carries inertia, so it smooths over small bumps and accelerates along consistent slopes.

Why it helps

  • It dampens the zigzag in steep narrow ravines
  • It speeds up movement across flat or gently sloped regions
  • It can carry the model past small local stalls

A refinement called Nesterov momentum looks ahead by computing the gradient at the predicted next position, which often gives more responsive updates. Momentum SGD remains a strong baseline and still wins on many vision tasks.

Key idea

Momentum accumulates a velocity from past gradients so descent rolls smoothly and quickly through noise and ravines.

Check yourself

Answer to earn rating on the learn ladder.

1. What does momentum accumulate across steps?

2. What problem does momentum mainly fix in plain SGD?