← Lessons

quiz vs the machine

Gold1400

Machine Learning

Momentum and Nesterov

Giving gradient descent inertia to glide through ravines.

4 min read · core · beat Gold to climb

Momentum and Nesterov

Plain gradient descent can crawl when the loss surface forms a narrow ravine. Momentum adds inertia so updates build up speed in consistent directions and damp out oscillations.

How momentum works

  • Keep a running velocity that is the decaying sum of past gradients.
  • Update parameters using this velocity rather than the raw gradient.
  • A coefficient around nine tenths controls how much past direction persists.

Why it helps

In a ravine the gradient points mostly across the valley walls and only weakly down the floor. Averaging gradients cancels the back and forth across the walls while reinforcing the steady push along the floor, so progress smooths out and accelerates.

The Nesterov twist

Nesterov momentum looks ahead before computing the gradient. It first takes a partial step in the velocity direction, then measures the gradient at that lookahead point. This anticipatory correction reacts sooner when the slope changes, often giving slightly faster and more stable convergence than plain momentum.

Key idea

Momentum accumulates a velocity to glide through ravines, and Nesterov measures the gradient at a lookahead point for sharper corrections.

Check yourself

Answer to earn rating on the learn ladder.

1. What does momentum accumulate?

2. How does Nesterov differ from plain momentum?