← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Stochastic Gradient Descent

Estimate the gradient from one example at a time for fast noisy progress.

4 min read · intro · beat Silver to climb

The idea

Stochastic gradient descent (SGD) updates parameters using the gradient from a single training example rather than the whole dataset.

  • Each example gives a noisy estimate of the true gradient.
  • Updates are cheap and frequent.
  • Over many steps the noise averages out toward the right direction.

Why use it

Computing the full gradient over millions of examples is expensive. SGD makes many small updates per pass, so the model starts improving long before seeing all the data.

  • It scales to huge datasets.
  • The noise can help escape shallow traps.

The tradeoff

The path is jittery because each step trusts a single sample. A decreasing learning rate helps the updates settle near a minimum instead of bouncing around it.

SGD trades exact gradients for speed, and in practice that trade usually wins on large problems.

Key idea

SGD estimates the gradient from one example per step, giving cheap noisy updates that converge in aggregate and scale to large datasets.

Check yourself

Answer to earn rating on the learn ladder.

1. How many examples does pure SGD use per update?

2. Why is the SGD path noisy?