← Lessons

quiz vs the machine

Gold1360

Machine Learning

Gradient Descent Variants

Batch, stochastic, and mini batch ways to step downhill.

5 min read · core · beat Gold to climb

Stepping down the loss

Gradient descent updates weights by moving them in the direction that lowers the loss. The variants differ in how much data they use to estimate that direction at each step.

  • Batch uses the entire dataset for one update.
  • Stochastic uses a single example per update.
  • Mini batch uses a small group, the common middle ground.

The tradeoffs

Batch descent gives a smooth, accurate gradient but each step is slow and memory heavy on large data. Stochastic descent updates quickly and its noise can help escape shallow traps, but the path jitters and convergence is noisy.

Mini batch combines the strengths. A batch of, say, a few hundred examples gives a stable gradient while still updating often and fitting in memory. It also maps well to parallel hardware.

Why noise can help

The randomness in stochastic and mini batch steps is not purely a flaw. The wobble lets the optimizer bounce out of poor local regions and explore, which often leads to better solutions than a perfectly smooth descent.

Key idea

Gradient descent variants trade gradient accuracy against update speed, and mini batch hits the practical balance of stable gradients, frequent updates, and helpful noise.

Check yourself

Answer to earn rating on the learn ladder.

1. How much data does mini batch descent use per update?

2. How can the noise in stochastic updates help?