Gradient Descent Variants

Stepping down the loss

Gradient descent updates weights by moving them in the direction that lowers the loss. The variants differ in how much data they use to estimate that direction at each step.

Batch uses the entire dataset for one update.
Stochastic uses a single example per update.
Mini batch uses a small group, the common middle ground.

The tradeoffs

Batch descent gives a smooth, accurate gradient but each step is slow and memory heavy on large data. Stochastic descent updates quickly and its noise can help escape shallow traps, but the path jitters and convergence is noisy.

Mini batch combines the strengths. A batch of, say, a few hundred examples gives a stable gradient while still updating often and fitting in memory. It also maps well to parallel hardware.

Why noise can help

The randomness in stochastic and mini batch steps is not purely a flaw. The wobble lets the optimizer bounce out of poor local regions and explore, which often leads to better solutions than a perfectly smooth descent.

Key idea

Gradient descent variants trade gradient accuracy against update speed, and mini batch hits the practical balance of stable gradients, frequent updates, and helpful noise.

Gradient Descent Variants

Stepping down the loss

The tradeoffs

Why noise can help

Key idea

Check yourself