Stepping down the loss
Gradient descent updates weights by moving them in the direction that lowers the loss. The variants differ in how much data they use to estimate that direction at each step.
- Batch uses the entire dataset for one update.
- Stochastic uses a single example per update.
- Mini batch uses a small group, the common middle ground.
The tradeoffs
Batch descent gives a smooth, accurate gradient but each step is slow and memory heavy on large data. Stochastic descent updates quickly and its noise can help escape shallow traps, but the path jitters and convergence is noisy.
Mini batch combines the strengths. A batch of, say, a few hundred examples gives a stable gradient while still updating often and fitting in memory. It also maps well to parallel hardware.
Why noise can help
The randomness in stochastic and mini batch steps is not purely a flaw. The wobble lets the optimizer bounce out of poor local regions and explore, which often leads to better solutions than a perfectly smooth descent.
Key idea
Gradient descent variants trade gradient accuracy against update speed, and mini batch hits the practical balance of stable gradients, frequent updates, and helpful noise.