Three flavors
Gradient descent differs by how many examples it uses per update.
- Batch gradient descent uses the whole dataset per step. Each step is accurate but slow.
- Stochastic gradient descent, or SGD, uses one example per step. Steps are fast and noisy.
- Minibatch uses a small group, the common middle ground.
Why noise can help
The noise in SGD is not purely bad.
- It lets the path escape shallow local minima and saddle points.
- It acts as a mild regularizer, discouraging sharp overfit solutions.
- But pure SGD updates are jittery and underuse modern hardware.
Why minibatch wins in practice
- A batch of, say, sixty four uses vectorized hardware efficiently.
- Averaging over the batch gives smoother gradients than single examples.
- The batch size becomes a knob balancing noise, speed, and memory.
Practical notes
- Larger batches need a larger or warmed up learning rate.
- Shuffle the data each epoch so batches stay representative.
- Very large batches can hurt generalization if not tuned carefully.
Key idea
Minibatch gradient descent balances the accuracy of full batch and the helpful noise of single example SGD while using hardware efficiently.