The middle ground
Mini batch gradient descent computes the gradient over a small group of examples, then updates once per batch. It sits between full batch and single example SGD.
- A batch of, say, 32 to 256 examples is common.
- The averaged gradient is less noisy than a single sample.
- Updates are still far cheaper than a full pass.
Why it wins
Averaging over a batch smooths the noise while keeping updates frequent. It also maps well onto hardware: matrix operations over a batch run efficiently on a GPU.
- Larger batches give smoother but slower updates.
- Smaller batches add useful noise and speed.
Choosing a size
Batch size interacts with learning rate. Bigger batches often allow a larger learning rate, but too large a batch can hurt how well the model generalizes.
Mini batches are the default in modern deep learning training loops.
Key idea
Mini batch descent averages gradients over a small batch, balancing the smoothness of full batch with the speed of single example SGD and mapping well onto parallel hardware.