More throughput per step
Distributed training naturally grows the global batch size as you add devices. Large batches give cleaner gradient estimates and high hardware utilization, but they change the optimization dynamics in ways you must manage.
- Larger batches mean fewer updates per epoch.
- Gradient noise drops, which can hurt exploration.
- Naive scaling often loses test accuracy.
Closing the generalization gap
Practitioners use a warmup that ramps the learning rate over the first steps, optimizers tuned for large batches, and careful schedules to recover the accuracy a small batch would reach. Above some critical batch size, returns diminish sharply.
- Warmup avoids early instability from a high rate.
- A critical batch size bounds useful scaling.
- Beyond it, doubling the batch barely speeds convergence.
Scaling the recipe
Large batch training is less about a bigger number and more about the schedule that keeps it stable.
Key idea
Large batch training boosts throughput by enlarging the global batch but needs warmup and tuned schedules to preserve generalization up to a critical batch size.