← Lessons

quiz vs the machine

Platinum1750

Machine Learning

The Large Batch Training

Scale batch size for throughput while keeping generalization intact.

5 min read · advanced · beat Platinum to climb

More throughput per step

Distributed training naturally grows the global batch size as you add devices. Large batches give cleaner gradient estimates and high hardware utilization, but they change the optimization dynamics in ways you must manage.

  • Larger batches mean fewer updates per epoch.
  • Gradient noise drops, which can hurt exploration.
  • Naive scaling often loses test accuracy.

Closing the generalization gap

Practitioners use a warmup that ramps the learning rate over the first steps, optimizers tuned for large batches, and careful schedules to recover the accuracy a small batch would reach. Above some critical batch size, returns diminish sharply.

  • Warmup avoids early instability from a high rate.
  • A critical batch size bounds useful scaling.
  • Beyond it, doubling the batch barely speeds convergence.

Scaling the recipe

Large batch training is less about a bigger number and more about the schedule that keeps it stable.

Key idea

Large batch training boosts throughput by enlarging the global batch but needs warmup and tuned schedules to preserve generalization up to a critical batch size.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does a learning rate warmup help large batch training?

2. What does the critical batch size describe?

3. How do large batches affect gradient noise?