Matching the update size
When you grow the batch, each step uses more data, so a fixed learning rate makes effectively smaller progress per example. The linear scaling rule says scale the learning rate in proportion to the batch size to keep the effective update comparable.
- Double the batch, double the learning rate.
- This keeps the per example update roughly constant.
- It pairs with a warmup to avoid early blowups.
When it holds and breaks
The rule works well in a moderate range but breaks at very large batches, where a linearly scaled rate becomes too aggressive. Some setups prefer a square root scaling instead, and all of them need a warmup to survive the high rate at the start.
- Linear scaling is a starting heuristic, not a law.
- Square root scaling is gentler for huge batches.
- Always combine with warmup for stability.
Scaling together
The rule gives a principled first guess that warmup and tuning then refine.
Key idea
The learning rate scaling rule grows the rate with the batch size to keep effective updates comparable, working as a heuristic that warmup and tuning refine.