The Learning Rate Scaling Rule

Matching the update size

When you grow the batch, each step uses more data, so a fixed learning rate makes effectively smaller progress per example. The linear scaling rule says scale the learning rate in proportion to the batch size to keep the effective update comparable.

Double the batch, double the learning rate.
This keeps the per example update roughly constant.
It pairs with a warmup to avoid early blowups.

When it holds and breaks

The rule works well in a moderate range but breaks at very large batches, where a linearly scaled rate becomes too aggressive. Some setups prefer a square root scaling instead, and all of them need a warmup to survive the high rate at the start.

Linear scaling is a starting heuristic, not a law.
Square root scaling is gentler for huge batches.
Always combine with warmup for stability.

Scaling together

The rule gives a principled first guess that warmup and tuning then refine.

Key idea

The learning rate scaling rule grows the rate with the batch size to keep effective updates comparable, working as a heuristic that warmup and tuning refine.

The Learning Rate Scaling Rule

Matching the update size

When it holds and breaks

Scaling together

Key idea

Check yourself