← Lessons

quiz vs the machine

Gold1340

Machine Learning

The Learning Rate Scaling Rule

Adjust the learning rate as the batch grows to keep updates comparable.

4 min read · core · beat Gold to climb

Matching the update size

When you grow the batch, each step uses more data, so a fixed learning rate makes effectively smaller progress per example. The linear scaling rule says scale the learning rate in proportion to the batch size to keep the effective update comparable.

  • Double the batch, double the learning rate.
  • This keeps the per example update roughly constant.
  • It pairs with a warmup to avoid early blowups.

When it holds and breaks

The rule works well in a moderate range but breaks at very large batches, where a linearly scaled rate becomes too aggressive. Some setups prefer a square root scaling instead, and all of them need a warmup to survive the high rate at the start.

  • Linear scaling is a starting heuristic, not a law.
  • Square root scaling is gentler for huge batches.
  • Always combine with warmup for stability.

Scaling together

The rule gives a principled first guess that warmup and tuning then refine.

Key idea

The learning rate scaling rule grows the rate with the batch size to keep effective updates comparable, working as a heuristic that warmup and tuning refine.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the linear scaling rule say to do when you double the batch?

2. Why combine the scaling rule with warmup?