← Lessons

quiz vs the machine

Silver1080

Machine Learning

The Scaling Laws Deep

How loss falls as a smooth power law in model size, data, and compute.

5 min read · intro · beat Silver to climb

What scaling laws say

Train many language models at different sizes and you find a clean pattern: the test loss drops as a smooth power of the resource you add. Plotted on log axes the curve becomes a near straight line over many orders of magnitude.

Three knobs

The loss depends on three quantities, each with its own power law when the others are not the bottleneck.

  • Parameters N, the number of weights in the model.
  • Data D, the number of training tokens seen.
  • Compute C, the total floating point operations, roughly six times N times D.

Why this is useful

  • You can fit the curve on small cheap runs and extrapolate to predict a large run before paying for it.
  • It tells you whether you are bottlenecked by model size or by data.
  • It frames training as an optimization over a fixed compute budget.

What it does not say

Power laws describe smooth average loss, not sudden capability jumps. They also assume the same data distribution and a well tuned recipe; a bad learning rate breaks the trend.

Key idea

Test loss falls as a power law in parameters, data, and compute, which lets you extrapolate from cheap small runs to plan an expensive large one.

Check yourself

Answer to earn rating on the learn ladder.

1. On a log loss versus log compute plot, a scaling law appears as roughly what shape?

2. Total training compute C is approximately what?

3. What do scaling laws not reliably predict?