What scaling laws say
Train many language models at different sizes and you find a clean pattern: the test loss drops as a smooth power of the resource you add. Plotted on log axes the curve becomes a near straight line over many orders of magnitude.
Three knobs
The loss depends on three quantities, each with its own power law when the others are not the bottleneck.
- Parameters N, the number of weights in the model.
- Data D, the number of training tokens seen.
- Compute C, the total floating point operations, roughly six times N times D.
Why this is useful
- You can fit the curve on small cheap runs and extrapolate to predict a large run before paying for it.
- It tells you whether you are bottlenecked by model size or by data.
- It frames training as an optimization over a fixed compute budget.
What it does not say
Power laws describe smooth average loss, not sudden capability jumps. They also assume the same data distribution and a well tuned recipe; a bad learning rate breaks the trend.
Key idea
Test loss falls as a power law in parameters, data, and compute, which lets you extrapolate from cheap small runs to plan an expensive large one.