The Scaling Laws Deep

What scaling laws say

Train many language models at different sizes and you find a clean pattern: the test loss drops as a smooth power of the resource you add. Plotted on log axes the curve becomes a near straight line over many orders of magnitude.

Three knobs

The loss depends on three quantities, each with its own power law when the others are not the bottleneck.

Parameters N, the number of weights in the model.
Data D, the number of training tokens seen.
Compute C, the total floating point operations, roughly six times N times D.

Why this is useful

You can fit the curve on small cheap runs and extrapolate to predict a large run before paying for it.
It tells you whether you are bottlenecked by model size or by data.
It frames training as an optimization over a fixed compute budget.

What it does not say

Power laws describe smooth average loss, not sudden capability jumps. They also assume the same data distribution and a well tuned recipe; a bad learning rate breaks the trend.

Key idea

Test loss falls as a power law in parameters, data, and compute, which lets you extrapolate from cheap small runs to plan an expensive large one.

The Scaling Laws Deep

What scaling laws say

Three knobs

Why this is useful

What it does not say

Key idea

Check yourself