The Scaling Laws For Transformers

Loss falls predictably

A striking finding is that transformer loss drops as a smooth power law in model size, dataset size, and compute. Plot loss against any of these on log axes and you get a near straight line over many orders of magnitude.

The three knobs

Parameters, the number of weights in the model.
Data, the number of training tokens.
Compute, roughly parameters times tokens times a constant.

Compute optimal balance

Given a fixed compute budget, there is an optimal split between making the model bigger and training on more data. Influential work argued that earlier large models were undertrained, and that parameters and tokens should scale together in roughly equal proportion.

Why this matters

Scaling laws let teams predict the loss of a huge run from small cheap runs.
They turn model building into a question of budget allocation.
They warn that data, not just parameters, can become the bottleneck as models grow.

The limits

Power laws describe pretraining loss, not downstream usefulness, and they eventually bend as data runs out or quality degrades. They are a planning tool, not a guarantee.

Key idea

Transformer loss follows smooth power laws in parameters, data, and compute, and for a fixed budget size and data should scale together, turning model building into predictable compute allocation rather than guesswork.