← Lessons

quiz vs the machine

Platinum1850

Machine Learning

The Scaling Laws For Transformers

The predictable power laws that guide how to spend compute.

6 min read · advanced · beat Platinum to climb

Loss falls predictably

A striking finding is that transformer loss drops as a smooth power law in model size, dataset size, and compute. Plot loss against any of these on log axes and you get a near straight line over many orders of magnitude.

The three knobs

  • Parameters, the number of weights in the model.
  • Data, the number of training tokens.
  • Compute, roughly parameters times tokens times a constant.

Compute optimal balance

Given a fixed compute budget, there is an optimal split between making the model bigger and training on more data. Influential work argued that earlier large models were undertrained, and that parameters and tokens should scale together in roughly equal proportion.

Why this matters

  • Scaling laws let teams predict the loss of a huge run from small cheap runs.
  • They turn model building into a question of budget allocation.
  • They warn that data, not just parameters, can become the bottleneck as models grow.

The limits

Power laws describe pretraining loss, not downstream usefulness, and they eventually bend as data runs out or quality degrades. They are a planning tool, not a guarantee.

Key idea

Transformer loss follows smooth power laws in parameters, data, and compute, and for a fixed budget size and data should scale together, turning model building into predictable compute allocation rather than guesswork.

Check yourself

Answer to earn rating on the learn ladder.

1. How does transformer loss behave as model size, data, and compute grow?

2. What did compute optimal scaling work conclude about earlier large models?

3. What is a stated limit of scaling laws?