The Compute Optimal Training

Reframing the goal

The naive instinct is to make the model as large as possible. The compute optimal view instead fixes a FLOP budget and asks which model size and token count reach the lowest loss for that exact spend.

The IsoFLOP method

A practical recipe sweeps configurations at a fixed compute level.

Pick several model sizes and train each on the token count that uses the same total FLOPs.
Plot final loss against model size to get a U shaped curve.
The bottom of the U is the optimal size for that budget.

What the curves reveal

Too small a model underfits even with abundant data.
Too large a model runs out of tokens and stays undertrained.
The optimal point moves to larger models and more tokens as the budget grows.

Using it in planning

You fit how the optimal size scales with compute, then read off the right model and data for your real budget. This avoids the costly mistake of building a model the budget cannot properly train.

Key idea

Compute optimal training fixes a FLOP budget and finds the model size and token count at the bottom of the IsoFLOP loss curve, avoiding models too big for their data.

The Compute Optimal Training

Reframing the goal

The IsoFLOP method

What the curves reveal

Using it in planning

Key idea

Check yourself