← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Compute Optimal Training

Spending a FLOP budget to minimize loss instead of maximizing model size.

5 min read · intro · beat Silver to climb

Reframing the goal

The naive instinct is to make the model as large as possible. The compute optimal view instead fixes a FLOP budget and asks which model size and token count reach the lowest loss for that exact spend.

The IsoFLOP method

A practical recipe sweeps configurations at a fixed compute level.

  • Pick several model sizes and train each on the token count that uses the same total FLOPs.
  • Plot final loss against model size to get a U shaped curve.
  • The bottom of the U is the optimal size for that budget.

What the curves reveal

  • Too small a model underfits even with abundant data.
  • Too large a model runs out of tokens and stays undertrained.
  • The optimal point moves to larger models and more tokens as the budget grows.

Using it in planning

You fit how the optimal size scales with compute, then read off the right model and data for your real budget. This avoids the costly mistake of building a model the budget cannot properly train.

Key idea

Compute optimal training fixes a FLOP budget and finds the model size and token count at the bottom of the IsoFLOP loss curve, avoiding models too big for their data.

Check yourself

Answer to earn rating on the learn ladder.

1. What shape is the loss versus model size curve at fixed compute?

2. In the IsoFLOP method, what is held constant across configurations?