← Lessons

quiz vs the machine

Silver1110

Machine Learning

The Chinchilla Optimal

Balancing parameters and tokens so a fixed compute budget buys the lowest loss.

5 min read · intro · beat Silver to climb

The question Chinchilla answered

Given a fixed compute budget, should you train a huge model on little data or a smaller model on much more data? The Chinchilla study fit scaling laws and found earlier large models were badly undertrained.

The balanced recipe

The key result is that parameters and training tokens should grow together at roughly equal rates as compute rises.

  • A rough guide is about twenty tokens per parameter for compute optimal training.
  • Doubling compute means roughly multiplying both model size and data by the square root of two.

Why bigger was not better

  • A giant model starved of data wastes capacity it never learns to use.
  • A smaller well fed model reaches lower loss for the same FLOPs.
  • Chinchilla matched a much larger predecessor using the same compute by rebalancing.

A caveat for deployment

Compute optimal is about training cost only. If a model will serve billions of queries, you may deliberately train a smaller model on extra data to cut inference cost, accepting a worse training trade.

Key idea

For a fixed training budget, scale parameters and tokens together at about twenty tokens per parameter; earlier giant models were undertrained and a balanced smaller model wins.

Check yourself

Answer to earn rating on the learn ladder.

1. What did Chinchilla say about many earlier large models?

2. Roughly how many training tokens per parameter is compute optimal?

3. Why might you ignore Chinchilla and train a smaller model on extra data?