The question Chinchilla answered
Given a fixed compute budget, should you train a huge model on little data or a smaller model on much more data? The Chinchilla study fit scaling laws and found earlier large models were badly undertrained.
The balanced recipe
The key result is that parameters and training tokens should grow together at roughly equal rates as compute rises.
- A rough guide is about twenty tokens per parameter for compute optimal training.
- Doubling compute means roughly multiplying both model size and data by the square root of two.
Why bigger was not better
- A giant model starved of data wastes capacity it never learns to use.
- A smaller well fed model reaches lower loss for the same FLOPs.
- Chinchilla matched a much larger predecessor using the same compute by rebalancing.
A caveat for deployment
Compute optimal is about training cost only. If a model will serve billions of queries, you may deliberately train a smaller model on extra data to cut inference cost, accepting a worse training trade.
Key idea
For a fixed training budget, scale parameters and tokens together at about twenty tokens per parameter; earlier giant models were undertrained and a balanced smaller model wins.