The single most important knob
Vocabulary size sets how many distinct tokens exist. It ripples through model size, speed, and quality, so it deserves deliberate thought.
Pushing the size up
A larger vocabulary means:
- Shorter sequences, since more text fits per token.
- Faster inference per character, because fewer steps are needed.
- A bigger embedding matrix and output layer, costing memory and parameters.
- Rarer tokens that are seen less often during training and learn weaker representations.
Pushing the size down
A smaller vocabulary means:
- Longer sequences that eat context and slow generation.
- A leaner embedding table.
- Better trained tokens since each is seen more often.
Common ranges
Modern large models often land between roughly thirty thousand and two hundred fifty thousand tokens, trending larger as multilingual coverage grows. The sweet spot depends on languages covered, model scale, and how much you care about sequence length versus parameter budget.
Key idea
Vocabulary size trades sequence length and per token quality against embedding cost, and the right value depends on languages, scale, and budget.