← Lessons

quiz vs the machine

Gold1400

Machine Learning

Vocabulary Size Tradeoffs

Why picking a vocabulary size is a balancing act with no free lunch.

5 min read · core · beat Gold to climb

The single most important knob

Vocabulary size sets how many distinct tokens exist. It ripples through model size, speed, and quality, so it deserves deliberate thought.

Pushing the size up

A larger vocabulary means:

  • Shorter sequences, since more text fits per token.
  • Faster inference per character, because fewer steps are needed.
  • A bigger embedding matrix and output layer, costing memory and parameters.
  • Rarer tokens that are seen less often during training and learn weaker representations.

Pushing the size down

A smaller vocabulary means:

  • Longer sequences that eat context and slow generation.
  • A leaner embedding table.
  • Better trained tokens since each is seen more often.

Common ranges

Modern large models often land between roughly thirty thousand and two hundred fifty thousand tokens, trending larger as multilingual coverage grows. The sweet spot depends on languages covered, model scale, and how much you care about sequence length versus parameter budget.

Key idea

Vocabulary size trades sequence length and per token quality against embedding cost, and the right value depends on languages, scale, and budget.

Check yourself

Answer to earn rating on the learn ladder.

1. A larger vocabulary tends to produce

2. A risk of a very large vocabulary is