Multilingual Tokenization

Sharing one vocabulary

A multilingual model uses a single tokenizer for every language it covers. The vocabulary is trained on a mix of languages, and the mix decides who gets efficient tokens.

Fertility and fairness

Fertility is the average number of tokens per word. Languages that were rare in the training mix get high fertility, meaning their text fragments into many tiny pieces.

High resource languages get whole words as single tokens.
Low resource languages and non Latin scripts often split into bytes.
More tokens per word means higher cost and shorter usable context for those users.

Balancing the mix

To be fairer, trainers upsample underrepresented languages so their tokens earn places in the vocabulary. This trades some efficiency on dominant languages for broader coverage.

A real equity issue

Because pricing and context limits are counted in tokens, speakers of high fertility languages pay more and fit less per request for the same meaning. It is a genuine fairness concern, not just an efficiency footnote.

Key idea

A shared multilingual vocabulary gives uneven fertility across languages, so underrepresented languages cost more tokens unless the training mix is balanced.

Multilingual Tokenization

Sharing one vocabulary

Fertility and fairness

Balancing the mix

A real equity issue

Key idea

Check yourself