Out of Vocabulary Handling

What happens when a token is not in the vocabulary, and how subwords mostly fix it.

The old problem

Word level tokenizers hit a wall on any word missing from the vocabulary. The classic fix was a single unknown token, which threw away all information about the surprising word.

How subwords help

Subword tokenizers shrink this problem dramatically. A word the tokenizer never saw whole can still be broken into known pieces, preserving meaning and morphology.

A rare technical term splits into familiar roots and suffixes.
A misspelling degrades into smaller known chunks.

The byte safety net

The strongest guarantee comes from working at the byte level. Since any text is a sequence of bytes and all bytes are in the base vocabulary, there is literally no out of vocabulary case left.

Residual issues

Even so, very strange input becomes long byte sequences, which is slow and wastes context. The unknown token still appears in some older models, so robust pipelines prefer byte fallback.

Key idea