← Lessons

quiz vs the machine

Gold1340

Machine Learning

Out of Vocabulary Handling

What happens when a token is not in the vocabulary, and how subwords mostly fix it.

4 min read · core · beat Gold to climb

The old problem

Word level tokenizers hit a wall on any word missing from the vocabulary. The classic fix was a single unknown token, which threw away all information about the surprising word.

How subwords help

Subword tokenizers shrink this problem dramatically. A word the tokenizer never saw whole can still be broken into known pieces, preserving meaning and morphology.

  • A rare technical term splits into familiar roots and suffixes.
  • A misspelling degrades into smaller known chunks.

The byte safety net

The strongest guarantee comes from working at the byte level. Since any text is a sequence of bytes and all bytes are in the base vocabulary, there is literally no out of vocabulary case left.

Residual issues

Even so, very strange input becomes long byte sequences, which is slow and wastes context. The unknown token still appears in some older models, so robust pipelines prefer byte fallback.

Key idea

Subword and byte level tokenization nearly eliminate out of vocabulary words by splitting the unfamiliar into known smaller pieces.

Check yourself

Answer to earn rating on the learn ladder.

1. How do subword tokenizers reduce the out of vocabulary problem?

2. Why does byte level tokenization eliminate out of vocabulary cases entirely?