The old problem
Word level tokenizers hit a wall on any word missing from the vocabulary. The classic fix was a single unknown token, which threw away all information about the surprising word.
How subwords help
Subword tokenizers shrink this problem dramatically. A word the tokenizer never saw whole can still be broken into known pieces, preserving meaning and morphology.
- A rare technical term splits into familiar roots and suffixes.
- A misspelling degrades into smaller known chunks.
The byte safety net
The strongest guarantee comes from working at the byte level. Since any text is a sequence of bytes and all bytes are in the base vocabulary, there is literally no out of vocabulary case left.
Residual issues
Even so, very strange input becomes long byte sequences, which is slow and wastes context. The unknown token still appears in some older models, so robust pipelines prefer byte fallback.
Key idea
Subword and byte level tokenization nearly eliminate out of vocabulary words by splitting the unfamiliar into known smaller pieces.