The coverage guarantee
A byte level tokenizer treats text as its underlying bytes rather than characters. Because there are only two hundred fifty six possible byte values and all of them sit in the base vocabulary, no input can ever be unrepresentable.
Two ways it shows up
- A pure byte level BPE, where merges are learned over raw bytes from the start.
- A byte fallback, where a mostly subword vocabulary drops to individual bytes only for pieces it cannot otherwise cover.
Handling Unicode cleanly
A single non Latin character may span several bytes. Byte level schemes encode it as a few byte tokens, so emoji, rare scripts, and arbitrary binary like text all tokenize without an unknown token.
The cost
The guarantee is not free. Rare or non Latin text can explode into many byte tokens, raising fertility, cost, and the chance of splitting a character across token boundaries. It trades efficiency for total coverage.
Key idea
Byte level tokenization and byte fallback guarantee any input is representable by dropping to raw bytes, at the cost of long token sequences for unusual text.