← Lessons

quiz vs the machine

Platinum1800

Machine Learning

Byte Level Fallback

How working at the byte level guarantees any input can be tokenized.

5 min read · advanced · beat Platinum to climb

The coverage guarantee

A byte level tokenizer treats text as its underlying bytes rather than characters. Because there are only two hundred fifty six possible byte values and all of them sit in the base vocabulary, no input can ever be unrepresentable.

Two ways it shows up

  • A pure byte level BPE, where merges are learned over raw bytes from the start.
  • A byte fallback, where a mostly subword vocabulary drops to individual bytes only for pieces it cannot otherwise cover.

Handling Unicode cleanly

A single non Latin character may span several bytes. Byte level schemes encode it as a few byte tokens, so emoji, rare scripts, and arbitrary binary like text all tokenize without an unknown token.

The cost

The guarantee is not free. Rare or non Latin text can explode into many byte tokens, raising fertility, cost, and the chance of splitting a character across token boundaries. It trades efficiency for total coverage.

Key idea

Byte level tokenization and byte fallback guarantee any input is representable by dropping to raw bytes, at the cost of long token sequences for unusual text.

Check yourself

Answer to earn rating on the learn ladder.

1. Why can byte level tokenization represent any input?

2. What is the main cost of byte level fallback?