From compression to tokenizers
Byte pair encoding or BPE began as a compression trick and now powers many model tokenizers. It learns a vocabulary by repeatedly merging the most frequent adjacent pair of symbols.
Training the merges
Starting from individual characters or bytes:
- Count every adjacent pair across the corpus.
- Merge the single most frequent pair into a new symbol.
- Repeat until you reach the target vocabulary size.
Each merge is recorded as a rule. The ordered list of merge rules is the trained model.
Applying at inference
To tokenize new text you greedily apply the learned merges in the order they were learned, gluing pieces together until no more rules apply.
Why it works
Frequent words like the get merged into a single token early, while a rare word splits into several known pieces. This gives short sequences for common text and graceful fallback for anything unusual, with no unknown token needed when bytes are the base units.
Key idea
BPE builds its vocabulary by greedily merging the most frequent adjacent pairs, and tokenizes by replaying those merges.