Byte Pair Encoding

From compression to tokenizers

Byte pair encoding or BPE began as a compression trick and now powers many model tokenizers. It learns a vocabulary by repeatedly merging the most frequent adjacent pair of symbols.

Training the merges

Starting from individual characters or bytes:

Count every adjacent pair across the corpus.
Merge the single most frequent pair into a new symbol.
Repeat until you reach the target vocabulary size.

Each merge is recorded as a rule. The ordered list of merge rules is the trained model.

Applying at inference

To tokenize new text you greedily apply the learned merges in the order they were learned, gluing pieces together until no more rules apply.

Why it works

Frequent words like the get merged into a single token early, while a rare word splits into several known pieces. This gives short sequences for common text and graceful fallback for anything unusual, with no unknown token needed when bytes are the base units.

Key idea