Byte Pair Encoding Tokenization

The problem

Models cannot read raw text. They need a fixed vocabulary of pieces called tokens. Using whole words leaves no way to handle rare or unseen words, while using single characters makes sequences very long.

The subword middle ground

Byte pair encoding, often called BPE, finds a balance. It starts from individual characters and repeatedly merges the most frequent adjacent pair into a new token.

Begin with a vocabulary of single characters
Count all adjacent pairs in the corpus
Merge the most common pair into one new symbol
Repeat until the vocabulary reaches a target size

Common words become single tokens while rare words split into reusable pieces. The word tokenization might become token plus ization.

Why it works

BPE keeps the vocabulary small yet can represent any string, since the worst case falls back to characters. This handles typos, new words, and many languages with one shared scheme.

Key idea

BPE builds a subword vocabulary by greedily merging frequent character pairs, balancing coverage and sequence length.

Byte Pair Encoding Tokenization

The problem

The subword middle ground

Why it works

Key idea

Check yourself