The problem
Models cannot read raw text. They need a fixed vocabulary of pieces called tokens. Using whole words leaves no way to handle rare or unseen words, while using single characters makes sequences very long.
The subword middle ground
Byte pair encoding, often called BPE, finds a balance. It starts from individual characters and repeatedly merges the most frequent adjacent pair into a new token.
- Begin with a vocabulary of single characters
- Count all adjacent pairs in the corpus
- Merge the most common pair into one new symbol
- Repeat until the vocabulary reaches a target size
Common words become single tokens while rare words split into reusable pieces. The word tokenization might become token plus ization.
Why it works
BPE keeps the vocabulary small yet can represent any string, since the worst case falls back to characters. This handles typos, new words, and many languages with one shared scheme.
Key idea
BPE builds a subword vocabulary by greedily merging frequent character pairs, balancing coverage and sequence length.