Subword Tokenization Revisited

Choosing the unit of text is a quiet but crucial decision. Splitting on whole words creates a huge vocabulary and leaves you helpless against any word not seen in training, the out of vocabulary problem. Splitting on characters avoids that but makes sequences very long and strips away word level meaning.

Subword tokenization finds a middle path. Frequent words stay whole, while rare words break into smaller reusable pieces. The word tokenization might split into token and ization, parts that recur across many words.

A popular method is byte pair encoding. It starts from characters and repeatedly merges the most frequent adjacent pair into a new unit, building a vocabulary of common chunks until a target size is reached.

The advantages are why nearly every modern language model uses it:

No out of vocabulary words, since any string falls back to known pieces
A bounded vocabulary, which keeps the embedding table a manageable size
Graceful handling of morphology, typos, and new words

The tradeoff is that a word may become several tokens, so token counts differ from word counts, which matters for context limits and cost. Even so, subwords are the default unit for transformers.

Key idea

Subword tokenization splits rare words into reusable pieces, eliminating out of vocabulary words while keeping the vocabulary bounded.

Subword Tokenization Revisited

Subword Tokenization Revisited

Key idea

Check yourself