← Lessons

quiz vs the machine

Silver1120

Machine Learning

The WordPiece Tokenizer

BERT's likelihood driven cousin of BPE.

4 min read · intro · beat Silver to climb

A likelihood based merge

WordPiece powers BERT and many encoder models. Like BPE it starts from characters and merges, but it does not merge the most frequent pair. Instead it merges the pair that most increases the likelihood of the training corpus under a unigram language model.

The selection score

Concretely it favors the pair whose merge gives the highest score, roughly the frequency of the pair divided by the product of the frequencies of its parts. This prefers pairs that occur together more than chance would predict.

The continuation marker

WordPiece marks subword pieces that continue a word with a prefix such as a double hash. So a split of playing might become play and a continuation piece ing, letting detokenization rejoin them cleanly.

Tokenizing

At inference WordPiece uses greedy longest match from the front of each word, peeling off the longest piece in the vocabulary and marking the rest as continuations.

Key idea

WordPiece merges by likelihood gain rather than raw frequency and uses continuation markers so pieces can be rejoined into words.

Check yourself

Answer to earn rating on the learn ladder.

1. How does WordPiece choose which pair to merge?

2. What is the continuation marker used for?