The WordPiece Tokenizer

A likelihood based merge

WordPiece powers BERT and many encoder models. Like BPE it starts from characters and merges, but it does not merge the most frequent pair. Instead it merges the pair that most increases the likelihood of the training corpus under a unigram language model.

The selection score

Concretely it favors the pair whose merge gives the highest score, roughly the frequency of the pair divided by the product of the frequencies of its parts. This prefers pairs that occur together more than chance would predict.

The continuation marker

WordPiece marks subword pieces that continue a word with a prefix such as a double hash. So a split of playing might become play and a continuation piece ing, letting detokenization rejoin them cleanly.

Tokenizing

At inference WordPiece uses greedy longest match from the front of each word, peeling off the longest piece in the vocabulary and marking the rest as continuations.

Key idea

WordPiece merges by likelihood gain rather than raw frequency and uses continuation markers so pieces can be rejoined into words.

The WordPiece Tokenizer

A likelihood based merge

The selection score

The continuation marker

Tokenizing

Key idea

Check yourself