The SentencePiece Unigram Model

A probabilistic tokenizer that prunes a vocabulary down rather than building it up.

Top down instead of bottom up

The unigram model, often run through the SentencePiece library, takes the opposite path from BPE. It starts with a large candidate vocabulary and prunes it, rather than growing merges from characters.

The probabilistic view

Each token has a probability. Any segmentation of a string has a likelihood equal to the product of its token probabilities. The best segmentation is the most probable one, found with the Viterbi algorithm.

Training by pruning

Seed a big set of candidate pieces.
Fit token probabilities with expectation maximization.
Score how much each piece contributes to total likelihood.
Drop the least useful pieces and repeat until the target size.

Whitespace as a symbol

SentencePiece treats the raw input as a stream and encodes spaces as a visible meta symbol, so it is fully reversible and language agnostic with no separate pre tokenizer needed.

Key idea

The unigram model assigns probabilities to tokens, picks the most likely segmentation with Viterbi, and trains by pruning a large vocabulary down to size.

The SentencePiece Unigram Model

Top down instead of bottom up

The probabilistic view

Training by pruning

Whitespace as a symbol

Key idea

Check yourself