Top down instead of bottom up
The unigram model, often run through the SentencePiece library, takes the opposite path from BPE. It starts with a large candidate vocabulary and prunes it, rather than growing merges from characters.
The probabilistic view
Each token has a probability. Any segmentation of a string has a likelihood equal to the product of its token probabilities. The best segmentation is the most probable one, found with the Viterbi algorithm.
Training by pruning
- Seed a big set of candidate pieces.
- Fit token probabilities with expectation maximization.
- Score how much each piece contributes to total likelihood.
- Drop the least useful pieces and repeat until the target size.
Whitespace as a symbol
SentencePiece treats the raw input as a stream and encodes spaces as a visible meta symbol, so it is fully reversible and language agnostic with no separate pre tokenizer needed.
Key idea
The unigram model assigns probabilities to tokens, picks the most likely segmentation with Viterbi, and trains by pruning a large vocabulary down to size.