← Lessons

quiz vs the machine

Gold1380

Machine Learning

The SentencePiece Unigram Model

A probabilistic tokenizer that prunes a vocabulary down rather than building it up.

5 min read · core · beat Gold to climb

Top down instead of bottom up

The unigram model, often run through the SentencePiece library, takes the opposite path from BPE. It starts with a large candidate vocabulary and prunes it, rather than growing merges from characters.

The probabilistic view

Each token has a probability. Any segmentation of a string has a likelihood equal to the product of its token probabilities. The best segmentation is the most probable one, found with the Viterbi algorithm.

Training by pruning

  • Seed a big set of candidate pieces.
  • Fit token probabilities with expectation maximization.
  • Score how much each piece contributes to total likelihood.
  • Drop the least useful pieces and repeat until the target size.

Whitespace as a symbol

SentencePiece treats the raw input as a stream and encodes spaces as a visible meta symbol, so it is fully reversible and language agnostic with no separate pre tokenizer needed.

Key idea

The unigram model assigns probabilities to tokens, picks the most likely segmentation with Viterbi, and trains by pruning a large vocabulary down to size.

Check yourself

Answer to earn rating on the learn ladder.

1. How does the unigram model build its vocabulary?

2. Which algorithm finds the best segmentation under the unigram model?

3. How does SentencePiece handle whitespace?