← Lessons

quiz vs the machine

Silver1110

Machine Learning

The Perplexity Revisited

Why the classic language model metric still matters and where it quietly misleads.

5 min read · intro · beat Silver to climb

A measure of surprise

Perplexity scores how well a model predicts text. It is the exponential of the average negative log likelihood the model assigns to each token. Lower perplexity means the model was less surprised, so it assigned higher probability to the words that actually appeared.

Why it endures

  • It needs only a corpus, no human labels.
  • It is cheap to compute during and after training.
  • It correlates with fluency for models trained the same way.

For pretraining, dropping perplexity is a reliable signal that the model is learning the language distribution better.

Where it misleads

Perplexity depends on the tokenizer. Two models with different vocabularies cannot be compared directly because they split text into different units. It also rewards probability mass on plausible words, not on being correct, helpful, or truthful. A model can have low perplexity yet still hallucinate facts.

How to use it well

Treat perplexity as an internal training thermometer on a fixed tokenizer and corpus. For comparing finished models on usefulness, switch to task benchmarks and human judgment.

Key idea

Perplexity is a cheap, label free measure of predictive surprise that tracks fluency within one tokenizer, but it cannot be compared across vocabularies and never guarantees factual correctness.

Check yourself

Answer to earn rating on the learn ladder.

1. Why can perplexity not be compared directly across two different models?

2. What does lower perplexity directly indicate?