The Perplexity Revisited

Why the classic language model metric still matters and where it quietly misleads.

A measure of surprise

Perplexity scores how well a model predicts text. It is the exponential of the average negative log likelihood the model assigns to each token. Lower perplexity means the model was less surprised, so it assigned higher probability to the words that actually appeared.

Why it endures

It needs only a corpus, no human labels.
It is cheap to compute during and after training.
It correlates with fluency for models trained the same way.

For pretraining, dropping perplexity is a reliable signal that the model is learning the language distribution better.

Where it misleads

Perplexity depends on the tokenizer. Two models with different vocabularies cannot be compared directly because they split text into different units. It also rewards probability mass on plausible words, not on being correct, helpful, or truthful. A model can have low perplexity yet still hallucinate facts.

How to use it well

Treat perplexity as an internal training thermometer on a fixed tokenizer and corpus. For comparing finished models on usefulness, switch to task benchmarks and human judgment.

Key idea