A measure of surprise
Perplexity scores how well a model predicts text. It is the exponential of the average negative log likelihood the model assigns to each token. Lower perplexity means the model was less surprised, so it assigned higher probability to the words that actually appeared.
Why it endures
- It needs only a corpus, no human labels.
- It is cheap to compute during and after training.
- It correlates with fluency for models trained the same way.
For pretraining, dropping perplexity is a reliable signal that the model is learning the language distribution better.
Where it misleads
Perplexity depends on the tokenizer. Two models with different vocabularies cannot be compared directly because they split text into different units. It also rewards probability mass on plausible words, not on being correct, helpful, or truthful. A model can have low perplexity yet still hallucinate facts.
How to use it well
Treat perplexity as an internal training thermometer on a fixed tokenizer and corpus. For comparing finished models on usefulness, switch to task benchmarks and human judgment.
Key idea
Perplexity is a cheap, label free measure of predictive surprise that tracks fluency within one tokenizer, but it cannot be compared across vocabularies and never guarantees factual correctness.