What it measures
Perplexity measures how surprised a language model is by a piece of text. A lower perplexity means the model assigned higher probability to the actual words, so it predicted better.
The intuition
Think of perplexity as the average number of equally likely choices the model felt it had at each step. If a model is perfectly confident and correct, perplexity approaches one. If it guesses uniformly among a vocabulary, perplexity equals the vocabulary size.
- It is computed from the model's probability for each true token
- It is the exponential of the average negative log probability
- Lower is better
Cautions
Perplexity only compares models that share the same tokenizer and vocabulary, since the unit of prediction changes the number. It also measures prediction quality, not usefulness. A model can have low perplexity yet still be unhelpful or produce unsafe text, which is why task based evaluation remains essential.
Key idea
Perplexity is the exponential of the average negative log probability of the true tokens, where lower means the model predicts text better.