Perplexity

What it measures

Perplexity measures how surprised a language model is by a piece of text. A lower perplexity means the model assigned higher probability to the actual words, so it predicted better.

The intuition

Think of perplexity as the average number of equally likely choices the model felt it had at each step. If a model is perfectly confident and correct, perplexity approaches one. If it guesses uniformly among a vocabulary, perplexity equals the vocabulary size.

It is computed from the model's probability for each true token
It is the exponential of the average negative log probability
Lower is better

Cautions

Perplexity only compares models that share the same tokenizer and vocabulary, since the unit of prediction changes the number. It also measures prediction quality, not usefulness. A model can have low perplexity yet still be unhelpful or produce unsafe text, which is why task based evaluation remains essential.

Key idea

Perplexity is the exponential of the average negative log probability of the true tokens, where lower means the model predicts text better.

What it measures

The intuition

Cautions

Key idea

Check yourself