Measuring surprise
Perplexity evaluates how well a language model predicts a text. Intuitively it measures how surprised the model is by the actual next words. A lower perplexity means the model assigned high probability to what really came next.
From cross entropy to perplexity
Perplexity is the exponential of the average per word cross entropy. Because of that link, it can be read as an effective branching factor.
- A perplexity of one means perfect prediction with no surprise.
- A perplexity of fifty means the model is as uncertain as choosing uniformly among fifty options at each step.
- Lower perplexity means a tighter, more confident model.
Cautions
- Perplexity depends on the tokenization, so scores only compare across models with the same vocabulary.
- It rewards fluency and probability, not factual correctness.
- A model can have low perplexity yet still produce confident falsehoods.
Key idea
Perplexity is the exponential of average cross entropy, an effective branching factor showing how surprised a model is by real text. Lower is better, but it measures fluency and probability, not truth.