Perplexity For Language Models

Measuring surprise

Perplexity evaluates how well a language model predicts a text. Intuitively it measures how surprised the model is by the actual next words. A lower perplexity means the model assigned high probability to what really came next.

From cross entropy to perplexity

Perplexity is the exponential of the average per word cross entropy. Because of that link, it can be read as an effective branching factor.

A perplexity of one means perfect prediction with no surprise.
A perplexity of fifty means the model is as uncertain as choosing uniformly among fifty options at each step.
Lower perplexity means a tighter, more confident model.

Cautions

Perplexity depends on the tokenization, so scores only compare across models with the same vocabulary.
It rewards fluency and probability, not factual correctness.
A model can have low perplexity yet still produce confident falsehoods.

Key idea

Perplexity is the exponential of average cross entropy, an effective branching factor showing how surprised a model is by real text. Lower is better, but it measures fluency and probability, not truth.

Perplexity For Language Models

Measuring surprise

From cross entropy to perplexity

Cautions

Key idea

Check yourself