The BLEU Score for Text

What it compares

BLEU scores a generated translation against one or more human reference translations. It rewards matching short sequences of words, called n grams, between the candidate and the references.

How it works

BLEU computes modified precision for n grams of several lengths, typically one through four words. The modified part caps how often a matched n gram can count, so repeating a correct word does not inflate the score.

Higher n grams reward fluent word order, not just vocabulary.
Geometric mean combines the precisions across lengths.

The brevity penalty

Precision alone could be gamed by producing a very short output that only emits sure words. BLEU adds a brevity penalty that lowers the score when the candidate is shorter than the reference, discouraging clipped outputs.

Its limits

BLEU correlates with quality in aggregate but is weak per sentence. It cannot see paraphrases, synonyms, or meaning, so a correct translation using different words may score low. It is best read as a relative metric across systems, not an absolute judgment of fluency.

Key idea

BLEU measures capped n gram precision against reference translations with a brevity penalty, working well in aggregate but missing paraphrase and meaning per sentence.

The BLEU Score for Text

What it compares

How it works

The brevity penalty

Its limits

Key idea

Check yourself