What it compares
BLEU scores a generated translation against one or more human reference translations. It rewards matching short sequences of words, called n grams, between the candidate and the references.
How it works
BLEU computes modified precision for n grams of several lengths, typically one through four words. The modified part caps how often a matched n gram can count, so repeating a correct word does not inflate the score.
- Higher n grams reward fluent word order, not just vocabulary.
- Geometric mean combines the precisions across lengths.
The brevity penalty
Precision alone could be gamed by producing a very short output that only emits sure words. BLEU adds a brevity penalty that lowers the score when the candidate is shorter than the reference, discouraging clipped outputs.
Its limits
BLEU correlates with quality in aggregate but is weak per sentence. It cannot see paraphrases, synonyms, or meaning, so a correct translation using different words may score low. It is best read as a relative metric across systems, not an absolute judgment of fluency.
Key idea
BLEU measures capped n gram precision against reference translations with a brevity penalty, working well in aggregate but missing paraphrase and meaning per sentence.