← Lessons

quiz vs the machine

Platinum1780

Machine Learning

The BLEU Score for Text

Measuring machine translation by overlap with references.

6 min read · advanced · beat Platinum to climb

What it compares

BLEU scores a generated translation against one or more human reference translations. It rewards matching short sequences of words, called n grams, between the candidate and the references.

How it works

BLEU computes modified precision for n grams of several lengths, typically one through four words. The modified part caps how often a matched n gram can count, so repeating a correct word does not inflate the score.

  • Higher n grams reward fluent word order, not just vocabulary.
  • Geometric mean combines the precisions across lengths.

The brevity penalty

Precision alone could be gamed by producing a very short output that only emits sure words. BLEU adds a brevity penalty that lowers the score when the candidate is shorter than the reference, discouraging clipped outputs.

Its limits

BLEU correlates with quality in aggregate but is weak per sentence. It cannot see paraphrases, synonyms, or meaning, so a correct translation using different words may score low. It is best read as a relative metric across systems, not an absolute judgment of fluency.

Key idea

BLEU measures capped n gram precision against reference translations with a brevity penalty, working well in aggregate but missing paraphrase and meaning per sentence.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the brevity penalty prevent?

2. A key limitation of BLEU is that it