← Lessons

quiz vs the machine

Platinum1780

Machine Learning

BLEU And ROUGE For Text

Overlap based scores for translation and summarization quality.

6 min read · advanced · beat Platinum to climb

Comparing generated text to references

When a model writes text, we often compare it to one or more human references. BLEU and ROUGE measure how much the generated text overlaps with those references using shared word sequences called n grams.

BLEU for translation

BLEU is precision oriented. It counts how many n grams in the candidate appear in the references, for several n gram lengths, then combines them.

  • A brevity penalty discourages very short outputs that would otherwise game precision.
  • BLEU rewards saying what the references say without adding junk.

ROUGE for summarization

ROUGE is recall oriented. It measures how many n grams from the reference appear in the candidate.

  • ROUGE N counts overlapping n grams.
  • ROUGE L uses the longest common subsequence to reward shared ordering.
  • It rewards covering the content of the reference.

Both metrics are cheap but shallow. They miss paraphrases and meaning, so they pair best with human judgment.

Key idea

BLEU leans on precision for translation while ROUGE leans on recall for summarization, both measuring n gram overlap with references. They are fast but blind to paraphrase, so treat them as proxies, not ground truth.

Check yourself

Answer to earn rating on the learn ladder.

1. BLEU and ROUGE differ mainly in that BLEU emphasizes what?

2. Why does BLEU include a brevity penalty?