Comparing generated text to references
When a model writes text, we often compare it to one or more human references. BLEU and ROUGE measure how much the generated text overlaps with those references using shared word sequences called n grams.
BLEU for translation
BLEU is precision oriented. It counts how many n grams in the candidate appear in the references, for several n gram lengths, then combines them.
- A brevity penalty discourages very short outputs that would otherwise game precision.
- BLEU rewards saying what the references say without adding junk.
ROUGE for summarization
ROUGE is recall oriented. It measures how many n grams from the reference appear in the candidate.
- ROUGE N counts overlapping n grams.
- ROUGE L uses the longest common subsequence to reward shared ordering.
- It rewards covering the content of the reference.
Both metrics are cheap but shallow. They miss paraphrases and meaning, so they pair best with human judgment.
Key idea
BLEU leans on precision for translation while ROUGE leans on recall for summarization, both measuring n gram overlap with references. They are fast but blind to paraphrase, so treat them as proxies, not ground truth.