What it is for
ROUGE scores generated summaries against human reference summaries. Where BLEU leans on precision, ROUGE leans on recall, asking how much of the reference content the candidate managed to cover.
The common variants
- ROUGE N counts overlapping n grams, with ROUGE one and ROUGE two the usual choices.
- ROUGE L measures the longest common subsequence, rewarding shared word order even with gaps between matches.
Because subsequences allow gaps, ROUGE L captures sentence level structure that fixed n grams miss.
Reading the numbers
ROUGE often reports precision, recall, and an F measure together. For summarization the recall side is emphasized because a good summary should retain the key points of the source. Reporting the F measure keeps it from rewarding summaries that simply copy everything.
Cautions
Like BLEU, ROUGE measures surface overlap and cannot judge whether a summary is faithful or readable. A summary that paraphrases the reference well can score lower than a clumsy copy, so it is best paired with human review.
Key idea
ROUGE measures recall oriented overlap of n grams and longest common subsequences against reference summaries, capturing coverage but not faithfulness or readability.