The ROUGE Score

What it is for

ROUGE scores generated summaries against human reference summaries. Where BLEU leans on precision, ROUGE leans on recall, asking how much of the reference content the candidate managed to cover.

The common variants

ROUGE N counts overlapping n grams, with ROUGE one and ROUGE two the usual choices.
ROUGE L measures the longest common subsequence, rewarding shared word order even with gaps between matches.

Because subsequences allow gaps, ROUGE L captures sentence level structure that fixed n grams miss.

Reading the numbers

ROUGE often reports precision, recall, and an F measure together. For summarization the recall side is emphasized because a good summary should retain the key points of the source. Reporting the F measure keeps it from rewarding summaries that simply copy everything.

Cautions

Like BLEU, ROUGE measures surface overlap and cannot judge whether a summary is faithful or readable. A summary that paraphrases the reference well can score lower than a clumsy copy, so it is best paired with human review.

Key idea

ROUGE measures recall oriented overlap of n grams and longest common subsequences against reference summaries, capturing coverage but not faithfulness or readability.

What it is for

The common variants

Reading the numbers

Cautions

Key idea

Check yourself