The Retrieval Evaluation Metrics

Why measure retrieval alone

In a retrieval system the generator can only work with what retrieval supplies. If the right passage never appears, no amount of clever generation can recover. So you must measure the retrieval stage on its own.

The core metrics

Recall at k: did the relevant passage appear in the top k results.
Precision at k: what fraction of the top k results are relevant.
Mean reciprocal rank: how high up the first relevant result appears, averaged over queries.
Normalized discounted cumulative gain: rewards relevant items near the top using graded relevance.

Choosing the right one

For a generator that reads several passages, recall at k matters most, since the answer must be present somewhere in the context.
When ranking order matters, mrr or ndcg capture how soon the good result appears.

The labeling cost

Every metric needs ground truth, a set of queries with their truly relevant passages. Building this set is the real work, and weak labels make every number untrustworthy.

Key idea

Retrieval is measured with recall, precision, mean reciprocal rank, and ndcg against labeled relevant passages, with recall at k often mattering most for downstream generation.

The Retrieval Evaluation Metrics

Why measure retrieval alone

The core metrics

Choosing the right one

The labeling cost

Key idea

Check yourself