← Lessons

quiz vs the machine

Platinum1810

Machine Learning

The Retrieval Evaluation Metrics

Numbers that tell you whether retrieval is actually finding the right passages.

5 min read · advanced · beat Platinum to climb

Why measure retrieval alone

In a retrieval system the generator can only work with what retrieval supplies. If the right passage never appears, no amount of clever generation can recover. So you must measure the retrieval stage on its own.

The core metrics

  • Recall at k: did the relevant passage appear in the top k results.
  • Precision at k: what fraction of the top k results are relevant.
  • Mean reciprocal rank: how high up the first relevant result appears, averaged over queries.
  • Normalized discounted cumulative gain: rewards relevant items near the top using graded relevance.

Choosing the right one

  • For a generator that reads several passages, recall at k matters most, since the answer must be present somewhere in the context.
  • When ranking order matters, mrr or ndcg capture how soon the good result appears.

The labeling cost

Every metric needs ground truth, a set of queries with their truly relevant passages. Building this set is the real work, and weak labels make every number untrustworthy.

Key idea

Retrieval is measured with recall, precision, mean reciprocal rank, and ndcg against labeled relevant passages, with recall at k often mattering most for downstream generation.

Check yourself

Answer to earn rating on the learn ladder.

1. Why measure the retrieval stage separately from generation?

2. Which metric matters most when a generator reads several retrieved passages?