Why measure retrieval alone
In a retrieval system the generator can only work with what retrieval supplies. If the right passage never appears, no amount of clever generation can recover. So you must measure the retrieval stage on its own.
The core metrics
- Recall at k: did the relevant passage appear in the top k results.
- Precision at k: what fraction of the top k results are relevant.
- Mean reciprocal rank: how high up the first relevant result appears, averaged over queries.
- Normalized discounted cumulative gain: rewards relevant items near the top using graded relevance.
Choosing the right one
- For a generator that reads several passages, recall at k matters most, since the answer must be present somewhere in the context.
- When ranking order matters, mrr or ndcg capture how soon the good result appears.
The labeling cost
Every metric needs ground truth, a set of queries with their truly relevant passages. Building this set is the real work, and weak labels make every number untrustworthy.
Key idea
Retrieval is measured with recall, precision, mean reciprocal rank, and ndcg against labeled relevant passages, with recall at k often mattering most for downstream generation.