Two stages, two failures
A retrieval augmented system first retrieves documents, then generates an answer from them. A bad answer can come from missing context or from misusing good context, so evaluation must separate the stages.
Grading retrieval
The retriever is scored like search:
- Recall, did the relevant document appear among the results.
- Precision, how much of what was retrieved is actually relevant.
- Ranking metrics that reward placing relevant chunks near the top.
If the right context is never retrieved, no generator can recover.
Grading generation
Given the retrieved context, the answer is judged on:
- Faithfulness, every claim is supported by the retrieved text.
- Answer relevance, the response addresses the question.
- Context use, the model relies on provided evidence rather than its own memory.
Faithfulness against the actual retrieved context separates true grounding from lucky parametric recall.
End to end and diagnosis
A combined score reflects user experience, but only stage wise metrics tell you where to fix things. Low recall points to the retriever; low faithfulness with good recall points to the generator. Report both so a single failing number becomes an actionable diagnosis rather than a mystery.
Key idea
Retrieval augmented evaluation grades the retriever with recall and precision and the generator with faithfulness and relevance, because only stage wise metrics turn a poor end to end score into an actionable diagnosis.