← Lessons

quiz vs the machine

Platinum1780

Machine Learning

The Retrieval Augmented Eval

Scoring a RAG system means grading retrieval and generation separately and together.

7 min read · advanced · beat Platinum to climb

Two stages, two failures

A retrieval augmented system first retrieves documents, then generates an answer from them. A bad answer can come from missing context or from misusing good context, so evaluation must separate the stages.

Grading retrieval

The retriever is scored like search:

  • Recall, did the relevant document appear among the results.
  • Precision, how much of what was retrieved is actually relevant.
  • Ranking metrics that reward placing relevant chunks near the top.

If the right context is never retrieved, no generator can recover.

Grading generation

Given the retrieved context, the answer is judged on:

  • Faithfulness, every claim is supported by the retrieved text.
  • Answer relevance, the response addresses the question.
  • Context use, the model relies on provided evidence rather than its own memory.

Faithfulness against the actual retrieved context separates true grounding from lucky parametric recall.

End to end and diagnosis

A combined score reflects user experience, but only stage wise metrics tell you where to fix things. Low recall points to the retriever; low faithfulness with good recall points to the generator. Report both so a single failing number becomes an actionable diagnosis rather than a mystery.

Key idea

Retrieval augmented evaluation grades the retriever with recall and precision and the generator with faithfulness and relevance, because only stage wise metrics turn a poor end to end score into an actionable diagnosis.

Check yourself

Answer to earn rating on the learn ladder.

1. Why must a RAG eval grade retrieval and generation separately?

2. What does faithfulness measure in a RAG system?

3. Good recall but low faithfulness points to a problem in which stage?