Two systems to grade
A RAG answer can fail because retrieval fetched the wrong passages or because generation ignored good ones. Evaluating only the final answer hides which stage broke, so good practice measures retrieval and generation separately.
Retrieval metrics
- Context recall asks whether the passages needed to answer were retrieved at all. Low recall means the answer was doomed before generation.
- Context precision asks whether the retrieved passages are mostly relevant rather than padded with noise.
Generation metrics
- Faithfulness asks whether every claim in the answer is supported by the retrieved context. Unsupported claims are hallucinations even when the context was good.
- Answer relevance asks whether the answer actually addresses the question rather than drifting.
LLM as judge
Many of these are scored by a strong language model acting as a judge, prompted to check, for example, whether each answer sentence is entailed by the context. This scales evaluation but needs spot checks against humans, since judges carry their own biases.
Reading the grid
Crossing the metrics localizes faults. High recall but low faithfulness points at the generator; low recall points at retrieval or chunking. This separation tells you which knob to turn instead of guessing.
Key idea
RAG evaluation splits into retrieval metrics like context recall and precision and generation metrics like faithfulness and answer relevance, so faults localize to the stage that actually broke.