← Lessons

quiz vs the machine

Platinum1750

Machine Learning

The RAG Evaluation Metrics Deep

Measure retrieval and generation separately to find where a RAG system fails.

6 min read · advanced · beat Platinum to climb

Two systems to grade

A RAG answer can fail because retrieval fetched the wrong passages or because generation ignored good ones. Evaluating only the final answer hides which stage broke, so good practice measures retrieval and generation separately.

Retrieval metrics

  • Context recall asks whether the passages needed to answer were retrieved at all. Low recall means the answer was doomed before generation.
  • Context precision asks whether the retrieved passages are mostly relevant rather than padded with noise.

Generation metrics

  • Faithfulness asks whether every claim in the answer is supported by the retrieved context. Unsupported claims are hallucinations even when the context was good.
  • Answer relevance asks whether the answer actually addresses the question rather than drifting.

LLM as judge

Many of these are scored by a strong language model acting as a judge, prompted to check, for example, whether each answer sentence is entailed by the context. This scales evaluation but needs spot checks against humans, since judges carry their own biases.

Reading the grid

Crossing the metrics localizes faults. High recall but low faithfulness points at the generator; low recall points at retrieval or chunking. This separation tells you which knob to turn instead of guessing.

Key idea

RAG evaluation splits into retrieval metrics like context recall and precision and generation metrics like faithfulness and answer relevance, so faults localize to the stage that actually broke.

Check yourself

Answer to earn rating on the learn ladder.

1. What does faithfulness measure?

2. What does high context recall but low faithfulness suggest?

3. Why spot check an LLM judge against humans?