What it is
An evaluation harness is the test infrastructure for an LLM system. It runs a fixed set of inputs through a model or pipeline, applies graders, and reports aggregate metrics. It turns vibes into numbers you can track across versions.
Core parts
- Dataset: curated inputs with expected behavior, split into representative slices.
- Runner: code that calls the model with the right prompt and config for each case.
- Graders: functions that score outputs, from exact match to an LLM judge.
- Report: aggregate scores plus per case traces for debugging.
Why it matters
Without a harness, every change is a guess. With one, you can run an offline suite on each prompt or model update and catch regressions before users do.
- Track a headline metric but also watch slices, since an average can hide a broken subgroup.
- Keep a golden set stable so scores stay comparable over time.
- Log full traces so a drop in score points to specific failing cases.
Online versus offline
Offline evals run on a frozen dataset. Online evals sample real traffic and grade it, catching drift the static set misses.
Key idea
An evaluation harness pairs a stable dataset with graders and a runner so model changes are measured, not guessed, with slice level visibility.