Evaluation Harnesses for LLMs

What it is

An evaluation harness is the test infrastructure for an LLM system. It runs a fixed set of inputs through a model or pipeline, applies graders, and reports aggregate metrics. It turns vibes into numbers you can track across versions.

Core parts

Dataset: curated inputs with expected behavior, split into representative slices.
Runner: code that calls the model with the right prompt and config for each case.
Graders: functions that score outputs, from exact match to an LLM judge.
Report: aggregate scores plus per case traces for debugging.

Why it matters

Without a harness, every change is a guess. With one, you can run an offline suite on each prompt or model update and catch regressions before users do.

Track a headline metric but also watch slices, since an average can hide a broken subgroup.
Keep a golden set stable so scores stay comparable over time.
Log full traces so a drop in score points to specific failing cases.

Online versus offline

Offline evals run on a frozen dataset. Online evals sample real traffic and grade it, catching drift the static set misses.

Key idea

An evaluation harness pairs a stable dataset with graders and a runner so model changes are measured, not guessed, with slice level visibility.

Evaluation Harnesses for LLMs

What it is

Core parts

Why it matters

Online versus offline

Key idea

Check yourself