← Lessons

quiz vs the machine

Gold1380

Machine Learning

Evaluation Harnesses for LLMs

Repeatable pipelines that measure model quality across many cases.

6 min read · core · beat Gold to climb

What it is

An evaluation harness is the test infrastructure for an LLM system. It runs a fixed set of inputs through a model or pipeline, applies graders, and reports aggregate metrics. It turns vibes into numbers you can track across versions.

Core parts

  • Dataset: curated inputs with expected behavior, split into representative slices.
  • Runner: code that calls the model with the right prompt and config for each case.
  • Graders: functions that score outputs, from exact match to an LLM judge.
  • Report: aggregate scores plus per case traces for debugging.

Why it matters

Without a harness, every change is a guess. With one, you can run an offline suite on each prompt or model update and catch regressions before users do.

  • Track a headline metric but also watch slices, since an average can hide a broken subgroup.
  • Keep a golden set stable so scores stay comparable over time.
  • Log full traces so a drop in score points to specific failing cases.

Online versus offline

Offline evals run on a frozen dataset. Online evals sample real traffic and grade it, catching drift the static set misses.

Key idea

An evaluation harness pairs a stable dataset with graders and a runner so model changes are measured, not guessed, with slice level visibility.

Check yourself

Answer to earn rating on the learn ladder.

1. Why watch per slice scores and not just the headline average?

2. What do online evals add over a frozen offline dataset?