The Agent Evaluation Harness

Why agents are hard to test

An agent takes many paths to a goal, calls tools, and may succeed in different ways. A single accuracy number hides this, so you need an evaluation harness: a repeatable rig that runs the agent on tasks and scores the outcomes.

What a harness contains

Task set: fixed scenarios with clear success criteria.
Environment: tools or sandboxes the agent acts in, reset between runs.
Scorer: a check on the final state, not just the text, such as did the file get created.
Trace capture: the full sequence of steps for debugging failures.

Outcome versus process

Scoring the outcome asks whether the goal was met. Scoring the process asks whether the steps were sensible. Good harnesses track both, since an agent can reach the goal by luck or fail despite reasonable steps.

Keeping it honest

Run each task several times since agents are stochastic.
Hold out a private task set to avoid tuning to the eval.
Watch cost and latency alongside success, not in isolation.

A harness turns vague impressions of an agent into numbers you can improve against.

Key idea

An agent evaluation harness runs the agent on fixed tasks in a resettable environment scoring both outcome and process and capturing traces, repeated across stochastic runs so vague impressions become numbers you can improve.

The Agent Evaluation Harness

Why agents are hard to test

What a harness contains

Outcome versus process

Keeping it honest

Key idea

Check yourself