Why agents are hard to test
An agent takes many paths to a goal, calls tools, and may succeed in different ways. A single accuracy number hides this, so you need an evaluation harness: a repeatable rig that runs the agent on tasks and scores the outcomes.
What a harness contains
- Task set: fixed scenarios with clear success criteria.
- Environment: tools or sandboxes the agent acts in, reset between runs.
- Scorer: a check on the final state, not just the text, such as did the file get created.
- Trace capture: the full sequence of steps for debugging failures.
Outcome versus process
Scoring the outcome asks whether the goal was met. Scoring the process asks whether the steps were sensible. Good harnesses track both, since an agent can reach the goal by luck or fail despite reasonable steps.
Keeping it honest
- Run each task several times since agents are stochastic.
- Hold out a private task set to avoid tuning to the eval.
- Watch cost and latency alongside success, not in isolation.
A harness turns vague impressions of an agent into numbers you can improve against.
Key idea
An agent evaluation harness runs the agent on fixed tasks in a resettable environment scoring both outcome and process and capturing traces, repeated across stochastic runs so vague impressions become numbers you can improve.