← Lessons

quiz vs the machine

Gold1430

Machine Learning

The Agent Evaluation Harness

How to measure agent quality with repeatable tasks scoring and traces.

5 min read · core · beat Gold to climb

Why agents are hard to test

An agent takes many paths to a goal, calls tools, and may succeed in different ways. A single accuracy number hides this, so you need an evaluation harness: a repeatable rig that runs the agent on tasks and scores the outcomes.

What a harness contains

  • Task set: fixed scenarios with clear success criteria.
  • Environment: tools or sandboxes the agent acts in, reset between runs.
  • Scorer: a check on the final state, not just the text, such as did the file get created.
  • Trace capture: the full sequence of steps for debugging failures.

Outcome versus process

Scoring the outcome asks whether the goal was met. Scoring the process asks whether the steps were sensible. Good harnesses track both, since an agent can reach the goal by luck or fail despite reasonable steps.

Keeping it honest

  • Run each task several times since agents are stochastic.
  • Hold out a private task set to avoid tuning to the eval.
  • Watch cost and latency alongside success, not in isolation.

A harness turns vague impressions of an agent into numbers you can improve against.

Key idea

An agent evaluation harness runs the agent on fixed tasks in a resettable environment scoring both outcome and process and capturing traces, repeated across stochastic runs so vague impressions become numbers you can improve.

Check yourself

Answer to earn rating on the learn ladder.

1. Why should an agent scorer check the final state, not just the text?

2. Why run each evaluation task several times?