Agent Evaluation Harness Deep Dive

Why evaluation is different for agents

A single model output is easy to score against a reference. An agent runs many steps, calls tools, and reaches a final state. You must evaluate the outcome and often the trajectory that produced it.

What a harness measures

Task success did the agent reach the correct end state
Efficiency how many steps, tokens, and tool calls it took
Trajectory quality were the intermediate actions sensible
Robustness does it still succeed across varied inputs and seeds

How it runs

The harness loads a task suite, runs the agent in a sandbox, captures the full trace, and scores the final state against expected outcomes.

Pitfalls

Single runs are noisy because agents are stochastic, so average over multiple seeds. Outcome only scoring can reward lucky paths that took unsafe actions, so inspect trajectories too. And keep tasks isolated so one run cannot pollute another's state.

Key idea

An agent evaluation harness scores final outcomes and trajectories across many seeds in a sandbox, because single noisy runs cannot tell you if an agent truly works.

Agent Evaluation Harness Deep Dive

Why evaluation is different for agents

What a harness measures

How it runs

Pitfalls

Key idea

Check yourself