Why evaluation is different for agents
A single model output is easy to score against a reference. An agent runs many steps, calls tools, and reaches a final state. You must evaluate the outcome and often the trajectory that produced it.
What a harness measures
- Task success did the agent reach the correct end state
- Efficiency how many steps, tokens, and tool calls it took
- Trajectory quality were the intermediate actions sensible
- Robustness does it still succeed across varied inputs and seeds
How it runs
The harness loads a task suite, runs the agent in a sandbox, captures the full trace, and scores the final state against expected outcomes.
Pitfalls
Single runs are noisy because agents are stochastic, so average over multiple seeds. Outcome only scoring can reward lucky paths that took unsafe actions, so inspect trajectories too. And keep tasks isolated so one run cannot pollute another's state.
Key idea
An agent evaluation harness scores final outcomes and trajectories across many seeds in a sandbox, because single noisy runs cannot tell you if an agent truly works.