← Lessons

quiz vs the machine

Platinum1850

Machine Learning

Agent Evaluation Harness Deep Dive

Measuring whether an agent actually completes its tasks.

6 min read · advanced · beat Platinum to climb

Why evaluation is different for agents

A single model output is easy to score against a reference. An agent runs many steps, calls tools, and reaches a final state. You must evaluate the outcome and often the trajectory that produced it.

What a harness measures

  • Task success did the agent reach the correct end state
  • Efficiency how many steps, tokens, and tool calls it took
  • Trajectory quality were the intermediate actions sensible
  • Robustness does it still succeed across varied inputs and seeds

How it runs

The harness loads a task suite, runs the agent in a sandbox, captures the full trace, and scores the final state against expected outcomes.

Pitfalls

Single runs are noisy because agents are stochastic, so average over multiple seeds. Outcome only scoring can reward lucky paths that took unsafe actions, so inspect trajectories too. And keep tasks isolated so one run cannot pollute another's state.

Key idea

An agent evaluation harness scores final outcomes and trajectories across many seeds in a sandbox, because single noisy runs cannot tell you if an agent truly works.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is evaluating an agent harder than scoring a single output?

2. Why run each task across multiple seeds?