Evaluation of Agent Trajectories
Evaluating an agent on its final answer alone hides a lot. A trajectory is the full sequence of thoughts, tool calls, and observations, and judging it reveals how the agent got there.
Why outcome alone is not enough
- An agent can reach a right answer by luck through a wrong path that will fail next time.
- It can reach a wrong answer despite mostly sound reasoning, which is easy to fix.
- Two agents with the same score can differ wildly in cost and reliability.
What to measure
Useful trajectory metrics include task success rate, the number of steps taken, tool selection accuracy, and how often the agent recovered from an error. Some evaluations check each step against a reference trajectory; others use an LLM judge to rate whether each action was reasonable given the state.
Building a harness
Good evaluation runs many tasks with fixed seeds and logged traces so results are reproducible and failures are inspectable. When a run fails you can replay its trajectory and see the exact step that went wrong. Without this, agent improvements are guesswork because you cannot tell whether a change helped the reasoning or just got lucky on a few examples.
Key idea
Trajectory evaluation judges the whole path of thoughts and actions, not just the final answer, so improvements are real and not luck.