← Lessons

quiz vs the machine

Gold1480

Machine Learning

Evaluation of Agent Trajectories

Judging not just the answer but the path the agent took.

5 min read · core · beat Gold to climb

Evaluation of Agent Trajectories

Evaluating an agent on its final answer alone hides a lot. A trajectory is the full sequence of thoughts, tool calls, and observations, and judging it reveals how the agent got there.

Why outcome alone is not enough

  • An agent can reach a right answer by luck through a wrong path that will fail next time.
  • It can reach a wrong answer despite mostly sound reasoning, which is easy to fix.
  • Two agents with the same score can differ wildly in cost and reliability.

What to measure

Useful trajectory metrics include task success rate, the number of steps taken, tool selection accuracy, and how often the agent recovered from an error. Some evaluations check each step against a reference trajectory; others use an LLM judge to rate whether each action was reasonable given the state.

Building a harness

Good evaluation runs many tasks with fixed seeds and logged traces so results are reproducible and failures are inspectable. When a run fails you can replay its trajectory and see the exact step that went wrong. Without this, agent improvements are guesswork because you cannot tell whether a change helped the reasoning or just got lucky on a few examples.

Key idea

Trajectory evaluation judges the whole path of thoughts and actions, not just the final answer, so improvements are real and not luck.

Check yourself

Answer to earn rating on the learn ladder.

1. What is an agent trajectory?

2. Why is judging the final answer alone insufficient?

3. Why log full traces with fixed seeds?