← Lessons

quiz vs the machine

Platinum1820

Machine Learning

The Agent Trajectory Eval

Judging an agent by its whole sequence of actions, not just the final answer.

7 min read · advanced · beat Platinum to climb

Why outcome alone is not enough

An agent that plans, calls tools, and acts over many steps produces a trajectory, the full sequence of thoughts, tool calls, and observations. Scoring only the final outcome misses how it got there, including wasteful, unsafe, or lucky paths.

What to measure

  • Task success, did the end state satisfy the goal.
  • Trajectory quality, were the steps sensible and efficient.
  • Tool use correctness, right tool, right arguments, correct handling of results.
  • Efficiency, number of steps, tokens, and wall clock cost.
  • Safety, did it avoid forbidden or destructive actions.

A right answer reached through a destructive shortcut should not score full marks.

Outcome versus process

Outcome based scoring checks the final state, which is objective but blind to bad paths and lucky successes. Process based scoring inspects each step, catching errors that happened to cancel out. Strong evals combine both: success gates the score, while trajectory quality refines it.

Practical challenges

Trajectories are long, branching, and stochastic, so the same agent varies across runs. Evals need reproducible environments, seeds, and per step checkpoints. Automated judges can label steps, but they inherit the biases of any LLM judge and must be validated against human review.

Key idea

Agent trajectory evaluation scores the whole sequence of actions, combining outcome success with process quality, tool correctness, efficiency, and safety, because a right answer reached by a reckless path is not a good result.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is outcome only scoring insufficient for agents?

2. What is a trajectory in agent evaluation?

3. Why do agent evals need reproducible environments and seeds?