Why outcome alone is not enough
An agent that plans, calls tools, and acts over many steps produces a trajectory, the full sequence of thoughts, tool calls, and observations. Scoring only the final outcome misses how it got there, including wasteful, unsafe, or lucky paths.
What to measure
- Task success, did the end state satisfy the goal.
- Trajectory quality, were the steps sensible and efficient.
- Tool use correctness, right tool, right arguments, correct handling of results.
- Efficiency, number of steps, tokens, and wall clock cost.
- Safety, did it avoid forbidden or destructive actions.
A right answer reached through a destructive shortcut should not score full marks.
Outcome versus process
Outcome based scoring checks the final state, which is objective but blind to bad paths and lucky successes. Process based scoring inspects each step, catching errors that happened to cancel out. Strong evals combine both: success gates the score, while trajectory quality refines it.
Practical challenges
Trajectories are long, branching, and stochastic, so the same agent varies across runs. Evals need reproducible environments, seeds, and per step checkpoints. Automated judges can label steps, but they inherit the biases of any LLM judge and must be validated against human review.
Key idea
Agent trajectory evaluation scores the whole sequence of actions, combining outcome success with process quality, tool correctness, efficiency, and safety, because a right answer reached by a reckless path is not a good result.