← Lessons

quiz vs the machine

Gold1410

Machine Learning

The Agent Error Recovery

How agents detect failed tool calls and recover instead of crashing or hallucinating.

5 min read · core · beat Gold to climb

Failure is normal

Tools time out, return errors, or hand back garbage. A robust agent expects this and has a recovery strategy so one bad step does not derail the whole task.

Recovery moves

  • Detect: read the tool result and notice an error code, empty payload, or malformed output.
  • Retry: call again, possibly with a small delay, for transient failures.
  • Repair: fix the arguments if the error names a bad parameter.
  • Reroute: switch to a different tool that can answer the same need.
  • Escalate: ask a human or report the blocker when nothing works.

Feeding the error back

The key trick is to put the error message into the context as an observation. The model then reasons about it and chooses the next move, rather than ignoring the failure.

Pitfalls

  • Blind retries on a permanent error waste budget.
  • A retry cap prevents infinite loops.
  • The agent must distinguish a real error from a valid but unexpected result.

Treating errors as observations to reason about, not exceptions to swallow, is what makes an agent resilient.

Key idea

Agent error recovery detects failed tool calls and responds by retrying repairing rerouting or escalating, feeding the error back as an observation with a retry cap so one bad step does not derail the task.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the key trick for handling a tool error in an agent?

2. Why does error recovery need a retry cap?