Failure is normal
Tools time out, return errors, or hand back garbage. A robust agent expects this and has a recovery strategy so one bad step does not derail the whole task.
Recovery moves
- Detect: read the tool result and notice an error code, empty payload, or malformed output.
- Retry: call again, possibly with a small delay, for transient failures.
- Repair: fix the arguments if the error names a bad parameter.
- Reroute: switch to a different tool that can answer the same need.
- Escalate: ask a human or report the blocker when nothing works.
Feeding the error back
The key trick is to put the error message into the context as an observation. The model then reasons about it and chooses the next move, rather than ignoring the failure.
Pitfalls
- Blind retries on a permanent error waste budget.
- A retry cap prevents infinite loops.
- The agent must distinguish a real error from a valid but unexpected result.
Treating errors as observations to reason about, not exceptions to swallow, is what makes an agent resilient.
Key idea
Agent error recovery detects failed tool calls and responds by retrying repairing rerouting or escalating, feeding the error back as an observation with a retry cap so one bad step does not derail the task.