Where the cost goes
Each agent step is a model call, and steps stack up fast. The biggest drivers are token volume, the number of steps, and serial waiting on slow tools. Optimization attacks all three.
Levers that help
- Smaller models route easy steps to a cheaper model, reserve the large one for hard reasoning
- Prompt caching reuse stable context so repeated tokens are not reprocessed
- Parallel tools run independent tool calls at once instead of in sequence
- Step pruning stop early when the goal is met, avoid redundant calls
A routing view
A router classifies each step and sends it to the cheapest model that can handle it, escalating only when needed.
Balancing the trade
Aggressive cost cutting can hurt quality if you under power a hard step or prune too early. Measure success rate alongside cost and latency, and only trim where success stays flat. Caching stable system prompts is often the cheapest win.
Key idea
Cut agent cost and latency by routing to smaller models, caching stable context, parallelizing tools, and stopping early, while watching that success rate holds.