← Lessons

quiz vs the machine

Gold1470

Machine Learning

Cost and Latency Optimization for Agents

Making agents cheaper and faster without losing quality.

5 min read · core · beat Gold to climb

Where the cost goes

Each agent step is a model call, and steps stack up fast. The biggest drivers are token volume, the number of steps, and serial waiting on slow tools. Optimization attacks all three.

Levers that help

  • Smaller models route easy steps to a cheaper model, reserve the large one for hard reasoning
  • Prompt caching reuse stable context so repeated tokens are not reprocessed
  • Parallel tools run independent tool calls at once instead of in sequence
  • Step pruning stop early when the goal is met, avoid redundant calls

A routing view

A router classifies each step and sends it to the cheapest model that can handle it, escalating only when needed.

Balancing the trade

Aggressive cost cutting can hurt quality if you under power a hard step or prune too early. Measure success rate alongside cost and latency, and only trim where success stays flat. Caching stable system prompts is often the cheapest win.

Key idea

Cut agent cost and latency by routing to smaller models, caching stable context, parallelizing tools, and stopping early, while watching that success rate holds.

Check yourself

Answer to earn rating on the learn ladder.

1. Which technique reuses stable context to avoid reprocessing tokens?

2. What must you watch when cutting agent cost?