← Lessons

quiz vs the machine

Gold1420

Machine Learning

The Latency Budget for Inference

Allocating a hard end to end deadline across the stages of a prediction.

5 min read · core · beat Gold to climb

A deadline you must split

If a request must return in 100 milliseconds, model inference gets only a slice. Network, feature lookup, and post processing all consume the rest.

Typical breakdown

  • Network and routing fixed overhead you rarely control
  • Feature fetching reads from a feature store or cache
  • Model inference the forward pass itself
  • Post processing ranking, filtering, formatting

Sum these and they must fit the deadline at the tail, not the average. Optimize for p99, since the worst requests define user pain.

Techniques to fit the budget

  • Cache features and frequent predictions
  • Quantize or distill the model to shrink the forward pass
  • Precompute embeddings offline so serving is a fast lookup
  • Parallelize independent stages

Key idea

Latency is a budget split across stages. Measure the tail, find the dominant stage, and optimize that one first.

Check yourself

Answer to earn rating on the learn ladder.

1. Why optimize for p99 latency rather than the average?

2. Which technique reduces the model forward pass time itself?