quiz vs the machine

Gold1420

Machine Learning

The Latency Budget for Inference

Allocating a hard end to end deadline across the stages of a prediction.

5 min read · core · beat Gold to climb

A deadline you must split

If a request must return in 100 milliseconds, model inference gets only a slice. Network, feature lookup, and post processing all consume the rest.

Typical breakdown

Network and routing fixed overhead you rarely control
Feature fetching reads from a feature store or cache
Model inference the forward pass itself
Post processing ranking, filtering, formatting

Sum these and they must fit the deadline at the tail, not the average. Optimize for p99, since the worst requests define user pain.

Techniques to fit the budget

Cache features and frequent predictions
Quantize or distill the model to shrink the forward pass
Precompute embeddings offline so serving is a fast lookup
Parallelize independent stages

Key idea

Latency is a budget split across stages. Measure the tail, find the dominant stage, and optimize that one first.

Check yourself

Answer to earn rating on the learn ladder.

1. Why optimize for p99 latency rather than the average?

2. Which technique reduces the model forward pass time itself?