A deadline you must split
If a request must return in 100 milliseconds, model inference gets only a slice. Network, feature lookup, and post processing all consume the rest.
Typical breakdown
- Network and routing fixed overhead you rarely control
- Feature fetching reads from a feature store or cache
- Model inference the forward pass itself
- Post processing ranking, filtering, formatting
Sum these and they must fit the deadline at the tail, not the average. Optimize for p99, since the worst requests define user pain.
Techniques to fit the budget
- Cache features and frequent predictions
- Quantize or distill the model to shrink the forward pass
- Precompute embeddings offline so serving is a fast lookup
- Parallelize independent stages
Key idea
Latency is a budget split across stages. Measure the tail, find the dominant stage, and optimize that one first.