Throughput and latency tension
Scaling inference means more queries per second while holding the latency budget. The two goals pull against each other.
Horizontal scaling
- Replicate the model server and load balance across replicas
- Autoscale on queue depth or utilization
- Stateless servers make replication trivial
Squeezing each machine
- Batching group requests to use the accelerator efficiently
- Hardware GPUs and accelerators for heavy models
- Quantization smaller faster math per request
- Caching skip recompute for popular inputs
Dynamic batching raises throughput but adds queueing delay, so cap the batch window to protect the tail latency.
Know your bottleneck
Profile first. If you are compute bound, add hardware or batch; if memory bound, quantize or shard; if IO bound, cache and colocate features.
Key idea
Scale out with stateless replicas and scale up with batching and accelerators, but cap batch windows so throughput gains never wreck tail latency.