quiz vs the machine

Platinum1800

Machine Learning

Scaling Inference

Serving more predictions per second without blowing latency or budget.

6 min read · advanced · beat Platinum to climb

Throughput and latency tension

Scaling inference means more queries per second while holding the latency budget. The two goals pull against each other.

Horizontal scaling

Replicate the model server and load balance across replicas
Autoscale on queue depth or utilization
Stateless servers make replication trivial

Squeezing each machine

Batching group requests to use the accelerator efficiently
Hardware GPUs and accelerators for heavy models
Quantization smaller faster math per request
Caching skip recompute for popular inputs

Dynamic batching raises throughput but adds queueing delay, so cap the batch window to protect the tail latency.

Know your bottleneck

Profile first. If you are compute bound, add hardware or batch; if memory bound, quantize or shard; if IO bound, cache and colocate features.

Key idea

Scale out with stateless replicas and scale up with batching and accelerators, but cap batch windows so throughput gains never wreck tail latency.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the tradeoff introduced by dynamic batching?

2. What should you do before choosing a scaling technique?