Why inference cost adds up
Training is a one time burst, but inference runs forever at traffic scale. A small per request cost multiplied by billions of calls dominates the total. Monitoring cost keeps a useful model from becoming an unaffordable one.
What drives the bill
- Compute, the GPU or CPU time per prediction.
- Model size, larger models cost more to run and host.
- Traffic volume, calls per second times hours of uptime.
- Idle capacity, provisioned hardware sitting underused.
Levers to pull
- Quantization and distillation shrink a model with little quality loss.
- Batching raises throughput per unit of hardware.
- Autoscaling matches capacity to demand instead of paying for peak always.
- Caching reuses results for repeated identical inputs.
Tie cost to value
Track cost per prediction alongside the value each prediction creates. A model is worth running only when its benefit clears its serving cost.
Key idea
Inference cost compounds with traffic, so monitor cost per prediction and tie it to value, using quantization, batching, autoscaling, and caching to keep models affordable.