Accuracy is not free
Every accuracy point costs compute, latency, and engineering time. The senior question is whether the gain pays for itself in business value.
Diminishing returns
Accuracy gains usually flatten while cost keeps climbing. A model twice as expensive may add only a fraction of a percent.
Levers to cut cost
- Distillation train a small model to mimic a large one
- Quantization use lower precision for faster cheaper inference
- Caching reuse predictions for repeated inputs
- Cascades run a cheap model first, escalate only hard cases to the big one
Frame it in money
Translate accuracy into business value and compare it to the serving cost. If a one percent lift earns less than the extra inference spend, the simpler model wins.
Key idea
Optimize value per dollar, not raw accuracy. Cascades, distillation, and quantization capture most of the quality for a fraction of the cost.