Reliability with a number
A service level objective is a target for a measurable reliability indicator over a window, such as ninety nine percent of requests under two hundred milliseconds per month. For ML services SLOs must cover not just availability but prediction quality.
Indicators worth targeting
- Availability, the fraction of successful responses.
- Latency, a percentile like p95 under a bound.
- Quality, accuracy or a proxy staying above a floor.
- Freshness, features and the model not older than a limit.
Error budgets
An SLO implies an error budget, the allowed shortfall. Spending it freely is fine until it runs out, at which point teams pause risky changes and invest in reliability.
ML specific care
- Choose percentiles, not averages, since tail latency hurts users.
- Tie quality SLOs to a metric with timely enough labels or a trusted proxy.
- Set targets from real user needs, not arbitrary nines.
Key idea
An ML SLO sets measurable targets over availability, latency, quality, and freshness, and its error budget governs how aggressively teams ship versus stabilize.