← Lessons

quiz vs the machine

Platinum1740

Machine Learning

The Slo For Ml Services

Setting measurable reliability targets that cover quality as well as uptime.

5 min read · advanced · beat Platinum to climb

Reliability with a number

A service level objective is a target for a measurable reliability indicator over a window, such as ninety nine percent of requests under two hundred milliseconds per month. For ML services SLOs must cover not just availability but prediction quality.

Indicators worth targeting

  • Availability, the fraction of successful responses.
  • Latency, a percentile like p95 under a bound.
  • Quality, accuracy or a proxy staying above a floor.
  • Freshness, features and the model not older than a limit.

Error budgets

An SLO implies an error budget, the allowed shortfall. Spending it freely is fine until it runs out, at which point teams pause risky changes and invest in reliability.

ML specific care

  • Choose percentiles, not averages, since tail latency hurts users.
  • Tie quality SLOs to a metric with timely enough labels or a trusted proxy.
  • Set targets from real user needs, not arbitrary nines.

Key idea

An ML SLO sets measurable targets over availability, latency, quality, and freshness, and its error budget governs how aggressively teams ship versus stabilize.

Check yourself

Answer to earn rating on the learn ladder.

1. Why must ML SLOs go beyond availability and latency?

2. What is an error budget?

3. Why prefer percentiles over averages for latency SLOs?