← Lessons

quiz vs the machine

Gold1370

System Design

Service Level Indicators

Choosing the few measurements that actually reflect what users experience.

4 min read · core · beat Gold to climb

Measuring what users feel

A service level indicator, or SLI, is a carefully chosen metric that reflects the user facing health of a service. The goal is to measure what a user would notice, not internal trivia.

Good SLIs

Most useful SLIs are expressed as a ratio of good events to valid events, which naturally lands between zero and one hundred percent.

  • Availability is the fraction of requests that succeed.
  • Latency is the fraction of requests served faster than a threshold.
  • Quality is the fraction of responses that are correct or complete.
  • Freshness is the fraction of data younger than a limit.

How to choose

  • Measure as close to the user as possible, ideally at the load balancer or client, since that is what users feel.
  • Prefer a percentile or threshold over an average, because averages hide painful tails.
  • Keep the set small, just a few SLIs per service, so they stay meaningful.

A weak SLI such as average CPU does not tell you whether users are happy. A strong SLI such as the fraction of requests under three hundred milliseconds does.

Key idea

An SLI is a small, user centric ratio of good to valid events, measured close to the user, that reflects experienced quality rather than internal noise.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a good SLI measure?

2. Why prefer a percentile over an average for latency SLIs?