Measuring what users feel
A service level indicator, or SLI, is a carefully chosen metric that reflects the user facing health of a service. The goal is to measure what a user would notice, not internal trivia.
Good SLIs
Most useful SLIs are expressed as a ratio of good events to valid events, which naturally lands between zero and one hundred percent.
- Availability is the fraction of requests that succeed.
- Latency is the fraction of requests served faster than a threshold.
- Quality is the fraction of responses that are correct or complete.
- Freshness is the fraction of data younger than a limit.
How to choose
- Measure as close to the user as possible, ideally at the load balancer or client, since that is what users feel.
- Prefer a percentile or threshold over an average, because averages hide painful tails.
- Keep the set small, just a few SLIs per service, so they stay meaningful.
A weak SLI such as average CPU does not tell you whether users are happy. A strong SLI such as the fraction of requests under three hundred milliseconds does.
Key idea
An SLI is a small, user centric ratio of good to valid events, measured close to the user, that reflects experienced quality rather than internal noise.