Defining reliability
Saying a system should be reliable is vague. These three terms make it precise.
- An SLI, service level indicator, is a measured number like the fraction of successful requests.
- An SLO, service level objective, is the target for that indicator, such as ninety nine point nine percent success over a month.
- An error budget is the allowed failure, the gap between one hundred percent and the SLO.
The power of the error budget
If your SLO is ninety nine point nine percent, you are allowed to fail one tenth of a percent of the time. That allowance is a budget you can spend. It reframes reliability from a vague goal into an accountable number.
- Budget remaining means you can ship risky changes faster.
- Budget exhausted means you freeze features and focus on stability.
This aligns product and reliability teams: chasing one hundred percent is wasteful, so the budget defines exactly how much risk is acceptable.
Key idea
An SLI is measured, an SLO is the target, and the error budget is the failure you are allowed to spend on risk.