The alerting balance
An alert threshold decides when a metric breach pages a human. Set it too tight and the team drowns in false alarms until they ignore them. Set it too loose and real failures slip through. Good thresholds sit between noise and blindness.
Ways to set a threshold
- Static, a fixed line such as accuracy below ninety percent.
- Relative, a drop of more than a set percent from baseline.
- Statistical, alert when a metric leaves a band of several standard deviations.
Reducing false pages
- Require a breach to persist over a window before alerting.
- Add severity tiers, a warning to a dashboard versus a page for on call.
- Account for seasonality so normal weekend dips do not fire alarms.
Alert fatigue
The biggest risk is fatigue, where too many low value alerts train people to dismiss them, so a real one gets ignored. Every alert should be actionable.
Key idea
Alerting thresholds must balance false alarms against missed failures using static, relative, or statistical bands, with persistence and severity tiers to fight alert fatigue.