When alerts stop working
Alert fatigue happens when responders receive so many alerts, especially false or low value ones, that they stop trusting and acting on them. A noisy pager is worse than a quiet one, because the real incident hides in the noise.
What drives it
- Non actionable alerts that no human can do anything about.
- Flapping, where a metric crosses and recrosses a threshold rapidly.
- Duplicate pages for the same root cause across many components.
- Thresholds set too tight, firing on normal variation.
How to fix it
- Make every alert actionable, with a clear playbook step. If there is no action, it should not page.
- Tie alerts to symptoms and SLOs so they fire on real user impact.
- Add hysteresis and for durations so a brief blip does not page.
- Group and deduplicate related alerts into one notification.
- Route by severity, sending low priority issues to a queue and only paging for urgent ones.
- Review regularly and delete alerts that never lead to action.
Key idea
Alert fatigue erodes trust when alerts are noisy or non actionable, so page only on actionable, symptom based, deduplicated signals.