Alerting on symptoms not causes
Good alerting wakes a human only when something actually matters. The classic guidance is to alert on symptoms, the user visible pain, rather than on causes, the many internal conditions that may or may not lead to pain.
The difference
A symptom is a high error rate on the checkout endpoint or latency exceeding the promised target. A cause is a server at high cpu, a full disk, or one slow database replica. Causes are useful for diagnosis but make poor alerts.
Why cause alerts hurt
- They produce noise, since many causes self heal or never affect users
- They cause alert fatigue, so responders start ignoring pages
- They miss novel failures whose cause you never thought to alarm on
By alerting on the symptom, one rule covers many possible causes. If users can check out fine, a busy cpu is not an emergency.
Pair with diagnosis
Symptom alerts say something is wrong. Dashboards, traces, and logs then tell you why. Keep cause signals as context for the on call engineer, not as pages.
Key idea
Page on user visible symptoms so one alert covers many causes, and keep cause signals as diagnostic context to avoid fatigue.