Alerting On Symptoms Not Causes

Alerting on symptoms not causes

Good alerting wakes a human only when something actually matters. The classic guidance is to alert on symptoms, the user visible pain, rather than on causes, the many internal conditions that may or may not lead to pain.

The difference

A symptom is a high error rate on the checkout endpoint or latency exceeding the promised target. A cause is a server at high cpu, a full disk, or one slow database replica. Causes are useful for diagnosis but make poor alerts.

Why cause alerts hurt

They produce noise, since many causes self heal or never affect users
They cause alert fatigue, so responders start ignoring pages
They miss novel failures whose cause you never thought to alarm on

By alerting on the symptom, one rule covers many possible causes. If users can check out fine, a busy cpu is not an emergency.

Pair with diagnosis

Symptom alerts say something is wrong. Dashboards, traces, and logs then tell you why. Keep cause signals as context for the on call engineer, not as pages.

Key idea

Page on user visible symptoms so one alert covers many causes, and keep cause signals as diagnostic context to avoid fatigue.

Alerting On Symptoms Not Causes