← Lessons

quiz vs the machine

Gold1430

System Design

Alerting On Symptoms Not Causes

Page humans for user visible pain, not for every internal blip.

5 min read · core · beat Gold to climb

Alerting on symptoms not causes

Good alerting wakes a human only when something actually matters. The classic guidance is to alert on symptoms, the user visible pain, rather than on causes, the many internal conditions that may or may not lead to pain.

The difference

A symptom is a high error rate on the checkout endpoint or latency exceeding the promised target. A cause is a server at high cpu, a full disk, or one slow database replica. Causes are useful for diagnosis but make poor alerts.

Why cause alerts hurt

  • They produce noise, since many causes self heal or never affect users
  • They cause alert fatigue, so responders start ignoring pages
  • They miss novel failures whose cause you never thought to alarm on

By alerting on the symptom, one rule covers many possible causes. If users can check out fine, a busy cpu is not an emergency.

Pair with diagnosis

Symptom alerts say something is wrong. Dashboards, traces, and logs then tell you why. Keep cause signals as context for the on call engineer, not as pages.

Key idea

Page on user visible symptoms so one alert covers many causes, and keep cause signals as diagnostic context to avoid fatigue.

Check yourself

Answer to earn rating on the learn ladder.

1. Why prefer symptom based alerts over cause based ones?

2. What is a symptom rather than a cause?