← Lessons

quiz vs the machine

Platinum1720

System Design

On Call and Incident Response

The human process that turns an alert into a coordinated, learning driven recovery.

6 min read · advanced · beat Platinum to climb

People close the loop

Observability surfaces problems, but humans resolve them. On call is the rotation of engineers responsible for responding, and incident response is the structured process they follow once an alert fires.

The lifecycle

  • Detect when an alert or report signals a problem.
  • Triage to assess severity and user impact, declaring an incident if it is significant.
  • Coordinate by assigning an incident commander who runs the response, plus communication and operations roles, so people do not collide.
  • Mitigate first to stop user harm, for example rolling back or shedding load, before chasing the deep root cause.
  • Resolve once the symptom is gone and the system is stable.
  • Review with a blameless postmortem that captures the timeline, contributing factors, and concrete action items.

What makes it healthy

  • Mitigate before diagnose, because stopping the bleeding beats understanding it fully.
  • Clear roles prevent confusion during the chaos.
  • Runbooks give first responders known steps for common failures.
  • Blameless culture focuses on systems and process, not individuals, so people share what really happened.
  • Sustainable rotations with reasonable load and follow the sun coverage prevent burnout.

Key idea

Incident response moves from detect to triage to mitigate to resolve to a blameless review, prioritizing stopping user harm and learning over assigning blame.

Check yourself

Answer to earn rating on the learn ladder.

1. What should responders prioritize first during an incident?

2. What is the role of the incident commander?

3. Why are postmortems blameless?