On Call and Incident Response

The human process that turns an alert into a coordinated, learning driven recovery.

People close the loop

Observability surfaces problems, but humans resolve them. On call is the rotation of engineers responsible for responding, and incident response is the structured process they follow once an alert fires.

The lifecycle

Detect when an alert or report signals a problem.
Triage to assess severity and user impact, declaring an incident if it is significant.
Coordinate by assigning an incident commander who runs the response, plus communication and operations roles, so people do not collide.
Mitigate first to stop user harm, for example rolling back or shedding load, before chasing the deep root cause.
Resolve once the symptom is gone and the system is stable.
Review with a blameless postmortem that captures the timeline, contributing factors, and concrete action items.

What makes it healthy

Mitigate before diagnose, because stopping the bleeding beats understanding it fully.
Clear roles prevent confusion during the chaos.
Runbooks give first responders known steps for common failures.
Blameless culture focuses on systems and process, not individuals, so people share what really happened.
Sustainable rotations with reasonable load and follow the sun coverage prevent burnout.

Key idea

Incident response moves from detect to triage to mitigate to resolve to a blameless review, prioritizing stopping user harm and learning over assigning blame.

On Call and Incident Response

People close the loop

The lifecycle

What makes it healthy

Key idea

Check yourself