Knowledge that survives the panic
During an incident, the responder is stressed and time is short. A runbook is a written guide for handling a specific alert or task: what it means, how to confirm it, and the concrete steps to mitigate it. Good runbooks let any qualified responder act without the one expert who is asleep.
What a runbook contains
- The alert meaning and why it fires.
- Diagnosis steps and the dashboards to check.
- Mitigation actions, including safe commands to run.
- Escalation paths when the steps do not resolve it.
On call done humanely
On call is the rotation of who responds to alerts. A healthy rotation keeps the load sustainable.
- Alert only on things that need human action now, to reduce noise and fatigue.
- Track toil and turn repeated manual fixes into automation.
- Hold a handoff so the next responder inherits context.
Key idea
Runbooks turn expert knowledge into steps any responder can follow, and a humane on call rotation keeps that response sustainable.