← Lessons

quiz vs the machine

Gold1360

System Design

Runbooks and On Call

Capturing operational knowledge so any responder can act under pressure.

5 min read · core · beat Gold to climb

Knowledge that survives the panic

During an incident, the responder is stressed and time is short. A runbook is a written guide for handling a specific alert or task: what it means, how to confirm it, and the concrete steps to mitigate it. Good runbooks let any qualified responder act without the one expert who is asleep.

What a runbook contains

  • The alert meaning and why it fires.
  • Diagnosis steps and the dashboards to check.
  • Mitigation actions, including safe commands to run.
  • Escalation paths when the steps do not resolve it.

On call done humanely

On call is the rotation of who responds to alerts. A healthy rotation keeps the load sustainable.

  • Alert only on things that need human action now, to reduce noise and fatigue.
  • Track toil and turn repeated manual fixes into automation.
  • Hold a handoff so the next responder inherits context.

Key idea

Runbooks turn expert knowledge into steps any responder can follow, and a humane on call rotation keeps that response sustainable.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the main purpose of a runbook?

2. What should trigger a page to an on call engineer?

3. What is toil in operations?