← Lessons

quiz vs the machine

Gold1470

System Design

Failure Detectors

Mechanisms that suspect when a node has died.

5 min read · core · beat Gold to climb

Why they exist

Because of FLP, you cannot perfectly know if a node crashed or is just slow. A failure detector is the component that makes a best guess, turning the unknowable into an actionable suspicion that protocols can use.

How they work

The common implementation is heartbeats:

  • Each node periodically sends an I am alive message
  • If a peer hears nothing for a timeout, it suspects that node
  • Some detectors un suspect a node if a late heartbeat arrives

The phi accrual detector goes further, outputting a suspicion level on a continuous scale instead of a binary alive or dead, so callers pick their own threshold.

Two quality properties

Chandra and Toueg classify detectors by:

  • Completeness every crashed node is eventually suspected
  • Accuracy correct nodes are not wrongly suspected

There is a tension: short timeouts catch crashes fast but cause false suspicions; long timeouts are accurate but slow.

Why they matter

  • Trigger leader election and failover
  • Drive membership changes in clusters
  • Feed gossip protocols like SWIM that spread suspicion efficiently

Key idea

A failure detector uses heartbeats and timeouts to suspect dead nodes, trading completeness against accuracy, and tools like phi accrual and SWIM make that suspicion tunable and scalable.

Check yourself

Answer to earn rating on the learn ladder.

1. Why can a failure detector only suspect rather than know a node is dead?

2. What does the phi accrual detector output?

3. What is the tension in choosing a heartbeat timeout?