Failure Detectors

Why they exist

Because of FLP, you cannot perfectly know if a node crashed or is just slow. A failure detector is the component that makes a best guess, turning the unknowable into an actionable suspicion that protocols can use.

How they work

The common implementation is heartbeats:

Each node periodically sends an I am alive message
If a peer hears nothing for a timeout, it suspects that node
Some detectors un suspect a node if a late heartbeat arrives

The phi accrual detector goes further, outputting a suspicion level on a continuous scale instead of a binary alive or dead, so callers pick their own threshold.

Two quality properties

Chandra and Toueg classify detectors by:

Completeness every crashed node is eventually suspected
Accuracy correct nodes are not wrongly suspected

There is a tension: short timeouts catch crashes fast but cause false suspicions; long timeouts are accurate but slow.

Why they matter

Trigger leader election and failover
Drive membership changes in clusters
Feed gossip protocols like SWIM that spread suspicion efficiently

Key idea

A failure detector uses heartbeats and timeouts to suspect dead nodes, trading completeness against accuracy, and tools like phi accrual and SWIM make that suspicion tunable and scalable.

Why they exist

How they work

Two quality properties

Why they matter

Key idea

Check yourself