Why they exist
Because of FLP, you cannot perfectly know if a node crashed or is just slow. A failure detector is the component that makes a best guess, turning the unknowable into an actionable suspicion that protocols can use.
How they work
The common implementation is heartbeats:
- Each node periodically sends an I am alive message
- If a peer hears nothing for a timeout, it suspects that node
- Some detectors un suspect a node if a late heartbeat arrives
The phi accrual detector goes further, outputting a suspicion level on a continuous scale instead of a binary alive or dead, so callers pick their own threshold.
Two quality properties
Chandra and Toueg classify detectors by:
- Completeness every crashed node is eventually suspected
- Accuracy correct nodes are not wrongly suspected
There is a tension: short timeouts catch crashes fast but cause false suspicions; long timeouts are accurate but slow.
Why they matter
- Trigger leader election and failover
- Drive membership changes in clusters
- Feed gossip protocols like SWIM that spread suspicion efficiently
Key idea
A failure detector uses heartbeats and timeouts to suspect dead nodes, trading completeness against accuracy, and tools like phi accrual and SWIM make that suspicion tunable and scalable.