The basic mechanism
A heartbeat is a small periodic message a node sends to say it is still alive. A monitor expects one every interval. If none arrives within a timeout, the monitor marks the node as suspected or dead.
The hard part is timeouts
Networks are asynchronous, so a missing heartbeat could mean a crash, a slow node, or a delayed packet. There is no perfect way to tell them apart.
- A short timeout reacts fast but produces false positives during congestion.
- A long timeout avoids false alarms but reacts slowly to real failures.
Smarter detectors
Rather than a fixed deadline, a phi accrual detector outputs a suspicion level that rises as silence grows, letting callers choose their own threshold. This adapts to changing network conditions instead of guessing one number.
Combine with action
Detection alone does nothing. Pair it with failover, removing the node from a load balancer, or triggering a leader election so the cluster keeps making progress.
Key idea
Heartbeats infer liveness from periodic pings but every timeout trades reaction speed against false positives in an asynchronous network.