Detecting that a node is alive
A heartbeat is a small message a node sends on a regular interval to say it is still alive. A peer that stops hearing heartbeats eventually concludes the sender has failed.
The two knobs
- Heartbeat interval: how often beats are sent. Shorter means faster detection but more traffic.
- Timeout: how long to wait before declaring a peer dead. It must be a few intervals so a single lost beat does not cause a false alarm.
The unavoidable trade
You cannot have both fast detection and few false positives. A short timeout catches failures quickly but misjudges slow networks; a long timeout is safe but sluggish. Tuning balances these for your environment.
Key idea
Heartbeats are periodic liveness pings, and the timeout sets a trade between detecting failures fast and avoiding false alarms from slow networks.