← Lessons

quiz vs the machine

Silver1100

System Design

Failure Detection with Heartbeats

Deciding a node is dead from periodic pings and the timeouts that make it tricky.

4 min read · intro · beat Silver to climb

The basic mechanism

A heartbeat is a small periodic message a node sends to say it is still alive. A monitor expects one every interval. If none arrives within a timeout, the monitor marks the node as suspected or dead.

The hard part is timeouts

Networks are asynchronous, so a missing heartbeat could mean a crash, a slow node, or a delayed packet. There is no perfect way to tell them apart.

  • A short timeout reacts fast but produces false positives during congestion.
  • A long timeout avoids false alarms but reacts slowly to real failures.

Smarter detectors

Rather than a fixed deadline, a phi accrual detector outputs a suspicion level that rises as silence grows, letting callers choose their own threshold. This adapts to changing network conditions instead of guessing one number.

Combine with action

Detection alone does nothing. Pair it with failover, removing the node from a load balancer, or triggering a leader election so the cluster keeps making progress.

Key idea

Heartbeats infer liveness from periodic pings but every timeout trades reaction speed against false positives in an asynchronous network.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the main drawback of a very short heartbeat timeout?

2. What advantage does a phi accrual detector offer over a fixed timeout?