Keeping the Cluster in Agreement
A distributed database needs every node to know which other nodes exist, which are alive, and which have failed. Asking a central coordinator creates a single point of failure. Instead, many systems use a gossip protocol where nodes exchange state with one another peer to peer.
How Gossip Works
- On a regular interval, each node picks a random peer and exchanges what it knows about the cluster.
- Each node tracks a heartbeat counter per node, incrementing its own and merging higher counters it learns.
- Information spreads like a rumor, reaching every node in time proportional to the logarithm of the cluster size.
Detecting Failure
If a node stops increasing its heartbeat and no fresher value arrives within a timeout, peers mark it suspected and eventually down. Because the judgment is shared through gossip, the whole cluster converges on the same view without a central authority.
Why It Scales
- There is no coordinator to overload or to fail.
- Each node talks to only a few peers per round, so traffic stays bounded.
- The approach tolerates message loss because state keeps being re shared.
Key idea
A gossip protocol spreads membership and failure state by having each node exchange heartbeats with random peers, converging the whole cluster without any central coordinator.