← Lessons

quiz vs the machine

Platinum1780

Concurrency

Membership with SWIM

Scalable failure detection by random pinging and indirect probes.

6 min read · advanced · beat Platinum to climb

The two halves

SWIM separates failure detection from membership dissemination. Detection runs a randomized probe protocol so the work per node stays constant as the cluster grows.

  • Every period, a node picks a random member and sends a direct ping.
  • If no ack arrives in time, it asks k other members to ping the target indirectly.
  • Only if both the direct and indirect probes fail does the node mark the target suspect.

Indirect probing protects against a single congested link wrongly declaring a healthy node dead.

Suspicion and dissemination

A naive design flips a member straight to dead. SWIM adds a suspect state with a timeout. A suspected node can refute the suspicion before it is declared dead. Membership changes ride piggybacked on the normal ping and ack messages, so there is no separate broadcast storm.

  • Constant probe load per node, independent of cluster size.
  • Detection time and accuracy tuned by the period and the count k.

Key idea

SWIM keeps per node work constant using random direct and indirect pings, and spreads membership updates by piggybacking on existing traffic.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does SWIM use indirect probes through k members?

2. How does SWIM disseminate membership changes?

3. What is the purpose of the suspect state?