The challenge
How do you record a consistent global snapshot of a running distributed system when there is no shared clock and messages are in flight? The Chandy Lamport algorithm does this without pausing the system.
The marker mechanism
The algorithm uses a special marker message on FIFO channels:
- An initiator records its own state, then sends a marker on every outgoing channel
- When a node first sees a marker, it records its own state and sends markers out
- A node records messages arriving on a channel between recording its state and receiving the marker on that channel; these are the in flight messages
Why it is consistent
The collected snapshot is a state the system could have been in, even if it was never simultaneously true everywhere. Channel states capture messages that were sent before the cut but not yet received, so nothing is lost or double counted.
Where it is used
- Checkpointing for crash recovery
- Deadlock and termination detection
- Stream processing systems like Flink use a marker based variant for exactly once state
Assumptions
- Channels are reliable and FIFO
- Markers eventually propagate to all nodes
Key idea
Chandy Lamport records a consistent global snapshot by flooding marker messages on FIFO channels, capturing each node state plus the in flight messages, all without halting the system.