The zombie holder problem
Suppose a node acquires a lock, then pauses for a long garbage collection. Its lease expires and another node takes the lock. When the first node wakes, it does not know it lost the lock and proceeds to write. Two writers now believe they hold exclusive access. A lock alone cannot stop this.
The fix is a token
A fencing token is a number the lock service hands out that increases every time the lock is granted. The holder must include its token with every write to the protected resource.
- Each grant returns a strictly larger token than the last.
- The holder attaches the token to every protected operation.
- The resource rejects any write with a token lower than the highest it has seen.
Why this works
When the paused node wakes and writes with its stale token, the resource has already seen a higher token from the new holder. The stale write is rejected. The resource itself, not the lock, enforces correctness.
Key idea
A fencing token is a monotonically increasing number checked by the resource itself, so a delayed old lock holder cannot corrupt state because its stale lower token is rejected.