Distributed Locks with Fencing Tokens

Why a paused lock holder can corrupt data and how monotonic tokens fence it out.

The illusion of safety

A distributed lock lets one client at a time access a resource. It feels safe, but a subtle failure breaks it. Suppose client A acquires the lock, then pauses for a long garbage collection or a network stall. The lock service times out the lease and grants the lock to client B. Now A wakes up, still believing it holds the lock, and writes. Two writers clash.

The fix is a fencing token

Each time the lock is granted, the service returns a monotonically increasing number called a fencing token. The client must include this token with every write to the storage system.

A gets token 33 then pauses.
B gets token 34 and writes successfully.
A wakes, writes with token 33, and storage rejects it because it already saw 34.

Why the storage must check

The lock service alone cannot stop A, because A acts independently after waking. Only the resource that enforces the token can reject the stale writer. The token turns a hopeful lock into an enforced one.

Takeaway

Never trust a distributed lock by itself for correctness under pauses. Combine it with fencing tokens checked at the resource so a delayed old holder can never overwrite newer work.

Key idea