The thundering herd
A thundering herd is when many processes wake up and contend for the same resource at once, only for one to win and the rest to do wasted work. In caching, it appears when a missing key makes hundreds of workers all try to rebuild the same value.
The mutex lock approach
The fix is a distributed lock keyed on the value being rebuilt. The first worker to acquire the lock recomputes the value and writes it to the cache. The others fail to acquire the lock and instead wait for the result or serve a stale copy.
Pitfalls to handle
- The lock needs a timeout so a crashed holder does not block everyone forever.
- Waiters need a fallback, such as serving stale data, if the holder is slow.
- The lock should be fine grained, one per key, not one global lock.
Key idea
A thundering herd lock elects one worker to rebuild a value while the rest wait, but it must use per key locks with timeouts to stay safe.