The problem
With plain sharding, every customer on a shard shares the same workers. One abusive customer can take down that shard and everyone on it. Shuffle sharding reduces the chance that two customers fully share fate.
How it works
Instead of one shard, each customer is assigned a small random subset of the worker pool, say two of eight workers.
- Two customers rarely get the exact same pair of workers.
- If one customer floods its workers, another customer usually still has at least one healthy worker.
- With graceful retries, that surviving worker keeps the second customer up.
Why it is powerful
The number of possible subsets grows fast, so the odds of complete overlap between any two customers shrink dramatically. A single bad actor degrades only the few customers who happen to share all its workers.
Shuffle sharding turns a shared fate problem into many partial overlaps that are far easier to survive.
Key idea
Give each customer a random subset of workers so a noisy neighbor rarely takes down all of another customer's workers.