What blast radius means
Blast radius is the set of users, requests, or components harmed when one thing fails. Containment is the practice of designing so that any single failure can only damage a small, predictable slice.
Techniques that shrink it
- Bulkheads: isolate resources so one overloaded tenant cannot exhaust the pool everyone shares. Named after ship compartments that stop one flooded section from sinking the vessel.
- Cells: partition the whole system into independent units, each serving a subset of users with its own capacity. A failed cell takes down only its share.
- Shuffle sharding: assign each customer a random combination of resources so two customers rarely share the same full set, limiting correlated harm.
Why it beats raw redundancy
Redundancy adds copies, but a poison pill request or a bad deploy can still hit every copy at once. Containment instead limits how far any one fault can spread, so even a logic bug stays local.
The deploy angle
Rolling out a change to one cell first turns a global outage risk into a single cell incident you can catch and roll back.
Key idea
Containment with bulkheads and cells limits how far any single failure can spread, not just how many copies exist.