← Lessons

quiz vs the machine

Platinum1750

System Design

Blast Radius Containment

Designing so that one failure can only harm a small slice of the system.

5 min read · advanced · beat Platinum to climb

What blast radius means

Blast radius is the set of users, requests, or components harmed when one thing fails. Containment is the practice of designing so that any single failure can only damage a small, predictable slice.

Techniques that shrink it

  • Bulkheads: isolate resources so one overloaded tenant cannot exhaust the pool everyone shares. Named after ship compartments that stop one flooded section from sinking the vessel.
  • Cells: partition the whole system into independent units, each serving a subset of users with its own capacity. A failed cell takes down only its share.
  • Shuffle sharding: assign each customer a random combination of resources so two customers rarely share the same full set, limiting correlated harm.

Why it beats raw redundancy

Redundancy adds copies, but a poison pill request or a bad deploy can still hit every copy at once. Containment instead limits how far any one fault can spread, so even a logic bug stays local.

The deploy angle

Rolling out a change to one cell first turns a global outage risk into a single cell incident you can catch and roll back.

Key idea

Containment with bulkheads and cells limits how far any single failure can spread, not just how many copies exist.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a bulkhead do?

2. Why does containment beat raw redundancy against a poison pill request?