Testing the unknown
You cannot be confident a system survives failures you have never tried. Chaos engineering is the practice of deliberately injecting failures, like killing servers or adding latency, to discover weaknesses before a real outage does.
The scientific method
A good chaos experiment is a hypothesis test, not random destruction.
- Define steady state, a measurable signal of healthy behavior.
- Hypothesize that steady state continues during a failure.
- Inject the failure on a small slice of traffic.
- Compare the result to the hypothesis.
Blast radius control
The first rule is to limit the blast radius. Start in staging or on a tiny fraction of production, and have an abort switch ready. As confidence grows, you widen the experiment.
Famous tooling like the chaos monkey randomly terminates instances so engineers are forced to build systems that tolerate node loss as a matter of routine.
Key idea
Chaos engineering injects controlled failures to validate that a system survives the conditions it will inevitably face.