← Lessons

quiz vs the machine

Gold1460

System Design

Chaos Engineering

Deliberately breaking things in production to find weaknesses before they break you.

5 min read · core · beat Gold to climb

Testing the unknown

You cannot be confident a system survives failures you have never tried. Chaos engineering is the practice of deliberately injecting failures, like killing servers or adding latency, to discover weaknesses before a real outage does.

The scientific method

A good chaos experiment is a hypothesis test, not random destruction.

  • Define steady state, a measurable signal of healthy behavior.
  • Hypothesize that steady state continues during a failure.
  • Inject the failure on a small slice of traffic.
  • Compare the result to the hypothesis.

Blast radius control

The first rule is to limit the blast radius. Start in staging or on a tiny fraction of production, and have an abort switch ready. As confidence grows, you widen the experiment.

Famous tooling like the chaos monkey randomly terminates instances so engineers are forced to build systems that tolerate node loss as a matter of routine.

Key idea

Chaos engineering injects controlled failures to validate that a system survives the conditions it will inevitably face.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the goal of chaos engineering?

2. Why is limiting the blast radius the first rule?