Failure Modes Overview

Why name failures

You cannot defend against what you have not named. A failure mode is a specific way a system stops doing what users expect. Listing them turns vague worry into a checklist you can design against.

Common modes

Crash failure: a process stops cleanly and goes silent.
Omission failure: a message or response is simply dropped.
Timing failure: a reply arrives too late to be useful.
Byzantine failure: a component returns wrong or contradictory answers.
Partition: the network splits the cluster into groups that cannot talk.

Why partial failure is special

In a single machine, code either runs or the box is down. Across a network you get partial failure: some nodes work, some do not, and you often cannot tell which from the outside. A slow node and a dead node look identical to a caller waiting on a socket.

Designing with modes in mind

Assume any remote call can hang, fail, or lie.
Decide what the system should do for each mode, not just the happy path.
Prefer modes you can detect over silent ones you cannot.

Key idea

Resilience starts by enumerating concrete failure modes so each one gets a deliberate response.

Failure Modes Overview

Why name failures

Common modes

Why partial failure is special

Designing with modes in mind

Key idea

Check yourself