Why name failures
You cannot defend against what you have not named. A failure mode is a specific way a system stops doing what users expect. Listing them turns vague worry into a checklist you can design against.
Common modes
- Crash failure: a process stops cleanly and goes silent.
- Omission failure: a message or response is simply dropped.
- Timing failure: a reply arrives too late to be useful.
- Byzantine failure: a component returns wrong or contradictory answers.
- Partition: the network splits the cluster into groups that cannot talk.
Why partial failure is special
In a single machine, code either runs or the box is down. Across a network you get partial failure: some nodes work, some do not, and you often cannot tell which from the outside. A slow node and a dead node look identical to a caller waiting on a socket.
Designing with modes in mind
- Assume any remote call can hang, fail, or lie.
- Decide what the system should do for each mode, not just the happy path.
- Prefer modes you can detect over silent ones you cannot.
Key idea
Resilience starts by enumerating concrete failure modes so each one gets a deliberate response.