← Lessons

quiz vs the machine

Silver1050

System Design

Failure Modes Overview

The map of how distributed systems break before you can design for resilience.

4 min read · intro · beat Silver to climb

Why name failures

You cannot defend against what you have not named. A failure mode is a specific way a system stops doing what users expect. Listing them turns vague worry into a checklist you can design against.

Common modes

  • Crash failure: a process stops cleanly and goes silent.
  • Omission failure: a message or response is simply dropped.
  • Timing failure: a reply arrives too late to be useful.
  • Byzantine failure: a component returns wrong or contradictory answers.
  • Partition: the network splits the cluster into groups that cannot talk.

Why partial failure is special

In a single machine, code either runs or the box is down. Across a network you get partial failure: some nodes work, some do not, and you often cannot tell which from the outside. A slow node and a dead node look identical to a caller waiting on a socket.

Designing with modes in mind

  • Assume any remote call can hang, fail, or lie.
  • Decide what the system should do for each mode, not just the happy path.
  • Prefer modes you can detect over silent ones you cannot.

Key idea

Resilience starts by enumerating concrete failure modes so each one gets a deliberate response.

Check yourself

Answer to earn rating on the learn ladder.

1. What makes partial failure harder than a full crash?

2. Why list failure modes explicitly?