← Lessons

quiz vs the machine

Silver1100

System Design

The Partial Failure Model

In distributed systems some parts fail while others keep running.

4 min read · intro · beat Silver to climb

The core difference

In a single program a crash takes everything down at once. In a distributed system one node can fail while its peers run on. This is partial failure, and it is the defining hardship of distributed computing.

Why it is hard

  • A caller cannot tell a slow node from a dead node
  • A request may have succeeded even though the reply was lost
  • Failures are independent so any subset can be down at once

The two generals problem

You can never be fully certain a remote action happened, because the acknowledgment can also be lost. This means you design for uncertainty, not for guaranteed knowledge.

Coping strategies

  • Timeouts turn silence into a decision
  • Idempotency lets you safely retry without double effects
  • Health checks and heartbeats detect dead peers
  • Bulkheads keep one failure from sinking the whole system

A mental shift

Stop asking did it work and start asking what do I do when I cannot tell. The honest answer drives retries, compensation, and reconciliation.

Key idea

Partial failure means some nodes die while others live, and you can never be sure a remote call succeeded, so design for uncertainty with timeouts and idempotency.

Check yourself

Answer to earn rating on the learn ladder.

1. What makes partial failure harder than a local crash?

2. Why does idempotency help with partial failure?