The core difference
In a single program a crash takes everything down at once. In a distributed system one node can fail while its peers run on. This is partial failure, and it is the defining hardship of distributed computing.
Why it is hard
- A caller cannot tell a slow node from a dead node
- A request may have succeeded even though the reply was lost
- Failures are independent so any subset can be down at once
The two generals problem
You can never be fully certain a remote action happened, because the acknowledgment can also be lost. This means you design for uncertainty, not for guaranteed knowledge.
Coping strategies
- Timeouts turn silence into a decision
- Idempotency lets you safely retry without double effects
- Health checks and heartbeats detect dead peers
- Bulkheads keep one failure from sinking the whole system
A mental shift
Stop asking did it work and start asking what do I do when I cannot tell. The honest answer drives retries, compensation, and reconciliation.
Key idea
Partial failure means some nodes die while others live, and you can never be sure a remote call succeeded, so design for uncertainty with timeouts and idempotency.