The Partial Failure Model

The core difference

In a single program a crash takes everything down at once. In a distributed system one node can fail while its peers run on. This is partial failure, and it is the defining hardship of distributed computing.

Why it is hard

A caller cannot tell a slow node from a dead node
A request may have succeeded even though the reply was lost
Failures are independent so any subset can be down at once

The two generals problem

You can never be fully certain a remote action happened, because the acknowledgment can also be lost. This means you design for uncertainty, not for guaranteed knowledge.

Coping strategies

Timeouts turn silence into a decision
Idempotency lets you safely retry without double effects
Health checks and heartbeats detect dead peers
Bulkheads keep one failure from sinking the whole system

A mental shift

Stop asking did it work and start asking what do I do when I cannot tell. The honest answer drives retries, compensation, and reconciliation.

Key idea

Partial failure means some nodes die while others live, and you can never be sure a remote call succeeded, so design for uncertainty with timeouts and idempotency.