The Event
A failover is the process of promoting a replica to primary when the current primary becomes unavailable. Promotion is the moment a replica stops following and starts accepting writes. Get it wrong and you lose data or end up with two writable primaries.
The Steps
A sound failover does several things in order:
- Detect the failure reliably, not just a transient blip, using a timeout and multiple observers.
- Choose the most up to date replica, the one with the longest applied log, to minimize lost writes.
- Promote that replica and redirect clients to it.
- Fence the old primary so it cannot accept writes if it comes back.
The Split Brain Danger
The worst outcome is split brain: the old primary returns, still thinks it is leader, and both nodes take writes. Resolving the divergence later means discarding someone's data. Fencing, often via a coordination service or a lease, prevents this by making leadership exclusive.
Key idea
Failover promotes the most current replica after reliably detecting failure, and fencing the old primary is essential to prevent split brain where two nodes accept writes.