Failover and Promotion

When a primary dies a replica must be promoted to take writes, and doing it safely means avoiding two primaries at once.

The Event

A failover is the process of promoting a replica to primary when the current primary becomes unavailable. Promotion is the moment a replica stops following and starts accepting writes. Get it wrong and you lose data or end up with two writable primaries.

The Steps

A sound failover does several things in order:

Detect the failure reliably, not just a transient blip, using a timeout and multiple observers.
Choose the most up to date replica, the one with the longest applied log, to minimize lost writes.
Promote that replica and redirect clients to it.
Fence the old primary so it cannot accept writes if it comes back.

The Split Brain Danger

The worst outcome is split brain: the old primary returns, still thinks it is leader, and both nodes take writes. Resolving the divergence later means discarding someone's data. Fencing, often via a coordination service or a lease, prevents this by making leadership exclusive.

Key idea

Failover promotes the most current replica after reliably detecting failure, and fencing the old primary is essential to prevent split brain where two nodes accept writes.

Failover and Promotion

The Event

The Steps

The Split Brain Danger

Key idea

Check yourself