When a whole region fails
A single region can fail entirely from a power, network, or software fault. Multi region failover is running in more than one region so traffic can shift to a healthy region when one dies. The hard parts are detecting the failure, redirecting traffic, and keeping data consistent across the move.
The moving pieces
- Health detection decides a region is unhealthy, usually from outside it to avoid trusting the failing region.
- Traffic redirection moves users, often by updating DNS or a global load balancer to point at the surviving region.
- Data replication keeps regions in sync. Asynchronous replication risks losing the last few writes, which sets the RPO.
Active passive versus active active
- Active passive keeps one region serving and another on standby. Failover is simpler but the standby capacity sits idle.
- Active active serves from both regions at once. It uses capacity well but must handle conflicting writes and cross region consistency.
Beware the split brain, where both regions think they are primary. Avoid it with a single source of truth for who is active.
Key idea
Multi region failover detects a dead region, redirects traffic to a healthy one, and replicates data carefully while avoiding split brain.