Planning for the worst day
Disaster recovery is the plan for restoring service after a major loss, such as a region going down or data being corrupted. Two numbers anchor the plan and drive its cost.
The two targets
- Recovery point objective is how much data you can afford to lose, measured in time. An RPO of five minutes means backups or replication must be no older than five minutes, so at most five minutes of data is lost.
- Recovery time objective is how long you can be down. An RTO of one hour means service must be restored within an hour of the disaster.
A tighter RPO needs more frequent or continuous replication. A tighter RTO needs warmer standby capacity ready to take over.
Matching strategy to targets
- Backup and restore is cheap but slow, fitting loose RTO.
- Warm standby keeps a scaled down copy running for a moderate RTO.
- Hot standby runs a full second site for near zero RTO at high cost.
Test the recovery regularly, because an untested plan usually fails when it is finally needed.
Key idea
RPO is the tolerable data loss and RTO is the tolerable downtime, and together they decide how much standby and replication you must pay for.