AWS Scenarios
Multi-Region Disaster Recovery On AWS
Design a multi-Region disaster recovery strategy using RTO, RPO, backup and restore, pilot light, warm standby, active-active, Route 53, ARC routing controls, data replication, and failover testing.
After this, you will understand
Disaster recovery becomes much easier to reason about once learners stop asking which service is best and start asking what recovery objective the business bought.
Choose the lowest-cost recovery pattern that can meet the required RTO and RPO, then test the failover path before a real outage.
Teams keep backups but never test restore, replicate corrupted data, or build a standby Region that cannot actually take traffic.
Translate business impact into RTO and RPO, choose backup and restore, pilot light, warm standby, or active-active, and design traffic, data, identity, capacity, and rollback together.
Think before readingWhy is active-active not automatically the best DR answer?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- Disaster recovery strategy
- Recovery Time Objective
- Recovery Point Objective
- Backup and restore
- Pilot light
- Warm standby
- Active-active
- DNS and traffic failover
- Data replication and corruption risk
- SAA-C03 resilience traps
1. Situation
A company runs a customer-facing application in one AWS Region. The application uses an ALB, EC2 Auto Scaling, Amazon RDS, S3, Route 53, CloudFront, CloudWatch, and IaC for deployments.
The business asks for a disaster recovery plan. This does not automatically mean "run everything in two Regions all the time." It means the team must decide how much downtime and data loss the business can tolerate.
The two numbers that matter are:
RTO = how long can the system be unavailable?
RPO = how much recently written data can be lost?
The architecture follows those numbers. A workload that can be down for several hours can use a cheaper pattern than a payment system that must recover in minutes.
2. Naive Design
The naive answer is "enable backups" and call that disaster recovery.
Backups are necessary, but a backup alone is not a running application. Someone still has to create infrastructure, restore data, update DNS, confirm credentials, scale capacity, and verify that dependencies work in the recovery Region.
Another naive answer is "just replicate everything." Replication can reduce data loss, but it can also replicate bad writes, accidental deletes, application bugs, and ransomware-encrypted objects if the design has no immutability or recovery points.
A third mistake is building a beautiful secondary Region that no one has tested. DR that only exists in diagrams often fails during the first real event.
3. What Breaks
Recovery time breaks when infrastructure is not already defined. If the team must click through consoles during an outage, RTO becomes wishful thinking.
Recovery point breaks when data movement does not match the business need. Daily backups may be fine for an internal reporting system, but terrible for an order system.
Traffic failover breaks when DNS records, health checks, certificates, origin policies, and client behavior are not tested.
Capacity breaks when the standby Region has data but not enough compute, quotas, NAT capacity, database class capacity, or third-party integrations.
Security breaks when the recovery Region lacks KMS keys, IAM roles, secrets, CloudTrail, log delivery, or least-privilege access.
4. AWS Architecture
Start with recovery objectives, then choose a strategy.
Backup and restore keeps backups in another place and rebuilds the workload during recovery. It is usually the lowest-cost option and has the longest RTO.
Pilot light keeps the critical core running or ready in the recovery Region, often including replicated data and minimal infrastructure. During recovery, the team scales out the rest of the stack.
Warm standby keeps a complete but smaller version of the workload running in the recovery Region. Recovery means scaling up and shifting traffic.
Active-active runs production traffic in more than one Region. It can provide very low RTO, but the hardest part becomes data consistency, routing, operational discipline, and failure isolation.
Route 53 can route traffic using failover, weighted, latency, or other policies. Amazon Application Recovery Controller routing controls can give operators a safer manual failover control for whole application replicas.
5. Request Or Data Flow
In normal operation, users reach the primary Region through Route 53, CloudFront, Global Accelerator, or another entry point.
Application writes land in the primary data store. Depending on the chosen pattern, data is backed up, snapshot-copied, asynchronously replicated, or globally replicated to another Region.
If the primary Region fails, the recovery process starts:
detect failure
confirm failover criteria
prepare or scale standby
verify data state
shift traffic
monitor errors and latency
decide failback path later
The failback path matters. Returning to the primary Region after recovery can be harder than failing over, especially if writes happened in the recovery Region.
6. Security Controls
Replicate or recreate IAM roles, KMS keys, secrets, certificates, and security groups deliberately. Do not assume a restored database is useful if the application cannot decrypt secrets or connect to it.
Protect backups with vault policies, cross-account copies, and immutability where appropriate. A compromised production account should not be able to delete every recovery point.
Log the recovery environment. CloudTrail, CloudWatch, Config, GuardDuty, and Security Hub should not disappear during a disaster.
Limit who can trigger failover. Failover is a powerful operational action and can become an outage if used incorrectly.
7. Resilience Controls
Test restore regularly. A backup that has never been restored is not proven.
Test failover with realistic runbooks. Include DNS behavior, client retries, cache behavior, database promotion, queue draining, and downstream dependencies.
Keep infrastructure as code for both Regions. This reduces configuration drift and helps rebuild consistently.
Use health checks carefully. Fully automatic failover can help simple systems, but complex systems often need operator confirmation to avoid failing over for a partial or false signal.
Track replication lag. Low RTO does not guarantee low RPO.
8. Performance Controls
Warm standby and active-active require capacity planning. A standby Region that can only handle five percent of normal traffic may fail under full production load unless scaling is fast and tested.
Global users may benefit from multi-Region routing even outside disaster recovery, but read latency and write consistency are separate concerns.
Synchronous cross-Region writes are rarely the default answer because they add latency and coupling. Many AWS services use asynchronous replication patterns, so design for possible lag.
For static assets, S3 replication and CloudFront can reduce regional dependency. For databases, choose the service-specific replication pattern rather than assuming all databases behave the same way.
9. Cost Controls
DR cost rises as RTO and RPO shrink.
Backup and restore is cheaper because little compute runs in the recovery Region. Pilot light costs more because critical infrastructure exists. Warm standby costs more again because the whole stack is running at reduced size. Active-active is the most expensive operationally and financially.
Replication, data transfer, duplicate environments, observability, KMS requests, and Route 53 or ARC controls all affect cost.
Do not buy active-active because it sounds mature. Buy it only when the business requirement justifies the complexity.
10. Exam Variants
"Lowest cost and can tolerate hours of downtime" often points to backup and restore.
"Core infrastructure and replicated data exist, but full capacity is launched during recovery" points to pilot light.
"Complete environment is already running at reduced capacity" points to warm standby.
"Both Regions actively serve production traffic" points to active-active.
"Need controlled failover of an entire application stack" can point to ARC routing controls with Route 53.
"Need meet specific RTO/RPO" means choose the strategy that satisfies those objectives, not the one with the most services.
11. Common Traps
Do not confuse high availability across Availability Zones with disaster recovery across Regions.
Do not assume backups equal recovery.
Do not replicate data without thinking about corruption and deletion.
Do not forget quotas and capacity in the recovery Region.
Do not assume DNS failover is instant.
Do not ignore failback.
12. Related Topics
Review Amazon Route 53, AWS Backup, S3 Replication, Amazon Aurora, and AWS Well-Architected Tool.
Official AWS references:
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
More Links
Additional references connected to this page.