AWS Scenarios

Multi-Region Disaster Recovery On AWS

Design a multi-Region disaster recovery strategy using RTO, RPO, backup and restore, pilot light, warm standby, active-active, Route 53, ARC routing controls, data replication, and failover testing.

intermediate6 min readUpdated 2026-06-03CloudCertificationReliabilityOperations

RTORPOBackup And RestorePilot LightWarm StandbyActive-ActiveRoute 53 FailoverARC Routing Control

After this, you will understand

Disaster recovery becomes much easier to reason about once learners stop asking which service is best and start asking what recovery objective the business bought.

Plain version

Choose the lowest-cost recovery pattern that can meet the required RTO and RPO, then test the failover path before a real outage.

Decision pressure

Teams keep backups but never test restore, replicate corrupted data, or build a standby Region that cannot actually take traffic.

Exam-ready model

Translate business impact into RTO and RPO, choose backup and restore, pilot light, warm standby, or active-active, and design traffic, data, identity, capacity, and rollback together.

Think before readingWhy is active-active not automatically the best DR answer?

It lowers recovery time but adds cost, consistency, routing, deployment, and operational complexity that many workloads do not need.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: On-Premises Migration To AWS

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

1Backup vs Replication Recovery Designaws-scenarios

Concepts Covered

Disaster recovery strategy
Recovery Time Objective
Recovery Point Objective
Backup and restore
Pilot light
Warm standby
Active-active
DNS and traffic failover
Data replication and corruption risk
SAA-C03 resilience traps

1. Situation

A company runs a customer-facing application in one AWS Region. The application uses an ALB, EC2 Auto Scaling, Amazon RDS, S3, Route 53, CloudFront, CloudWatch, and IaC for deployments.

The business asks for a disaster recovery plan. This does not automatically mean "run everything in two Regions all the time." It means the team must decide how much downtime and data loss the business can tolerate.

The two numbers that matter are:

RTO = how long can the system be unavailable?
RPO = how much recently written data can be lost?

The architecture follows those numbers. A workload that can be down for several hours can use a cheaper pattern than a payment system that must recover in minutes.

2. Naive Design

The naive answer is "enable backups" and call that disaster recovery.

Backups are necessary, but a backup alone is not a running application. Someone still has to create infrastructure, restore data, update DNS, confirm credentials, scale capacity, and verify that dependencies work in the recovery Region.

Another naive answer is "just replicate everything." Replication can reduce data loss, but it can also replicate bad writes, accidental deletes, application bugs, and ransomware-encrypted objects if the design has no immutability or recovery points.

A third mistake is building a beautiful secondary Region that no one has tested. DR that only exists in diagrams often fails during the first real event.

3. What Breaks

Recovery time breaks when infrastructure is not already defined. If the team must click through consoles during an outage, RTO becomes wishful thinking.

Recovery point breaks when data movement does not match the business need. Daily backups may be fine for an internal reporting system, but terrible for an order system.

Traffic failover breaks when DNS records, health checks, certificates, origin policies, and client behavior are not tested.

Capacity breaks when the standby Region has data but not enough compute, quotas, NAT capacity, database class capacity, or third-party integrations.

Security breaks when the recovery Region lacks KMS keys, IAM roles, secrets, CloudTrail, log delivery, or least-privilege access.

4. AWS Architecture

Start with recovery objectives, then choose a strategy.

Backup and restore keeps backups in another place and rebuilds the workload during recovery. It is usually the lowest-cost option and has the longest RTO.

Pilot light keeps the critical core running or ready in the recovery Region, often including replicated data and minimal infrastructure. During recovery, the team scales out the rest of the stack.

Warm standby keeps a complete but smaller version of the workload running in the recovery Region. Recovery means scaling up and shifting traffic.

Active-active runs production traffic in more than one Region. It can provide very low RTO, but the hardest part becomes data consistency, routing, operational discipline, and failure isolation.

Route 53 can route traffic using failover, weighted, latency, or other policies. Amazon Application Recovery Controller routing controls can give operators a safer manual failover control for whole application replicas.

5. Request Or Data Flow

In normal operation, users reach the primary Region through Route 53, CloudFront, Global Accelerator, or another entry point.

Application writes land in the primary data store. Depending on the chosen pattern, data is backed up, snapshot-copied, asynchronously replicated, or globally replicated to another Region.

If the primary Region fails, the recovery process starts:

detect failure
confirm failover criteria
prepare or scale standby
verify data state
shift traffic
monitor errors and latency
decide failback path later

The failback path matters. Returning to the primary Region after recovery can be harder than failing over, especially if writes happened in the recovery Region.

6. Security Controls

Replicate or recreate IAM roles, KMS keys, secrets, certificates, and security groups deliberately. Do not assume a restored database is useful if the application cannot decrypt secrets or connect to it.

Protect backups with vault policies, cross-account copies, and immutability where appropriate. A compromised production account should not be able to delete every recovery point.

Log the recovery environment. CloudTrail, CloudWatch, Config, GuardDuty, and Security Hub should not disappear during a disaster.

Limit who can trigger failover. Failover is a powerful operational action and can become an outage if used incorrectly.

7. Resilience Controls

Test restore regularly. A backup that has never been restored is not proven.

Test failover with realistic runbooks. Include DNS behavior, client retries, cache behavior, database promotion, queue draining, and downstream dependencies.

Keep infrastructure as code for both Regions. This reduces configuration drift and helps rebuild consistently.

Use health checks carefully. Fully automatic failover can help simple systems, but complex systems often need operator confirmation to avoid failing over for a partial or false signal.

Track replication lag. Low RTO does not guarantee low RPO.

8. Performance Controls

Warm standby and active-active require capacity planning. A standby Region that can only handle five percent of normal traffic may fail under full production load unless scaling is fast and tested.

Global users may benefit from multi-Region routing even outside disaster recovery, but read latency and write consistency are separate concerns.

Synchronous cross-Region writes are rarely the default answer because they add latency and coupling. Many AWS services use asynchronous replication patterns, so design for possible lag.

For static assets, S3 replication and CloudFront can reduce regional dependency. For databases, choose the service-specific replication pattern rather than assuming all databases behave the same way.

9. Cost Controls

DR cost rises as RTO and RPO shrink.

Backup and restore is cheaper because little compute runs in the recovery Region. Pilot light costs more because critical infrastructure exists. Warm standby costs more again because the whole stack is running at reduced size. Active-active is the most expensive operationally and financially.

Replication, data transfer, duplicate environments, observability, KMS requests, and Route 53 or ARC controls all affect cost.

Do not buy active-active because it sounds mature. Buy it only when the business requirement justifies the complexity.

10. Exam Variants

"Lowest cost and can tolerate hours of downtime" often points to backup and restore.

"Core infrastructure and replicated data exist, but full capacity is launched during recovery" points to pilot light.

"Complete environment is already running at reduced capacity" points to warm standby.

"Both Regions actively serve production traffic" points to active-active.

"Need controlled failover of an entire application stack" can point to ARC routing controls with Route 53.

"Need meet specific RTO/RPO" means choose the strategy that satisfies those objectives, not the one with the most services.

11. Common Traps

Do not confuse high availability across Availability Zones with disaster recovery across Regions.

Do not assume backups equal recovery.

Do not replicate data without thinking about corruption and deletion.

Do not forget quotas and capacity in the recovery Region.

Do not assume DNS failover is instant.

Do not ignore failback.

Review Amazon Route 53, AWS Backup, S3 Replication, Amazon Aurora, and AWS Well-Architected Tool.

Official AWS references:

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.

Prerequisites

Read these first if the mechanics feel unfamiliar.

AWS Global InfrastructureStart here if AWS Global Infrastructure is still fuzzy.Amazon Route 53Start here if Amazon Route 53 is still fuzzy.AWS BackupStart here if AWS Backup is still fuzzy.

Read these in order

What to study next

Prerequisites

More Links