AWS Exam Review

Design Resilient Architectures

Review SAA-C03 Domain 2 by connecting loose coupling, high availability, fault tolerance, disaster recovery, backups, failover, quotas, and resilience traps.

intermediate5 min readUpdated 2026-06-04CloudCertificationReliabilityOperationsCapacityTradeoffs

ResilienceLoose CouplingHigh AvailabilityFault ToleranceDisaster RecoveryMulti-AZRPO And RTOFailover

After this, you will understand

Resilience questions become easier when learners separate four ideas that exam options blur: scaling, loose coupling, high availability, and recoverability.

Plain version

This domain asks whether you can design systems that keep working during component failures and can recover when bad things still happen.

Decision pressure

Learners add replicas without backups, use read replicas for automatic failover, keep every component in one AZ, or confuse asynchronous messaging with guaranteed processing.

Exam-ready model

Identify the failure mode first, then choose isolation, redundancy, decoupling, backup, failover, or disaster recovery controls.

Think before readingWhy is replication not the same thing as backup?

Replication keeps another copy current, which can also copy bad changes; backups preserve earlier recovery points.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: Resilient Architecture Trap Drills

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

Concepts Covered

SAA-C03 resilient architecture domain
Loose coupling
Multi-tier design
Event-driven architecture
Horizontal scaling
Multi-AZ high availability
Fault tolerance
Backups and replication
Disaster recovery
Service quotas and throttling

1. Domain Mental Model

Resilience is about what happens when parts of the system fail.

The core question is:

what failure are we surviving, and what design control handles it?

A single EC2 instance failure is not the same as an Availability Zone disruption. A bad database migration is not the same as a database instance failure. A traffic spike is not the same as a regional outage. A queue backlog is not the same as lost data.

SAA-C03 resilience questions reward precise matching. Use Multi-AZ for high availability. Use backups for historical recovery. Use replication for continuity or geographic copies. Use queues and events for loose coupling. Use Auto Scaling for variable load. Use DR patterns when an entire Region or workload environment must be recoverable.

2. Official Task Map

AWS groups this domain into two task areas:

scalable, loosely coupled architectures
highly available or fault-tolerant architectures

The official weighting is 26 percent of scored content.

Arcflow maps that to:

decoupling and asynchronous design
multi-tier architecture and scaling boundaries
managed service selection
high availability across Availability Zones
backup, replication, and disaster recovery strategy
service quota and throttling awareness
workload visibility and automation

The exam often hides the answer in the failure mode. Read slowly.

3. What AWS Is Testing

AWS is testing whether you can design for failure without overbuilding.

For loose coupling, expect SQS, SNS, EventBridge, Step Functions, API Gateway, Lambda, ECS/Fargate, load balancers, containers, microservices, stateless workloads, and caching.

For high availability, expect Regions, Availability Zones, Route 53, ALB, Auto Scaling groups, RDS Multi-AZ, Aurora replicas, DynamoDB global tables, S3 durability, EFS mount targets, CloudFront, and failover routing.

For disaster recovery, expect RTO, RPO, backup and restore, pilot light, warm standby, active-active, cross-Region replication, cross-account backups, and restore testing.

For resilience operations, expect CloudWatch, X-Ray, service quotas, throttling, automation, immutable infrastructure, and managed services that reduce operational burden.

4. Service And Concept Clusters

Start with infrastructure placement:

Then decouple components:

Then protect state:

Then plan recovery:

5. Architecture Reasoning Patterns

Separate availability from recovery.

Availability asks:

can the workload keep serving users during a failure?

Recovery asks:

can we restore to a good state after damage or outage?

Multi-AZ helps availability. Backups help recovery. Replication can support continuity, geographic access, or DR, but it can also copy corruption or deletes.

Separate scaling from resilience. Auto Scaling handles variable capacity. It does not make a stateful dependency recoverable by itself.

Separate queueing from processing. SQS can buffer work, but consumers still need idempotency, retries, DLQs, visibility timeout design, and monitoring.

Use managed services when operational overhead matters. RDS, Aurora, DynamoDB, S3, Lambda, Fargate, SQS, SNS, and EventBridge remove different kinds of infrastructure burden.

6. High-Yield Comparisons

Multi-AZ vs read replica: automatic failover availability versus read scaling or regional read copies.

Backup vs replication: historical restore points versus current or near-current copy.

SQS vs SNS vs EventBridge: queue for work buffering, pub/sub fanout, event routing.

Step Functions vs SQS retries: workflow state and orchestration versus message buffering and consumer retry.

Lambda vs ECS/Fargate vs EC2: event-driven serverless compute, managed containers, and full instance control.

CloudFront vs Global Accelerator: CDN caching and web edge controls versus static IPs and accelerated routing to endpoints.

Pilot light vs warm standby vs active-active: minimum-cost recoverability, pre-scaled reduced environment, and live multi-Region serving.

RDS Multi-AZ vs backups: failover for infrastructure failure versus point-in-time recovery for bad data.

7. Scenario Triggers

"Application tier must scale horizontally" points to Auto Scaling groups, ALB, ECS Service Auto Scaling, Lambda concurrency, or Fargate service scaling.

"Decouple producers and consumers" points to SQS, SNS, EventBridge, or Step Functions depending on flow.

"Handle sudden spikes without losing requests" often points to SQS buffering plus scalable consumers.

"Survive one AZ failure" points to multi-AZ placement across compute, load balancers, databases, and subnets.

"Automatically fail over relational database" points to RDS Multi-AZ or Aurora HA behavior.

"Recover from accidental deletion" points to backups, versioning, snapshots, or PITR.

"Regional DR with low RTO" points toward warm standby or active-active, not basic backup and restore.

8. Common Traps

Do not use read replicas as the automatic failover answer for standard RDS Multi-AZ questions.

Do not remove backups because replication exists.

Do not build highly available compute with one subnet in one AZ.

Do not forget stateful dependencies when scaling stateless app servers.

Do not assume queues guarantee exactly-once business processing.

Do not choose active-active multi-Region when requirements only need daily backup restore.

Do not ignore service quotas in standby environments.

Do not call S3 Standard-IA less durable than S3 Standard. Study availability, durability, and retrieval behavior separately.

9. Study Path

Study in this order:

After that, practice tracing one failure at a time. Ask what breaks, which component owns the recovery, and whether the design meets the required RTO and RPO.

Review Design Secure Architectures, Design High-Performing Architectures, Highly Available RDS App, and Event-Driven Order Processing.

Official AWS references:

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.

Prerequisites

Read these first if the mechanics feel unfamiliar.

AWS Global InfrastructureStart here if AWS Global Infrastructure is still fuzzy.VPC Networking ModelStart here if VPC Networking Model is still fuzzy.

Read these in order

What to study next

Prerequisites

More Links