AWS Exam Review
Design Resilient Architectures
Review SAA-C03 Domain 2 by connecting loose coupling, high availability, fault tolerance, disaster recovery, backups, failover, quotas, and resilience traps.
After this, you will understand
Resilience questions become easier when learners separate four ideas that exam options blur: scaling, loose coupling, high availability, and recoverability.
This domain asks whether you can design systems that keep working during component failures and can recover when bad things still happen.
Learners add replicas without backups, use read replicas for automatic failover, keep every component in one AZ, or confuse asynchronous messaging with guaranteed processing.
Identify the failure mode first, then choose isolation, redundancy, decoupling, backup, failover, or disaster recovery controls.
Think before readingWhy is replication not the same thing as backup?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- SAA-C03 resilient architecture domain
- Loose coupling
- Multi-tier design
- Event-driven architecture
- Horizontal scaling
- Multi-AZ high availability
- Fault tolerance
- Backups and replication
- Disaster recovery
- Service quotas and throttling
1. Domain Mental Model
Resilience is about what happens when parts of the system fail.
The core question is:
what failure are we surviving, and what design control handles it?
A single EC2 instance failure is not the same as an Availability Zone disruption. A bad database migration is not the same as a database instance failure. A traffic spike is not the same as a regional outage. A queue backlog is not the same as lost data.
SAA-C03 resilience questions reward precise matching. Use Multi-AZ for high availability. Use backups for historical recovery. Use replication for continuity or geographic copies. Use queues and events for loose coupling. Use Auto Scaling for variable load. Use DR patterns when an entire Region or workload environment must be recoverable.
2. Official Task Map
AWS groups this domain into two task areas:
- scalable, loosely coupled architectures
- highly available or fault-tolerant architectures
The official weighting is 26 percent of scored content.
Arcflow maps that to:
- decoupling and asynchronous design
- multi-tier architecture and scaling boundaries
- managed service selection
- high availability across Availability Zones
- backup, replication, and disaster recovery strategy
- service quota and throttling awareness
- workload visibility and automation
The exam often hides the answer in the failure mode. Read slowly.
3. What AWS Is Testing
AWS is testing whether you can design for failure without overbuilding.
For loose coupling, expect SQS, SNS, EventBridge, Step Functions, API Gateway, Lambda, ECS/Fargate, load balancers, containers, microservices, stateless workloads, and caching.
For high availability, expect Regions, Availability Zones, Route 53, ALB, Auto Scaling groups, RDS Multi-AZ, Aurora replicas, DynamoDB global tables, S3 durability, EFS mount targets, CloudFront, and failover routing.
For disaster recovery, expect RTO, RPO, backup and restore, pilot light, warm standby, active-active, cross-Region replication, cross-account backups, and restore testing.
For resilience operations, expect CloudWatch, X-Ray, service quotas, throttling, automation, immutable infrastructure, and managed services that reduce operational burden.
4. Service And Concept Clusters
Start with infrastructure placement:
- AWS Global Infrastructure
- Public vs Private Subnets
- Application Load Balancer vs Network Load Balancer vs Gateway Load Balancer
- Amazon EC2 Auto Scaling
Then decouple components:
Then protect state:
Then plan recovery:
- AWS Backup
- S3 Replication
- Backup vs Replication Recovery Design
- Multi-Region Disaster Recovery On AWS
5. Architecture Reasoning Patterns
Separate availability from recovery.
Availability asks:
can the workload keep serving users during a failure?
Recovery asks:
can we restore to a good state after damage or outage?
Multi-AZ helps availability. Backups help recovery. Replication can support continuity, geographic access, or DR, but it can also copy corruption or deletes.
Separate scaling from resilience. Auto Scaling handles variable capacity. It does not make a stateful dependency recoverable by itself.
Separate queueing from processing. SQS can buffer work, but consumers still need idempotency, retries, DLQs, visibility timeout design, and monitoring.
Use managed services when operational overhead matters. RDS, Aurora, DynamoDB, S3, Lambda, Fargate, SQS, SNS, and EventBridge remove different kinds of infrastructure burden.
6. High-Yield Comparisons
Multi-AZ vs read replica: automatic failover availability versus read scaling or regional read copies.
Backup vs replication: historical restore points versus current or near-current copy.
SQS vs SNS vs EventBridge: queue for work buffering, pub/sub fanout, event routing.
Step Functions vs SQS retries: workflow state and orchestration versus message buffering and consumer retry.
Lambda vs ECS/Fargate vs EC2: event-driven serverless compute, managed containers, and full instance control.
CloudFront vs Global Accelerator: CDN caching and web edge controls versus static IPs and accelerated routing to endpoints.
Pilot light vs warm standby vs active-active: minimum-cost recoverability, pre-scaled reduced environment, and live multi-Region serving.
RDS Multi-AZ vs backups: failover for infrastructure failure versus point-in-time recovery for bad data.
7. Scenario Triggers
"Application tier must scale horizontally" points to Auto Scaling groups, ALB, ECS Service Auto Scaling, Lambda concurrency, or Fargate service scaling.
"Decouple producers and consumers" points to SQS, SNS, EventBridge, or Step Functions depending on flow.
"Handle sudden spikes without losing requests" often points to SQS buffering plus scalable consumers.
"Survive one AZ failure" points to multi-AZ placement across compute, load balancers, databases, and subnets.
"Automatically fail over relational database" points to RDS Multi-AZ or Aurora HA behavior.
"Recover from accidental deletion" points to backups, versioning, snapshots, or PITR.
"Regional DR with low RTO" points toward warm standby or active-active, not basic backup and restore.
8. Common Traps
Do not use read replicas as the automatic failover answer for standard RDS Multi-AZ questions.
Do not remove backups because replication exists.
Do not build highly available compute with one subnet in one AZ.
Do not forget stateful dependencies when scaling stateless app servers.
Do not assume queues guarantee exactly-once business processing.
Do not choose active-active multi-Region when requirements only need daily backup restore.
Do not ignore service quotas in standby environments.
Do not call S3 Standard-IA less durable than S3 Standard. Study availability, durability, and retrieval behavior separately.
9. Study Path
Study in this order:
- AWS Global Infrastructure
- Amazon EC2 Auto Scaling
- Application Load Balancer vs Network Load Balancer vs Gateway Load Balancer
- SQS vs SNS vs EventBridge
- Step Functions vs SQS And Lambda Retries
- RDS Multi-AZ vs Read Replicas
- Backup vs Replication Recovery Design
- Multi-Region Disaster Recovery On AWS
- CloudTrail vs Config vs CloudWatch vs Trusted Advisor
After that, practice tracing one failure at a time. Ask what breaks, which component owns the recovery, and whether the design meets the required RTO and RPO.
10. Related Topics
Review Design Secure Architectures, Design High-Performing Architectures, Highly Available RDS App, and Event-Driven Order Processing.
Official AWS references:
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
More Links
Additional references connected to this page.