AWS Services

Amazon CloudWatch

Understand CloudWatch for AWS metrics, logs, alarms, dashboards, events, and operational visibility across applications and infrastructure.

foundation6 min readUpdated 2026-06-02CloudCertificationOperationsReliability
MetricAlarmLog GroupLog StreamDashboardMetric FilterEventObservability

After this, you will understand

CloudWatch is where AWS architectures become observable enough to scale, alert, debug, and prove that they are healthy.

Plain version

CloudWatch collects metrics and logs, then lets you create dashboards, alarms, filters, and operational signals.

Decision pressure

Learners confuse CloudWatch with CloudTrail and expect metrics to explain who changed a resource.

Exam-ready model

Send workload signals to CloudWatch, alarm on symptoms and saturation, and use logs plus metrics together for operations.

Think before readingWhat is the simplest difference between CloudWatch and CloudTrail?
CloudWatch observes performance and logs; CloudTrail records AWS API activity and who did what.

Reading in progress

This page is saved in your local study history so you can continue later.

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

  1. 1AWS CloudTrailaws-services
  2. 2AWS Lambdaaws-services

Concepts Covered

  • CloudWatch metrics
  • CloudWatch alarms
  • CloudWatch Logs
  • Log groups and log streams
  • Metric filters
  • Dashboards
  • Agent-based metrics
  • Container and Lambda insights
  • Operational visibility
  • CloudWatch versus CloudTrail

1. Plain-English Mental Model

Amazon CloudWatch is AWS's main monitoring and observability service.

It collects metrics, stores logs, creates alarms, powers dashboards, and helps teams see whether infrastructure and applications are healthy.

The simple model is:

resources and apps -> metrics and logs -> CloudWatch -> alarms and dashboards

Metrics answer questions like "how much CPU is this instance using?" or "how many messages are in this queue?" Logs answer questions like "what did the application write at this time?" Alarms answer "should someone or something react?"

CloudWatch is not the same as CloudTrail. CloudWatch is about operational telemetry. CloudTrail is about AWS API activity and audit history.

2. Why This Service Exists

Cloud systems fail in distributed ways.

An EC2 instance can run out of CPU. A load balancer target can become unhealthy. A Lambda function can error. An SQS queue can build up. A database can run out of connections. An application can log repeated timeouts long before users fully notice.

Without centralized monitoring, teams discover problems through customer complaints.

CloudWatch exists to collect signals from AWS services and custom applications, then turn those signals into dashboards, alarms, scaling inputs, and troubleshooting evidence.

For SAA-C03, CloudWatch appears in questions about alarms, Auto Scaling metrics, EC2 detailed monitoring, log retention, metric filters, application logs, CPU alarms, SQS queue depth, Lambda errors, and operational visibility.

3. The Naive Approach And Where It Breaks

The naive approach is to deploy the architecture and hope managed services tell you when something is wrong.

That breaks quickly. Managed services emit metrics, but you still need to decide which symptoms matter. A database can be available but overloaded. A queue can be durable but growing. A Lambda function can be invoked but failing.

Another naive approach is to SSH into servers and inspect local log files. That fails when instances scale out, are replaced, or run in private subnets. Logs should survive instance lifecycle events.

Another mistake is alarming on everything. Too many noisy alarms train teams to ignore alerts.

CloudWatch needs intentional signal design: symptoms, saturation, errors, latency, capacity, and business-specific metrics.

4. Core Primitives

A metric is a time-series measurement, such as CPU utilization, request count, duration, errors, or queue depth. Metrics have namespaces and dimensions.

An alarm watches a metric or expression and changes state when thresholds are crossed.

A log group is a container for log streams. A log stream is usually a sequence of log events from one source, such as an instance, container, or Lambda execution environment.

A metric filter extracts metric data from log events. This can turn patterns like "ERROR" into a metric that can alarm.

A dashboard displays metrics and widgets.

The CloudWatch agent can collect OS-level metrics and logs from EC2 instances. Some metrics, such as memory and disk usage, require an agent because EC2 default metrics do not see inside the guest OS.

5. Architecture Use Cases

Use CloudWatch alarms to trigger Auto Scaling policies based on CPU, request count, queue depth, or custom metrics.

Use CloudWatch Logs for application logs from EC2, Lambda, ECS, API Gateway, and other services.

Use dashboards for service health views: ALB latency and 5xx errors, Auto Scaling group capacity, RDS CPU and connections, Lambda duration and errors, SQS queue depth and message age.

Use metric filters to alarm on log patterns, such as repeated authentication failures or application exceptions.

Use CloudWatch with SNS to notify teams when alarms fire.

Use custom metrics when AWS service metrics do not capture application health, such as checkout failures per minute or background job lag.

7. Security Model

CloudWatch access is controlled through IAM.

Applications need permission to put logs or metrics. Operators need read access to dashboards, metrics, and logs. Administrators can modify alarms and retention policies.

Logs can contain sensitive data. Do not log secrets, tokens, passwords, or personal data unless the architecture explicitly protects and justifies it.

CloudWatch Logs can be encrypted with KMS where required. Retention policies should match compliance and cost requirements.

Use least privilege for log writers. A Lambda execution role should write to its log group, not administer all monitoring resources.

CloudWatch data can be important forensic evidence, but CloudTrail is still the audit source for AWS API actions.

8. Reliability And Resilience

CloudWatch helps reliability by detecting failure symptoms and triggering response.

Alarms can notify humans or trigger automated actions. Auto Scaling can use CloudWatch metrics to add or remove capacity.

Log retention ensures logs survive instance or container replacement. This is critical in elastic systems where the broken host may disappear.

Composite alarms can reduce noise by combining signals. For example, alert only when high latency and high error rate occur together.

CloudWatch does not make a workload reliable by itself. It provides the feedback loop. The architecture still needs redundancy, retries, failover, and capacity.

9. Performance And Scaling

CloudWatch metrics help identify bottlenecks: CPU, memory, I/O, network, latency, errors, throttles, queue depth, and concurrency.

Detailed monitoring can provide more granular EC2 metrics than basic monitoring, which can matter for scaling and troubleshooting.

Custom metrics let applications emit domain-specific performance signals.

For high-volume logs, ingestion and query patterns matter. Structured logs are easier to search and filter than raw text.

Metric math can combine signals. Percentiles can be more useful than averages for latency.

10. Cost Model

CloudWatch cost includes custom metrics, dashboards, alarms, logs ingestion, logs storage, logs queries, detailed monitoring, and some insights features.

Leaving verbose debug logs on forever can become expensive.

Set retention policies for log groups. Move long-term audit data to S3 when needed.

Use custom metrics deliberately. They are powerful but not free.

Alarm on useful symptoms, not every possible low-level signal.

12. SAA-C03 Exam Signals

"Create an alarm when CPU exceeds threshold" points to CloudWatch alarms.

"Scale EC2 based on metric" points to CloudWatch metric plus Auto Scaling policy.

"Collect application logs from EC2" points to CloudWatch Logs agent or agent configuration.

"Alarm when a log contains ERROR" points to metric filters and CloudWatch alarms.

"Need memory or disk metrics for EC2" points to the CloudWatch agent.

"Who made an API call?" points to CloudTrail, not CloudWatch.

"Queue depth should scale workers" points to SQS metric in CloudWatch and Auto Scaling.

13. Common Exam Traps

Do not confuse CloudWatch metrics with CloudTrail audit events.

Do not assume EC2 default metrics include memory or disk usage. Use the agent.

Do not rely on instance-local logs in Auto Scaling groups.

Do not leave log groups without retention if cost or compliance matters.

Do not alarm only on CPU for every workload. Queue depth, latency, and errors may be better signals.

Do not treat dashboards as alerts. Dashboards require someone to look.

Review Public Web App On AWS, AWS CloudTrail, and AWS Lambda.

Official AWS references:

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.