AWS Scenarios
Analytics Data Lake On S3
Design an AWS analytics data lake using S3, Glue, Athena, Redshift, Kinesis, QuickSight, IAM, partitioning, lifecycle, and cost controls for SAA-C03 scenarios.
After this, you will understand
This scenario makes analytics service choices practical: raw storage, catalog, ad hoc SQL, warehouse queries, streaming ingestion, and dashboards.
Store durable data in S3, catalog it with Glue, query it with Athena or Redshift, and visualize it with QuickSight.
Learners pick Redshift for every query, Athena for every dashboard, or S3 alone without catalog, partitioning, security, and lifecycle design.
Use S3 as the durable lake, organize data into zones, catalog tables, optimize file layout, and choose Athena or Redshift based on query and performance requirements.
Think before readingWhat is the simplest Athena cost lever in a data lake?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- S3 as a data lake foundation
- Raw, curated, and analytics zones
- Glue Data Catalog and crawlers
- Athena serverless SQL
- Redshift data warehouse
- Kinesis streaming ingestion
- QuickSight dashboards
- Partitioning and Parquet
- Lake security and cost controls
- Analytics exam traps
1. Situation
An organization wants to collect application logs, transaction exports, clickstream events, and operational data in AWS. Analysts need ad hoc SQL. Business users need dashboards. Some data arrives in batches and some arrives as a stream.
The data should be durable, cost-effective, and usable by multiple analytics tools without every team copying it into separate databases.
The common architecture is:
sources -> S3 data lake -> Glue catalog -> Athena, Redshift, QuickSight
The exact services depend on query frequency, latency, data format, governance, and cost.
2. Naive Design
The naive design sends all data directly into one relational database or one Redshift cluster:
all sources -> warehouse -> every query and dashboard
That may work at first, but it mixes raw retention, transformation, ad hoc exploration, dashboards, and warehouse performance into one system.
Another naive design dumps files into S3 with no structure:
bucket -> random CSV files -> everyone guesses paths
That is cheap storage, not a usable data lake. Without cataloging, partitions, file formats, permissions, lifecycle, and ownership, the lake becomes a messy archive.
3. What Breaks
Raw CSV files become expensive to query because Athena scans too much data.
Dashboards become slow if every refresh runs broad ad hoc queries over unoptimized objects.
Data ownership breaks when every team writes different schemas into the same prefix.
Security breaks when analysts receive broad bucket access instead of table, prefix, or role-scoped access.
Cost breaks when lifecycle is missing, crawlers run too often, logs are retained forever in hot storage, and Redshift capacity stays running for occasional queries.
The data lake needs architecture, not just a bucket.
4. AWS Architecture
Use S3 as the durable storage foundation. Organize data into zones:
raw -> cleaned -> curated -> analytics
Use prefixes that reflect domain and partition keys, such as date, source, Region, or tenant where appropriate.
Use AWS Glue Data Catalog to define tables over S3 data. Use Glue crawlers or explicit table definitions depending on schema control needs.
Use Athena for serverless ad hoc SQL over S3, especially when queries are occasional or exploratory.
Use Redshift when the workload needs a data warehouse with consistent high-performance analytics, complex joins, dashboard concurrency, and managed warehouse features.
Use Kinesis Data Streams or Firehose-style ingestion patterns when data arrives continuously.
Use QuickSight for dashboards and business intelligence.
5. Request Or Data Flow
Batch systems export files to S3 raw prefixes. Streaming producers send events to Kinesis and then to S3 or downstream processing.
Glue crawlers or jobs update metadata in the Data Catalog. ETL jobs convert raw data into curated formats such as Parquet, partitioned by useful keys.
Analysts query curated tables with Athena. Dashboards read from Athena, Redshift, or prepared datasets depending on performance requirements.
Redshift can load curated data, query external data through Spectrum-style patterns, or use zero-ETL integrations where supported by service-specific docs.
CloudWatch, CloudTrail, S3 access logs, and cost tools provide operational and governance visibility.
6. Security Controls
Keep S3 buckets private. Use IAM roles, bucket policies, KMS encryption, and least privilege.
Separate raw and curated zones with permissions. Not every analyst needs raw sensitive data.
Use Lake Formation where centralized data lake permissions and fine-grained access control are required.
Encrypt data at rest with S3 encryption and KMS where needed. Encrypt data in transit.
Use CloudTrail data events selectively for sensitive buckets when object-level audit is required, understanding the cost.
Avoid putting secrets or personal data into logs without governance.
7. Resilience Controls
S3 provides durable object storage across multiple Availability Zones for general purpose buckets.
Use versioning for important datasets where accidental overwrite or deletion risk matters.
Use replication when compliance, regional resilience, or data locality requirements justify extra cost.
For streaming ingestion, design retry and buffering behavior. A failed consumer should not lose events.
For ETL, make jobs restartable and idempotent. A failed transformation should not create duplicate curated records.
Keep raw data long enough to reprocess if transformation bugs are discovered.
8. Performance Controls
Use columnar formats such as Parquet or ORC for analytics tables.
Partition data on common query filters, such as date or tenant, but avoid overly granular partitions that create metadata pain.
Compress files and compact small files. Thousands of tiny files can slow query planning and increase request overhead.
Use Athena workgroups to control query settings and scanned data.
Use Redshift for repeated, complex, high-concurrency analytical workloads where a warehouse is justified.
Use QuickSight extracts or prepared datasets when dashboard performance matters.
9. Cost Controls
S3 cost depends on storage class, requests, lifecycle, retrieval, replication, and data transfer.
Athena charges mainly by data scanned, so file format and partitioning are direct cost controls.
Glue crawlers and ETL jobs cost based on usage. Do not crawl huge datasets unnecessarily.
Redshift costs are different: provisioned or serverless warehouse capacity, storage, snapshots, data transfer, and concurrency features.
Move cold raw data to lower-cost storage classes when retrieval requirements allow it.
Use Cost Explorer, CUR/Data Exports, and budgets to track analytics spend.
10. Exam Variants
"Serverless SQL over data in S3" points to Athena.
"Metadata catalog for S3 datasets" points to Glue Data Catalog.
"ETL jobs and crawlers" point to AWS Glue.
"Data warehouse for complex analytics and dashboards" points to Redshift.
"Streaming ingestion" can point to Kinesis Data Streams or related streaming services.
"Reduce Athena query cost" points to partitioning, compression, columnar formats, and limiting scanned data.
11. Common Traps
Do not choose Redshift for occasional ad hoc S3 queries if Athena is enough.
Do not choose Athena for every high-concurrency dashboard workload without performance planning.
Do not dump raw CSV into S3 and call it a data lake.
Do not ignore small files, partitions, and schema evolution.
Do not give broad bucket access when table or prefix-level access is needed.
Do not forget lifecycle policies for raw, log, and archive data.
12. Related Topics
Review Amazon S3, AWS Glue, Amazon Athena, Amazon Redshift, and Amazon QuickSight.
Official AWS references:
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
More Links
Additional references connected to this page.