AWS Scenarios

Analytics Data Lake On S3

Design an AWS analytics data lake using S3, Glue, Athena, Redshift, Kinesis, QuickSight, IAM, partitioning, lifecycle, and cost controls for SAA-C03 scenarios.

intermediate5 min readUpdated 2026-06-03CloudCertificationDataCostOperations

Amazon S3 Data LakeAWS Glue Data CatalogAmazon AthenaAmazon RedshiftAmazon Kinesis Data StreamsAmazon QuickSightPartitioningColumnar Formats

After this, you will understand

This scenario makes analytics service choices practical: raw storage, catalog, ad hoc SQL, warehouse queries, streaming ingestion, and dashboards.

Plain version

Store durable data in S3, catalog it with Glue, query it with Athena or Redshift, and visualize it with QuickSight.

Decision pressure

Learners pick Redshift for every query, Athena for every dashboard, or S3 alone without catalog, partitioning, security, and lifecycle design.

Exam-ready model

Use S3 as the durable lake, organize data into zones, catalog tables, optimize file layout, and choose Athena or Redshift based on query and performance requirements.

Think before readingWhat is the simplest Athena cost lever in a data lake?

Reduce scanned data with partitioning, compression, columnar formats, and narrower queries.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: Backup vs Replication Recovery Design

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

1On-Premises Migration To AWSaws-scenarios

Concepts Covered

S3 as a data lake foundation
Raw, curated, and analytics zones
Glue Data Catalog and crawlers
Athena serverless SQL
Redshift data warehouse
Kinesis streaming ingestion
QuickSight dashboards
Partitioning and Parquet
Lake security and cost controls
Analytics exam traps

An organization wants to collect application logs, transaction exports, clickstream events, and operational data in AWS. Analysts need ad hoc SQL. Business users need dashboards. Some data arrives in batches and some arrives as a stream.

The data should be durable, cost-effective, and usable by multiple analytics tools without every team copying it into separate databases.

The common architecture is:

sources -> S3 data lake -> Glue catalog -> Athena, Redshift, QuickSight

The exact services depend on query frequency, latency, data format, governance, and cost.

2. Naive Design

The naive design sends all data directly into one relational database or one Redshift cluster:

all sources -> warehouse -> every query and dashboard

That may work at first, but it mixes raw retention, transformation, ad hoc exploration, dashboards, and warehouse performance into one system.

Another naive design dumps files into S3 with no structure:

bucket -> random CSV files -> everyone guesses paths

That is cheap storage, not a usable data lake. Without cataloging, partitions, file formats, permissions, lifecycle, and ownership, the lake becomes a messy archive.

3. What Breaks

Raw CSV files become expensive to query because Athena scans too much data.

Dashboards become slow if every refresh runs broad ad hoc queries over unoptimized objects.

Data ownership breaks when every team writes different schemas into the same prefix.

Security breaks when analysts receive broad bucket access instead of table, prefix, or role-scoped access.

Cost breaks when lifecycle is missing, crawlers run too often, logs are retained forever in hot storage, and Redshift capacity stays running for occasional queries.

The data lake needs architecture, not just a bucket.

4. AWS Architecture

Use S3 as the durable storage foundation. Organize data into zones:

raw -> cleaned -> curated -> analytics

Use prefixes that reflect domain and partition keys, such as date, source, Region, or tenant where appropriate.

Use AWS Glue Data Catalog to define tables over S3 data. Use Glue crawlers or explicit table definitions depending on schema control needs.

Use Athena for serverless ad hoc SQL over S3, especially when queries are occasional or exploratory.

Use Redshift when the workload needs a data warehouse with consistent high-performance analytics, complex joins, dashboard concurrency, and managed warehouse features.

Use Kinesis Data Streams or Firehose-style ingestion patterns when data arrives continuously.

Use QuickSight for dashboards and business intelligence.

5. Request Or Data Flow

Batch systems export files to S3 raw prefixes. Streaming producers send events to Kinesis and then to S3 or downstream processing.

Glue crawlers or jobs update metadata in the Data Catalog. ETL jobs convert raw data into curated formats such as Parquet, partitioned by useful keys.

Analysts query curated tables with Athena. Dashboards read from Athena, Redshift, or prepared datasets depending on performance requirements.

Redshift can load curated data, query external data through Spectrum-style patterns, or use zero-ETL integrations where supported by service-specific docs.

CloudWatch, CloudTrail, S3 access logs, and cost tools provide operational and governance visibility.

6. Security Controls

Keep S3 buckets private. Use IAM roles, bucket policies, KMS encryption, and least privilege.

Separate raw and curated zones with permissions. Not every analyst needs raw sensitive data.

Use Lake Formation where centralized data lake permissions and fine-grained access control are required.

Encrypt data at rest with S3 encryption and KMS where needed. Encrypt data in transit.

Use CloudTrail data events selectively for sensitive buckets when object-level audit is required, understanding the cost.

Avoid putting secrets or personal data into logs without governance.

7. Resilience Controls

S3 provides durable object storage across multiple Availability Zones for general purpose buckets.

Use versioning for important datasets where accidental overwrite or deletion risk matters.

Use replication when compliance, regional resilience, or data locality requirements justify extra cost.

For streaming ingestion, design retry and buffering behavior. A failed consumer should not lose events.

For ETL, make jobs restartable and idempotent. A failed transformation should not create duplicate curated records.

Keep raw data long enough to reprocess if transformation bugs are discovered.

8. Performance Controls

Use columnar formats such as Parquet or ORC for analytics tables.

Partition data on common query filters, such as date or tenant, but avoid overly granular partitions that create metadata pain.

Compress files and compact small files. Thousands of tiny files can slow query planning and increase request overhead.

Use Athena workgroups to control query settings and scanned data.

Use Redshift for repeated, complex, high-concurrency analytical workloads where a warehouse is justified.

Use QuickSight extracts or prepared datasets when dashboard performance matters.

9. Cost Controls

S3 cost depends on storage class, requests, lifecycle, retrieval, replication, and data transfer.

Athena charges mainly by data scanned, so file format and partitioning are direct cost controls.

Glue crawlers and ETL jobs cost based on usage. Do not crawl huge datasets unnecessarily.

Redshift costs are different: provisioned or serverless warehouse capacity, storage, snapshots, data transfer, and concurrency features.

Move cold raw data to lower-cost storage classes when retrieval requirements allow it.

Use Cost Explorer, CUR/Data Exports, and budgets to track analytics spend.

10. Exam Variants

"Serverless SQL over data in S3" points to Athena.

"Metadata catalog for S3 datasets" points to Glue Data Catalog.

"ETL jobs and crawlers" point to AWS Glue.

"Data warehouse for complex analytics and dashboards" points to Redshift.

"Streaming ingestion" can point to Kinesis Data Streams or related streaming services.

"Reduce Athena query cost" points to partitioning, compression, columnar formats, and limiting scanned data.

11. Common Traps

Do not choose Redshift for occasional ad hoc S3 queries if Athena is enough.

Do not choose Athena for every high-concurrency dashboard workload without performance planning.

Do not dump raw CSV into S3 and call it a data lake.

Do not ignore small files, partitions, and schema evolution.

Do not give broad bucket access when table or prefix-level access is needed.

Do not forget lifecycle policies for raw, log, and archive data.

Review Amazon S3, AWS Glue, Amazon Athena, Amazon Redshift, and Amazon QuickSight.

Official AWS references:

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.

Prerequisites

Read these first if the mechanics feel unfamiliar.

Amazon S3Start here if Amazon S3 is still fuzzy.AWS GlueStart here if AWS Glue is still fuzzy.Amazon AthenaStart here if Amazon Athena is still fuzzy.