AWS Services
AWS Glue
Understand AWS Glue for serverless data integration, including Data Catalog, crawlers, ETL jobs, Glue Studio, DataBrew, schemas, and SAA-C03 signals.
After this, you will understand
Glue teaches the missing middle of analytics: data must be discovered, cataloged, transformed, and cleaned before queries become useful.
AWS Glue is a serverless data integration service with a Data Catalog, crawlers, and ETL jobs for preparing analytics data.
Learners expect S3 files to become queryable and trustworthy without metadata, schema discovery, transformations, or pipeline jobs.
Use Glue to catalog data, discover schemas, run ETL jobs, and prepare S3 or warehouse datasets for Athena, Redshift, and BI tools.
Think before readingWhat is the difference between Glue and Athena?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- AWS Glue
- Data Catalog
- Crawlers
- ETL jobs
- Glue Studio
- DataBrew
- Schema discovery
- S3 data lakes
- Glue versus Athena, Redshift, and Kinesis
- SAA-C03 analytics traps
1. Plain-English Mental Model
AWS Glue is serverless data integration for analytics.
The simple model is:
raw data -> Glue catalog and ETL -> curated data -> Athena, Redshift, QuickSight
Data does not become useful just because it lands in S3. It needs metadata, schemas, formats, partitions, transformations, cleanup, and sometimes enrichment.
Glue provides a central Data Catalog, crawlers that discover schemas, and ETL jobs that move and transform data.
Glue is not the query engine. Athena queries. Redshift warehouses. QuickSight visualizes. Glue prepares and describes the data.
2. Why This Service Exists
Analytics systems need data plumbing.
Files arrive in S3 with messy formats. Columns change. Logs need partitioning by date. CSV data should become Parquet. Operational data needs to be extracted, transformed, and loaded into a warehouse. Analysts need table metadata instead of guessing file paths.
Glue exists to reduce custom data integration work.
For SAA-C03, Glue appears in questions about ETL, serverless data preparation, crawlers, schema discovery, Data Catalog, data lake metadata, transforming S3 files for Athena or Redshift, and creating reusable metadata tables.
The exam boundary: Glue prepares and catalogs data. Athena queries S3 data. Redshift is the warehouse. Kinesis streams data. QuickSight visualizes data.
3. The Naive Approach And Where It Breaks
The naive pattern is a pile of files:
S3 bucket -> many folders -> analysts guess schemas
This breaks when file formats differ, columns drift, partitions are missing, and no one knows what a dataset means.
Another naive pattern is hand-written ETL scripts on EC2. That creates server patching, dependency management, scheduling, retry, logging, and scaling work.
Another mistake is using Glue crawlers and assuming data quality is solved. Crawlers discover metadata. They do not automatically make dirty or semantically inconsistent data trustworthy.
Glue gives you tools for catalog and transformation, but data modeling and pipeline ownership still matter.
4. Core Primitives
The AWS Glue Data Catalog stores metadata about databases, tables, schemas, partitions, and data locations.
A crawler scans data stores and creates or updates catalog metadata.
An ETL job runs transformation code, often using Spark or Python shell depending on configuration.
Glue Studio provides a visual interface for creating and managing ETL jobs.
Triggers and workflows orchestrate Glue jobs.
Connections define access to data stores such as JDBC sources.
DataBrew supports visual data preparation for certain use cases.
Schema Registry can support schema management for streaming data integrations.
5. Architecture Use Cases
Use Glue crawlers to discover S3 data lake tables for Athena.
Use Glue ETL to convert raw CSV or JSON logs into partitioned Parquet:
raw S3 zone -> Glue job -> curated S3 zone -> Athena and QuickSight
Use Glue to extract from databases, transform data, and load into Redshift.
Use Glue Data Catalog as the common metadata store for Athena, Redshift Spectrum, EMR, and other analytics tools.
Use scheduled Glue jobs for recurring batch pipelines.
Use workflows to coordinate multiple dependent ETL steps.
7. Security Model
Glue security depends on IAM roles, S3 permissions, Data Catalog permissions, network connections, KMS, and Lake Formation where used.
Glue jobs need permissions to read source data, write target data, access catalog metadata, and use KMS keys.
If jobs connect to private databases, VPC networking, security groups, subnet choices, and secrets matter.
The Data Catalog can reveal datasets and schema details. Restrict catalog access when datasets are sensitive.
Job logs may reveal data samples, paths, or errors containing sensitive values. Protect CloudWatch Logs.
Use least privilege for job roles, crawlers, and human operators.
8. Reliability And Resilience
Glue jobs should be designed as repeatable pipelines.
If a job fails halfway, it should be safe to retry without duplicating or corrupting data.
Partition registration should match data writes. If partitions are missing, Athena queries may miss data.
Schema changes need governance. A crawler can detect change, but downstream queries or dashboards may break.
Use bookmarks or idempotent job design where appropriate to process incremental data safely.
Monitor job failures, duration, data volume, and output quality.
9. Performance And Scaling
Glue ETL performance depends on data size, file format, partitioning, worker type, worker count, transformation logic, and source systems.
Converting row-oriented text files to columnar formats like Parquet can improve Athena and Redshift Spectrum performance.
Many small files can hurt downstream query performance.
Glue is serverless, but not magic. Bad transformations can still be slow or expensive.
Tune job resources and use partitioning to avoid scanning unnecessary data.
10. Cost Model
Glue cost depends on crawler runtime, ETL job resources and duration, interactive sessions, DataBrew usage, and related services such as S3, CloudWatch Logs, and KMS.
A crawler that runs too often over huge datasets can cost more than expected.
ETL jobs that transform raw data into efficient formats can reduce Athena and Redshift costs later.
The cost model is pipeline-wide: ingestion, transformation, storage, query, monitoring, and reprocessing.
Do not optimize the ETL job alone if it causes expensive downstream queries.
12. SAA-C03 Exam Signals
"Serverless ETL" points to AWS Glue.
"Crawler discovers schema" points to AWS Glue.
"Central metadata catalog for S3 data lake" points to Glue Data Catalog.
"Prepare data for Athena or Redshift" points to Glue.
"Run SQL query over S3" points to Athena.
"Data warehouse" points to Redshift.
"Real-time stream ingestion" points to Kinesis.
13. Common Exam Traps
Do not confuse Glue with Athena.
Do not assume crawlers fix data quality.
Do not run custom EC2 ETL when serverless Glue is the lower-operational-overhead answer.
Do not forget S3 and KMS permissions for job roles.
Do not ignore partitioning and file format.
Do not let schema drift silently break downstream dashboards.
15. Related Topics
Review Amazon Athena, Amazon Redshift, Amazon S3, Amazon Kinesis Data Streams, and Amazon QuickSight.
Official AWS references:
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
More Links
Additional references connected to this page.