AWS Services
Glue Crawler vs Data Catalog vs ETL Jobs
Compare AWS Glue crawlers, the AWS Glue Data Catalog, and Glue ETL jobs for schema discovery, metadata management, transformations, table definitions, partitions, and analytics pipelines.
After this, you will understand
Glue becomes much less abstract when learners separate the metadata store, the scanner that fills it, and the jobs that transform data.
The Data Catalog stores metadata, crawlers discover schemas and update catalog tables, and ETL jobs transform data between sources and targets.
Learners expect crawlers to clean data, expect the catalog to move bytes, or use Athena without catalog and S3 permissions.
Name the job first: discover metadata, store metadata, or transform data.
Think before readingWhat does a Glue crawler actually create or update?
Reading in progress
This page is saved in your local study history so you can continue later.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
Concepts Covered
- AWS Glue Data Catalog
- Glue crawlers
- Classifiers
- Table metadata
- Partitions
- ETL jobs
- Sources and targets
- Schema drift
- Athena integration
- SAA-C03 Glue traps
1. Plain-English Mental Model
Glue is easier to learn when its parts are separated.
Data Catalog = metadata store
Crawler = scanner that discovers schema and updates metadata
ETL job = transformation job that reads, changes, and writes data
The Data Catalog does not move data. A crawler does not clean dirty data. An ETL job does the transformation work.
This distinction matters because AWS exam questions often ask for "discover schema", "query S3 with Athena", "convert CSV to Parquet", or "store central table metadata." Each phrase points to a different Glue component.
2. Why This Service Exists
Data lakes need more than files.
S3 can store raw data, but query engines need to know the schema, file locations, formats, partitions, and table names. Analysts need to discover datasets without guessing folder structure. Pipelines need repeatable transforms from raw to curated data.
Glue exists to provide serverless data integration around those needs.
The Data Catalog provides persistent metadata. Crawlers can populate or update that metadata by scanning data stores. ETL jobs can use the catalog as source and target metadata while transforming actual data.
3. The Naive Approach And Where It Breaks
The naive data lake is just a bucket:
S3 raw files -> everyone guesses paths and columns
That breaks when schemas change, partitions are missing, files are inconsistent, and query engines cannot infer the right table shape.
Another naive approach is to run a crawler and assume the data is now trustworthy. Crawlers infer metadata. They do not validate business meaning, clean bad records, deduplicate, or transform file formats by themselves.
A third mistake is thinking the Data Catalog stores the dataset. It stores metadata about the dataset. The data remains in S3, JDBC sources, or other data stores.
4. Core Primitives
The AWS Glue Data Catalog stores databases, tables, schemas, partitions, locations, and metadata properties.
A crawler scans data stores, uses classifiers to infer schema and format, and creates or updates catalog tables and partitions.
Classifiers recognize file formats such as JSON, CSV, Avro, and others, including custom classifier options.
An ETL job runs code or visual job definitions that read from sources, transform data, and write to targets.
Glue connections provide network and credential details for certain data sources.
Glue workflows and triggers coordinate job execution.
5. Architecture Use Cases
Use the Data Catalog as a shared metadata layer for S3 data lake tables:
S3 dataset -> Glue Data Catalog table -> Athena query
Use crawlers when data arrives in known stores and schema discovery or partition updates should be automated.
Use ETL jobs to transform raw data into curated data:
raw JSON or CSV -> Glue ETL -> partitioned Parquet -> Athena or Redshift Spectrum
Use manual table definitions when schema control matters more than automatic inference.
Use Glue jobs before Redshift loads when data needs cleaning, normalization, or format changes.
7. Security Model
Glue security includes IAM roles, Data Catalog permissions, S3 permissions, KMS keys, Lake Formation where used, and network access for private sources.
A crawler role needs permission to read the data source and write metadata into the Data Catalog.
An ETL job role needs permission to read sources, write targets, access catalog metadata, write logs, and use KMS keys.
Athena users need permission to read catalog metadata, read S3 data, and write query results.
Catalog metadata can reveal sensitive dataset names and schemas. Protect metadata as well as data.
8. Reliability And Resilience
Catalog correctness is operationally important. If partitions are missing, Athena may not see new data. If schema inference changes unexpectedly, downstream jobs can fail.
Crawlers should run on a schedule that matches data arrival and cost expectations.
ETL jobs should be idempotent. A failed retry should not duplicate output data or corrupt curated partitions.
Schema drift needs governance. Automatic updates can be useful, but silent schema changes can break dashboards and downstream pipelines.
Monitor crawler runs, job failures, output row counts, data freshness, and partition updates.
9. Performance And Scaling
The Data Catalog affects query planning and discovery, but it does not process data by itself.
Crawlers can be expensive or slow if they repeatedly scan huge datasets unnecessarily.
ETL job performance depends on worker type, worker count, file size, partitioning, transformation logic, and source-system limits.
Good Glue pipelines improve Athena and Redshift performance by converting data to columnar formats, compacting tiny files, and registering useful partitions.
Poor metadata and many small raw files can make downstream queries slow and expensive.
10. Cost Model
Glue costs include crawler runtime, ETL job resources and duration, interactive sessions, Data Catalog storage and requests where applicable, logs, and related S3/KMS charges.
Crawlers that run too often over unchanged data can waste money.
ETL jobs can save money downstream by reducing Athena scanned bytes or Redshift load cost.
Data Catalog cost is usually not the biggest line item, but metadata sprawl and inefficient pipelines still matter.
Optimize the whole pipeline, not just the Glue job.
12. SAA-C03 Exam Signals
"Discover schema from S3 data" points to Glue crawler.
"Populate or update the Data Catalog" points to crawler.
"Persistent metadata store for Athena tables" points to Glue Data Catalog.
"Transform CSV to Parquet" points to Glue ETL job.
"Clean, join, or enrich datasets before analytics" points to Glue job.
"Run SQL query over S3 data" points to Athena using catalog metadata.
"Data warehouse analytics" points to Redshift, often after ETL.
13. Common Exam Traps
Do not confuse the Data Catalog with the data itself.
Do not expect crawlers to clean or transform data.
Do not run crawlers over streams when the service does not support that source pattern.
Do not forget S3 permissions for Athena and Glue roles.
Do not let schema drift silently break downstream queries.
Do not use Glue when the requirement is only to visualize dashboards. That points to QuickSight.
15. Related Topics
Review AWS Glue, Amazon Athena, Amazon Redshift, Amazon S3, and Analytics Data Lake On S3.
Official AWS references:
What to study next
These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.
Prerequisites
Read these first if the mechanics feel unfamiliar.
More Links
Additional references connected to this page.