AWS Services

Glue Crawler vs Data Catalog vs ETL Jobs

Compare AWS Glue crawlers, the AWS Glue Data Catalog, and Glue ETL jobs for schema discovery, metadata management, transformations, table definitions, partitions, and analytics pipelines.

foundation5 min readUpdated 2026-06-03CloudCertificationDataOperations

AWS GlueGlue CrawlerGlue Data CatalogETL JobSchema DiscoveryTable MetadataPartitionData Pipeline

After this, you will understand

Glue becomes much less abstract when learners separate the metadata store, the scanner that fills it, and the jobs that transform data.

Plain version

The Data Catalog stores metadata, crawlers discover schemas and update catalog tables, and ETL jobs transform data between sources and targets.

Decision pressure

Learners expect crawlers to clean data, expect the catalog to move bytes, or use Athena without catalog and S3 permissions.

Exam-ready model

Name the job first: discover metadata, store metadata, or transform data.

Think before readingWhat does a Glue crawler actually create or update?

It creates or updates metadata tables and partitions in the AWS Glue Data Catalog after scanning data sources.

Reading in progress

This page is saved in your local study history so you can continue later.

Next: Amazon Kinesis Data Streams

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

1Athena vs Redshift vs OpenSearchaws-services

Concepts Covered

AWS Glue Data Catalog
Glue crawlers
Classifiers
Table metadata
Partitions
ETL jobs
Sources and targets
Schema drift
Athena integration
SAA-C03 Glue traps

1. Plain-English Mental Model

Glue is easier to learn when its parts are separated.

Data Catalog = metadata store
Crawler = scanner that discovers schema and updates metadata
ETL job = transformation job that reads, changes, and writes data

The Data Catalog does not move data. A crawler does not clean dirty data. An ETL job does the transformation work.

This distinction matters because AWS exam questions often ask for "discover schema", "query S3 with Athena", "convert CSV to Parquet", or "store central table metadata." Each phrase points to a different Glue component.

2. Why This Service Exists

Data lakes need more than files.

S3 can store raw data, but query engines need to know the schema, file locations, formats, partitions, and table names. Analysts need to discover datasets without guessing folder structure. Pipelines need repeatable transforms from raw to curated data.

Glue exists to provide serverless data integration around those needs.

The Data Catalog provides persistent metadata. Crawlers can populate or update that metadata by scanning data stores. ETL jobs can use the catalog as source and target metadata while transforming actual data.

3. The Naive Approach And Where It Breaks

The naive data lake is just a bucket:

S3 raw files -> everyone guesses paths and columns

That breaks when schemas change, partitions are missing, files are inconsistent, and query engines cannot infer the right table shape.

Another naive approach is to run a crawler and assume the data is now trustworthy. Crawlers infer metadata. They do not validate business meaning, clean bad records, deduplicate, or transform file formats by themselves.

A third mistake is thinking the Data Catalog stores the dataset. It stores metadata about the dataset. The data remains in S3, JDBC sources, or other data stores.

4. Core Primitives

The AWS Glue Data Catalog stores databases, tables, schemas, partitions, locations, and metadata properties.

A crawler scans data stores, uses classifiers to infer schema and format, and creates or updates catalog tables and partitions.

Classifiers recognize file formats such as JSON, CSV, Avro, and others, including custom classifier options.

An ETL job runs code or visual job definitions that read from sources, transform data, and write to targets.

Glue connections provide network and credential details for certain data sources.

Glue workflows and triggers coordinate job execution.

5. Architecture Use Cases

Use the Data Catalog as a shared metadata layer for S3 data lake tables:

S3 dataset -> Glue Data Catalog table -> Athena query

Use crawlers when data arrives in known stores and schema discovery or partition updates should be automated.

Use ETL jobs to transform raw data into curated data:

raw JSON or CSV -> Glue ETL -> partitioned Parquet -> Athena or Redshift Spectrum

Use manual table definitions when schema control matters more than automatic inference.

Use Glue jobs before Redshift loads when data needs cleaning, normalization, or format changes.

7. Security Model

Glue security includes IAM roles, Data Catalog permissions, S3 permissions, KMS keys, Lake Formation where used, and network access for private sources.

A crawler role needs permission to read the data source and write metadata into the Data Catalog.

An ETL job role needs permission to read sources, write targets, access catalog metadata, write logs, and use KMS keys.

Athena users need permission to read catalog metadata, read S3 data, and write query results.

Catalog metadata can reveal sensitive dataset names and schemas. Protect metadata as well as data.

8. Reliability And Resilience

Catalog correctness is operationally important. If partitions are missing, Athena may not see new data. If schema inference changes unexpectedly, downstream jobs can fail.

Crawlers should run on a schedule that matches data arrival and cost expectations.

ETL jobs should be idempotent. A failed retry should not duplicate output data or corrupt curated partitions.

Schema drift needs governance. Automatic updates can be useful, but silent schema changes can break dashboards and downstream pipelines.

Monitor crawler runs, job failures, output row counts, data freshness, and partition updates.

9. Performance And Scaling

The Data Catalog affects query planning and discovery, but it does not process data by itself.

Crawlers can be expensive or slow if they repeatedly scan huge datasets unnecessarily.

ETL job performance depends on worker type, worker count, file size, partitioning, transformation logic, and source-system limits.

Good Glue pipelines improve Athena and Redshift performance by converting data to columnar formats, compacting tiny files, and registering useful partitions.

Poor metadata and many small raw files can make downstream queries slow and expensive.

10. Cost Model

Glue costs include crawler runtime, ETL job resources and duration, interactive sessions, Data Catalog storage and requests where applicable, logs, and related S3/KMS charges.

Crawlers that run too often over unchanged data can waste money.

ETL jobs can save money downstream by reducing Athena scanned bytes or Redshift load cost.

Data Catalog cost is usually not the biggest line item, but metadata sprawl and inefficient pipelines still matter.

Optimize the whole pipeline, not just the Glue job.

12. SAA-C03 Exam Signals

"Discover schema from S3 data" points to Glue crawler.

"Populate or update the Data Catalog" points to crawler.

"Persistent metadata store for Athena tables" points to Glue Data Catalog.

"Transform CSV to Parquet" points to Glue ETL job.

"Clean, join, or enrich datasets before analytics" points to Glue job.

"Run SQL query over S3 data" points to Athena using catalog metadata.

"Data warehouse analytics" points to Redshift, often after ETL.

13. Common Exam Traps

Do not confuse the Data Catalog with the data itself.

Do not expect crawlers to clean or transform data.

Do not run crawlers over streams when the service does not support that source pattern.

Do not forget S3 permissions for Athena and Glue roles.

Do not let schema drift silently break downstream queries.

Do not use Glue when the requirement is only to visualize dashboards. That points to QuickSight.

Review AWS Glue, Amazon Athena, Amazon Redshift, Amazon S3, and Analytics Data Lake On S3.

Official AWS references:

What to study next

These links keep the session moving: read prerequisites first, then open the systems, concepts, and patterns that deepen this page.

Prerequisites

Read these first if the mechanics feel unfamiliar.

AWS GlueStart here if AWS Glue is still fuzzy.Amazon AthenaStart here if Amazon Athena is still fuzzy.

Read these in order

What to study next

Prerequisites

More Links