Step 4 - Glue

AWSGlue

EcommCo’s Data Lake leverages AWS Glue capabilities to extract, transform, and load (ETL) data from the Curated Datasets Bucket. Glue also has capabilities to auto-discover Curated Datasets with Crawlers and makes data immediately searchable, queryable, and available for ETL in Data Lake.

a. Demonstrate Ingest process (ETL) and automatic schema discovery with Glue

The diagram below illustrates how ECommCo uses Glue to perform ETL job on Curated Datasets Bucket.

b. Curated Datasets ETL

{% include 'error_box.html' %}

When you click this button, the following steps will be performed within your AWS account:

  • AWS Glue Crawler {{ curated_datasets_crawler_name }} runs on Curated Datasets Bucket
  • AWS Glue Job {{ curated_datasets_job_name }} performs json to parquet conversion
  • AWS Glue Crawler {{ curated_datasets_crawler_name }} waits for above job to complete and runs again
  • c. Observe AWS Glue crawlers and databases created by them

  • Visit AWS Glue crawlers in your AWS Management Console to see {{ curated_datasets_crawler_name }}.
  • Visit AWS Glue databases in your AWS Management Console to see {{ curated_datasets_database_name }} database.
  • d. Observe AWS Glue ETL job

    1. Visit AWS Glue jobs in your AWS Management Console
    2. Note: This job run can take about 20 minutes