Add a crawler for curated data

Note: To proceed with this step, you need to wait for all the glue jobs to finish.

Now that we have the data in Parquet format, we need to infer the schema.

Repeat this step per each job you created in the previous step.

Glue crawler connects to a data store to determine the schema for your data, and then creates metadata tables in the data catalog.

  • start by navigating to the Crawlers menu on the navigation pane, then press Add crawler.

  • specify the name: “byod-YOUR-TABLE-NAME-curated-crawler” and press Next;

  • choose Data stores as Crawler source type and press Next;

  • Choose S3 as data store. Add S3 path where your curated data resides and press *Next;

  • If you have more than one folder (meaning different sets type of data), you need to add them as other datastores one by one in this step. Otherwide, choose “No”.

  • Choose the glue-processor-role as IAM Role and proceed to the schedule;

  • Leave the Run on demand option at the Frequency section and press Next;

  • Click on the Add database button and specify “byod_curated” as database name (this will be the name representing the curated database in the data catalog - make sure the name does not have “-” since you may have problems in the future steps). Press Next and Finish;

  • select the newly created crawler and push the Run crawler button. It will take a few minutes until it populates the data catalog.