+++ title = "Create data catalog from Amazon S3 files" date = 2020-10-20T20:12:11+01:00 weight = 22 chapter = true pre = "3.2. " +++ We will be using AWS Glue Crawlers to infer the schema of the files and create data catalog. Without a crawler, you can still read data from the Amazon S3 by a AWS Glue job, but it will not be able to determine data types (string, int, etc) for each column. Go back to the AWS Glue Console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). - Start by navigating to the **Crawlers** menu on the navigation pane, then press **Add crawler**. - Specify the name: “byod-YOUR_TABLE_NAME-raw-crawler” and press **Next**; - Choose **Data stores** as _Crawler source type_ and press **Next**; - Choose _Amazon S3_ as data store. Add Amazon S3 path of the **specific folder** where your files for your table resides (for example: mybyodbucket/raw/orders) and press \*_Next_. We will create a separate crawler for each table (each folder in your raw folder),so for this step just add one folder. - At this stage we don't add any other data source; - Choose the _glue-processor-role_ as IAM Role and proceed to the schedule; - Leave the _Run on demand_ option at the Frequency section and press **Next**; - Click on the **Add database** button and specify "byod_src" as database name (this will be the name representing the source database in the data catalog). Press **Next** and **Finish**; ![add a new database](/glue_images/ingestion/crawler2.png) ![add a new database](/glue_images/ingestion/crawler3.png) - Select the newly created crawler and push the **Run crawler** button. It will take a few minutes until it populates the data catalog. Validation: After the crawler finishes running, from the left menu, go to **Databases**, select your database, and click "Tables in " link. Choose the table you just created. You will be able to see information and schema of your files.