# Config Driven ETL Framework for AWS Glue A lightweight and config driven ETL (extract, transform, load) framework to perform ETL operations on data lakes using AWS Glue ## Features - A config driven framework, reads from a YAML file - Extacts data from data lake - Runs transformations using Spark-SQL - Loads the processed data back to data lake - Can execute external Python Scripts
## ETL Architecture for AWS Glue ![ETL Framework Architecture](images/etl_framework_architecture.png) #
## Config File Config file is a YAML file where you can provide config parameter values to execute the ETL operation |Key|Sub Key|Description|Type|Valid Values|Default|Example| |--|--|--|--|--|--|--| |job|||Required|||| ||name|Name of the job|Optional|||| ||type|Type of the job|Required|spark-sql
pyhon-script||| ||spark-config|Optional Spark Config Items|Optional|||spark-config:
- spark.executor.memory: 4g
- spark.driver.memory: 4g| |variables|||Required|||| ||pipeline_name|Put the name of your pipeline
* You can add multiple pipeline names if the config is shared|Required|Config value should be in Python Dictionary format||my_pipeline: {"variable1":"value1", "variable2":"value2"}| |input_tables|||Required|||| ||-name
YAML array|* The name of the table to read from
* If source is S3 It's the Spark view name
* Use same name in the SQL|Required|||| ||source|Source type of the table|Required|* S3 : Direct access to S3 locations
* GDC : Access source tables using Glue Data Catalog
* JDBC : Access source tables using JDBC||S3| ||database|The database to read from|Required if source is GDC or JDBC|||| ||location|The physical location of the table
* Can include variables provided in pipeline variables|Required if source is S3|||s3://data-lake-bucket/{value1}/{value2}table_name/| ||format|File format when the source is S3|Optional|parquet
csv
json
orc
avro|parquet|parquet| ||format_options|Formatting options for the source files|Optional|https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html||{"quoteChar": '"',
"withHeader": True,
"separator": ",",
"optimizePerformance":False}| |output_table|||Required|||| ||name|Name of the output table
Can include {pipeline_name} variable|Required|||patient_{pipeline_name}| ||target|Target type of the output table|Required|S3
GDC||S3| ||database|The database to write to|Required when target is GDC or JDBC|||| ||schema|The db schema to write to|Required when target is JDBC|||| ||location|The physical location of the table
Can include variables provided in pipeline variables|Required when target is S3|||s3://data-lake-bucket/output-folder/my_table_{pipeline_name}/| ||governed|Specifies if the output table is a governed table in GDC|Optional
Can be used only when target is GDC|yes||yes| ||format|File format when the target is S3|Optional|parquet
csv
json
orc
avro|parquet|parquet| ||format_options|Formatting options for the source files|Optional|https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html||{"quoteChar": '"',
"withHeader": ||refresh|Table data refresh type|Optional|full
incremental||| ||coalesce|Controls the number of files produced in the output table|Optional|Any integer value||10| ||repartition|Controls the number of files produced in the output table|Optional|Any integer value||10| ||partition_keys|Table partition keys|Optional|||"column1,column2"| |sql||* A single SQL SELECT statement in YAML multiline format
* Can include variables provided in pipeline variables|Optional|* Either sql or sql_file should be privided, not both
* If both are provided, sql would be used||\|
SELECT a.\*, b.*
FROM spark_table1 a
JOIN spark_table2 b ON b.column1 = b. column1| |sql_file||* A SQL script file which contains a single SELECT statement
* Can include variables provided in pipeline variables|Optional|* Either sql or sql_file should be privided, not both
* If both are provided, sql would be used||s3://data-lake-bucket/etl-framefork/config/sql/my_query.sql| |python_file||* A Python script file to be executed within the Glue job
* Can include variables provided in pipeline variables|Required when type.job is python-script|||s3://data-lake-bucket/etl- framefork/config/python/my_script.py| #
#### Usage Notes : - The framework can perform UPSERT on both non-partitioned or partitioned data lake tables. - You can read from one or many tables but write back to a single table #

# Example 1 Pipeline Name = 2022_sales
Job Type = spark-sql

Input tables :

   sv_sales :
      Source = S3,
      Location = s3://data-lake-bucket/y2022/sales/
      Format = Default (parquet)
   sv_region : Source = S3,
      Format = CSV,
      Location = s3://data-lake-bucket/sales/region/
   sv_customer : Source = S3,
      Source = S3,
      Location = s3://data-lake-bucket/sales/customer/
      Format = Default (parquet)

Output table :
   sales_by_region,
      Target = S3,
      Location = s3://datalake-bucket/outputfolder/y2022/sales_by_region/

## Config File for Example 1 ``` job: type: spark-sql variables: sales_2021: {"datasource": "y2021", "area": "sales"} sales_2022: {"datasource": "y2021", "area": "sales"} input_tables: - name: sv_sales source: s3 location: s3://datalake-bucket/{datasource}/sales/ - name: sv_region source: s3 location: s3://data-lake-bucket/{area}/region/ format: csv format_options: { "quoteChar": '"', "withHeader": True, "separator": ",", "optimizePerformance": False, } - name: sv_customer source: s3 location: s3://data-lake-bucket/{area}/customer/ output_table: name: sales_by_region target: s3 location: s3://datalake-bucket/outputfolder/{datasource}/sales_by_region/ sql: | SELECT r.region_id, sum(s.total_amount) as total_amount FROM sv_sales s JOIN sv_customer c ON c.customer_id = s.customer_id JOIN sv_region r ON r.region_id = c.region_id ``` #
# Example 2 Pipeline Name = 2022_sales
Job Type = spark-sql

Input tables :

   sv_sales :
      Source = GDC
      Database = sales_2022
   sv_region : Source = S3
      Source = GDC
      Database = common_db
   sv_customer : Source = S3
      Source = GDC
      Database = common_db

Output table :
   sales_by_region
      Target = GDC
      Database = ML

## Config File for Example 2 ``` job: type: spark-sql variables: sales_2021: {"datasource": "y2021", "area": "sales"} sales_2022: {"datasource": "y2021", "area": "sales"} input_tables: - name: sv_sales source: GDC database: sales_2022 - name: sv_region source: GDC database: common_db - name: sv_customer source: GDC database: common_db output_table: name: sales_by_region_{datasource} target: gdc database: common_db sql_file: s3://datalake-bucket/config/sql/sales.sql ```