# Build a Genomics Data Lake on AWS This repo contains the code referenced in the AWS blog post "Build a Genomics data lake on AWS". #### ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster **EMRGenomics.py** - Lambda function that is triggered by the cloudFormation template to create EMR cluster to process VCFs. **EventEMRGenomics.py** - Event trigger Lambda function **emr_config.json** - JSON file with EMR configuration for this example. This file can be edited to change EMR configuration parameters. **vcfToParquetTransform.py** - pySpark script that performs the VCF to parquet transformation using the Hail API. This can be customized to perform any specific transformation steps required. **genomics_datalake_emr.template** - Cloudformation template that can be deployed in your account for the solution. #### 1000Genomes.ipynb - Python notebook with sample queries **For instructions on how to create the Glue data catalog tables for 1000 Genomes on the Registry of Open Data, please check the DataLakeAsCode repo at https://github.com/aws-samples/data-lake-as-code/tree/roda#readme. The repo also has CloudFormation templates for ClinVar and gnomAD.**