# Genomic analysis using Glow on EMR Serverless Some customers may want to use [Glow](https://glow.readthedocs.io/en/latest/index.html) to work with genomic data. This is an example of how to package Glow dependencies and run a job on EMR Serverless using [1000 Genomes](https://registry.opendata.aws/1000-genomes/) data. ## Getting Started First, we'll need to build a `virtualenv` that we can upload to S3. Similar to our base example, we'll use a Dockerfile to do this. In addition, we're _also_ going to create an uber-jar directly from the [Glow](https://github.com/projectglow/glow) project. - In this directory, run the following command and it will build the Dockerfile and export `pyspark_glow.tar.gz` and `glow-spark3-assembly-1.1.2-SNAPSHOT.jar` to your local filesystem. ```shell # Enable BuildKit backend DOCKER_BUILDKIT=1 docker build --output ./dependencies . ``` - Now copy those files to S3. ```shell export S3_BUCKET= aws s3 sync ./dependencies/ s3://${S3_BUCKET}/artifacts/pyspark/glow/ ``` - And this demo code ```shell aws s3 cp glow_demo.py s3://${S3_BUCKET}/code/pyspark/ ``` ## Submitting a new job Make sure to follow the [EMR Serverless getting started guide](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/getting-started.html) to spin up your Spark application. - Define your EMR Serverless app ID and job role and submit the job. Unfortunately, in preview EMR Serverless does not have outbound internet access. So we can't use the `--packages` option. Instead, we'll provide an `--archives` option for our PySpark virtualenv and `--jars` to our Glow uberjar on S3. ```shell export APPLICATION_ID= export JOB_ROLE_ARN= aws emr-serverless start-job-run \ --application-id $APPLICATION_ID \ --execution-role-arn $JOB_ROLE_ARN \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/glow_demo.py", "entryPointArguments": ["pyspark_glow.tar.gz"], "sparkSubmitParameters": "--conf spark.hadoop.io.compression.codecs=io.projectglow.sql.util.BGZFCodec --conf spark.driver.cores=1 --conf spark.driver.memory=2g --conf spark.executor.cores=3 --conf spark.executor.memory=4g --conf spark.executor.instances=20 --archives=s3://'${S3_BUCKET}'/artifacts/pyspark/glow/pyspark_glow.tar.gz --jars s3://'${S3_BUCKET}'/artifacts/pyspark/glow/glow-spark3-assembly-1.1.2-SNAPSHOT.jar" } }' \ --configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://'${S3_BUCKET}'/logs/" } } }' ``` - Set your job run ID and get the status of the job. ```shell export JOB_RUN_ID=001234567890 aws emr-serverless get-job-run \ --application-id $APPLICATION_ID \ --job-run-id $JOB_RUN_ID ``` - View the logs while the job is running. ```shell aws s3 cp s3://${S3_BUCKET}/logs/applications/${APPLICATION_ID}/jobs/${JOB_RUN_ID}/SPARK_DRIVER/stdout.gz - | gunzip ``` ## Alternatives One other alternative I came up with while researching this is using a temp image to fetch your dependencies. It uses an EMR on EKS image, so you'll need to [authenticate before building](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/docker-custom-images-steps.html). ```shell # Enable BuildKit backend DOCKER_BUILDKIT=1 docker build -f Dockerfile.jars . --output jars ``` This will export ~100 jars to your local filesystem, which you can upload to S3 and specificy via the `--jars` variable as a comma-delimited list.