{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [Cromwell on AWS](https://docs.opendata.aws/genomics-workflows/)\n", "\n", "[Cromwell](https://github.com/broadinstitute/cromwell) is a Workflow Management System geared towards scientific workflows. Cromwell is open sourced under the [BSD 3-Clause license](https://github.com/broadinstitute/cromwell/blob/develop/LICENSE.txt).\n", "\n", "![Image of Cromwell](https://docs.opendata.aws/genomics-workflows/cromwell/images/cromwell-on-aws_infrastructure.png)\n", "\n", "### Initialize Notebook Environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sys\n", "import os\n", "import json\n", "import base64\n", "import project_path # path to helper methods\n", "import pprint\n", "import time\n", "import pandas as pd\n", "\n", "from lib import workshop\n", "from botocore.exceptions import ClientError\n", "\n", "cfn = boto3.client('cloudformation')\n", "batch = boto3.client('batch')\n", "iam = boto3.client('iam')\n", "ec2_client = boto3.client('ec2')\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "\n", "key_name = 'genomics-ami'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)\n", "\n", "We will create an S3 bucket that will be used throughout the workshop for storing our data.\n", "\n", "[s3.create_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket) boto3 documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket = workshop.create_bucket_name('genomics-')\n", "session.resource('s3').create_bucket(Bucket=bucket, CreateBucketConfiguration={'LocationConstraint': region})\n", "print(bucket)\n", "\n", "response = session.client('s3').put_bucket_encryption(\n", " Bucket=bucket,\n", " ServerSideEncryptionConfiguration={\n", " 'Rules': [\n", " {\n", " 'ApplyServerSideEncryptionByDefault': {\n", " 'SSEAlgorithm': 'AES256'\n", " }\n", " },\n", " ]\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create VPC](https://aws.amazon.com/vpc/)\n", "\n", "Amazon Virtual Private Cloud (Amazon VPC) lets you provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways. You can use both IPv4 and IPv6 in your VPC for secure and easy access to resources and applications." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vpc, subnet, subnet2 = workshop.create_and_configure_vpc()\n", "vpc_id = vpc.id\n", "subnet_id = subnet.id\n", "subnet2_id = subnet2.id\n", "print(vpc_id)\n", "print(subnet_id)\n", "print(subnet2_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create a custom AMI for Cromwell](https://docs.opendata.aws/genomics-workflows/aws-batch/create-custom-ami/)\n", "\n", "In all cases, you will need a AMI ID for the AWS Batch Compute Resource AMI that you created using the [\"Create a Custom AMI\"](https://docs.opendata.aws/genomics-workflows/aws-batch/create-custom-ami/) guide! We do not provide a default value since for most genomics workloads, you will need to account for more storage than the default AWS Batch AMI provides. We will download and execute a Python script to generate the custom AMI for use with Cromwell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://s3.amazonaws.com/aws-genomics-workflows/artifacts/aws-custom-ami.tgz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!tar --warning=no-unknown-keyword -xzf aws-custom-ami.tgz && rm aws-custom-ami.tgz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Execute the Python script passing in the created VPC and Subnet from above. We will be using the UserData for cromwell in this example and the key pair will be created from the script.\n", "\n", "Replace the values for `vpc_id` and `subnet_id` with values from the creation of the vpc above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python ./custom-ami/create-genomics-ami.py --user-data custom-ami/cromwell-genomics-ami.cloud-init.yaml --key-pair-name 'genomics-ami' \\\n", "--vpc-id '{{vpc_id}}' --subnet-id '{{subnet_id}}' --use-instance-profile\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create the Batch Environment](https://docs.opendata.aws/genomics-workflows/aws-batch/configure-aws-batch-cfn/)\n", "\n", "We will create the required AWS Batch environment for genomics workflows in the next few cells. There is a [`Full Stack`](https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=GenomicsEnv-Full&templateURL=https://s3.amazonaws.com/aws-genomics-workflows/templates/aws-genomics-root.template.yaml) template that is self-contained and will create all of the AWS resources, including VPC network, security groups, etc if you want to access the quickstart. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Helper Methods for the Batch Environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_compute_environment(computeEnvironmentName, computeType, unitVCpus, imageId, serviceRole, instanceRole,\n", " subnets, securityGroups, keyPair, bidPercentage=None, spotFleetRole=None):\n", " \n", " compute_resources = {\n", " 'type': computeType,\n", " 'imageId': imageId,\n", " 'minvCpus': unitVCpus * 1,\n", " 'maxvCpus': unitVCpus * 16,\n", " 'desiredvCpus': unitVCpus * 1,\n", " 'instanceTypes': ['optimal'],\n", " 'subnets': subnets,\n", " 'securityGroupIds': securityGroups,\n", " 'ec2KeyPair': keyPair,\n", " 'instanceRole': instanceRole\n", " }\n", " \n", " if computeType == 'SPOT':\n", " compute_resources = {\n", " 'type': computeType,\n", " 'imageId': imageId,\n", " 'minvCpus': unitVCpus * 1,\n", " 'maxvCpus': unitVCpus * 16,\n", " 'desiredvCpus': unitVCpus * 1,\n", " 'instanceTypes': ['optimal'],\n", " 'subnets': subnets,\n", " 'securityGroupIds': securityGroups,\n", " 'ec2KeyPair': keyPair,\n", " 'instanceRole': instanceRole,\n", " 'bidPercentage': bidPercentage,\n", " 'spotIamFleetRole': spotFleetRole,\n", " }\n", " \n", " response = batch.create_compute_environment(\n", " computeEnvironmentName=computeEnvironmentName,\n", " type='MANAGED',\n", " serviceRole=serviceRole,\n", " computeResources=compute_resources\n", " )\n", "\n", " while True:\n", " describe = batch.describe_compute_environments(computeEnvironments=[computeEnvironmentName])\n", " computeEnvironment = describe['computeEnvironments'][0]\n", " status = computeEnvironment['status']\n", " if status == 'VALID':\n", " print('\\rSuccessfully created compute environment {}'.format(computeEnvironmentName))\n", " break\n", " elif status == 'INVALID':\n", " reason = computeEnvironment['statusReason']\n", " raise Exception('Failed to create compute environment: {}'.format(reason))\n", " print('\\rCreating compute environment...')\n", " time.sleep(5)\n", "\n", " return response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the AWS Batch Service Role" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "role_doc = {\n", " \"Version\": \"2012-10-17\", \n", " \"Statement\": [\n", " {\"Sid\": \"\", \n", " \"Effect\": \"Allow\", \n", " \"Principal\": {\n", " \"Service\": \"batch.amazonaws.com\"\n", " }, \n", " \"Action\": \"sts:AssumeRole\"\n", " }]\n", " }\n", "\n", "batch_role_arn = workshop.create_role(iam=iam, policy_name='GenomicsEnvBatchServiceRole', \\\n", " assume_role_policy_document=json.dumps(role_doc), \\\n", " managed_policy='arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole')\n", "print(batch_role_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the AWS Batch Spot Fleet Role" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "role_doc = {\n", " \"Version\": \"2012-10-17\", \n", " \"Statement\": [\n", " {\"Sid\": \"\", \n", " \"Effect\": \"Allow\", \n", " \"Principal\": {\n", " \"Service\": \"spotfleet.amazonaws.com\"\n", " }, \n", " \"Action\": \"sts:AssumeRole\"\n", " }]\n", " }\n", "\n", "spot_fleet_role_arn = workshop.create_role(iam=iam, policy_name='GenomicsEnvBatchSpotFleetRole', \\\n", " assume_role_policy_document=json.dumps(role_doc), \\\n", " managed_policy='arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole')\n", "print(spot_fleet_role_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Default and High Priority Environments\n", "\n", "Grabs values from above for:\n", "* `imageId` use value from `EC2 AMI ImageId:`.\n", "* `instanceRole` use value from `IAM Instance Profile:`\n", "* `securityGroups` use value from `EC2 Security Group:`\n", "\n", "### Create the Default Compute Environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image_id = '{{image_id}}'\n", "instance_role = '{{instance_role}}'\n", "security_groups = ['{{security_group}}']\n", "\n", "bid_percentage = 100\n", "default_env = 'DefaultCromwellEnvironment'\n", "hp_env = 'HighPriorityCromwellEnvironment'\n", "desired_cpu = 4\n", "\n", "resp = create_compute_environment(default_env, 'SPOT', desired_cpu, image_id, batch_role_arn, instance_role, \\\n", " [subnet_id], security_groups, key_name, bid_percentage, spot_fleet_role_arn)\n", "default_ce_arn = resp['computeEnvironmentArn']\n", "default_ce = resp['computeEnvironmentName']\n", "print(default_ce_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the High Priority Compute Environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = create_compute_environment(hp_env, 'EC2', desired_cpu, image_id, batch_role_arn, instance_role, \\\n", " [subnet_id], security_groups, key_name)\n", "hp_ce_arn = resp['computeEnvironmentArn']\n", "hp_ce = resp['computeEnvironmentName']\n", "print(hp_ce_arn)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the AWS Batch Job Queues\n", "\n", "We will be creating two job queues one each for the default and high priority environments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_job_queue(primaryComputeEnvironmentName, secondaryComputeEnvironment, priority):\n", " jobQueueName = primaryComputeEnvironmentName + '_queue'\n", " response = batch.create_job_queue(jobQueueName=jobQueueName,\n", " priority=priority,\n", " computeEnvironmentOrder=[\n", " {\n", " 'order': 1,\n", " 'computeEnvironment': primaryComputeEnvironmentName\n", " },\n", " {\n", " 'order': 2,\n", " 'computeEnvironment': secondaryComputeEnvironment\n", " }\n", " ])\n", "\n", " while True:\n", " describe = batch.describe_job_queues(jobQueues=[jobQueueName])\n", " jobQueue = describe['jobQueues'][0]\n", " status = jobQueue['status']\n", " if status == 'VALID':\n", " print('\\rSuccessfully created job queue {}'.format(jobQueueName))\n", " break\n", " elif status == 'INVALID':\n", " reason = jobQueue['statusReason']\n", " raise Exception('Failed to create job queue: {}'.format(reason))\n", " print('\\rCreating job queue... ')\n", " time.sleep(5)\n", "\n", " return response" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = create_job_queue(hp_env, default_env, 1000)\n", "hp_queue_arn = resp['jobQueueArn']\n", "hp_queue = resp['jobQueueName']\n", "print(hp_queue_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = create_job_queue(default_env, hp_env, 1)\n", "default_queue_arn = resp['jobQueueArn']\n", "default_queue = resp['jobQueueName']\n", "print(default_queue_arn)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Launch the Cromwell Server CloudFormation stack](https://docs.opendata.aws/genomics-workflows/cromwell/cromwell-aws-batch/)\n", "\n", "#### Cromwell Server\n", "To ensure the highest level of security, and robustness for long running workflows, it is recommended that you use an EC2 instance as your Cromwell server for submitting workflows to AWS Batch.\n", "\n", "A couple things to note:\n", "\n", "* This server does not need to be permanent. In fact, when you are not running workflows, you should stop or terminate the instance so that you are not paying for resources you are not using.\n", "\n", "* You can launch a Cromwell server just for yourself and exactly when you need it.\n", "\n", "* This server does not need to be in the same VPC as the one that Batch will launch instances in.\n", "\n", "We will pull the latest Cromwell server CloudFormation, build the required parameters, and create the stack." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://s3.amazonaws.com/aws-genomics-workflows/templates/cromwell/cromwell-server.template.yaml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat cromwell-server.template.yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Required parameters for CloudFormation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('VpcId={}'.format(vpc_id))\n", "print('PublicSubnetId={}'.format(subnet_id))\n", "print('KeyName={}'.format(key_name))\n", "print('S3BucketName={}'.format(bucket))\n", "print('BatchQueue={}'.format(default_queue_arn))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Replace the `ParameterValue` below based on the `ParameterKey` above.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile cromwell-params.json\n", "\n", "[\n", " {\n", " \"ParameterKey\": \"InstanceType\",\n", " \"ParameterValue\": \"t2.medium\"\n", " }, \n", " {\n", " \"ParameterKey\": \"VpcId\",\n", " \"ParameterValue\": \"{{VpcId}}\"\n", " }, \n", " {\n", " \"ParameterKey\": \"PublicSubnetID\",\n", " \"ParameterValue\": \"{{PublicSubnetId}}\"\n", " }, \n", " {\n", " \"ParameterKey\": \"LatestAmazonLinuxAMI\",\n", " \"ParameterValue\": \"/aws/service/ami-amazon-linux-latest/amzn-ami-hvm-x86_64-gp2\"\n", " }, \n", " {\n", " \"ParameterKey\": \"InstanceName\",\n", " \"ParameterValue\": \"cromwell-server\"\n", " }, \n", " {\n", " \"ParameterKey\": \"KeyName\",\n", " \"ParameterValue\": \"{{KeyName}}\"\n", " }, \n", " {\n", " \"ParameterKey\": \"SSHLocation\",\n", " \"ParameterValue\": \"0.0.0.0/0\"\n", " }, \n", " {\n", " \"ParameterKey\": \"HTTPLocation\",\n", " \"ParameterValue\": \"0.0.0.0/0\"\n", " }, \n", " {\n", " \"ParameterKey\": \"S3BucketName\",\n", " \"ParameterValue\": \"{{S3BucketName}}\"\n", " }, \n", " {\n", " \"ParameterKey\": \"BatchQueue\",\n", " \"ParameterValue\": \"{{BatchQueue}}\"\n", " }\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the Cromwell Server CloudFormation stack" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stack_name = 'CromwellServer'\n", "\n", "!aws cloudformation create-stack --stack-name $stack_name \\\n", " --template-body file://cromwell-server.template.yaml \\\n", " --parameters file://cromwell-params.json \\\n", " --capabilities CAPABILITY_IAM \\\n", " --region $region" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wait for Cromwell CloudFormation stack completion " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('waiting for stack complete...')\n", "waiter = cfn.get_waiter('stack_create_complete')\n", "waiter.wait(\n", " StackName=stack_name\n", ")\n", "print('stack complete.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get Outputs from CloudFormation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = cfn.describe_stacks(StackName=stack_name)\n", "\n", "outputs = response[\"Stacks\"][0][\"Outputs\"]\n", "pd.set_option('display.max_colwidth', -1)\n", "pd.DataFrame(outputs, columns=[\"OutputKey\", \"OutputValue\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the Hello World wdl file for Cromwell" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile simple-hello.wdl\n", "\n", "task echoHello{\n", " command {\n", " echo \"Hello AWS!\"\n", " }\n", " runtime {\n", " docker: \"ubuntu:latest\"\n", " }\n", "\n", "}\n", "\n", "workflow printHelloAndGoodbye {\n", " call echoHello\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit Hello World wdl file\n", "\n", "In the curl below swap out `{{cromwell server}}` with the `HostName` from the CloudFormation template above. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!curl -X POST \"http://{{cromwell server}}/api/workflows/v1\" \\\n", " -H \"accept: application/json\" \\\n", " -F \"workflowSource=@simple-hello.wdl\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logging\n", "\n", "All Cromwell server logs get sent to CloudWatch Logs. With these logs you can diagnose and troubleshoot any issues that may arise with submitting jobs to the AWS Batch compute environments." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('https://{0}.console.aws.amazon.com/cloudwatch/home?region={0}#logStream:group=cromwell-server'.format(region))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Real world example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile HaplotypeCaller.aws.wdl\n", "\n", "## Copyright Broad Institute, 2017\n", "##\n", "## This WDL workflow runs HaplotypeCaller from GATK4 in GVCF mode on a single sample\n", "## according to the GATK Best Practices (June 2016), scattered across intervals.\n", "##\n", "## Requirements/expectations :\n", "## - One analysis-ready BAM file for a single sample (as identified in RG:SM)\n", "## - Set of variant calling intervals lists for the scatter, provided in a file\n", "##\n", "## Outputs :\n", "## - One GVCF file and its index\n", "##\n", "## Cromwell version support\n", "## - Successfully tested on v29\n", "## - Does not work on versions < v23 due to output syntax\n", "##\n", "## IMPORTANT NOTE: HaplotypeCaller in GATK4 is still in evaluation phase and should not\n", "## be used in production until it has been fully vetted. In the meantime, use the GATK3\n", "## version for any production needs.\n", "##\n", "## Runtime parameters are optimized for Broad's Google Cloud Platform implementation.\n", "##\n", "## LICENSING :\n", "## This script is released under the WDL source code license (BSD-3) (see LICENSE in\n", "## https://github.com/broadinstitute/wdl). Note however that the programs it calls may\n", "## be subject to different licenses. Users are responsible for checking that they are\n", "## authorized to run all programs before running this script. Please see the dockers\n", "## for detailed licensing information pertaining to the included programs.\n", "\n", "# WORKFLOW DEFINITION\n", "workflow HaplotypeCallerGvcf_GATK4 {\n", " File input_bam\n", " File input_bam_index\n", " File ref_dict\n", " File ref_fasta\n", " File ref_fasta_index\n", " File scattered_calling_intervals_list\n", "\n", " String gatk_docker\n", "\n", " String gatk_path\n", "\n", " Array[File] scattered_calling_intervals = read_lines(scattered_calling_intervals_list)\n", "\n", " String sample_basename = basename(input_bam, \".bam\")\n", "\n", " String gvcf_name = sample_basename + \".g.vcf.gz\"\n", " String gvcf_index = sample_basename + \".g.vcf.gz.tbi\"\n", "\n", " # Call variants in parallel over grouped calling intervals\n", " scatter (interval_file in scattered_calling_intervals) {\n", "\n", " # Generate GVCF by interval\n", " call HaplotypeCaller {\n", " input:\n", " input_bam = input_bam,\n", " input_bam_index = input_bam_index,\n", " interval_list = interval_file,\n", " gvcf_name = gvcf_name,\n", " ref_dict = ref_dict,\n", " ref_fasta = ref_fasta,\n", " ref_fasta_index = ref_fasta_index,\n", " docker_image = gatk_docker,\n", " gatk_path = gatk_path\n", " }\n", " }\n", "\n", " # Merge per-interval GVCFs\n", " call MergeGVCFs {\n", " input:\n", " input_vcfs = HaplotypeCaller.output_gvcf,\n", " vcf_name = gvcf_name,\n", " vcf_index = gvcf_index,\n", " docker_image = gatk_docker,\n", " gatk_path = gatk_path\n", " }\n", "\n", " # Outputs that will be retained when execution is complete\n", " output {\n", " File output_merged_gvcf = MergeGVCFs.output_vcf\n", " File output_merged_gvcf_index = MergeGVCFs.output_vcf_index\n", " }\n", "}\n", "\n", "# TASK DEFINITIONS\n", "\n", "# HaplotypeCaller per-sample in GVCF mode\n", "task HaplotypeCaller {\n", " File input_bam\n", " File input_bam_index\n", " String gvcf_name\n", " File ref_dict\n", " File ref_fasta\n", " File ref_fasta_index\n", " File interval_list\n", " Int? interval_padding\n", " Float? contamination\n", " Int? max_alt_alleles\n", "\n", " Int preemptible_tries\n", " Int disk_size\n", " String mem_size\n", "\n", " String docker_image\n", " String gatk_path\n", " String java_opt\n", "\n", " command {\n", " ${gatk_path} --java-options ${java_opt} \\\n", " HaplotypeCaller \\\n", " -R ${ref_fasta} \\\n", " -I ${input_bam} \\\n", " -O ${gvcf_name} \\\n", " -L ${interval_list} \\\n", " -ip ${default=100 interval_padding} \\\n", " -contamination ${default=0 contamination} \\\n", " --max-alternate-alleles ${default=3 max_alt_alleles} \\\n", " -ERC GVCF\n", " }\n", "\n", " runtime {\n", " docker: docker_image\n", " memory: mem_size\n", " cpu: 1\n", " disks: \"local-disk\"\n", " }\n", "\n", " output {\n", " File output_gvcf = \"${gvcf_name}\"\n", " }\n", "}\n", "\n", "# Merge GVCFs generated per-interval for the same sample\n", "task MergeGVCFs {\n", " Array [File] input_vcfs\n", " String vcf_name\n", " String vcf_index\n", "\n", " Int preemptible_tries\n", " Int disk_size\n", " String mem_size\n", "\n", " String docker_image\n", " String gatk_path\n", " String java_opt\n", "\n", " command {\n", " ${gatk_path} --java-options ${java_opt} \\\n", " MergeVcfs \\\n", " --INPUT=${sep=' --INPUT=' input_vcfs} \\\n", " --OUTPUT=${vcf_name}\n", " }\n", "\n", " runtime {\n", " docker: docker_image\n", " memory: mem_size\n", " cpu: 1\n", " disks: \"local-disk\"\n", "}\n", "\n", " output {\n", " File output_vcf = \"${vcf_name}\"\n", " File output_vcf_index = \"${vcf_index}\"\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Input parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile HaplotypeCaller.aws.json\n", "\n", "{\n", " \"##_COMMENT1\": \"INPUT BAM\",\n", " \"HaplotypeCallerGvcf_GATK4.input_bam\": \"s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam\",\n", " \"HaplotypeCallerGvcf_GATK4.input_bam_index\": \"s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bai\",\n", "\n", " \"##_COMMENT2\": \"REFERENCE FILES\",\n", " \"HaplotypeCallerGvcf_GATK4.ref_dict\": \"s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict\",\n", " \"HaplotypeCallerGvcf_GATK4.ref_fasta\": \"s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta\",\n", " \"HaplotypeCallerGvcf_GATK4.ref_fasta_index\": \"s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai\",\n", "\n", " \"##_COMMENT3\": \"INTERVALS\",\n", " \"HaplotypeCallerGvcf_GATK4.scattered_calling_intervals_list\": \"s3://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt\",\n", " \"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.interval_padding\": 100,\n", "\n", " \"##_COMMENT4\": \"DOCKERS\",\n", " \"HaplotypeCallerGvcf_GATK4.gatk_docker\": \"broadinstitute/gatk:4.0.0.0\",\n", "\n", " \"##_COMMENT5\": \"PATHS\",\n", " \"HaplotypeCallerGvcf_GATK4.gatk_path\": \"/gatk/gatk\",\n", "\n", " \"##_COMMENT6\": \"JAVA OPTIONS\",\n", " \"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.java_opt\": \"-Xms8000m\",\n", " \"HaplotypeCallerGvcf_GATK4.MergeGVCFs.java_opt\": \"-Xms8000m\",\n", "\n", " \"##_COMMENT7\": \"MEMORY ALLOCATION\",\n", " \"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.mem_size\": \"10 GB\",\n", " \"HaplotypeCallerGvcf_GATK4.MergeGVCFs.mem_size\": \"30 GB\",\n", "\n", " \"##_COMMENT8\": \"DISK SIZE ALLOCATION\",\n", " \"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.disk_size\": 100,\n", " \"HaplotypeCallerGvcf_GATK4.MergeGVCFs.disk_size\": 100,\n", "\n", " \"##_COMMENT9\": \"PREEMPTION\",\n", " \"HaplotypeCallerGvcf_GATK4.HaplotypeCaller.preemptible_tries\": 3,\n", " \"HaplotypeCallerGvcf_GATK4.MergeGVCFs.preemptible_tries\": 3\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submit job to Cromwell server" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!curl -X POST \"http://{{cromwell server}}/api/workflows/v1\" \\\n", " -H \"accept: application/json\" \\\n", " -F \"workflowSource=@HaplotypeCaller.aws.wdl\" \\\n", " -F \"workflowInputs=@HaplotypeCaller.aws.json\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def delete_compute_environment(computeEnvironment):\n", " response = batch.update_compute_environment(\n", " computeEnvironment=computeEnvironment,\n", " state='DISABLED',\n", " )\n", " print(response)\n", " time.sleep(10)\n", " response = batch.delete_compute_environment(\n", " computeEnvironment=computeEnvironment\n", " )\n", " return response\n", "\n", "def delete_job_queue(name):\n", " response = batch.update_job_queue(\n", " jobQueue=name,\n", " state='DISABLED'\n", " )\n", " print(response)\n", " time.sleep(10)\n", " response = batch.delete_job_queue(\n", " jobQueue=name\n", " )\n", " return response\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = delete_job_queue(hp_queue)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = delete_job_queue(default_queue)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = delete_compute_environment(hp_ce)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "resp = delete_compute_environment(default_ce)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = cfn.delete_stack(StackName=stack_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('waiting for stack complete...')\n", "waiter = cfn.get_waiter('stack_delete_complete')\n", "waiter.wait(\n", " StackName=stack_name\n", ")\n", "print('stack complete.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = ec2_client.delete_key_pair(KeyName=key_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "workshop.vpc_cleanup(vpc_id)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }