{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Running WDL and Nextflow pipelines with HealthOmics Workflows" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, you will learn how to create, run, and debug WDL and Nextflow based pipelines that process data from HealthOmics Storage and Amazon S3 using HealthOmics Workflows." ] }, { "attachments": {}, "cell_type": "markdown", "id": "4550118a", "metadata": {}, "source": [ "## Prerequisites\n", "### Python requirements\n", "* Python >= 3.8\n", "* Packages:\n", " * boto3 >= 1.26.19\n", " * botocore >= 1.29.19\n", "\n", "### AWS requirements\n", "\n", "#### AWS CLI\n", "You will need the AWS CLI installed and configured in your environment. Supported AWS CLI versions are:\n", "\n", "* AWS CLI v2 >= 2.9.3 (Recommended)\n", "* AWS CLI v1 >= 1.27.19\n", "\n", "#### Output buckets\n", "You will need a bucket **in the same region** you are running this tutorial in to store workflow outputs.\n", "\n", "#### Input data\n", "If you modify any of the workflows to retrieve input data (e.g. references or raw reads), that data **MUST reside in the same region**. AWS HealthOmics does not support cross-region read or write at this time." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Environment setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "from datetime import datetime\n", "import glob\n", "import io\n", "import os\n", "from pprint import pprint\n", "from textwrap import dedent\n", "from time import sleep\n", "from urllib.parse import urlparse\n", "from zipfile import ZipFile, ZIP_DEFLATED\n", "\n", "import boto3\n", "import botocore.exceptions" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Create a service IAM role\n", "To use AWS HealthOmics, you need to create an IAM role that grants the service permissions to access resources in your account. We'll do this below using the IAM client.\n", "\n", "> **Note**: this step is fully automated from the HealthOmics Workflows Console when you create a run" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dt_fmt = '%Y%m%dT%H%M%S'\n", "ts = datetime.now().strftime(dt_fmt)\n", "\n", "iam = boto3.client('iam')\n", "role = iam.create_role(\n", " RoleName=f\"OmicsServiceRole-{ts}\",\n", " AssumeRolePolicyDocument=json.dumps({\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [{\n", " \"Principal\": {\n", " \"Service\": \"omics.amazonaws.com\"\n", " },\n", " \"Effect\": \"Allow\",\n", " \"Action\": \"sts:AssumeRole\"\n", " }]\n", " }),\n", " Description=\"HealthOmics service role\",\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "After creating the role, we next need to add policies to grant permissions. In this case, we are allowing read/write access to all S3 buckets in the account. This is fine for this tutorial, but in a real world setting you will want to scope this down to only the necessary resources. We are also adding a permissions to create CloudWatch Logs which is where any outputs sent to `STDOUT` or `STDERR` are collected." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "AWS_ACCOUNT_ID = boto3.client('sts').get_caller_identity()['Account']\n", "\n", "policy_s3 = iam.create_policy(\n", " PolicyName=f\"omics-s3-access-{ts}\",\n", " PolicyDocument=json.dumps({\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"s3:PutObject\",\n", " \"s3:Get*\",\n", " \"s3:List*\",\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::*/*\"\n", " ]\n", " }\n", " ]\n", " })\n", ")\n", "\n", "policy_logs = iam.create_policy(\n", " PolicyName=f\"omics-logs-access-{ts}\",\n", " PolicyDocument=json.dumps({\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"logs:CreateLogGroup\"\n", " ],\n", " \"Resource\": [\n", " f\"arn:aws:logs:*:{AWS_ACCOUNT_ID}:log-group:/aws/omics/WorkflowLog:*\"\n", " ]\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"logs:DescribeLogStreams\",\n", " \"logs:CreateLogStream\",\n", " \"logs:PutLogEvents\",\n", " ],\n", " \"Resource\": [\n", " f\"arn:aws:logs:*:{AWS_ACCOUNT_ID}:log-group:/aws/omics/WorkflowLog:log-stream:*\"\n", " ]\n", " }\n", " ]\n", " })\n", ")\n", "\n", "\n", "for policy in (policy_s3, policy_logs):\n", " _ = iam.attach_role_policy(\n", " RoleName=role['Role']['RoleName'],\n", " PolicyArn=policy['Policy']['Arn']\n", " )" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Using AWS HealthOmics Workflows - the basics\n", "AWS HealthOmics Workflows allows you to perform bioinformatics compute - like genomics secondary analysis - at scale on AWS. These compute workloads are defined using workflow languages like WDL and Nextflow that specify multiple compute tasks and their input and output dependencies.\n", "\n", "The cell below creates an example WDL workflow. (To learn more about WDL see: https://github.com/openwdl/wdl). This is a simple workflow with one task that creates a copy of a file. It's simple enough that we can stash the entire definition in a Python string. Note that more complex workflows may be larger and have multiple files. In that case, it would be better to create and edit the workflow in a separate text editor, notably one that also supports language specific syntax highlighting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.makedirs('workflows/wdl/sample', exist_ok=True)\n", "\n", "wdl = dedent(\"\"\"\n", "version 1.0\n", "\n", "workflow Test {\n", "\tinput {\n", "\t\tFile input_file\n", "\t}\n", "\n", "\tcall FileCopy {\n", "\t\tinput:\n", "\t\t\tinput_file = input_file,\n", "\t}\n", "\n", "\toutput {\n", "\t\tFile output_file = FileCopy.output_file\n", "\t}\n", "}\n", "\n", "task FileCopy {\n", "\tinput {\n", "\t\tFile input_file\n", "\t}\n", "\n", "\tcommand {\n", "\t\techo \"copying ~{input_file}\" | tee >(cat >&2)\n", "\t\tcat ~{input_file} > output\n", "\t}\n", "\n", "\toutput {\n", "\t\tFile output_file = \"output\"\n", "\t}\n", "}\n", "\"\"\").strip()\n", "\n", "with open('workflows/wdl/sample/main.wdl', 'wt') as f:\n", " f.write(wdl)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To run this workflow, we'll start by creating a client for the `omics` service." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "omics = boto3.client('omics')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to bundle up the workflow as a zip-file and call the `create_workflow` API for `omics`. We'll encapsulate these operations in a function for reuse later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_workflow(\n", " workflow_root_dir, \n", " parameters={\"param_name\":{\"description\": \"param_desc\"}}, \n", " name=None, \n", " description=None, \n", " main=None):\n", " buffer = io.BytesIO()\n", " print(\"creating zip file:\")\n", " with ZipFile(buffer, mode='w', compression=ZIP_DEFLATED) as zf:\n", " for file in glob.iglob(os.path.join(workflow_root_dir, '**/*'), recursive=True):\n", " if os.path.isfile(file):\n", " arcname = file.replace(os.path.join(workflow_root_dir, ''), '')\n", " print(f\".. adding: {file} -> {arcname}\")\n", " zf.write(file, arcname=arcname)\n", "\n", " response = omics.create_workflow(\n", " name=name,\n", " description=description,\n", " definitionZip=buffer.getvalue(), # this argument needs bytes\n", " main=main,\n", " parameterTemplate=parameters,\n", " )\n", "\n", " workflow_id = response['id']\n", " print(f\"workflow {workflow_id} created, waiting for it to become ACTIVE\")\n", "\n", " try:\n", " waiter = omics.get_waiter('workflow_active')\n", " waiter.wait(id=workflow_id)\n", "\n", " print(f\"workflow {workflow_id} ready for use\")\n", " except botocore.exceptions.WaiterError as e:\n", " print(f\"workflow {workflow_id} FAILED:\")\n", " print(e)\n", "\n", " workflow = omics.get_workflow(id=workflow_id)\n", " return workflow" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "There are a few things to notice:\n", "\n", "* To avoid polluting the local filesystem the zip-file is created in-memory with a byte buffer. If your workflow has many files such that the resultant bundle is large, you should consider alternative means of creating the zip file.\n", "* A `main.(ext)` file, where `ext` matches the type of the workflow (e.g. `wdl`, or `nf`) must be at the root level of the zip file. HealthOmics uses this file as the primary entrypoint for the workflow. This is relevant for more modular workflows that have multiple definition files. In the call below, we explicitly point to `main.wdl`.\n", "* The `definitionZip` argument takes a binary value and reads the byte buffer value directly.\n", "* The `parameters` argument is a list of `parameterTemplate`s, which for now provide the parameter's name, and a description of what the parameter is. Actual parameter values are provided when the workflow is \"run\" - more on this below." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can now use this function to create a workflow in HealthOmics Workflows from our WDL definition above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "workflow = create_workflow(\n", " 'workflows/wdl/sample', \n", " parameters={\"input_file\": {\"description\": \"input text file to copy\"}},\n", " name=\"Sample\",\n", " description=\"Sample WDL workflow\",\n", " main=\"main.wdl\"\n", ")\n", "pprint(workflow)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now we can start a workflow run with some input data using the `start_run` API call.\n", "\n", "Note the following:\n", "* Here the parameter value `input_file` is associated with an S3 uri. This is specific to this case. Workflow parameters will vary depending on the workflow definition they are associated with.\n", "\n", "* We provide the ARN to the service role we created above. You can specify different roles as needed depending on what resources your workflow needs access to.\n", "\n", "* We provide an `outputUri` with `start_run`. This is where the workflow will place **final** outputs as they are defined by the workflow definition (e.g. values in the `workflow.output` block of a WDL workflow). All intermediate results are discarded once the workflow completes.\n", "\n", "In the cell below, we're using `waiters` to check for when the run starts and completes. These will block the current execution thread.\n", "\n", "It will take about **30min** for this workflow to start (scale up resources), run, and stop (scale down resources). Because this workflow is simple, the time it spends in a `RUNNING` state is fairly short relative to the scale-up/down times. For more complex workflows, or ones that process large amounts of data, the `RUNNING` state will be much longer (e.g. several hours). In that case, it's recommended to asynchronously check on the workflow status." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "role['Role']['Arn']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "## NOTE: replace these S3 URIs with ones you have access to\n", "input_uri = \"s3://my_source_data_bucket/source_file_to_copy\"\n", "output_uri = \"s3://my_results_data_bucket/path/to/results\"\n", "\n", "run = omics.start_run(\n", " workflowId=workflow['id'],\n", " name=\"Sample workflow run\",\n", " roleArn=role['Role']['Arn'],\n", " parameters={\n", " \"input_file\": input_uri\n", " },\n", " outputUri=output_uri,\n", ")\n", "\n", "print(f\"running workflow {workflow['id']}, starting run {run['id']}\")\n", "try:\n", " waiter = omics.get_waiter('run_running')\n", " waiter.wait(id=run['id'], WaiterConfig={'Delay': 30, 'MaxAttempts': 60})\n", "\n", " print(f\"run {run['id']} is running\")\n", "\n", " waiter = omics.get_waiter('run_completed')\n", " waiter.wait(id=run['id'], WaiterConfig={'Delay': 60, 'MaxAttempts': 60})\n", "\n", " print(f\"run {run['id']} completed\")\n", "except botocore.exceptions.WaiterError as e:\n", " print(e)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Once the run completes we can verify its status by either listing it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[_ for _ in omics.list_runs()['items'] if _['id'] == run['id']]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "or getting its full details:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "omics.get_run(id=run['id'])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can verify that the correct output was generated by listing the `outputUri` for the workflow run:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3uri = urlparse(omics.get_run(id=run['id'])['outputUri'])\n", "boto3.client('s3').list_objects_v2(\n", " Bucket=s3uri.netloc, \n", " Prefix='/'.join([s3uri.path[1:], run['id']])\n", ")['Contents']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Workflows typically have multiple tasks. We can list workflow tasks with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tasks = omics.list_run_tasks(id=run['id'])\n", "pprint(tasks['items'])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "and get specific task details with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "task = omics.get_run_task(id=run['id'], taskId=tasks['items'][0]['taskId'])\n", "pprint(task)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "After running the cell above we should see that each task has an associated CloudWatch Logs LogStream. These capture any text generated by the workflow task that has been sent to either `STDOUT` or `STDERR`. These outputs are helpful for debugging any task failures and can be retrieved with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "events = boto3.client('logs').get_log_events(\n", " logGroupName=\"/aws/omics/WorkflowLog\",\n", " logStreamName=f\"run/{run['id']}/task/{task['taskId']}\"\n", ")\n", "for event in events['events']:\n", " print(event['message'])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Using HealthOmics Workflows RunGroups\n", "HealthOmics Workflows Run Groups are a means to control the amount of resources a set of workflows has, and therefore costs associated with running workflows. With a Run Group you can set the max number of concurrent vCPUs used by tasks, the maximum duration of tasks, or the max concurrent number of runs.\n", "\n", "In the cell below, we'll create a run group that with a maximum of 100 vCPUs and a workflow duration limit of 600 minutes (10hrs)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run_group = omics.create_run_group(\n", " name=\"TestRunGroup\",\n", " maxCpus=100,\n", " maxDuration=600,\n", ")\n", "\n", "omics.get_run_group(id=run_group['id'])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "One of the ways you can use a RunGroup is to run multiple iterations of a workflow - each with different input values. Below we'll define a simple Nextflow workflow that takes a simple string parameter that we can easily modify for multiple iterations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.makedirs('workflows/nf/sample', exist_ok=True)\n", "\n", "nf = dedent('''\n", "nextflow.enable.dsl = 2\n", "\n", "params.greeting = 'hello'\n", "params.addressee = null\n", "\n", "if (!params.addressee) exit 1, \"required parameter 'addressee' missing\"\n", "\n", "process Greet {\n", " publishDir '/mnt/workflow/pubdir'\n", " input:\n", " val greeting\n", " val addressee\n", " \n", " output:\n", " path \"output\", emit: output_file\n", " \n", " script:\n", " \"\"\"\n", " echo \"${greeting} ${addressee}\" | tee output\n", " \"\"\"\n", "}\n", "\n", "workflow {\n", " Greet(params.greeting, params.addressee)\n", "}\n", "\n", "''').strip()\n", "\n", "with open('workflows/nf/sample/main.nf', 'wt') as f:\n", " f.write(nf)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the `create_function` function we defined above to create an HealthOmics Workflow from this definition:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "workflow = create_workflow(\n", " 'workflows/nf/sample',\n", " parameters={\n", " \"greeting\": {\n", " \"description\": \"(string) greeting to use\",\n", " \"optional\": True\n", " },\n", " \"addressee\": {\n", " \"description\": \"(string) who to greet\"\n", " }\n", " },\n", " name=\"GreetingsNF\",\n", " description=\"Greetings Nextflow workflow\",\n", " main=\"main.nf\"\n", ")\n", "pprint(workflow)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can now run our this workflow with our run group. We'll start several runs of the workflow concurrently, each with different inputs to distinguish them, to see how the run group works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rg_runs = []\n", "run_inputs = [\n", " {\"greeting\": \"Hello\", \"addressee\": \"AWS\"},\n", " {\"greeting\": \"Bonjour\", \"addressee\": \"HealthOmics\"},\n", " {\"greeting\": \"Hola\", \"addressee\": \"Workflows\"},\n", "]\n", "\n", "for run_num, run_input in enumerate(run_inputs):\n", " run = omics.start_run(\n", " workflowId=workflow['id'],\n", " name=f\"{workflow['name']} - {run_num} :: {run_input}\",\n", " roleArn=role['Role']['Arn'],\n", " parameters=run_input,\n", " outputUri=output_uri,\n", " \n", " runGroupId=run_group['id'], # <-- here is where we specify the run group\n", " )\n", " \n", " print(f\"({run_num}) workflow {workflow['id']}, run group {run_group['id']}, run {run['id']}, input {run_input}\")\n", " rg_runs += [run]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can now list all the runs in the RunGroup and should see all of them transition from `PENDING` to `STARTING` at once.\n", "\n", "(run the following cell multiple times)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[(_['id'], _['status']) for _ in omics.list_runs(runGroupId=run_group['id'])['items']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 5 }