{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "86c73553", "metadata": {}, "source": [ "# Querying annotations and variants with HealthOmics Analytics" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f82bb38a", "metadata": {}, "source": [ "The goal of this notebook is to help you get started importing variant and annotation data into a variety of HealthOmics Analytics stores. We will cover:\n", "\n", "1. VCF and gVCF inputs\n", "2. Annotation data in VCF, TSV, and GFF formats\n", "3. How to write queries on these data\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "30100b3f", "metadata": {}, "source": [ "## Prerequisites\n", "### Python requirements\n", "* Python >= 3.8\n", "* Packages:\n", " * boto3 >= 1.26.19\n", " * botocore >= 1.29.19\n", " * [AWS Wrangler](https://pypi.org/project/awswrangler/) >= 2.17.0\n", "\n", "### AWS requirements\n", "\n", "#### AWS CLI\n", "You will need the AWS CLI installed and configured in your environment. Supported AWS CLI versions are:\n", "\n", "* AWS CLI v2 >= 2.9.3 (Recommended)\n", "* AWS CLI v1 >= 1.27.19\n", "\n", "#### AWS Region\n", "AWS HealthOmics is currently available in Oregon (us-west-2), N. Virginia (us-east-1), Dublin (eu-west-1), London (eu-west-2), Frankfurt (eu-central-1), and Singapore (ap-southeast-1).\n", "\n", "**================**\n", "\n", "**AWS HealthOmics only supports importing data within the same region.**\n", "\n", "This notebook works in all AWS HealthOmics supported regions, utilizing regional buckets. Data in the regional buckets originates from the following publicly available buckets:\n", "\n", "* s3://1000genomes-dragen/data/precisionFDA/hg38-graph-based/\n", "* s3://aws-genomics-datasets/omics-e2e/clinvar.vcf.gz\n", "\n", "#### Output buckets\n", "You will need a bucket in the same region you are running this tutorial in to store query outputs.\n", "\n", "#### AWS HealthOmics references\n", "You need to have the following AWS HealthOmics resources available:\n", "\n", "* Reference store\n", "* A reference in the Reference store named \"GRCh38\"\n", "\n", "For more information on how to create these, see the AWS HealthOmics Storage tutorial." ] }, { "attachments": {}, "cell_type": "markdown", "id": "14d8e363", "metadata": {}, "source": [ "## Set Up" ] }, { "attachments": {}, "cell_type": "markdown", "id": "14cc69fb-15ed-4ad7-b1f0-149f88e31c64", "metadata": {}, "source": [ "### Execution role\n", "\n", "To run this notebook in Amazon Sagemaker on a notebook instance you will need an execution role with the following permissions:\n", "\n", "```json\n", "{\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"iam:GetUser\",\n", " \"iam:GetPolicy\",\n", " \"iam:CreatePolicy\",\n", " \"iam:GetRole\",\n", " \"iam:PassRole\",\n", " \"iam:CreateRole\",\n", " \"iam:AttachRolePolicy\"\n", " ],\n", " \"Resource\": \"*\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"omics:*\"\n", " ],\n", " \"Resource\": \"*\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"ram:AcceptResourceShareInvitation\",\n", " \"ram:GetResourceShareInvitations\",\n", " \"ram:ListResources\",\n", " \"ram:GetResourceShares\"\n", " ],\n", " \"Resource\": \"*\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"glue:CreateTable\",\n", " \"glue:DeleteTable\",\n", " \"glue:GetTable\",\n", " \"glue:UpdateTable\",\n", " \"glue:GetTables\"\n", " ],\n", " \"Resource\": \"*\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"athena:ListWorkGroups\",\n", " \"athena:CreateWorkGroup\",\n", " \"athena:GetWorkGroup\",\n", " \"athena:StartQueryExecution\",\n", " \"athena:GetQueryExecution\",\n", " \"athena:GetQueryResults\"\n", " ],\n", " \"Resource\": [\n", " \"*\"\n", " ]\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"lakeformation:GetDataAccess\",\n", " \"lakeformation:GrantPermissions\",\n", " \"lakeformation:GetDataLakeSettings\",\n", " \"lakeformation:PutDataLakeSettings\"\n", " ],\n", " \"Resource\": [\n", " \"*\"\n", " ]\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"s3:PutObject\",\n", " \"s3:GetObject\",\n", " \"s3:ListBucket\",\n", " \"s3:DeleteObject\"\n", " ],\n", " \"Resource\": \"arn:aws:s3:::*\"\n", " }\n", " ]\n", "}\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f3f7f81e-cdb8-4128-ad96-8ccf534531b8", "metadata": {}, "source": [ "### Python package imports" ] }, { "cell_type": "code", "execution_count": null, "id": "2b32e188", "metadata": { "tags": [] }, "outputs": [], "source": [ "from datetime import datetime\n", "import json\n", "import os\n", "import time\n", "import urllib\n", "\n", "import boto3\n", "import botocore.exceptions\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4ee00a75", "metadata": {}, "source": [ "### Create a service IAM role" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1f69e3ce", "metadata": {}, "source": [ "For the purposes of this tutorial, we will use the following policy and trust policy to demo the capabilities of AWS HealthOmics, you are free to customize permissions as required for your use case after going though this tutorial.\n", "\n", "**NOTE:**\n", "In this case we've defined rather permissive permissions (i.e. \"\\*\" Resources). In particular, we are allowing read/write access to all S3 buckets available to the account for this tutorial. In a real world setting you will want to scope this down to only the minimally needed actions on necessary resources." ] }, { "cell_type": "code", "execution_count": null, "id": "0060b6b7", "metadata": { "tags": [] }, "outputs": [], "source": [ "# set a timestamp\n", "dt_fmt = '%Y%m%dT%H%M%S'\n", "ts = datetime.now().strftime(dt_fmt)" ] }, { "cell_type": "code", "execution_count": null, "id": "d99a6a17", "metadata": { "tags": [] }, "outputs": [], "source": [ "demo_policy = {\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"omics:*\"\n", " ],\n", " \"Resource\": \"*\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"ram:AcceptResourceShareInvitation\",\n", " \"ram:GetResourceShareInvitations\"\n", " ],\n", " \"Resource\": \"*\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": [\n", " \"s3:GetBucketLocation\",\n", " \"s3:PutObject\",\n", " \"s3:GetObject\",\n", " \"s3:ListBucket\",\n", " \"s3:AbortMultipartUpload\",\n", " \"s3:ListMultipartUploadParts\",\n", " \"s3:GetObjectAcl\",\n", " \"s3:PutObjectAcl\"\n", " ],\n", " \"Resource\": \"*\"\n", " }\n", " ]\n", "}\n", "\n", "demo_trust_policy = {\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"omics.amazonaws.com\"\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " }\n", " ]\n", "}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ee051a10", "metadata": {}, "source": [ "In order to proceed we need to create a couple of resources the first is the role that you will be passing into AWS HealthOmics. If the role doesn't exist, we will need to create it and attach the policy and trust policy defined above." ] }, { "cell_type": "code", "execution_count": null, "id": "d87de02a", "metadata": { "tags": [] }, "outputs": [], "source": [ "# We will use this as the base name for our role and policy\n", "omics_iam_name = f'OmicsTutorialRole-{ts}'\n", "\n", "# Create the iam client\n", "iam = boto3.resource('iam')\n", "\n", "# Check if the role already exist if not create it\n", "try:\n", " role = iam.Role(omics_iam_name)\n", " role.load()\n", " \n", "except botocore.exceptions.ClientError as ex:\n", " if ex.response[\"Error\"][\"Code\"] == \"NoSuchEntity\":\n", " #Create the role with the corresponding trust policy\n", " role = iam.create_role(\n", " RoleName=omics_iam_name, \n", " AssumeRolePolicyDocument=json.dumps(demo_trust_policy))\n", " \n", " #Create policy\n", " policy = iam.create_policy(\n", " PolicyName='{}-policy'.format(omics_iam_name), \n", " Description=\"Policy for AWS HealthOmicsics demo\",\n", " PolicyDocument=json.dumps(demo_policy))\n", " \n", " #Attach the policy to the role\n", " policy.attach_role(RoleName=omics_iam_name)\n", " else:\n", " print('Something went wrong, please retry and check your account settings and permissions')\n", " print(ex)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cb24de06", "metadata": {}, "source": [ "Now that we know the role exist, lets create a helper function to help us retrieve the role arn which we will need to pass into the service API calls. The role arn will grant AWS HealthOmics the permissions it needs to access the resources it needs in your aws account." ] }, { "cell_type": "code", "execution_count": null, "id": "7ae8fd6b", "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_role_arn(role_name):\n", " try:\n", " iam = boto3.resource('iam')\n", " role = iam.Role(role_name)\n", " role.load() # calls GetRole to load attributes\n", " except ClientError:\n", " print(\"Couldn't get role named %s.\"%role_name)\n", " raise\n", " else:\n", " return role.arn" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3dcd8ac3", "metadata": {}, "source": [ "### Creating the AWS HealthOmics client\n", "\n", "Here we create a client for AWS HealthOmics.\n", "\n", "We'll also get the region that the notebook is running in. This is used to select the right regional data source." ] }, { "cell_type": "code", "execution_count": null, "id": "c3dc86d0", "metadata": { "tags": [] }, "outputs": [], "source": [ "omics = boto3.client('omics')\n", "\n", "region = omics.meta.region_name\n", "regional_bucket = f'aws-genomics-static-{region}'" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3b26679d", "metadata": {}, "source": [ "## Variant Store\n", "\n", "We'll be using the DRAGEN Thousand Genomes data set, which is available in the [Registry of Open Data on AWS](https://registry.opendata.aws/ilmn-dragen-1kgp/). \n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "000394ad", "metadata": {}, "source": [ "In order to create a variant store, you will need to have uploaded a reference into omics storage, the following helper method will retrieve the reference id for a given reference imported into omics storage. " ] }, { "cell_type": "code", "execution_count": null, "id": "68721614", "metadata": { "tags": [] }, "outputs": [], "source": [ "def get_reference_arn(ref_name, client=None):\n", " if not client:\n", " client = boto3.client('omics')\n", " \n", " resp = client.list_reference_stores(maxResults=10)\n", " ref_stores = resp.get('referenceStores')\n", " \n", " # There can only be one reference store per account per region\n", " # if there is a store present, it is the first one\n", " ref_store = ref_stores[0] if ref_stores else None\n", " \n", " if not ref_store:\n", " raise RuntimeError(\"You have not created a reference store, please got to the AWS HealthOmics Storage tutorial to learn how to create one. Do note continue with this notebook\")\n", " \n", " ref_arn = None\n", " resp = omics.list_references(referenceStoreId=ref_store['id'])\n", " ref_list = resp.get('references')\n", " \n", " for ref in resp.get('references'):\n", " if ref['name'] == ref_name:\n", " ref_arn = ref['arn']\n", " \n", " if ref_arn == None:\n", " raise RuntimeError(f\"Could not find {ref_name}, please got to the AWS HealthOmics Storage tutorial to learn how to import one. Do note continue with this notebook\")\n", " \n", " return ref_arn" ] }, { "cell_type": "code", "execution_count": null, "id": "3e34ba04", "metadata": { "tags": [] }, "outputs": [], "source": [ "omics.list_reference_stores()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "68125bec", "metadata": {}, "source": [ "### Step 1 - Prepare data\n", "\n", "This tutorial will import a few VCFs that originated from the 1000 Genomes dataset which is available via the [Registry of Open Data on AWS](https://registry.opendata.aws/ilmn-dragen-1kgp/). The original data is located in `us-east-1`. Since HealthOmics does not allow for cross-region data access, the data have been replicated across all regions that HealthOmics is available in into S3 buckets used by this tutorial.\n", "\n", "For demonstration purposes, we're only using 3 genomes worth of data. AWS HealthOmics can handle much more, and it is possible to import the entire 1000 Genomes dataset (3202 samples)." ] }, { "cell_type": "code", "execution_count": null, "id": "e69ecd44", "metadata": { "tags": [] }, "outputs": [], "source": [ "SOURCE_VARIANT_URI = f\"s3://{regional_bucket}/omics-tutorials/data/variants\"" ] }, { "cell_type": "code", "execution_count": null, "id": "6a53d030", "metadata": { "tags": [] }, "outputs": [], "source": [ "source = urllib.parse.urlparse(SOURCE_VARIANT_URI)\n", "bucket = source.netloc\n", "prefix = source.path[1:]\n", "\n", "s3r = boto3.resource('s3')\n", "\n", "bucket = s3r.Bucket(bucket)\n", "objects = bucket.objects.filter(Prefix=prefix, MaxKeys=10_000)\n", "ext = 'hard-filtered.vcf.gz'\n", "\n", "# this is a list comprehension for demo purposes.\n", "# if you have a larger set of data a generator comprehension will be more performant here \n", "# as it effectively lazy-loads the list of object keys\n", "vcf_list = [f\"s3://{o.bucket_name}/{o.key}\" for o in objects if o.key.endswith(ext)]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0036071f", "metadata": {}, "source": [ "Let's take a look at the list of our source VCFs" ] }, { "cell_type": "code", "execution_count": null, "id": "f7aea493", "metadata": { "tags": [] }, "outputs": [], "source": [ "vcf_list" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c7ca4077", "metadata": {}, "source": [ "### Step 2 - Create Variant Store\n", "\n", "Now, let's create a variant store. Feel free to update `var_store_name`. Note that Variant store names need to match the pattern `[a-z]{1}[a-z0-9_]{254}` (i.e. alphanumeric lowercase with \"_\" up to 255 characters)" ] }, { "cell_type": "code", "execution_count": null, "id": "e09ed446", "metadata": { "tags": [] }, "outputs": [], "source": [ "var_store_name = f'tutorial_variants_{ts.lower()}'\n", "ref_name = 'GRCh38' ## Change this reference name to match one you have created if needed" ] }, { "cell_type": "code", "execution_count": null, "id": "80424e0a", "metadata": { "tags": [] }, "outputs": [], "source": [ "response = omics.create_variant_store(\n", " name=var_store_name, \n", " reference={\"referenceArn\": get_reference_arn(ref_name, omics)}\n", ")\n", "\n", "var_store = response\n", "response" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1a43b2be", "metadata": {}, "source": [ "Creating a variant store takes up to 5 min to complete. We can use a `waiter` to tell us when the store is ready to use." ] }, { "cell_type": "code", "execution_count": null, "id": "07cb1e72", "metadata": { "tags": [] }, "outputs": [], "source": [ "try:\n", " waiter = omics.get_waiter('variant_store_created')\n", " waiter.wait(name=var_store['name'])\n", "\n", " print(f\"variant store {var_store['name']} ready for use\")\n", "except botocore.exceptions.WaiterError as e:\n", " print(f\"variant store {var_store['name']} FAILED:\")\n", " print(e)\n", "\n", "var_store = omics.get_variant_store(name=var_store['name'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "55812b38", "metadata": {}, "source": [ "After the store is created, navigate to Lake Formation and create resource links as described in the AWS HealthOmics [documentation](https://docs.aws.amazon.com/omics/latest/dev/creating-variant-stores.html#create-resource-links). These resource links will be needed for querying the data in Amazon Athena below.\n", "\n", "In the meantime, you can continue importing data (these can be done in parallel)." ] }, { "attachments": {}, "cell_type": "markdown", "id": "78aea9d5", "metadata": {}, "source": [ "### Step 3 - Import VCFs" ] }, { "cell_type": "code", "execution_count": null, "id": "f7fa3609-a94a-45bc-a3b8-af207e2234db", "metadata": { "tags": [] }, "outputs": [], "source": [ "response = omics.start_variant_import_job(\n", " destinationName=var_store['name'],\n", " roleArn=get_role_arn(omics_iam_name),\n", " \n", " # since we only have 3 vcfs to import we can use a simple list comprehension here\n", " # variant import jobs have a limit of 1000 sources per import\n", " items=[{\"source\": uri} for uri in vcf_list]\n", ")\n", "response" ] }, { "attachments": {}, "cell_type": "markdown", "id": "943643ed", "metadata": {}, "source": [ "Variant import jobs can take up to 10min to complete, depending on the size of the data being imported. We can monitor the progress of our imports by periodically listing them. This will not block the kernel so we can do other things." ] }, { "cell_type": "code", "execution_count": null, "id": "ed11e6f5", "metadata": { "tags": [] }, "outputs": [], "source": [ "omics.list_variant_import_jobs(filter={\"storeName\": var_store['name']})" ] }, { "attachments": {}, "cell_type": "markdown", "id": "15084505", "metadata": {}, "source": [ "### Step 4 - Query in Athena" ] }, { "attachments": {}, "cell_type": "markdown", "id": "99fc64c0", "metadata": {}, "source": [ "In order to query your datain Amazon Athena, you need to create resource links to your database using the [AWS LakeFormation Console](https://console.aws.amazon.com/lakeformation/home). You will also need to ensure that the IAM user running this notebook is a Data Lake Administrator. **Note** without both of these in place, the following queries will fail. To satisfy these prerequisites, refer to the instructions provided in the [AWS Lake Formation documention](https://docs.aws.amazon.com/lake-formation/latest/dg/getting-started-setup.html#create-data-lake-admin) and [AWS HealthOmics documentation](https://docs.aws.amazon.com/omics/latest/dev/creating-variant-stores.html#create-resource-links)." ] }, { "attachments": {}, "cell_type": "markdown", "id": "3253c93a", "metadata": {}, "source": [ "The following code will create resource links to the `default` database in the `AwsDataCatalog` in AWS Glue. It makes a few assumptions to do so - like IAM identity you are using to run this notebook is a Data Lake Administrator and has the permissions to create AWS Glue tables.\n", "\n", "If you want to be fully sure you are making the correct resource links and providing access to them to the correct identities it is best to create them directly. Refer to the instructions in the AWS HealthOmics [documentation](https://docs.aws.amazon.com/omics/latest/dev/creating-variant-stores.html#create-resource-links) on how to do this." ] }, { "attachments": {}, "cell_type": "markdown", "id": "f80e7ad7", "metadata": {}, "source": [ "We'll need to work with AWS RAM, AWS Glue, and AWS Lakeformation to setup resource links and grant database permissions." ] }, { "cell_type": "code", "execution_count": null, "id": "e40fe661", "metadata": { "tags": [] }, "outputs": [], "source": [ "ram = boto3.client('ram')\n", "glue = boto3.client('glue')\n", "\n", "caller_identity = boto3.client('sts').get_caller_identity()\n", "AWS_ACCOUNT_ID = caller_identity['Account']\n", "AWS_IDENITY_ARN = caller_identity['Arn']" ] }, { "attachments": {}, "cell_type": "markdown", "id": "12aa5f0a", "metadata": {}, "source": [ "First we'll list available shared resources from `OTHER-ACCOUNTS` in AWS RAM and look for the resource that matches the `id` of the Variant store we created above." ] }, { "cell_type": "code", "execution_count": null, "id": "d7517618", "metadata": { "tags": [] }, "outputs": [], "source": [ "response = ram.list_resources(resourceOwner='OTHER-ACCOUNTS', resourceType='glue:Database')\n", "\n", "if not response.get('resources'):\n", " print('no shared resources found. verify that you have successfully created an HealthOmics Analytics store')\n", "else:\n", " variantstore_resources = [resource for resource in response['resources'] if var_store['id'] in resource['arn']]\n", " if not variantstore_resources:\n", " print(f\"no shared resources matching variant store id {var_store['id']} found\")\n", " else:\n", " variantstore_resource = variantstore_resources[0]\n", "\n", "variantstore_resource" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f838bd6c", "metadata": {}, "source": [ "Next, we'll get the specific resource share the Variant store is associated with. This is so we can get the `owningAccountId` attribute for the share. (Note we could also do this by parsing the `resourceShareArn` for the resource above, but doing it this way is more explicit)" ] }, { "cell_type": "code", "execution_count": null, "id": "3af41982", "metadata": { "tags": [] }, "outputs": [], "source": [ "resource_share = ram.get_resource_shares(\n", " resourceOwner='OTHER-ACCOUNTS', \n", " resourceShareArns=[variantstore_resource['resourceShareArn']])['resourceShares'][0]\n", "resource_share" ] }, { "attachments": {}, "cell_type": "markdown", "id": "96428ef3", "metadata": {}, "source": [ "Now we'll create a table in AWS Glue for the variant store. This is the same as creating a resource link in AWS LakeFormation." ] }, { "cell_type": "code", "execution_count": null, "id": "77662b2b", "metadata": { "tags": [] }, "outputs": [], "source": [ "# this creates a resource link to the table for the variant store and adds it to the `default` database\n", "glue.create_table(\n", " DatabaseName='default',\n", " TableInput = {\n", " \"Name\": var_store['name'],\n", " \"TargetTable\": {\n", " \"CatalogId\": resource_share['owningAccountId'],\n", " \"DatabaseName\": f\"variant_{AWS_ACCOUNT_ID}_{var_store['id']}\",\n", " \"Name\": var_store['name'],\n", " }\n", " }\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1e18e22c-9647-46ef-b378-251a2a14e895", "metadata": {}, "source": [ "For this section of the tutorial, the identity that runs this notebook either:\n", "\n", "1. needs to be a Data lake administrator in AWS Lake Formation, or\n", "2. must be granted access to the AWS RAM shared resources by an existing administrator.\n", "\n", "The latter pattern is recommended. Both `DESCRIBE` and `SELECT` on the target table for the variant store is required and can be done via the \"Grant on target\" action on a resource link in the [AWS LakeFormation Console](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html). You can also do this with and admin identity via the SDK with code like:\n", "\n", "```python\n", "lfn = boto3.client('lakeformation')\n", "\n", "lfn.grant_permissions(\n", " Principal={\n", " \"DataLakePrincipalIdentifier\": principal\n", " },\n", " Resource={\n", " \"Table\": {\n", " \"CatalogId\": resource_share['owningAccountId'],\n", " \"DatabaseName\": f\"variant_{AWS_ACCOUNT_ID}_{var_store['id']}\",\n", " \"Name\": var_store['name']\n", " }\n", " },\n", " Permissions=[ 'DESCRIBE' ],\n", " PermissionsWithGrantOption=[ 'DESCRIBE' ]\n", ")\n", "\n", "lfn.grant_permissions(\n", " Principal={\n", " \"DataLakePrincipalIdentifier\": principal\n", " },\n", " Resource={\n", " \"TableWithColumns\": {\n", " \"CatalogId\": resource_share['owningAccountId'],\n", " \"DatabaseName\": f\"variant_{AWS_ACCOUNT_ID}_{var_store['id']}\",\n", " \"Name\": var_store['name'],\n", " \"ColumnWildcard\": {}\n", " }\n", " },\n", " Permissions=[ 'SELECT' ],\n", " PermissionsWithGrantOption=[ 'SELECT' ]\n", ")\n", "```\n", "\n", "To do the former, you can add the identity you are using for this notebook (e.g. if in Amazon Sagemaker, the notebook instance's execution role) as an AWS Lake Formation data lake administrator with the following code:\n", "\n", "```python\n", "lfn = boto3.client('lakeformation')\n", "\n", "caller_arn = boto3.client('sts').get_caller_identity()['Arn']\n", "caller_identity = caller_arn.split(':')[-1].split('/')\n", "\n", "principal = caller_arn # if an IAM user\n", "if caller_identity[0] == 'assumed-role':\n", " principal = 'role/'.join([caller_arn.replace('sts', 'iam').split('assumed-role')[0], caller_identity[1]])\n", "\n", "dls = lfn.get_data_lake_settings()['DataLakeSettings']\n", "if principal not in [admin['DataLakePrincipalIdentifier'] for admin in dls['DataLakeAdmins']]:\n", " dls['DataLakeAdmins'] += [{'DataLakePrincipalIdentifier': principal}]\n", "\n", "lfn.put_data_lake_settings(DataLakeSettings=dls)\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2dbde67e", "metadata": {}, "source": [ "Now that we have resource links created, we can start quering the data using [Amazon Athena](https://console.aws.amazon.com/athena). You don't need to wait for all the import jobs to complete to start doing this. Queries can be made while data imports in the background.\n", "\n", "To query AWS HealthOmics Analytics stores, you need to use Athena engine version 3. The following code checks if you have an existing Athena workgroup that satisfies this criteria. If not it will create one called `omics`." ] }, { "cell_type": "code", "execution_count": null, "id": "c726651b", "metadata": { "tags": [] }, "outputs": [], "source": [ "athena = boto3.client('athena')" ] }, { "cell_type": "code", "execution_count": null, "id": "e64eca70", "metadata": { "tags": [] }, "outputs": [], "source": [ "athena_workgroups = athena.list_work_groups()['WorkGroups']\n", "athena_workgroups" ] }, { "cell_type": "code", "execution_count": null, "id": "9e87b93e", "metadata": { "tags": [] }, "outputs": [], "source": [ "athena_workgroups = athena.list_work_groups()['WorkGroups']\n", "\n", "athena_workgroup = None\n", "for wg in athena_workgroups:\n", " print(wg['EngineVersion']['EffectiveEngineVersion'])\n", " if wg['EngineVersion']['EffectiveEngineVersion'] == 'Athena engine version 3':\n", " print(f\"Workgroup '{wg['Name']}' found using Athena engine version 3\")\n", " athena_workgroup = wg\n", " break\n", "else:\n", " print(\"No workgroups with Athena engine version 3 found. creating one\")\n", " athena_workgroup = athena.create_work_group(\n", " Name='omics',\n", " Configuration={\n", " \"EngineVersion\": {\n", " \"SelectedEngineVersion\": \"Athena engine version 3\"\n", " }\n", " }\n", " )\n", "\n", "athena_workgroup" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6e8e6768", "metadata": {}, "source": [ "Let's start writing queries!\n", "\n", "For fun, let's calculate the TI/TV ratio for these samples. You can navigate to the Athena console or do this from your Jupyter Notebook. This example uses the workgroup `omics` and assumes you have made a resource link to your Variant store in your `default` database." ] }, { "cell_type": "code", "execution_count": null, "id": "65348c6a", "metadata": { "tags": [] }, "outputs": [], "source": [ "titv_query = f\"\"\"\n", "WITH dataset AS (\n", " SELECT \n", " referenceallele,\n", " alternatealleles,\n", " if (cardinality(calls) = 2 and element_at(calls,1)=element_at(calls, 2),2,1) as allelecount, \n", " sampleid\n", " FROM \"default\".\"{var_store['name']}\"\n", " WHERE filters[1] = 'PASS'\n", ") \n", "SELECT \n", " sampleid, \n", " CAST (sum(CASE WHEN\n", " referenceallele in ('C','T') and alternateallele in ('C','T') THEN allelecount\n", " WHEN referenceallele in ('A','G') and alternateallele in ('A','G') THEN allelecount ELSE 0 END) AS double )/ sum(CASE WHEN\n", " referenceallele in ('C','T') and alternateallele in ('A','G') THEN allelecount\n", " WHEN referenceallele in ('A','G') and alternateallele in ('C','T') THEN allelecount ELSE 0 END) as titv_ratio\n", "FROM dataset\n", "CROSS JOIN UNNEST(alternatealleles) as t(alternateallele)\n", "WHERE \n", " length(referenceallele) = 1 and length(alternateallele) = 1\n", "GROUP BY \n", " sampleid\n", "\"\"\"\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "06cf4c1b", "metadata": {}, "source": [ "We can now use [AWS Wrangler](https://aws-sdk-pandas.readthedocs.io/en/stable/) to submit the query and return results as a Pandas Dataframe\n", "\n", "If you don't have AWS Wrangler installed in your environment you can get it by running the following cell. Otherwise, you can skip to the next one." ] }, { "cell_type": "code", "execution_count": null, "id": "3112dcb6-cc75-460d-af66-f9a6b704b825", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install awswrangler" ] }, { "cell_type": "code", "execution_count": null, "id": "05c3b884", "metadata": { "tags": [] }, "outputs": [], "source": [ "import awswrangler as wr" ] }, { "cell_type": "code", "execution_count": null, "id": "be455e5a", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_titv = wr.athena.read_sql_query(\n", " titv_query, \n", " database='default', \n", " workgroup=athena_workgroup['Name'])" ] }, { "cell_type": "code", "execution_count": null, "id": "52a253ae", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_titv" ] }, { "attachments": {}, "cell_type": "markdown", "id": "91d612ae", "metadata": {}, "source": [ "## Annotation Store" ] }, { "attachments": {}, "cell_type": "markdown", "id": "508594fb", "metadata": {}, "source": [ "Now, let's set up an Annotation store.\n", "\n", "AWS HealthOmics Annotation stores support annotations in VCF, GFF, and TSV formats. In this tutorial, we import [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) annotations which can be [downloaded](https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/) from the NCBI as a VCF file. Imports need to come from an S3 location in the same region, so we'll use a copy in a regional bucket for this tutorial." ] }, { "cell_type": "code", "execution_count": null, "id": "3bfcb98a", "metadata": { "tags": [] }, "outputs": [], "source": [ "SOURCE_ANNOTATION_URI = f\"s3://{regional_bucket}/omics-tutorials/data/annotations/clinvar.vcf.gz\"" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2f64df28", "metadata": {}, "source": [ "### Creating, importing data into, and querying an Annotation store\n", "\n", "The process of creating, importing data into, and querying an Annotation store is similar to the process you did above for the Variant store, so we'll be brief on the descriptions of each step." ] }, { "attachments": {}, "cell_type": "markdown", "id": "9a01ec26", "metadata": {}, "source": [ "#### Creating ..." ] }, { "cell_type": "code", "execution_count": null, "id": "647c275f", "metadata": { "tags": [] }, "outputs": [], "source": [ "ann_store_name = f'tutorial_annotations_{ts.lower()}'\n", "ref_name = 'GRCh38' ## Change this reference name to match one you have created if needed" ] }, { "cell_type": "code", "execution_count": null, "id": "8e19b59b", "metadata": { "tags": [] }, "outputs": [], "source": [ "response = omics.create_annotation_store(\n", " name=ann_store_name, \n", " reference={\"referenceArn\": get_reference_arn(ref_name, omics)},\n", " storeFormat='VCF' ## <<--- THIS IS UNIQUE TO ANNOTATION STORES\n", ")\n", "\n", "ann_store = response\n", "response" ] }, { "cell_type": "code", "execution_count": null, "id": "02fdec96", "metadata": { "tags": [] }, "outputs": [], "source": [ "try:\n", " waiter = omics.get_waiter('annotation_store_created')\n", " waiter.wait(name=ann_store['name'])\n", "\n", " print(f\"annotation store {ann_store['name']} ready for use\")\n", "except botocore.exceptions.WaiterError as e:\n", " print(f\"annotation store {ann_store['name']} FAILED:\")\n", " print(e)\n", "\n", "ann_store = omics.get_annotation_store(name=ann_store['name'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e7eded44", "metadata": {}, "source": [ "#### Importing data ..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "9f1028b4", "metadata": {}, "source": [ "Since the ClinVar dataset relatively small, this should take under 2min to complete." ] }, { "cell_type": "code", "execution_count": null, "id": "1e653f84", "metadata": { "tags": [] }, "outputs": [], "source": [ "response = omics.start_annotation_import_job(\n", " destinationName=ann_store['name'],\n", " roleArn=get_role_arn(omics_iam_name),\n", " items=[{\"source\": SOURCE_ANNOTATION_URI}]\n", ")\n", "response" ] }, { "cell_type": "code", "execution_count": null, "id": "c939af73", "metadata": { "tags": [] }, "outputs": [], "source": [ "omics.list_annotation_import_jobs(filter={\"storeName\": ann_store['name']})" ] }, { "attachments": {}, "cell_type": "markdown", "id": "222e2c24", "metadata": {}, "source": [ "#### Querying ..." ] }, { "attachments": {}, "cell_type": "markdown", "id": "d2b28896", "metadata": {}, "source": [ "Creating a resource link to the Annotation store is the same as with the Variant store. We'll do this all in one cell below." ] }, { "cell_type": "code", "execution_count": null, "id": "80d2f99c", "metadata": { "tags": [] }, "outputs": [], "source": [ "response = ram.list_resources(resourceOwner='OTHER-ACCOUNTS', resourceType='glue:Database')\n", "\n", "if not response.get('resources'):\n", " print('no shared resources found. verify that you have successfully created an HealthOmics Analytics store')\n", "else:\n", " annotationstore_resources = [resource for resource in response['resources'] if ann_store['id'] in resource['arn']]\n", " if not annotationstore_resources:\n", " print(f\"no shared resources matching annotation store id {ann_store['id']} found\")\n", " else:\n", " annotationstore_resource = annotationstore_resources[0]\n", "\n", " resource_share = ram.get_resource_shares(\n", " resourceOwner='OTHER-ACCOUNTS', \n", " resourceShareArns=[annotationstore_resource['resourceShareArn']])['resourceShares'][0]\n", " \n", " # this creates a resource link to the table for the annotation store and adds it to the `default` database\n", " glue.create_table(\n", " DatabaseName='default',\n", " TableInput = {\n", " \"Name\": ann_store['name'],\n", " \"TargetTable\": {\n", " \"CatalogId\": resource_share['owningAccountId'],\n", " \"DatabaseName\": f\"annotation_{AWS_ACCOUNT_ID}_{ann_store['id']}\",\n", " \"Name\": ann_store['name'],\n", " }\n", " }\n", " )\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "75e4e37d", "metadata": {}, "source": [ "Querying the data and returning a Pandas Dataframe:" ] }, { "cell_type": "code", "execution_count": null, "id": "680d87fc", "metadata": { "tags": [] }, "outputs": [], "source": [ "# this query identifies single nucleotide variants associated \n", "# with Non-small cell lung cancer (SNOMED_CT 254637007)\n", "# with \"drug_response\" clinical significance\n", "ncsclc_query = f\"\"\"\n", "SELECT \n", " concat('chr', contigname) as contigname, \n", " start, \n", " referenceallele,\n", " alternatealleles,\n", " attributes['GENEINFO'] as gene_info,\n", " attributes['CLNSIG'] as clinical_significance, \n", " regexp_extract_all(attributes['CLNDISDB'], 'SNOMED_CT:\\d+') as snomed_ct, \n", " attributes['CLNDN'] as disease_name\n", "from \"default\".\"{ann_store['name']}\"\n", "where \n", " attributes['CLNDISDB'] like '%SNOMED_CT:254637007%'\n", " and length(referenceallele) = 1\n", " and cardinality(alternatealleles) = 1\n", " and length(alternatealleles[1]) = 1\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "3fc9169b", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_nsclc = wr.athena.read_sql_query(\n", " ncsclc_query, \n", " database='default', \n", " workgroup=athena_workgroup['Name'])" ] }, { "cell_type": "code", "execution_count": null, "id": "f6c0e7f7", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_nsclc" ] }, { "cell_type": "code", "execution_count": null, "id": "fdd83824", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 5 }