{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c0097d01",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Using HealthOmics Storage with genomic references and readsets"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c5afbe0b",
   "metadata": {
    "tags": []
   },
   "source": [
    "The goal of this notebook is to get you acquainted with HealthOmics Storage. <br>If you complete this notebook you will have:\n",
    "1. Created a Reference Store\n",
    "2. Imported a Reference Genome\n",
    "3. Created a Sequence Store\n",
    "4. Imported several FASTQ files\n",
    "5. Imported a CRAM file\n",
    "6. Downloaded a ReadSet using the HealthOmics Transfer Manager."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cec921d9",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "### Python requirements\n",
    "* Python >= 3.8\n",
    "* Packages:\n",
    "  * boto3 >= 1.26.19\n",
    "  * botocore >= 1.29.19\n",
    "\n",
    "### AWS requirements\n",
    "\n",
    "#### AWS CLI\n",
    "You will need the AWS CLI installed and configured in your environment. <br>Supported AWS CLI versions are:\n",
    "\n",
    "* AWS CLI v2 >= 2.9.3 (Recommended)\n",
    "* AWS CLI v1 >= 1.27.19\n",
    "\n",
    "#### AWS Region\n",
    "AWS HealthOmics is currently available in Oregon (us-west-2), N. Virginia (us-east-1), Dublin (eu-west-1), London (eu-west-2), Frankfurt (eu-central-1), and Singapore (ap-southeast-1).\n",
    "\n",
    "**================**\n",
    "\n",
    "**AWS HealthOmics only allows importing data within the same region.** \n",
    "\n",
    "This notebook works in all AWS HealthOmics supported regions, utilizing regional buckets.\n",
    "Data in the regional buckets originate from the following, publicaly available buckets:\n",
    "\n",
    "* s3://1000genomes-dragen-v3.7.6/references/fasta/hg38.fa\n",
    "* s3://1000genomes/1000G_2504_high_coverage/additional_698_related/data/ERR3988761/HG00405.final.cram*\n",
    "* s3://giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140407_D00360_0016_AH948VADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R{1,2}_*.fastq.gz"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "850bb2c1-9c44-40b8-aa8c-1e7c51e9d8e3",
   "metadata": {},
   "source": [
    "## Policy Set Up\n",
    "\n",
    "This notebook runs under the role that was created/selected when the notebook instance was created.<br>\n",
    "This role is a very specific role, for running notebooks in Sagemaker."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f7d940d5-9ce5-4303-a142-d53646a06ebd",
   "metadata": {},
   "source": [
    "Prior to executing this notebook, verify that the specific role has the below policies attached:"
   ]
  },
  {
   "cell_type": "raw",
   "id": "ae0c68af-8cbd-44de-9438-af42d896ed4e",
   "metadata": {},
   "source": [
    "{\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"iam:GetUser\",\n",
    "                \"iam:GetPolicy\",\n",
    "                \"iam:CreatePolicy\",\n",
    "                \"iam:GetRole\",\n",
    "                \"iam:PassRole\",\n",
    "                \"iam:CreateRole\",\n",
    "                \"iam:AttachRolePolicy\"\n",
    "            ],\n",
    "            \"Resource\": \"*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"omics:*\"\n",
    "            ],\n",
    "            \"Resource\": \"*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"s3:PutObject\",\n",
    "                \"s3:GetObject\",\n",
    "                \"s3:ListBucket\",\n",
    "                \"s3:DeleteObject\"\n",
    "            ],\n",
    "            \"Resource\": \"arn:aws:s3:::*\"\n",
    "        }\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f0547669",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Environment Set Up\n",
    "We execute a reset of resources, in case we re-run the tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e97dc65e-5b3c-428f-b606-f605d7b04885",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%reset -f"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6471c00c-3f4f-47b0-89dc-441e5ad1fd99",
   "metadata": {},
   "source": [
    "We set up our environment, by importing required libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8fc5e986",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "import itertools as it\n",
    "import json\n",
    "import gzip\n",
    "import os\n",
    "from pprint import pprint\n",
    "import time\n",
    "import urllib\n",
    "\n",
    "import boto3\n",
    "import botocore.exceptions\n",
    "import requests\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7c486647",
   "metadata": {},
   "source": [
    "### Create a service IAM role"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5dde7782",
   "metadata": {},
   "source": [
    "For the purposes of this tutorial, we will use the following policy and trust policy to demo the capabilities of AWS HealthOmics, you are free to customize permissions as required for your use case after going though this tutorial.\n",
    "\n",
    "**NOTE:**\n",
    "In this case we've defined rather permissive permissions (i.e. \"\\*\" Resources). In particular, we are allowing read/write access to all S3 buckets available to the account for this tutorial. In a real world setting you will want to scope this down to only the minimally needed actions on necessary resources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ac5de4d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# set a timestamp\n",
    "dt_fmt = '%Y%m%dT%H%M%S'\n",
    "ts = datetime.now().strftime(dt_fmt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "715f4201",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "demo_policy = {\n",
    "  \"Version\": \"2012-10-17\",\n",
    "  \"Statement\": [\n",
    "    {\n",
    "      \"Effect\": \"Allow\",\n",
    "      \"Action\": [\n",
    "        \"omics:*\"\n",
    "      ],\n",
    "      \"Resource\": \"*\"\n",
    "    },\n",
    "    {\n",
    "      \"Effect\": \"Allow\",\n",
    "      \"Action\": [\n",
    "        \"ram:AcceptResourceShareInvitation\",\n",
    "        \"ram:GetResourceShareInvitations\"\n",
    "      ],\n",
    "      \"Resource\": \"*\"\n",
    "    },\n",
    "    {\n",
    "      \"Effect\": \"Allow\",\n",
    "      \"Action\": [\n",
    "        \"s3:GetBucketLocation\",\n",
    "        \"s3:PutObject\",\n",
    "        \"s3:GetObject\",\n",
    "        \"s3:ListBucket\",\n",
    "        \"s3:AbortMultipartUpload\",\n",
    "        \"s3:ListMultipartUploadParts\",\n",
    "        \"s3:GetObjectAcl\",\n",
    "        \"s3:PutObjectAcl\"\n",
    "      ],\n",
    "      \"Resource\": \"*\"\n",
    "    }\n",
    "  ]\n",
    "}\n",
    "\n",
    "demo_trust_policy = {\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Principal\": {\n",
    "                \"Service\": \"omics.amazonaws.com\"\n",
    "            },\n",
    "            \"Action\": \"sts:AssumeRole\"\n",
    "        }\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c2d1cdb9",
   "metadata": {},
   "source": [
    "In order to proceed we need to create a couple of resources. <br>The first is the role that you will be passing into AWS HealthOmics. If the role doesn't exist, we will need to create it and attach the policy and trust policy defined above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36e04bb2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# We will use this as the base name for our role and policy\n",
    "omics_iam_name = f'OmicsTutorialRole-{ts}'\n",
    "\n",
    "# Create the iam client\n",
    "iam = boto3.resource('iam')\n",
    "\n",
    "# Check if the role already exist if not create it\n",
    "try:\n",
    "    role = iam.Role(omics_iam_name)\n",
    "    role.load()\n",
    "    \n",
    "except botocore.exceptions.ClientError as ex:\n",
    "    if ex.response[\"Error\"][\"Code\"] == \"NoSuchEntity\":\n",
    "        #Create the role with the corresponding trust policy\n",
    "        role = iam.create_role(\n",
    "            RoleName=omics_iam_name, \n",
    "            AssumeRolePolicyDocument=json.dumps(demo_trust_policy))\n",
    "        \n",
    "        #Create policy\n",
    "        policy = iam.create_policy(\n",
    "            PolicyName='{}-policy'.format(omics_iam_name), \n",
    "            Description=\"Policy for AWS HealthOmicsics demo\",\n",
    "            PolicyDocument=json.dumps(demo_policy))\n",
    "        \n",
    "        #Attach the policy to the role\n",
    "        policy.attach_role(RoleName=omics_iam_name)\n",
    "    else:\n",
    "        print('Somthing went wrong, please retry and check your account settings and permissions')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f2b91bc5-7127-4c0e-8b8b-7e08cac4a088",
   "metadata": {},
   "source": [
    "Now that we know the role exists, lets create a helper function to help us retrieve the role arn which we will need to pass into the service API calls. \n",
    "<br>The role arn will grant AWS HealthOmics the proper permissions to access the resources it needs in your AWS account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "809efb81",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def get_role_arn(role_name):\n",
    "    try:\n",
    "        iam = boto3.resource('iam')\n",
    "        role = iam.Role(role_name)\n",
    "        role.load()  # calls GetRole to load attributes\n",
    "    except botocore.exceptions.ClientError:\n",
    "        print(\"Couldn't get role named %s.\"%role_name)\n",
    "        raise\n",
    "    else:\n",
    "        return role.arn"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1d9bcb5d-0c45-48fb-8fd8-c0e3dbdfe53c",
   "metadata": {},
   "source": [
    "Let's get the region in which we are running our notebook. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c877099-c8ea-47a4-80ce-30cd1ea3a005",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "region = boto3.session.Session().region_name"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "583e6092-f54a-43bb-9efd-7681b422340f",
   "metadata": {},
   "source": [
    "We are using region information to create a regional  dictionary of datasources we'll be using in this tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ddaabc9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "regional_bucket = f'aws-genomics-static-{region}'\n",
    "\n",
    "SOURCE_S3_URI_TPL = {\n",
    "    \"reads\": {\n",
    "        \"fastq\": \"s3://{regional_bucket}/omics-tutorials/data/fastq/NA12878/Sample_U0a/U0a_CGATGT_L001_R{read}_{part:03}.fastq.gz\",\n",
    "    }\n",
    "}\n",
    "\n",
    "SOURCE_S3_URIS = {\n",
    "    \"reference\": f's3://{regional_bucket}/omics-tutorials/data/references/hg38/Homo_sapiens_assembly38.fasta',\n",
    "    \"reads\": {\n",
    "        \"fastq\": [\n",
    "            SOURCE_S3_URI_TPL['reads']['fastq'].format(regional_bucket=regional_bucket,read=read, part=part) \n",
    "            for read, part, in list(it.product([1,2], [1,2,3,4]))],\n",
    "        \"cram\": [\n",
    "            f's3://{regional_bucket}/omics-tutorials/data/cram/ERR3988761/HG00405.final.cram',\n",
    "            f's3://{regional_bucket}/omics-tutorials/data/cram/ERR3988762/HG00408.final.cram',\n",
    "            f's3://{regional_bucket}/omics-tutorials/data/cram/ERR3988763/HG00418.final.cram',\n",
    "            f's3://{regional_bucket}/omics-tutorials/data/cram/ERR3988764/HG00420.final.cram'\n",
    "        ]\n",
    "    }\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7c0a0f09",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Creating the AWS HealthOmics client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "989dbe72",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "omics = boto3.client('omics', region_name=region)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "880e76c4",
   "metadata": {},
   "source": [
    "## Reference stores"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c1b8812f",
   "metadata": {},
   "source": [
    "We are first going to import a genome reference. This reference will be taken from 1000 genomes, which will also be the source of the reads. This import has two steps. First, if this is the first time you've created a reference, you first must create a reference store.\n",
    "\n",
    "Reference Stores (and Seqeuence Stores) also support CMKs; however, we'll use AWS-owned encryption keys throughout.\n",
    "\n",
    "To start lets create a helper method to retrieve the reference store id and have it return empty if it doesn't exist. There should only be one reference store per region per account.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdd07f62",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def get_ref_store_id(client=None):\n",
    "    if not client:\n",
    "        client = boto3.client('omics')\n",
    "    \n",
    "    resp = client.list_reference_stores(maxResults=10)\n",
    "    list_of_stores = resp.get('referenceStores')\n",
    "    store_id = None\n",
    "    \n",
    "    if list_of_stores != None:\n",
    "        # As mentioned above there can only be one store per region\n",
    "        # if there is a store present is the first one\n",
    "        store_id = list_of_stores[0].get('id')\n",
    "    \n",
    "    return store_id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd8b0bf5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(f\"Checking for a reference store in region: {omics.meta.region_name}\")\n",
    "if get_ref_store_id(omics) == None:\n",
    "    response = omics.create_reference_store(name='myReferenceStore')\n",
    "    print(response)\n",
    "else:\n",
    "    print(\"Congratulations, you have an existing reference store!\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "acee901f",
   "metadata": {},
   "source": [
    "### Importing references"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4706b33f",
   "metadata": {},
   "source": [
    "We'll now import a reference using the `start_reference_import_job` API call.\n",
    "\n",
    "This will use the reference store we created (or found) and the IAM role we created above. All references in a Reference store must have a unique name. So, we're also going to apply a timestamp to the reference name to ensure that it is unique.\n",
    "\n",
    "Also note that the IAM role used should have the ability to read the object in the S3 bucket that your reference is sourced from."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1aa5486a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ref_name = f'tutorial-1kg-grch38-{ts}'\n",
    "\n",
    "ref_import_job = omics.start_reference_import_job(\n",
    "    referenceStoreId=get_ref_store_id(omics), \n",
    "    roleArn=get_role_arn(omics_iam_name),\n",
    "    sources=[{\n",
    "        'sourceFile': SOURCE_S3_URIS[\"reference\"],\n",
    "        'name': ref_name,\n",
    "        'tags': {'SourceLocation': '1kg'}\n",
    "    }])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "42a642e8",
   "metadata": {},
   "source": [
    "You can check the status here either using `omics.get_reference_import_job` or by navigating to the [Console](https://console.aws.amazon.com/omics/home#/reference/referenceStoreDetail)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7dcc38fb",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ref_import_job = omics.get_reference_import_job(\n",
    "    referenceStoreId=get_ref_store_id(omics), \n",
    "    id=ref_import_job['id'])\n",
    "ref_import_job"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "16123de3",
   "metadata": {},
   "source": [
    "The import can take up to 5 minutes to complete. We can wait for it to complete using a `waiter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8233b8e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(f\"waiting for job {ref_import_job['id']} to complete\")\n",
    "try:\n",
    "    waiter = omics.get_waiter('reference_import_job_completed')\n",
    "    waiter.wait(referenceStoreId=ref_import_job['referenceStoreId'], id=ref_import_job['id'])\n",
    "\n",
    "    print(f\"job {ref_import_job['id']} complete\")\n",
    "except botocore.exceptions.WaiterError as e:\n",
    "    print(f\"job {ref_import_job['id']} FAILED:\")\n",
    "    print(e)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4c68c8bb",
   "metadata": {},
   "source": [
    "Once the import job is complete, we should be able to see our reference in the list of available references in the Reference store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e9401d4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "resp = omics.list_references(referenceStoreId=get_ref_store_id(omics), filter={\"name\": ref_name})\n",
    "\n",
    "ref_list = resp\n",
    "pprint(resp)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a6154064",
   "metadata": {},
   "source": [
    "And now we can get more details for our imported reference with the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7042d44",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# storing this specific reference as we'll be using it later\n",
    "ref = omics.get_reference_metadata(\n",
    "    referenceStoreId=get_ref_store_id(omics), \n",
    "    id=ref_list['references'][0]['id'])\n",
    "ref"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "976911a7",
   "metadata": {},
   "source": [
    "Now that you've imported your reference, let's import our first set of genomics reads."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c813e9ad",
   "metadata": {},
   "source": [
    "## Sequence Stores"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "21712624",
   "metadata": {},
   "source": [
    "We will now create our first sequence store. A sequence store is similar to an S3 bucket; it holds a set of objects, known as read sets. \n",
    "\n",
    "The easiest way to think about a read set is it an abstraction of a set of genomics file types (FASTQ, BAM, CRAM) that store reads from a sequencer. When you get a read set back from a sequence store and process it, the data you get back is semantically identical to what you put in.\n",
    "\n",
    "Note that you can use customer managed CMKs with HealthOmics storage; these are managed at the store level."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d2fa606f",
   "metadata": {},
   "source": [
    "### Creating sequence stores"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e1526367",
   "metadata": {},
   "source": [
    "Let's create a sequence store and retrieve its details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93d538bb",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "sequence_store_name = f'tutorial-seq-store-{ts}'\n",
    "response = omics.create_sequence_store(name=sequence_store_name)\n",
    "\n",
    "response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd5d0fcb",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "seqstore = response\n",
    "omics.get_sequence_store(id=seqstore['id'])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9c45971c",
   "metadata": {},
   "source": [
    "### Import sequence data\n",
    "\n",
    "Now, let's import some data in the form of FASTQs and CRAMs. AWS HealthOmics allows you to provide additional metadata for your reads using import manifests. Let's look at how to generate manifests for CRAM and FASTQ data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8d7c49f7",
   "metadata": {},
   "source": [
    "#### CRAM manifests\n",
    "CRAM manifests are relatively simple since all the read data is typically in one source file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5d8944f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "role_arn = get_role_arn(omics_iam_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75090db7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# create a list of cram sources\n",
    "cram_sources = []\n",
    "for source_uri in SOURCE_S3_URIS['reads']['cram']:\n",
    "    source = urllib.parse.urlparse(source_uri)\n",
    "    subject_id = source.path[1:].split('/')[-2]  # accession, e.g. ERR1234567\n",
    "    sample_id = source.path[1:].split('/')[-1].split('.')[0] # basename of object, e.g. HG12345\n",
    "    \n",
    "    cram_sources += [{\n",
    "        'subjectId': subject_id,\n",
    "        'sampleId': sample_id,\n",
    "        'sourceFileType': 'CRAM',\n",
    "        'sourceFiles': { 'source1': source_uri },\n",
    "        'referenceArn': ref['arn'],  # we stashed this earlier\n",
    "        'generatedFrom': source_uri,\n",
    "        'description': f'{subject_id}/{sample_id} from {source.netloc}',\n",
    "        'tags': {'SourceLocation': source.netloc}\n",
    "    }]\n",
    "cram_sources"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a52c05f9",
   "metadata": {},
   "source": [
    "#### FASTQ manifests\n",
    "FASTQs can be a bit more complicated since read data can come from one or two source files. Additionally, sources for multiple paired-end reads may exist in the same directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acc2e947",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# the set of source URIs is evenly matched with two reads {R1, R2} and for parts each\n",
    "# so we can just sort and divide the list into two to get matched lists of paired reads\n",
    "source_uris = zip(\n",
    "    sorted(SOURCE_S3_URIS['reads']['fastq'])[:len(SOURCE_S3_URIS['reads']['fastq'])//2],\n",
    "    sorted(SOURCE_S3_URIS['reads']['fastq'])[len(SOURCE_S3_URIS['reads']['fastq'])//2:]\n",
    ")\n",
    "\n",
    "fastq_sources = []\n",
    "for reads in source_uris:\n",
    "    source = urllib.parse.urlparse(reads[0])\n",
    "    subject_id = source.path[1:].split('/')[-3]  # NA12878\n",
    "    sample_id = source.path[1:].split('/')[-2]  # Sample_U0a\n",
    "    \n",
    "    fastq_sources += [{\n",
    "        'subjectId': subject_id,\n",
    "        'sampleId': sample_id,\n",
    "        'sourceFileType': 'FASTQ',\n",
    "        'sourceFiles': { \n",
    "            'source1': reads[0],\n",
    "            'source2': reads[1]\n",
    "        },\n",
    "        'referenceArn': ref['arn'],  # we stashed this earlier\n",
    "        'generatedFrom': f'{subject_id}/{sample_id} fastq',\n",
    "        'description': f'{subject_id}/{sample_id} from {source.netloc}',\n",
    "        'tags': {'SourceLocation': source.netloc}\n",
    "    }]\n",
    "fastq_sources"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "aa94356c",
   "metadata": {},
   "source": [
    "## Importing ReadSets\n",
    "\n",
    "Now, we'll make a combined set of import sources for our store. You can mix and match read sets of different types with imports. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5407c9f1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import_sources = fastq_sources + cram_sources\n",
    "import_sources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ad23fe0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "readset_import_job = omics.start_read_set_import_job(\n",
    "    roleArn=get_role_arn(omics_iam_name),\n",
    "    sequenceStoreId=seqstore['id'], \n",
    "    sources=import_sources)\n",
    "readset_import_job"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0435b7f7",
   "metadata": {},
   "source": [
    "You can monitor this progress with `omics.get_read_set_import_job`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f285ccfe",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "readset_import_job = omics.get_read_set_import_job(\n",
    "    sequenceStoreId=seqstore['id'], \n",
    "    id=readset_import_job['id'])\n",
    "\n",
    "readset_import_job"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d86025c4",
   "metadata": {},
   "source": [
    "Depending on the size of the source data, ReadSet import jobs can take up to 30min for each source to complete.\n",
    "\n",
    "The import of an individual Read Set into a sequence store will take longer than a traditional S3 copy. This is because AWS HealthOmics will validate files, calculate metadata, and organize your ReadSets to make it easier for you to consume and share.\n",
    "\n",
    "AWS HealthOmics supports high parallelism across imports. For example, if the import of a FASTQ takes X minutes, the import of many FASTQs of the same size should also take approximately X minutes. Each import job supports a maximum of 1000 read sets. So we could import all of 1000 genomes with only 4 API calls. For higher parallelism, you can split larger datastes into multiple ReadSet import jobs."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cc7078c4",
   "metadata": {},
   "source": [
    "## Copying References and ReadSets to a local file system"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cc8ca528",
   "metadata": {},
   "source": [
    "Many existing tools for genomics data processing expect to read files on the local filesystem.\n",
    "\n",
    "AWS HealthOmics supports multipart downloads much like S3, but specifically designed for genomic datatypes. After importing your references and readsets into AWS HealthOmics Storage, you can easily retrieve them and their supporting files (e.g. index files like `*.fai` for references, or `*.crai` for reads)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e1525cd",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ref_list = omics.list_references(referenceStoreId=get_ref_store_id(omics))\n",
    "readset_list = omics.list_read_sets(sequenceStoreId=seqstore['id'])\n",
    "\n",
    "pprint(ref_list)\n",
    "pprint(readset_list)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f0765309",
   "metadata": {},
   "source": [
    "AWS HealthOmics provides a number of basic methods to retrieve data. For efficiency, references and read sets are stored in multiple parts. We can see this if we look at one of the references we imported earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2aa1e204",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ref_metadata = omics.get_reference_metadata(referenceStoreId=ref['referenceStoreId'], id=ref['id'])\n",
    "ref_metadata"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "47eadc8c",
   "metadata": {},
   "source": [
    "Notice the `files` property:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91c6a4f2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ref_metadata['files']"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3babc1dc",
   "metadata": {},
   "source": [
    "Here we can see the `source` file has several parts. There is also a smaller `index` file which will typically have only one part.\n",
    "\n",
    "The AWS HealthOmics API `get_reference` lets you download one part at a time. Let's use it to download pieces of the reference and its associated index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6eab8ebf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# download a part of the reference\n",
    "response = omics.get_reference(referenceStoreId=ref['referenceStoreId'], id=ref['id'], partNumber=1, file='SOURCE')\n",
    "with open('reference.part.fa', 'wb') as f:\n",
    "    f.write(response['payload'].read())\n",
    "\n",
    "# download a part of the reference index\n",
    "response = omics.get_reference(referenceStoreId=ref['referenceStoreId'], id=ref['id'], partNumber=1, file='INDEX')\n",
    "with open('reference.part.fai', 'wb') as f:\n",
    "    f.write(response['payload'].read())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fcbbfc41",
   "metadata": {},
   "source": [
    "There are similar APIs for downloading ReadSets. One notable difference is that ReadSets can have more than one source - e.g. `SOURCE1` and `SOURCE2` - and no index if they originate from FASTQ, or only a single source and an accompanying index if they come from an aligned format like BAM or CRAM.\n",
    "\n",
    "> **Note** the following cells will only work if you have an available readset. The first cell below will check for one. Subsequent cells thereafter assume a readset was found."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "686652be",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# check if there are readsets available\n",
    "readset_import_jobs = omics.list_read_set_import_jobs(sequenceStoreId=seqstore['id'], filter={'status': 'IN_PROGRESS'})\n",
    "readset_list = omics.list_read_sets(sequenceStoreId=seqstore['id'])\n",
    "if not readset_list.get('readSets'):\n",
    "    print(f\"no active readsets available in {seqstore['id']}.\")\n",
    "    if readset_import_jobs.get('importJobs'):\n",
    "        print(f\"{len(readset_import_jobs.get('importJobs'))} in progress readset import jobs found. Readsets should be available in a few minutes.\")\n",
    "else:\n",
    "    readset = readset_list['readSets'][0]\n",
    "    pprint(readset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae0848c8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "readset_metadata = omics.get_read_set_metadata(sequenceStoreId=readset['sequenceStoreId'], id=readset['id'])\n",
    "readset_metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86011574",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "response = omics.get_read_set(sequenceStoreId=readset['sequenceStoreId'], id=readset['id'], partNumber=1, file='SOURCE1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97081da1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# download a part of the readset\n",
    "response = omics.get_read_set(sequenceStoreId=readset['sequenceStoreId'], id=readset['id'], partNumber=1, file='SOURCE1')\n",
    "with open(f\"reads1.part.{readset_metadata['fileType'].lower()}\", 'wb') as f:\n",
    "    f.write(response['payload'].read())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}