{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
    "SPDX-License-Identifier: Apache-2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pc5-mbsX9PZC"
   },
   "source": [
    "# Using the AWS Batch Architecture for AlphaFold\n",
    "\n",
    "This notebook allows you to predict protein structures using AlphaFold on AWS Batch. \n",
    "\n",
    "**Differences to AlphaFold Notebooks**\n",
    "\n",
    "In comparison to AlphaFold v2.1.0, this notebook uses AWS Batch to submit protein analysis jobs to a scalable compute cluster. The accuracy should be the same as if you ran it locally. However, by using HPC services like AWS Batch and Amazon FSx for Lustre, we can support parallel job execution and optimize the resources for each run.\n",
    "\n",
    "**Citing this work**\n",
    "\n",
    "Any publication that discloses findings arising from using this notebook should [cite](https://github.com/deepmind/alphafold/#citing-this-work) the [AlphaFold paper](https://doi.org/10.1038/s41586-021-03819-2).\n",
    "\n",
    "**Licenses**\n",
    "\n",
    "Please refer to the `LICENSE` and `THIRD-PARTY-NOTICES` file for more information about third-party software/licensing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "0. [Install Dependencies](#0.-Install-Dependencies)\n",
    "1. [Run a monomer analysis job](#1.-Run-a-monomer-analysis-job)\n",
    "2. [Run a multimer analysis job](#2.-Run-a-multimer-analysis-job) \n",
    "3. [Analyze multiple proteins](#3.-Analyze-multiple-proteins)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Install Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -r notebook-requirements.txt -q -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required Python packages\n",
    "from Bio import SeqIO\n",
    "from Bio.SeqRecord import SeqRecord\n",
    "import boto3\n",
    "from datetime import datetime\n",
    "from IPython import display\n",
    "from nbhelpers import nbhelpers\n",
    "import os\n",
    "import pandas as pd\n",
    "import sagemaker\n",
    "from time import sleep\n",
    "\n",
    "pd.set_option(\"max_colwidth\", None)\n",
    "\n",
    "# Get client informatiion\n",
    "boto_session = boto3.session.Session()\n",
    "sm_session = sagemaker.session.Session(boto_session)\n",
    "region = boto_session.region_name\n",
    "s3_client = boto_session.client(\"s3\", region_name=region)\n",
    "batch_client = boto_session.client(\"batch\")\n",
    "\n",
    "S3_BUCKET = sm_session.default_bucket()\n",
    "print(f\" S3 bucket name is {S3_BUCKET}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have multiple AWS-Alphafold stacks deployed in your account, which one should we use? If not specified, the `submit_batch_alphafold_job` function defaults to the first item in this list. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nbhelpers.list_alphafold_stacks()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "W4JpOs6oA-QS"
   },
   "source": [
    "## 1. Run a monomer analysis job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Provide sequences for analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "id": "rowN0bVYLe9n"
   },
   "outputs": [],
   "source": [
    "## Enter the amino acid sequence to fold\n",
    "id_1 = \"T1084\"\n",
    "sequence_1 = \"MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH\"\n",
    "\n",
    "DB_PRESET = \"reduced_dbs\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Validate the input and determine which models to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_sequences = (sequence_1,)\n",
    "input_ids = (id_1,)\n",
    "input_sequences, model_preset = nbhelpers.validate_input(input_sequences)\n",
    "sequence_length = len(input_sequences[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upload input file to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "job_name = nbhelpers.create_job_name()\n",
    "object_key = nbhelpers.upload_fasta_to_s3(\n",
    "    input_sequences, input_ids, S3_BUCKET, job_name, region=region\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Submit two Batch jobs, the first one for the data prep and the second one (dependent on the first) for the structure prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define resources for data prep and prediction steps\n",
    "if DB_PRESET == \"reduced_dbs\":\n",
    "    prep_cpu = 4\n",
    "    prep_mem = 16\n",
    "    prep_gpu = 0\n",
    "\n",
    "else:\n",
    "    prep_cpu = 16\n",
    "    prep_mem = 32\n",
    "    prep_gpu = 0\n",
    "\n",
    "if sequence_length < 700:\n",
    "    predict_cpu = 4\n",
    "    predict_mem = 16\n",
    "    predict_gpu = 1\n",
    "else:\n",
    "    predict_cpu = 16\n",
    "    predict_mem = 64\n",
    "    predict_gpu = 1\n",
    "\n",
    "step_1_response = nbhelpers.submit_batch_alphafold_job(\n",
    "    job_name=str(job_name),\n",
    "    fasta_paths=object_key,\n",
    "    output_dir=job_name,\n",
    "    db_preset=DB_PRESET,\n",
    "    model_preset=model_preset,\n",
    "    s3_bucket=S3_BUCKET,\n",
    "    cpu=prep_cpu,\n",
    "    memory=prep_mem,\n",
    "    gpu=prep_gpu,\n",
    "    run_features_only=True,\n",
    "    use_spot_instances=True,\n",
    ")\n",
    "\n",
    "print(f\"Job ID {step_1_response['jobId']} submitted\")\n",
    "\n",
    "step_2_response = nbhelpers.submit_batch_alphafold_job(\n",
    "    job_name=str(job_name),\n",
    "    fasta_paths=object_key,\n",
    "    output_dir=job_name,\n",
    "    db_preset=DB_PRESET,\n",
    "    model_preset=model_preset,\n",
    "    s3_bucket=S3_BUCKET,\n",
    "    cpu=predict_cpu,\n",
    "    memory=predict_mem,\n",
    "    gpu=predict_gpu,\n",
    "    features_paths=os.path.join(job_name, job_name, \"features.pkl\"),\n",
    "    depends_on=step_1_response[\"jobId\"],\n",
    ")\n",
    "\n",
    "print(f\"Job ID {step_2_response['jobId']} submitted\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check status of jobs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "status_1 = nbhelpers.get_batch_job_info(step_1_response[\"jobId\"])\n",
    "status_2 = nbhelpers.get_batch_job_info(step_2_response[\"jobId\"])\n",
    "\n",
    "print(f\"Data prep job {status_1['jobName']} is in {status_1['status']} status\")\n",
    "print(f\"Predict job {status_2['jobName']} is in {status_2['status']} status\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the job has a status of \"RUNNING\", \"SUCCEEDED\" or \"FAILED\", we can view the run logs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nbhelpers.get_batch_logs(status_1[\"logStreamName\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download results from S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "job_name = status_1[\"jobName\"]\n",
    "# job_name = \"1234567890\"  # You can also provide the name of a previous job here\n",
    "nbhelpers.download_results(bucket=S3_BUCKET, job_name=job_name, local=\"data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View MSA information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nbhelpers.plot_msa_output_folder(\n",
    "    path=f\"data/{job_name}/{job_name}/msas\", id=input_ids[0]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View predicted structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdb_path = os.path.join(f\"data/{job_name}/{job_name}/ranked_0.pdb\")\n",
    "print(\"Default coloring is by plDDT\")\n",
    "nbhelpers.display_structure(pdb_path)\n",
    "\n",
    "print(\"Can also use rainbow coloring\")\n",
    "nbhelpers.display_structure(pdb_path, color=\"rainbow\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "W4JpOs6oA-QS"
   },
   "source": [
    "## 2. Run a multimer analysis job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Provide sequences for analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "id": "rowN0bVYLe9n"
   },
   "outputs": [],
   "source": [
    "## Enter the amino acid sequence to fold\n",
    "\n",
    "id_1 = \"5ZNG_1\"\n",
    "sequence_1 = \"MDLSNMESVVESALTGQRTKIVVKVHMPCGKSRAKAMALAASVNGVDSVEITGEDKDRLVVVGRGIDPVRLVALLREKCGLAELLMVELVEKEKTQLAGGKKGAYKKHPTYNLSPFDYVEYPPSAPIMQDINPCSTM\"\n",
    "\n",
    "id_2 = \"5ZNG_2\"\n",
    "sequence_2 = (\n",
    "    \"MAWKDCIIQRYKDGDVNNIYTANRNEEITIEEYKVFVNEACHPYPVILPDRSVLSGDFTSAYADDDESCYRHHHHHH\"\n",
    ")\n",
    "\n",
    "# Add additional sequences, if necessary\n",
    "\n",
    "input_ids = (\n",
    "    id_1,\n",
    "    id_2,\n",
    ")\n",
    "input_sequences = (\n",
    "    sequence_1,\n",
    "    sequence_2,\n",
    ")\n",
    "\n",
    "DB_PRESET = \"reduced_dbs\"\n",
    "\n",
    "input_sequences, model_preset = nbhelpers.validate_input(input_sequences)\n",
    "sequence_length = len(max(input_sequences))\n",
    "\n",
    "# Upload input file to S3\n",
    "job_name = nbhelpers.create_job_name()\n",
    "object_key = nbhelpers.upload_fasta_to_s3(\n",
    "    input_sequences, input_ids, S3_BUCKET, job_name, region=region\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Submit batch jobs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define resources for data prep and prediction steps\n",
    "if DB_PRESET == \"reduced_dbs\":\n",
    "    prep_cpu = 4\n",
    "    prep_mem = 16\n",
    "    prep_gpu = 0\n",
    "\n",
    "else:\n",
    "    prep_cpu = 16\n",
    "    prep_mem = 32\n",
    "    prep_gpu = 0\n",
    "\n",
    "if sequence_length < 700:\n",
    "    predict_cpu = 4\n",
    "    predict_mem = 16\n",
    "    predict_gpu = 1\n",
    "else:\n",
    "    predict_cpu = 16\n",
    "    predict_mem = 64\n",
    "    predict_gpu = 1\n",
    "\n",
    "step_1_response = nbhelpers.submit_batch_alphafold_job(\n",
    "    job_name=str(job_name),\n",
    "    fasta_paths=object_key,\n",
    "    output_dir=job_name,\n",
    "    db_preset=DB_PRESET,\n",
    "    model_preset=model_preset,\n",
    "    s3_bucket=S3_BUCKET,\n",
    "    cpu=prep_cpu,\n",
    "    memory=prep_mem,\n",
    "    gpu=prep_gpu,\n",
    "    use_spot_instances=True,\n",
    "    run_features_only=True,\n",
    ")\n",
    "\n",
    "print(f\"Job ID {step_1_response['jobId']} submitted\")\n",
    "\n",
    "step_2_response = nbhelpers.submit_batch_alphafold_job(\n",
    "    job_name=str(job_name),\n",
    "    fasta_paths=object_key,\n",
    "    output_dir=job_name,\n",
    "    db_preset=DB_PRESET,\n",
    "    model_preset=model_preset,\n",
    "    s3_bucket=S3_BUCKET,\n",
    "    cpu=predict_cpu,\n",
    "    memory=predict_mem,\n",
    "    gpu=predict_gpu,\n",
    "    features_paths=os.path.join(job_name, job_name, \"features.pkl\"),\n",
    "    depends_on=step_1_response[\"jobId\"],\n",
    ")\n",
    "\n",
    "print(f\"Job ID {step_2_response['jobId']} submitted\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check status of jobs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "status_1 = nbhelpers.get_batch_job_info(step_1_response[\"jobId\"])\n",
    "status_2 = nbhelpers.get_batch_job_info(step_2_response[\"jobId\"])\n",
    "\n",
    "print(f\"Data prep job {status_1['jobName']} is in {status_1['status']} status\")\n",
    "print(f\"Predict job {status_2['jobName']} is in {status_2['status']} status\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download results from S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "job_name = status_1[\"jobName\"]\n",
    "# job_name = \"1234567890\"  # You can also provide the name of a previous job here\n",
    "nbhelpers.download_results(bucket=S3_BUCKET, job_name=job_name, local=\"data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View MSA information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nbhelpers.plot_msa_output_folder(\n",
    "    path=f\"data/{job_name}/{job_name}/msas\", id=input_ids[0]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View predicted structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdb_path = os.path.join(f\"data/{job_name}/{job_name}/ranked_0.pdb\")\n",
    "print(\"Can also color by chain\")\n",
    "nbhelpers.display_structure(pdb_path, chains=2, color=\"chain\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Analyze multiple proteins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download and process CASP14 sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget \"https://predictioncenter.org/download_area/CASP14/sequences/casp14.seq.txt\" "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!sed '137,138d' \"casp14.seq.txt\" > \"casp14_dedup.fa\" # Remove duplicate entry for T1085"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "casp14_iterator = SeqIO.parse(\"casp14_dedup.fa\", \"fasta\")\n",
    "casp14_df = pd.DataFrame(\n",
    "    (\n",
    "        (record.id, record.description, len(record), record.seq)\n",
    "        for record in casp14_iterator\n",
    "    ),\n",
    "    columns=[\"id\", \"description\", \"length\", \"seq\"],\n",
    ").sort_values(by=\"length\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display information about CASP14 proteins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pd.option_context(\"display.max_rows\", None):\n",
    "    display.display(casp14_df.loc[:, (\"id\", \"description\")])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot distribution of the protein lengths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "plt.hist(casp14_df.length, bins=50)\n",
    "plt.ylabel(\"Sample count\")\n",
    "plt.xlabel(\"Residue count\")\n",
    "plt.title(\"CASP-14 Protein Length Distribution\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Submit analysis jobs for a subset of CASP14 proteins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "protein_count = (\n",
    "    5  # Change this to analyze a larger number of CASP14 targets, smallest to largest\n",
    ")\n",
    "job_name_list = []\n",
    "\n",
    "for row in casp14_df[:protein_count].itertuples(index=False):\n",
    "    record = SeqRecord(row.seq, id=row.id, description=row.description)\n",
    "    print(f\"Protein sequence for analysis is \\n{record.description}\")\n",
    "    sequence_length = len(record.seq)\n",
    "    print(f\"Sequence length is {sequence_length}\")\n",
    "    print(record.seq)\n",
    "\n",
    "    input_ids = (record.id,)\n",
    "    input_sequences = (str(record.seq),)\n",
    "    DB_PRESET = \"reduced_dbs\"\n",
    "\n",
    "    input_sequences, model_preset = nbhelpers.validate_input(input_sequences)\n",
    "\n",
    "    # Upload input file to S3\n",
    "    job_name = nbhelpers.create_job_name()\n",
    "    object_key = nbhelpers.upload_fasta_to_s3(\n",
    "        input_sequences, input_ids, S3_BUCKET, job_name, region=region\n",
    "    )\n",
    "\n",
    "    # Define resources for data prep and prediction steps\n",
    "    if DB_PRESET == \"reduced_dbs\":\n",
    "        prep_cpu = 4\n",
    "        prep_mem = 16\n",
    "        prep_gpu = 0\n",
    "\n",
    "    else:\n",
    "        prep_cpu = 16\n",
    "        prep_mem = 32\n",
    "        prep_gpu = 0\n",
    "\n",
    "    if sequence_length < 700:\n",
    "        predict_cpu = 4\n",
    "        predict_mem = 16\n",
    "        predict_gpu = 1\n",
    "    else:\n",
    "        predict_cpu = 16\n",
    "        predict_mem = 64\n",
    "        predict_gpu = 1\n",
    "\n",
    "    step_1_response = nbhelpers.submit_batch_alphafold_job(\n",
    "        job_name=str(job_name),\n",
    "        fasta_paths=object_key,\n",
    "        output_dir=job_name,\n",
    "        db_preset=DB_PRESET,\n",
    "        model_preset=model_preset,\n",
    "        s3_bucket=S3_BUCKET,\n",
    "        cpu=prep_cpu,\n",
    "        memory=prep_mem,\n",
    "        gpu=prep_gpu,\n",
    "        run_features_only=True,\n",
    "    )\n",
    "\n",
    "    print(f\"Job ID {step_1_response['jobId']} submitted\")\n",
    "\n",
    "    step_2_response = nbhelpers.submit_batch_alphafold_job(\n",
    "        job_name=str(job_name),\n",
    "        fasta_paths=object_key,\n",
    "        output_dir=job_name,\n",
    "        db_preset=DB_PRESET,\n",
    "        model_preset=model_preset,\n",
    "        s3_bucket=S3_BUCKET,\n",
    "        cpu=predict_cpu,\n",
    "        memory=predict_mem,\n",
    "        gpu=predict_gpu,\n",
    "        features_paths=os.path.join(job_name, job_name, \"features.pkl\"),\n",
    "        depends_on=step_1_response[\"jobId\"],\n",
    "    )\n",
    "\n",
    "    print(f\"Job ID {step_2_response['jobId']} submitted\")\n",
    "    sleep(1)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "AlphaFold.ipynb",
   "private_outputs": true,
   "provenance": []
  },
  "interpreter": {
   "hash": "8c6456aa60ef065493dc18f7bd7b99bea83a2d81b0d08a88287447b08bfc7a5d"
  },
  "kernelspec": {
   "display_name": "Python 3.9.8 ('.venv': venv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}