{ "cells": [ { "cell_type": "markdown", "id": "c35e567a-517a-4995-8aa9-350d777d6c34", "metadata": { "tags": [] }, "source": [ "# Module 2: Run Protein Structure Design and Protein Structure Prediction\n" ] }, { "cell_type": "markdown", "id": "349a9fdf-0b63-4439-bd00-13405b03e142", "metadata": {}, "source": [ "NOTE: The authors recommend running this notebook in Amazon SageMaker Studio with the following environment settings: \n", "* **PyTorch 1.13 Python 3.9 GPU-optimized** image \n", "* **Python 3** kernel \n", "* **ml.g4dn.xlarge** instance type \n", "\n", "---" ] }, { "cell_type": "markdown", "id": "a7d6f50f-8480-4776-acb4-0011962b6641", "metadata": {}, "source": [ "Analyzing large macromolecules like proteins is an essential part of designing new therapeutics. Recently, a number of deep-learning based approaches have improved the speed and accuracy of protein structure analysis. Some of these methods are shown in the image below.\n", "\n", "\n", "\n", "In this module, we will use several AI algorithms to design **new** variants of the heavy chain for the structure of Herceptin (Trastuzumab). The steps for the pipeline are as follows:\n", "\n", "* [RFDiffusion](https://github.com/RosettaCommons/RFdiffusion) is used to generate a small number of variant designs. We will only attempt to redesign parts of the variable region.\n", "\n", "* [ProteinMPNN](https://github.com/dauparas/ProteinMPNN) is then used to discover novel sequences that are expected to fold to the novel structure.\n", "\n", "* [ESMFold](https://github.com/facebookresearch/esm) is then used to score each of the candidate proteins. ESMFold returns the average predicted local distance difference test (pLDDT) score; which represents the confidence (averaged over all residues) in the predicted structure. This will be used to assess whether the predicted structure is likely to be correct.\n", "\n", "For running ESMFold, we will use the ESMFold endpoint deployed in Module 1, so please ensure that you have run that module **before** running this one." ] }, { "cell_type": "markdown", "id": "f3439cfe-4302-4c67-81d6-ab416019f5f4", "metadata": {}, "source": [ "---\n", "## 1. Setup and installation" ] }, { "cell_type": "markdown", "id": "954d42aa-dc06-4dd4-88e7-4e154b6c8429", "metadata": {}, "source": [ "Install RFDiffusion and it's dependencies" ] }, { "cell_type": "code", "execution_count": null, "id": "715d02b9-dc64-4fec-b8de-dd76f2df32c9", "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install -U -q -r protein-design-requirements.txt --disable-pip-version-check" ] }, { "cell_type": "markdown", "id": "0c147e3f-f8aa-4f69-adf5-7b8a2229f263", "metadata": {}, "source": [ "Download and extract the RFDiffusion and ProteinMPNN model weights (This will take several minutes)" ] }, { "cell_type": "code", "execution_count": null, "id": "64bc76b6-4ff0-4595-b1fa-300a705be5cb", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%bash\n", "mkdir -p \"data/weights/rfdiffusion\" \"data/weights/proteinmpnn\" \n", "aws s3 cp --no-sign-request \"s3://aws-batch-architecture-for-alphafold-public-artifacts/compressed/rfdiffusion_parameters_220407.tar.gz\" \"weights.tar.gz\"\n", "tar --extract -z --file=\"weights.tar.gz\" --directory=\"data/weights/rfdiffusion\" --no-same-owner\n", "rm \"weights.tar.gz\"\n", "wget -q -P \"data/weights/proteinmpnn\" https://github.com/dauparas/ProteinMPNN/raw/main/vanilla_model_weights/v_48_020.pt\n", "wget -q -P \"data\" https://files.rcsb.org/download/1N8Z.pdb" ] }, { "cell_type": "markdown", "id": "d32f66e4-9b21-4d2b-bce9-48c7b22aa861", "metadata": {}, "source": [ "---\n", "## 2. Generate new heavy chain structures with RFdiffusion\n", "First we will run RFdiffusion to generate some novel protein structures. To do this, we give the model a starting structure and tell it which which parts to change. We want it to redesign **only** the residues 98-109 on the B chain, while keeping the rest of the structure the same.\n", "\n", "Let's take a look at the regions of interest. In the following view, the part of the heavy chain we want to redesign is colored blue and the target region is in green.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f2ffc610-9566-4c87-98aa-539ad302a289", "metadata": { "tags": [] }, "outputs": [], "source": [ "import py3Dmol\n", "\n", "view = py3Dmol.view(width=600, height=400)\n", "with open(\"data/1N8Z.pdb\") as ifile:\n", " experimental_structure = \"\".join([x for x in ifile])\n", "view.addModel(experimental_structure)\n", "view.setStyle({\"chain\": \"A\"}, {\"opacity\": 0})\n", "view.setStyle({\"chain\": \"B\"}, {\"cartoon\": {\"color\": \"blue\", \"opacity\": 0.4}})\n", "view.addStyle(\n", " {\"chain\": \"B\", \"resi\": \"98-109\"}, {\"cartoon\": {\"color\": \"#57C4F8\", \"opacity\": 1.0}}\n", ")\n", "view.setStyle({\"chain\": \"C\"}, {\"cartoon\": {\"color\": \"green\", \"opacity\": 0.4}})\n", "view.setStyle(\n", " {\"chain\": \"C\", \"resi\": \"540-580\"}, {\"cartoon\": {\"color\": \"#37F20E\", \"opacity\": 1.0}}\n", ")\n", "view.zoomTo()\n", "view.show()" ] }, { "cell_type": "markdown", "id": "d3856c56-e4e7-4e3d-a743-e501215ce6a1", "metadata": { "tags": [] }, "source": [ "Here are our design constraints:\n", "\n", "- Look at residues 540-580 of the target molecule (green structure in image above)\n", "- Create a new structure that includes residues 1-97 of the heavy chain (blue above), then generate 10 new residues, then add residues 110-120 from the heavy chain.\n", "- Use residues 570 and 573 from the target molecule as \"hotspots\", meaning we want to make sure that the new structure interacts with these specific amino acids.\n", "- Create 4 designs in total\n", "- Leave the rest of the RFdiffusion parameters as the default values\n", "\n", "The RFDiffusion job will take about 5 minutes to complete on a ml.g4dn.xlarge instance type." ] }, { "cell_type": "code", "execution_count": null, "id": "d6c003d7-6f06-4da7-b25c-44d77334d126", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%time\n", "!mkdir -p data/results/rfdiffusion\n", "\n", "from prothelpers.rfdiffusion import create_structures\n", "\n", "create_structures(\n", " overrides=[\n", " \"inference.input_pdb=data/1N8Z.pdb\",\n", " \"inference.output_prefix=data/results/rfdiffusion/rfdiffusion_result\",\n", " \"inference.model_directory_path=data/weights/rfdiffusion\",\n", " \"contigmap.contigs=[C540-580/0 B1-97/13/B110-120]\",\n", " \"ppi.hotspot_res=[C570,C573]\",\n", " \"inference.num_designs=4\",\n", " ]\n", ")" ] }, { "cell_type": "markdown", "id": "6f2a0b2f-9e50-4501-a81a-2e0314516b7b", "metadata": {}, "source": [ "Our new designs are in the `data/results/rfdiffusion` folder. Let's take a look at them." ] }, { "cell_type": "code", "execution_count": null, "id": "0309a81c-c345-466b-b8ab-dc4e4d9817fe", "metadata": { "tags": [] }, "outputs": [], "source": [ "from prothelpers.structure import extract_structures_from_dir, create_tiled_py3dmol_view\n", "\n", "rfdiffusion_results_dir = \"data/results/rfdiffusion\"\n", "structures = extract_structures_from_dir(rfdiffusion_results_dir)\n", "\n", "view = create_tiled_py3dmol_view(structures, total_cols=2)\n", "\n", "view.setStyle({\"chain\": \"A\"}, {\"cartoon\": {\"color\": \"blue\", \"opacity\": 0.4}})\n", "view.setStyle(\n", " {\"chain\": \"A\", \"resi\": \"98-109\"}, {\"cartoon\": {\"color\": \"#57C4F8\", \"opacity\": 1.0}}\n", ")\n", "view.setStyle({\"chain\": \"B\"}, {\"cartoon\": {\"color\": \"green\"}})\n", "\n", "view.zoomTo()\n", "view.show()" ] }, { "cell_type": "markdown", "id": "071f3f9a-306f-4a71-99f5-4a8d8aa5dae1", "metadata": { "tags": [] }, "source": [ "## 3. Translate Structure into Sequence with ProteinMPNN\n", "ProteinMPNN is a tool for **inverse protein folding**. In inverse protein folding, the input is a protien tertiary structure, while the output is a sequence (or sequences) that are predicted to fold in the specified structure. Here is a schematic for how it works:\n", "