{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1b2aebda",
   "metadata": {},
   "source": [
    "# Serve Multiple DL models on GPU with Amazon SageMaker Multi-model endpoints (MME)\n",
    "\n",
    "\n",
    "\n",
    "Amazon SageMaker multi-model endpoints(MME) provide a scalable and cost-effective way to deploy large number of deep learning models. Previously, customers had limited options to deploy 100s of deep learning models that need accelerated compute with GPUs. Now customers can deploy 1000s of deep learning models behind one SageMaker endpoint. Now, MME will run multiple models on a GPU, share GPU instances behind an endpoint across multiple models and dynamically load/unload models based on the incoming traffic. With this, customers can significantly save cost and achieve best price performance.\n",
    "\n",
    "\n",
    "\n",
    "<div class=\"alert alert-info\"> 💡 <strong> Note </strong>\n",
    "This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g5.xlarge`.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96038ada",
   "metadata": {},
   "source": [
    "In this notebook, we will walk you through how to use NVIDIA Triton Inference Server on Amazon SageMaker MME with GPU feature to deploy a **T5** NLP model for **Translation**. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07d31e8a",
   "metadata": {},
   "source": [
    "## Installs\n",
    "\n",
    "Installs the dependencies required to package the model and run inferences using Triton server. Update SageMaker, boto3, awscli etc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "367c8d77",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install -qU pip awscli boto3 sagemaker\n",
    "!pip install nvidia-pyindex --quiet\n",
    "!pip install tritonclient[http] --quiet\n",
    "!pip install transformers[sentencepiece] --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f29c9de0",
   "metadata": {},
   "source": [
    "## Imports and variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2386d882",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.12-py3\n"
     ]
    }
   ],
   "source": [
    "import boto3, json, sagemaker, time\n",
    "from sagemaker import get_execution_role\n",
    "import numpy as np\n",
    "import os\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
    "\n",
    "# sagemaker variables\n",
    "role = get_execution_role()\n",
    "sm_client = boto3.client(service_name=\"sagemaker\")\n",
    "runtime_sm_client = boto3.client(\"sagemaker-runtime\")\n",
    "sagemaker_session = sagemaker.Session(boto_session=boto3.Session())\n",
    "s3_client = boto3.client('s3')\n",
    "bucket = sagemaker.Session().default_bucket()\n",
    "prefix = \"nlp-mme-gpu\"\n",
    "\n",
    "# account mapping for SageMaker MME Triton Image\n",
    "account_id_map = {\n",
    "    \"us-east-1\": \"785573368785\",\n",
    "    \"us-east-2\": \"007439368137\",\n",
    "    \"us-west-1\": \"710691900526\",\n",
    "    \"us-west-2\": \"301217895009\",\n",
    "    \"eu-west-1\": \"802834080501\",\n",
    "    \"eu-west-2\": \"205493899709\",\n",
    "    \"eu-west-3\": \"254080097072\",\n",
    "    \"eu-north-1\": \"601324751636\",\n",
    "    \"eu-south-1\": \"966458181534\",\n",
    "    \"eu-central-1\": \"746233611703\",\n",
    "    \"ap-east-1\": \"110948597952\",\n",
    "    \"ap-south-1\": \"763008648453\",\n",
    "    \"ap-northeast-1\": \"941853720454\",\n",
    "    \"ap-northeast-2\": \"151534178276\",\n",
    "    \"ap-southeast-1\": \"324986816169\",\n",
    "    \"ap-southeast-2\": \"355873309152\",\n",
    "    \"cn-northwest-1\": \"474822919863\",\n",
    "    \"cn-north-1\": \"472730292857\",\n",
    "    \"sa-east-1\": \"756306329178\",\n",
    "    \"ca-central-1\": \"464438896020\",\n",
    "    \"me-south-1\": \"836785723513\",\n",
    "    \"af-south-1\": \"774647643957\",\n",
    "}\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "if region not in account_id_map.keys():\n",
    "    raise (\"UNSUPPORTED REGION\")\n",
    "\n",
    "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n",
    "triton_image_uri = (\n",
    "    \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.12-py3\".format(\n",
    "        account_id=account_id_map[region], region=region, base=base\n",
    "    )\n",
    ")\n",
    "print(triton_image_uri)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce64b95c",
   "metadata": {},
   "source": [
    "## Workflow Overview\n",
    "\n",
    "This section presents overview of main steps for preparing a T5 Pytorch model (served using Python backend) using Triton Inference Server.\n",
    "### 1. Generate Model Artifacts\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0f444f2",
   "metadata": {},
   "source": [
    "#### T5 PyTorch Model\n",
    "\n",
    "In case of T5-small HuggingFace PyTorch Model, since we are serving it using Triton's [python backend](https://github.com/triton-inference-server/python_backend#usage) we have python script [model.py](./workspace/model.py) which implements all the logic to initialize the T5 model and execute inference for the translation task."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb3dd938",
   "metadata": {},
   "source": [
    "### 2. Build Model Respository\n",
    "\n",
    "Using Triton on SageMaker requires us to first set up a [model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md) folder containing the models we want to serve. For each model we need to create a model directory consisting of the model artifact and define config.pbtxt file to specify [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) which Triton uses to load and serve the model. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a39c4592",
   "metadata": {},
   "source": [
    "#### T5 Python Backend Model\n",
    "\n",
    "Model repository structure for T5 Model.\n",
    "\n",
    "```\n",
    "t5_pytorch\n",
    "├── 1\n",
    "│   └── model.py\n",
    "└── config.pbtxt\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93792c33",
   "metadata": {},
   "source": [
    "Next we set up the T5 PyTorch Python Backend Model in the model repository:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4b6d6272",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!mkdir -p model_repository/t5_pytorch/1\n",
    "!cp workspace/model.py model_repository/t5_pytorch/1/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0a12da8",
   "metadata": {},
   "source": [
    "##### Create Conda Environment for Dependencies\n",
    "\n",
    "For serving the HuggingFace T5 PyTorch Model using Triton's Python backend we have PyTorch and HuggingFace transformers as dependencies.\n",
    "\n",
    "We follow the instructions from the [Triton documentation for packaging dependencies](https://github.com/triton-inference-server/python_backend#2-packaging-the-conda-environment) to be used in the python backend as conda env tar file. Running the bash script [create_hf_env.sh]('./workspace/create_hf_env.sh') creates the conda environment containing PyTorch and HuggingFace transformers, packages it as tar file and then we move it into the t5-pytorch model directory. This can take a few minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e203209f",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting package metadata (current_repodata.json): done\n",
      "Solving environment: done\n",
      "\n",
      "\n",
      "==> WARNING: A newer version of conda exists. <==\n",
      "  current version: 22.9.0\n",
      "  latest version: 23.1.0\n",
      "\n",
      "Please update conda by running\n",
      "\n",
      "    $ conda update -n base -c conda-forge conda\n",
      "\n",
      "\n",
      "\n",
      "## Package Plan ##\n",
      "\n",
      "  environment location: /home/ec2-user/anaconda3/envs/hf_env\n",
      "\n",
      "  added / updated specs:\n",
      "    - python=3.8\n",
      "\n",
      "\n",
      "The following packages will be downloaded:\n",
      "\n",
      "    package                    |            build\n",
      "    ---------------------------|-----------------\n",
      "    pip-23.0.1                 |     pyhd8ed1ab_0         1.3 MB  conda-forge\n",
      "    python-3.8.16              |he550d4f_1_cpython        21.8 MB  conda-forge\n",
      "    setuptools-67.6.0          |     pyhd8ed1ab_0         566 KB  conda-forge\n",
      "    ------------------------------------------------------------\n",
      "                                           Total:        23.6 MB\n",
      "\n",
      "The following NEW packages will be INSTALLED:\n",
      "\n",
      "  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge None\n",
      "  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu None\n",
      "  bzip2              conda-forge/linux-64::bzip2-1.0.8-h7f98852_4 None\n",
      "  ca-certificates    conda-forge/linux-64::ca-certificates-2022.12.7-ha878542_0 None\n",
      "  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.40-h41732ed_0 None\n",
      "  libffi             conda-forge/linux-64::libffi-3.4.2-h7f98852_5 None\n",
      "  libgcc-ng          conda-forge/linux-64::libgcc-ng-12.2.0-h65d4601_19 None\n",
      "  libgomp            conda-forge/linux-64::libgomp-12.2.0-h65d4601_19 None\n",
      "  libnsl             conda-forge/linux-64::libnsl-2.0.0-h7f98852_0 None\n",
      "  libsqlite          conda-forge/linux-64::libsqlite-3.40.0-h753d276_0 None\n",
      "  libuuid            conda-forge/linux-64::libuuid-2.32.1-h7f98852_1000 None\n",
      "  libzlib            conda-forge/linux-64::libzlib-1.2.13-h166bdaf_4 None\n",
      "  ncurses            conda-forge/linux-64::ncurses-6.3-h27087fc_1 None\n",
      "  openssl            conda-forge/linux-64::openssl-3.0.8-h0b41bf4_0 None\n",
      "  pip                conda-forge/noarch::pip-23.0.1-pyhd8ed1ab_0 None\n",
      "  python             conda-forge/linux-64::python-3.8.16-he550d4f_1_cpython None\n",
      "  readline           conda-forge/linux-64::readline-8.1.2-h0f457ee_0 None\n",
      "  setuptools         conda-forge/noarch::setuptools-67.6.0-pyhd8ed1ab_0 None\n",
      "  tk                 conda-forge/linux-64::tk-8.6.12-h27826a3_0 None\n",
      "  wheel              conda-forge/noarch::wheel-0.38.4-pyhd8ed1ab_0 None\n",
      "  xz                 conda-forge/linux-64::xz-5.2.6-h166bdaf_0 None\n",
      "\n",
      "\n",
      "\n",
      "Downloading and Extracting Packages\n",
      "python-3.8.16        | 21.8 MB   | ##################################### | 100% \n",
      "setuptools-67.6.0    | 566 KB    | ##################################### | 100% \n",
      "pip-23.0.1           | 1.3 MB    | ##################################### | 100% \n",
      "Preparing transaction: done\n",
      "Verifying transaction: done\n",
      "Executing transaction: done\n",
      "#\n",
      "# To activate this environment, use\n",
      "#\n",
      "#     $ conda activate hf_env\n",
      "#\n",
      "# To deactivate an active environment, use\n",
      "#\n",
      "#     $ conda deactivate\n",
      "\n",
      "Retrieving notices: ...working... done\n",
      "Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com, https://download.pytorch.org/whl/cu116\n",
      "Collecting torch\n",
      "  Downloading https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl (1977.9 MB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 GB\u001b[0m \u001b[31m23.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:02\u001b[0mm\n",
      "\u001b[?25hCollecting typing-extensions\n",
      "  Downloading typing_extensions-4.5.0-py3-none-any.whl (27 kB)\n",
      "Installing collected packages: typing-extensions, torch\n",
      "Successfully installed torch-1.13.1+cu116 typing-extensions-4.5.0\n",
      "Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com\n",
      "Collecting transformers[sentencepiece]\n",
      "  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.3/6.3 MB\u001b[0m \u001b[31m84.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
      "\u001b[?25hCollecting packaging>=20.0\n",
      "  Downloading packaging-23.0-py3-none-any.whl (42 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m42.7/42.7 kB\u001b[0m \u001b[31m267.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting numpy>=1.17\n",
      "  Downloading numpy-1.24.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.3/17.3 MB\u001b[0m \u001b[31m267.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
      "\u001b[?25hCollecting filelock\n",
      "  Downloading filelock-3.9.0-py3-none-any.whl (9.7 kB)\n",
      "Collecting requests\n",
      "  Downloading requests-2.28.2-py3-none-any.whl (62 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.8/62.8 kB\u001b[0m \u001b[31m313.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1\n",
      "  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.6/7.6 MB\u001b[0m \u001b[31m268.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting regex!=2019.12.17\n",
      "  Downloading regex-2022.10.31-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (772 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m772.3/772.3 kB\u001b[0m \u001b[31m420.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting pyyaml>=5.1\n",
      "  Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m701.2/701.2 kB\u001b[0m \u001b[31m412.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting tqdm>=4.27\n",
      "  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.1/77.1 kB\u001b[0m \u001b[31m206.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting huggingface-hub<1.0,>=0.11.0\n",
      "  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m199.2/199.2 kB\u001b[0m \u001b[31m358.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting sentencepiece!=0.1.92,>=0.1.91\n",
      "  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m421.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting protobuf<=3.20.2\n",
      "  Downloading protobuf-3.20.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m415.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /home/ec2-user/anaconda3/envs/hf_env/lib/python3.8/site-packages (from huggingface-hub<1.0,>=0.11.0->transformers[sentencepiece]) (4.5.0)\n",
      "Collecting certifi>=2017.4.17\n",
      "  Downloading certifi-2022.12.7-py3-none-any.whl (155 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m155.3/155.3 kB\u001b[0m \u001b[31m368.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting charset-normalizer<4,>=2\n",
      "  Downloading charset_normalizer-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (195 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m195.9/195.9 kB\u001b[0m \u001b[31m308.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting urllib3<1.27,>=1.21.1\n",
      "  Downloading urllib3-1.26.15-py2.py3-none-any.whl (140 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m140.9/140.9 kB\u001b[0m \u001b[31m335.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hCollecting idna<4,>=2.5\n",
      "  Downloading idna-3.4-py3-none-any.whl (61 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m61.5/61.5 kB\u001b[0m \u001b[31m266.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hInstalling collected packages: tokenizers, sentencepiece, urllib3, tqdm, regex, pyyaml, protobuf, packaging, numpy, idna, filelock, charset-normalizer, certifi, requests, huggingface-hub, transformers\n",
      "Successfully installed certifi-2022.12.7 charset-normalizer-3.1.0 filelock-3.9.0 huggingface-hub-0.13.2 idna-3.4 numpy-1.24.2 packaging-23.0 protobuf-3.20.2 pyyaml-6.0 regex-2022.10.31 requests-2.28.2 sentencepiece-0.1.97 tokenizers-0.13.2 tqdm-4.65.0 transformers-4.26.1 urllib3-1.26.15\n",
      "Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com\n",
      "Collecting conda-pack\n",
      "  Downloading conda-pack-0.6.0.tar.gz (43 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m43.2/43.2 kB\u001b[0m \u001b[31m4.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25hRequirement already satisfied: setuptools in /home/ec2-user/anaconda3/envs/hf_env/lib/python3.8/site-packages (from conda-pack) (67.6.0)\n",
      "Building wheels for collected packages: conda-pack\n",
      "  Building wheel for conda-pack (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for conda-pack: filename=conda_pack-0.6.0-py2.py3-none-any.whl size=30882 sha256=789ef55b9179bb233a9bf3686845b73d832f9c210a0cb1566961661f6ccb97dc\n",
      "  Stored in directory: /tmp/pip-ephem-wheel-cache-ccr06n7x/wheels/56/1b/9e/0da27a4c18349d8f048a8fe87d763d75d3098384e9fa285e45\n",
      "Successfully built conda-pack\n",
      "Installing collected packages: conda-pack\n",
      "Successfully installed conda-pack-0.6.0\n",
      "Collecting packages...\n",
      "Packing environment at '/home/ec2-user/anaconda3/envs/hf_env' to 'hf_env.tar.gz'\n",
      "[########################################] | 100% Completed |  2min 25.2s\n"
     ]
    }
   ],
   "source": [
    "!bash workspace/create_hf_env.sh\n",
    "!mv hf_env.tar.gz model_repository/t5_pytorch/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3363b6dc",
   "metadata": {},
   "source": [
    "After creating the tar file from the conda environment and placing it in model folder, you need to tell Python backend to use that environment for your model. We do this by including the lines below in the model `config.pbtxt` file:\n",
    "\n",
    "```\n",
    "parameters: {\n",
    "  key: \"EXECUTION_ENV_PATH\",\n",
    "  value: {string_value: \"$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz\"}\n",
    "}\n",
    "```\n",
    "Here, `$$TRITON_MODEL_DIRECTORY` helps provide environment path relative to the model folder in model repository and is resolved to `$pwd/model_repository/t5_pytorch`. Finally `hf_env.tar.gz` is the name we gave to our conda env file."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de6f3953",
   "metadata": {},
   "source": [
    "Now we are ready to define the config file for t5 pytorch model being served through Triton's Python Backend:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "43dcaf95",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing model_repository/t5_pytorch/config.pbtxt\n"
     ]
    }
   ],
   "source": [
    "%%writefile model_repository/t5_pytorch/config.pbtxt\n",
    "name: \"t5_pytorch\"\n",
    "backend: \"python\"\n",
    "max_batch_size: 8\n",
    "input: [\n",
    "    {\n",
    "        name: \"input_ids\"\n",
    "        data_type: TYPE_INT32\n",
    "        dims: [ -1 ]\n",
    "    },\n",
    "    {\n",
    "        name: \"attention_mask\"\n",
    "        data_type: TYPE_INT32\n",
    "        dims: [ -1 ]\n",
    "    }\n",
    "]\n",
    "output [\n",
    "  {\n",
    "    name: \"output\"\n",
    "    data_type: TYPE_INT32\n",
    "    dims: [ -1 ]\n",
    "  }\n",
    "]\n",
    "instance_group {\n",
    "  count: 1\n",
    "  kind: KIND_GPU\n",
    "}\n",
    "dynamic_batching {\n",
    "}\n",
    "parameters: {\n",
    "  key: \"EXECUTION_ENV_PATH\",\n",
    "  value: {string_value: \"$$TRITON_MODEL_DIRECTORY/hf_env.tar.gz\"}\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3b7670a",
   "metadata": {},
   "source": [
    "### 3. Package models and upload to S3\n",
    "\n",
    "Next, we will package our model as `*.tar.gz` files for uploading to S3. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5b6bb061",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!tar -C model_repository/ -czf t5_pytorch.tar.gz t5_pytorch\n",
    "model_uri_t5_pytorch = sagemaker_session.upload_data(path=\"t5_pytorch.tar.gz\", key_prefix=prefix)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b690fe2",
   "metadata": {},
   "source": [
    "### 4. Create SageMaker Endpoint\n",
    "\n",
    "Now that we have uploaded the model artifacts to S3, we can create a SageMaker endpoint."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0075cbb6",
   "metadata": {},
   "source": [
    "#### Define the serving container\n",
    "In the container definition, define the `ModelDataUrl` to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load and serve predictions. Set `Mode` to `MultiModel` to indicate SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "8681e13c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "model_data_url = f\"s3://{bucket}/{prefix}/\"\n",
    "\n",
    "container = {\n",
    "    \"Image\": triton_image_uri,\n",
    "    \"ModelDataUrl\": model_data_url,\n",
    "    \"Mode\": \"MultiModel\", #optional in case you want to host more than one model.\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6c879df",
   "metadata": {},
   "source": [
    "#### Create a multi-model object"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a396ecb",
   "metadata": {},
   "source": [
    "Once the image, data location are set we create the model using `create_model` by specifying the `ModelName` and the Container definition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "9a22f650",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model Arn: arn:aws:sagemaker:us-west-2:757967535041:model/nlp-mme-gpu-mdl-2023-03-17-22-46-46\n"
     ]
    }
   ],
   "source": [
    "ts = time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "sm_model_name = f\"{prefix}-mdl-{ts}\"\n",
    "\n",
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4e335d8",
   "metadata": {},
   "source": [
    "#### Define configuration for the multi-model endpoint"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30d79536",
   "metadata": {},
   "source": [
    "Using the model above, we create an [endpoint configuration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) where we can specify the type and number of instances we want in the endpoint. Here we are deploying to `g5.2xlarge` NVIDIA GPU instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "a9d4b510",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Endpoint Config Arn: arn:aws:sagemaker:us-west-2:757967535041:endpoint-config/nlp-mme-gpu-epc-2023-03-17-22-46-46-2xl\n"
     ]
    }
   ],
   "source": [
    "endpoint_config_name = f\"{prefix}-epc-{ts}-2xl\"\n",
    "\n",
    "create_endpoint_config_response = sm_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.g5.2xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": sm_model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a23980c",
   "metadata": {},
   "source": [
    "#### Create SageMaker Endpoint"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52f050c5",
   "metadata": {},
   "source": [
    "Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "9b1b19a5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Endpoint Arn: arn:aws:sagemaker:us-west-2:757967535041:endpoint/nlp-mme-gpu-ep-2023-03-17-22-46-46-2xl\n"
     ]
    }
   ],
   "source": [
    "endpoint_name = f\"{prefix}-ep-{ts}-2xl\"\n",
    "\n",
    "create_endpoint_response = sm_client.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "e3bba0d5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Status: Creating\n",
      "Status: Creating\n",
      "Status: Creating\n",
      "Status: Creating\n",
      "Status: Creating\n",
      "Status: Creating\n",
      "Status: InService\n",
      "Arn: arn:aws:sagemaker:us-west-2:757967535041:endpoint/nlp-mme-gpu-ep-2023-03-17-22-46-46-2xl\n",
      "Status: InService\n"
     ]
    }
   ],
   "source": [
    "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b21403ea",
   "metadata": {},
   "source": [
    "### 5. Run Inference"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6be758ed",
   "metadata": {},
   "source": [
    "Once we have the endpoint running we can use some sample raw data to do an inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard [inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/protocol/README.md)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a408192b",
   "metadata": {},
   "source": [
    "#### Add utility methods for preparing JSON request payload\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59d65e6f",
   "metadata": {},
   "source": [
    "We'll use the following utility methods to convert our inference request for DistilBERT and T5 models into a json payload."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "ca3c7965",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "bart_payload is {'inputs': [{'name': 'input_ids', 'shape': (1, 7), 'datatype': 'INT32', 'data': [[8774, 6, 82, 1782, 19, 5295, 1]]}, {'name': 'attention_mask', 'shape': (1, 7), 'datatype': 'INT32', 'data': [[1, 1, 1, 1, 1, 1, 1]]}]}\n",
      " payload type is <class 'dict'>\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "def get_tokenizer(model_name):\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "    return tokenizer\n",
    "\n",
    "def tokenize_text(model_name, text):\n",
    "    tokenizer = get_tokenizer(model_name)\n",
    "    tokenized_text = tokenizer(text, padding=True, return_tensors=\"np\")\n",
    "    return tokenized_text.input_ids, tokenized_text.attention_mask\n",
    "\n",
    "def get_text_payload(model_name, text):\n",
    "    input_ids, attention_mask = tokenize_text(model_name, text)\n",
    "    payload = {}\n",
    "    payload[\"inputs\"] = []\n",
    "    payload[\"inputs\"].append({\"name\": \"input_ids\", \"shape\": input_ids.shape, \"datatype\": \"INT32\", \"data\": input_ids.tolist()})\n",
    "    payload[\"inputs\"].append({\"name\": \"attention_mask\", \"shape\": attention_mask.shape, \"datatype\": \"INT32\", \"data\": attention_mask.tolist()})\n",
    "    return payload\n",
    "\n",
    "\n",
    "text_input = \"Hello, my dog is cute\"\n",
    "bart_payload = get_text_payload('t5-small', text_input)\n",
    "\n",
    "print(\"bart_payload is\", bart_payload)\n",
    "print(\" payload type is\", type(bart_payload))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c8701a7",
   "metadata": {},
   "source": [
    "#### Invoke target model on Multi Model Endpoint"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9386a4a0",
   "metadata": {},
   "source": [
    "We can send inference request to multi-model endpoint using `invoke_enpoint` API. We specify the `TargetModel` in the invocation call and pass in the payload for each model type."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39cbadfb",
   "metadata": {},
   "source": [
    "#### T5 PyTorch Model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84e38866",
   "metadata": {},
   "source": [
    "Next, we show some sample inference for translation on the T5 PyTorch Model deployed on Triton's Python Backend behind SageMaker MME GPU endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "35c245b3",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "endpoint_name  = \"nlp-mme-gpu-ep-2023-03-17-22-46-46-2xl\"\n",
    "\n",
    "texts_to_translate = [\"translate English to German: The house is wonderful.\"]\n",
    "batch_size = len(texts_to_translate)\n",
    "print(batch_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "b618886a-311e-4adc-874d-8eec0db99caa",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'inputs': [{'name': 'input_ids', 'shape': (1, 11), 'datatype': 'INT32', 'data': [[13959, 1566, 12, 2968, 10, 37, 629, 19, 1627, 5, 1]]}, {'name': 'attention_mask', 'shape': (1, 11), 'datatype': 'INT32', 'data': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}]}\n"
     ]
    }
   ],
   "source": [
    "t5_payload = get_text_payload(\"t5-small\", texts_to_translate)\n",
    "print(t5_payload)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e22d076",
   "metadata": {},
   "source": [
    "##### Sample T5 Inference using Json Payload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "ce955041",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2 µs, sys: 0 ns, total: 2 µs\n",
      "Wall time: 4.05 µs\n",
      "{'ResponseMetadata': {'RequestId': '422dac26-c9d1-4c90-a326-ce1678277572', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '422dac26-c9d1-4c90-a326-ce1678277572', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Fri, 17 Mar 2023 22:55:00 GMT', 'content-type': 'application/json', 'content-length': '166'}, 'RetryAttempts': 0}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7f60cb388640>}\n"
     ]
    }
   ],
   "source": [
    "%time\n",
    "t5_payload = get_text_payload(\"t5-small\", texts_to_translate)\n",
    "\n",
    "response = runtime_sm_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/octet-stream\",\n",
    "    Body=json.dumps(t5_payload),\n",
    "    TargetModel=\"t5_pytorch.tar.gz\",\n",
    ")\n",
    "print(response)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c27afedc-3573-47ee-9f1a-94a00ff25a03",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "response_body = json.loads(response[\"Body\"].read().decode(\"utf8\"))\n",
    "response_body"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a620c0cf-b463-430a-8345-c5b13c3c149d",
   "metadata": {},
   "outputs": [],
   "source": [
    "output_ids = np.array(response_body[\"outputs\"][0][\"data\"]).reshape(batch_size, -1)\n",
    "t5_tokenizer = get_tokenizer(\"t5-small\")\n",
    "decoded_outputs = t5_tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n",
    "for text in decoded_outputs:\n",
    "    print(text, \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db6dc833",
   "metadata": {},
   "source": [
    "##### Sample T5 Inference using Binary + Json Payload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "d4c5f084-da45-467b-a897-61c8a32c068e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2 µs, sys: 0 ns, total: 2 µs\n",
      "Wall time: 3.81 µs\n"
     ]
    }
   ],
   "source": [
    "%time\n",
    "import tritonclient.http as httpclient\n",
    "import numpy as np\n",
    "\n",
    "def get_text_payload_binary(model_name, text):\n",
    "    inputs = []\n",
    "    outputs = []\n",
    "    input_ids, attention_mask = tokenize_text(model_name, text)\n",
    "    inputs.append(httpclient.InferInput(\"input_ids\", input_ids.shape, \"INT32\"))\n",
    "    inputs.append(httpclient.InferInput(\"attention_mask\", attention_mask.shape, \"INT32\"))\n",
    "\n",
    "    inputs[0].set_data_from_numpy(input_ids.astype(np.int32), binary_data=True)\n",
    "    inputs[1].set_data_from_numpy(attention_mask.astype(np.int32), binary_data=True)\n",
    "    \n",
    "    output_name = \"output\" if model_name == \"t5-small\" else \"logits\"\n",
    "    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(\n",
    "        inputs, outputs=outputs\n",
    "    )\n",
    "    return request_body, header_length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "9c0a33cd",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2 µs, sys: 0 ns, total: 2 µs\n",
      "Wall time: 4.05 µs\n",
      "Das Haus ist wunderbar. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "%time\n",
    "request_body, header_length = get_text_payload_binary(\"t5-small\", texts_to_translate)\n",
    "\n",
    "response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name,\n",
    "                                  ContentType='application/vnd.sagemaker-triton.binary+json;json-header-size={}'.format(header_length),\n",
    "                                  Body=request_body,\n",
    "                                 TargetModel='t5_pytorch.tar.gz')\n",
    "\n",
    "# Parse json header size length from the response\n",
    "header_length_prefix = \"application/vnd.sagemaker-triton.binary+json;json-header-size=\"\n",
    "header_length_str = response['ContentType'][len(header_length_prefix):]\n",
    "output_name = \"output\"\n",
    "# Read response body\n",
    "result = httpclient.InferenceServerClient.parse_response_body(\n",
    "    response['Body'].read(), header_length=int(header_length_str))\n",
    "output_ids = result.as_numpy(output_name)\n",
    "decoded_output = t5_tokenizer.batch_decode(output_ids, skip_special_tokens=True)\n",
    "for text in decoded_outputs:\n",
    "    print(text, \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b6977d9",
   "metadata": {},
   "source": [
    "### Terminate endpoint and clean up artifacts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a91a735c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# sm_client.delete_model(ModelName=sm_model_name)\n",
    "# sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "# sm_client.delete_endpoint(EndpointName=endpoint_name)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}