{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bb02351a",
   "metadata": {},
   "source": [
    "# Benchmarking PyTorch NLP Model with NVIDIA Triton Inference Server and Inference Recommender on Amazon SageMaker\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb02351a",
   "metadata": {},
   "source": [
    "\n",
    "<div class=\"alert alert-info\"> \ud83d\udca1 <strong> Note </strong>\n",
    "Use Amazon SageMaker notebook instance to execute this notebook, this will not work on Amazon SageMaker Studio notebooks\n",
    "</div>\n",
    "\n",
    "This notebook demonstrates the use of Amazon SageMaker Inference recommender to perform custom load testing in order to performance fine tune the NLP BERT Model serving using NVIDIA Triton Serving on SageMaker.\n",
    "\n",
    "[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.\n",
    "\n",
    "Now, [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.\n",
    "\n",
    "SageMaker Inference Recommender is a new capability of SageMaker that reduces the time required to get machine learning (ML) models in production by automating performance benchmarking and load testing models across SageMaker ML instances. You can use Inference Recommender to deploy your model to a real-time inference endpoint that delivers the best performance at the lowest cost.\n",
    "\n",
    "This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `ml.g4dn.8xlarge` with 50 GB EBS volume.\n",
    "\n",
    "\n",
    "<div class=\"alert alert-info\"> \ud83d\udca1 <strong> Pricing: </strong>\n",
    "Estimated cost to execute this SageMaker Studio notebook in us-east-1 is < 15USD . Price vary by region. Please refer Amazon SageMaker Studio notebooks <a href=\"https://aws.amazon.com/sagemaker/pricing/\"> pricing </a> page to estimate the cost of executing this notebook. The cost of instances that are provisioned by Amazon SageMaker Inference Recommender during custom load test is based on instance usage, there is no additional fee for SageMaker Inference Recommender. For example, estimated cost to run custom load test using ml.g4dn.xlarge ( <a href=\"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model_package\"> SupportedRealtimeInferenceInstanceTypes </a> )  in us-east-1 for 2 hours is < 1.5USD  Instances provisioned during the load tests are automatically terminated by Inference recommender after the duration of the job.  No explicit clean up required.\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca32a358",
   "metadata": {},
   "source": [
    "## Introduction to NVIDIA Triton Server\n",
    "\n",
    "[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.\n",
    "\n",
    "Some key features of Triton are:\n",
    "* **Support for Multiple frameworks**: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats. \n",
    "* **Model pipelines**: Triton model ensemble represents a pipeline of one or more models or pre/post processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.\n",
    "* **Concurrent model execution**: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.\n",
    "* **Dynamic batching**: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.\n",
    "* **Diverse CPUs and GPUs**: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.\n",
    "\n",
    "**Note**: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal `config.pbtxt` configuration file is **required** in the model artifacts. This release doesn't support inferring the model config automatically."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81e22ce6",
   "metadata": {},
   "source": [
    "#### Install packages\n",
    "\n",
    "Installs the dependencies required to package the model and run inferences using Triton server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce0372d3",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install -qU pip awscli boto3 sagemaker transformers==4.9.1\n",
    "!pip install nvidia-pyindex\n",
    "!pip install tritonclient[http]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "997b792b",
   "metadata": {},
   "source": [
    "#### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6b436c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# general imports\n",
    "import boto3\n",
    "import json\n",
    "import os\n",
    "import re\n",
    "import copy\n",
    "import time\n",
    "from time import gmtime, strftime\n",
    "import numpy as np\n",
    "import datetime\n",
    "import pprint\n",
    "import pandas as pd\n",
    "\n",
    "# sagemaker\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "# triton\n",
    "import tritonclient.http as httpclient\n",
    "\n",
    "# transformers\n",
    "from transformers import BertTokenizer\n",
    "\n",
    "# custom CloudWatch\n",
    "from cloudwatch import get_endpoint_metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ae46734",
   "metadata": {},
   "source": [
    "#### Set Variables\n",
    "\n",
    "We set SageMaker variables and other variables below, also define the IAM role that will give Amazon SageMaker access to the model artifacts and the NVIDIA Triton ECR image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35f724e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = boto3.Session()\n",
    "sm = sess.client(\"sagemaker\")\n",
    "sagemaker_session = sagemaker.Session(boto_session=sess)\n",
    "role = get_execution_role()\n",
    "region = boto3.Session().region_name\n",
    "bucket = sagemaker.Session().default_bucket()\n",
    "prefix = \"sagemaker/pt-triton-inference-recommender\"\n",
    "\n",
    "sm_client = boto3.client(\"sagemaker\", region_name=region)\n",
    "cw_client = boto3.client(\"cloudwatch\", region)\n",
    "\n",
    "account_id_map = {\n",
    "    \"us-east-1\": \"785573368785\",\n",
    "    \"us-east-2\": \"007439368137\",\n",
    "    \"us-west-1\": \"710691900526\",\n",
    "    \"us-west-2\": \"301217895009\",\n",
    "    \"eu-west-1\": \"802834080501\",\n",
    "    \"eu-west-2\": \"205493899709\",\n",
    "    \"eu-west-3\": \"254080097072\",\n",
    "    \"eu-north-1\": \"601324751636\",\n",
    "    \"eu-south-1\": \"966458181534\",\n",
    "    \"eu-central-1\": \"746233611703\",\n",
    "    \"ap-east-1\": \"110948597952\",\n",
    "    \"ap-south-1\": \"763008648453\",\n",
    "    \"ap-northeast-1\": \"941853720454\",\n",
    "    \"ap-northeast-2\": \"151534178276\",\n",
    "    \"ap-southeast-1\": \"324986816169\",\n",
    "    \"ap-southeast-2\": \"355873309152\",\n",
    "    \"cn-northwest-1\": \"474822919863\",\n",
    "    \"cn-north-1\": \"472730292857\",\n",
    "    \"sa-east-1\": \"756306329178\",\n",
    "    \"ca-central-1\": \"464438896020\",\n",
    "    \"me-south-1\": \"836785723513\",\n",
    "    \"af-south-1\": \"774647643957\",\n",
    "}\n",
    "\n",
    "\n",
    "if region not in account_id_map.keys():\n",
    "    raise (\"UNSUPPORTED REGION\")\n",
    "\n",
    "print(f\"SageMaker Role: {role}\")\n",
    "print(f\"Region Name: {region}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "588caf80",
   "metadata": {},
   "outputs": [],
   "source": [
    "ts = time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "sm_model_name = \"pt-triton-benchmark-model-\" + ts\n",
    "model_package_group_name = \"pt-triton-benchmark-model-group-\" + ts\n",
    "advanced_job = \"pt-triton-benchmark-advanced-job-\" + ts\n",
    "\n",
    "print(f\"SageMaker Model Name: {sm_model_name}\")\n",
    "print(f\"SageMaker Mode Package Name: {model_package_group_name}\")\n",
    "print(f\"SageMaker Advanced Job Name: {advanced_job}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3077340",
   "metadata": {},
   "source": [
    "#### Amazon SageMaker Triton Inference Server Deep Learning Container Image\n",
    "\n",
    "Let's retrieve Amazon SageMaker NVIDIA Triton Inference server container image based on the account ID you are running this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93c22f17",
   "metadata": {},
   "source": [
    "Set `triton_image_uri` based on the `account_id` and `region` information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a79da5a",
   "metadata": {},
   "outputs": [],
   "source": [
    "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n",
    "triton_image_uri = \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:21.08-py3\".format(\n",
    "    account_id=account_id_map[region], region=region, base=base\n",
    ")\n",
    "\n",
    "print(f\"Triton Inference server DLC image: {triton_image_uri}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52978f33",
   "metadata": {},
   "source": [
    "## NLP Use case\n",
    "\n",
    "Deploying and scaling NLP models in a production set up can be quite challenging. NLP models are often very large in size, containing millions of model parameters. Optimal model configurations are required to satisfy stringent performance and scalability of production grade NLP applications\n",
    "\n",
    "In this notebook, we will benchmark a NLP use case using SageMaker Triton inference server and recommend performance tuning optimizations for the below NLP profile. We will use a large pre-trained transformer based `bert-large-uncased` model which has about 336 million model parameters. The input sentence used for the binary classification model will be padded and truncated to a maximum input sequence length 512 tokens. The inference load test will simulate to achieve 500 TPS (30000 maximum invocations per minute) and model latency of < 0.5 seconds (500 milliseconds)\n",
    "\n",
    "## NVIDIA Triton Setup with Amazon SageMaker\n",
    "\n",
    "1. We will use this [generate_models.sh](./workspace/generate_models.sh) to generate the `bert-large-uncased` model to be used with NVIDIA Triton inference server.\n",
    "2. The script for loading the pre-trained `bert_large_uncased` model and saving it can be found in this [pt_exporter.py](./workspace/pt_exporter.py)\n",
    "3. Pre-trained model is loaded in torchscript format and model artifacts are jit traced with a dummy input and store in model.pt format\n",
    "4. After the model is serialized we package it into the format that Triton and SageMaker expect it to be.\n",
    "5. We used the pre-configured `config.pbtxt` file provided with this repo to specify model [configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) which Triton uses to load the model. \n",
    "6. We tar the model directory and upload it to s3 to later create a [SageMaker Model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html).\n",
    "\n",
    "\n",
    "**Note**: Amazon SageMaker expects the model tarball file to have a top level directory with the same name as the model defined in the `config.pbtxt`. Below is the sample model directory structure\n",
    "\n",
    "```\n",
    "bert\n",
    "\u251c\u2500\u2500 1\n",
    "\u2502   \u2514\u2500\u2500 model.pt\n",
    "\u2514\u2500\u2500 config.pbtxt\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f50d97c7",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\"> \ud83d\udca1 <strong> NOTE </strong>\n",
    "The below script uses docker and thus will not work on Amazon SageMaker Studio notebook. Please use Amazon SageMaker Notebook instance to execute this notebook\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab6ab65a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!docker run --gpus=all --rm -it \\\n",
    "            -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:21.08-py3 \\\n",
    "            /bin/bash generate_models.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a03638ed",
   "metadata": {},
   "source": [
    "The script saves the model in this [workspace](./workspace/) directory"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0035908",
   "metadata": {},
   "source": [
    "### Model Configuration\n",
    "\n",
    "- Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a `config.pbtxt` file specified as `ModelConfig protobuf`. \n",
    "- The model configuration name property is optional. If the name of the model is not specified in the configuration it is assumed to be the same as the model repository directory containing the model. If name is specified it must match the name of the model repository directory containing the model\n",
    "- The `max_batch_size` property indicates the maximum batch size that the model supports for the types of batching that can be exploited by Triton. If the model's batch dimension is the first dimension, and all inputs and outputs to the model have this batch dimension, then Triton can use its dynamic batcher or sequence batcher to automatically use batching with the model. In this case `max_batch_size` should be set to a value greater-or-equal-to 1 that indicates the maximum batch size that Triton should use with the model\n",
    "- Each model input and output must specify a name, datatype, and shape. The name specified for an input or output tensor must match the name expected by the model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "100665cd",
   "metadata": {},
   "source": [
    "The below is the baseline configuration for PyTorch model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6bbba3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-serve-pt/bert/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6c56843",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile triton-serve-pt/bert/config.pbtxt\n",
    "platform: \"pytorch_libtorch\"\n",
    "max_batch_size: 32\n",
    "input [\n",
    "  {\n",
    "    name: \"INPUT__0\"\n",
    "    data_type: TYPE_INT32\n",
    "    dims: [512]\n",
    "  },\n",
    "  {\n",
    "    name: \"INPUT__1\"\n",
    "    data_type: TYPE_INT32\n",
    "    dims: [512]\n",
    "  }\n",
    "]\n",
    "output [\n",
    "  {\n",
    "    name: \"OUTPUT__0\"\n",
    "    data_type: TYPE_FP32\n",
    "    dims: [512, 768]\n",
    "  },\n",
    "  {\n",
    "    name: \"1634__1\"\n",
    "    data_type: TYPE_FP32\n",
    "    dims: [768]\n",
    "  }\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3652895",
   "metadata": {},
   "source": [
    "You can find other model configurations that are used in the benchmarking exercise in this [workspaces](./workspaces) directory.\n",
    "\n",
    "1. config-dg.pbtxt (Dynamic batching enabled)\n",
    "2. config-ig.pbtxt (Multiple instance group)\n",
    "3. config-db-ig.pbtxt (Dynamic batching, Multiple instance group enabled)\n",
    "\n",
    "To execute benchmarking load using Amazon SageMaker Inference recommender copy above files with `config.pbtxt` and reupload the model tar files."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b285777",
   "metadata": {},
   "source": [
    "We will copy the `model.pt` file and create model tar file to be uploaded to S3. The model archive tar file will be used by Triton Inference server"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd9f6b68",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-serve-pt/bert/1/\n",
    "!cp workspace/model.pt triton-serve-pt/bert/1/\n",
    "!tar -C triton-serve-pt/ -czf model.tar.gz bert\n",
    "model_uri = sagemaker_session.upload_data(path=\"model.tar.gz\", key_prefix=\"triton-serve-pt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "443d90b3",
   "metadata": {},
   "source": [
    "### Create Amazon SageMaker Real Time Endpoint\n",
    "\n",
    "We start off by creating a [sagemaker model](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) from the model files we uploaded to s3 in the previous step.\n",
    "\n",
    "In this step we also provide an additional Environment Variable i.e. `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` which specifies the name of the model to be loaded by Triton. **The value of this key should match the folder name in the model package uploaded to s3**. This variable is optional in case of a single model. In case of ensemble models, this key **has to be** specified for Triton to startup in SageMaker.\n",
    "\n",
    "Additionally, customers can set `SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT` and `SAGEMAKER_TRITON_THREAD_COUNT` for optimizing the thread counts.\n",
    "\n",
    "*Note*: The current release of Triton (21.08-py3) on SageMaker doesn't support running instances of different models on the same server, except in case of [ensembles](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#ensemble-models). Only multiple model instances of the same model are supported, which can be specified under the [instance-groups](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#instance-groups) section of the config.pbtxt file."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ddc3594",
   "metadata": {},
   "source": [
    "###  Create payload \n",
    "\n",
    "Create payload in JSON format and upload it on S3. This will be used by Inference Recommender to run the custom load test. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9cbaaad",
   "metadata": {},
   "outputs": [],
   "source": [
    "def tokenize_text(text):\n",
    "    enc = BertTokenizer.from_pretrained(\"bert-large-uncased\")\n",
    "    encoded_text = enc(text, padding=\"max_length\", max_length=512, truncation=True)\n",
    "    return encoded_text[\"input_ids\"], encoded_text[\"attention_mask\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4974b7ef",
   "metadata": {},
   "source": [
    "If you want to change the payload (Token Length), below are the changes -\n",
    "1. Change the JSON with shape reflecting the right token length below\n",
    "2. Change the tokenize_text method to reflect the token length\n",
    "3. Change the config.pbtxt the triton* folder to reflect the input id and attention mask length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0313631c",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_triton = \"\"\"\n",
    "                Create payload JSON and upload it on S3. \n",
    "                This will be used by Inference Recommender to run the load test.\n",
    "              \"\"\"\n",
    "\n",
    "input_ids, attention_mask = tokenize_text(text_triton)\n",
    "\n",
    "payload = {\n",
    "    \"inputs\": [\n",
    "        {\"name\": \"INPUT__0\", \"shape\": [1, 512], \"datatype\": \"INT32\", \"data\": input_ids},\n",
    "        {\"name\": \"INPUT__1\", \"shape\": [1, 512], \"datatype\": \"INT32\", \"data\": attention_mask},\n",
    "    ]\n",
    "}\n",
    "\n",
    "print(f\"Sample payload to be used with Inference Recommender\")\n",
    "print(payload)\n",
    "\n",
    "payload_location = \"./sample-payload/\"\n",
    "\n",
    "if not os.path.exists(payload_location):\n",
    "    os.makedirs(payload_location)\n",
    "    print(f\"Directory Created {payload_location}\")\n",
    "else:\n",
    "    print(f\"Directory already exists {payload_location}\")\n",
    "\n",
    "payload_archive_name = \"payload.tar.gz\"\n",
    "\n",
    "f = open(payload_location + \"request.json\", \"w\")\n",
    "json.dump(payload, f)\n",
    "f.close()\n",
    "\n",
    "!cd ./sample-payload/ && tar czvf ../payload.tar.gz *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52b1bda8",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "sample_payload_url = sagemaker.Session().upload_data(\n",
    "    payload_archive_name, bucket=bucket, key_prefix=prefix + \"/inference\"\n",
    ")\n",
    "\n",
    "model_archive_name = \"model.tar.gz\"\n",
    "model_url = sagemaker.Session().upload_data(\n",
    "    model_archive_name, bucket=bucket, key_prefix=prefix + \"/model\"\n",
    ")\n",
    "\n",
    "print(f\"Sample Payload location in S3:  {sample_payload_url}\")\n",
    "print(f\"Model archive location: {model_url}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be919d53",
   "metadata": {},
   "source": [
    "### Amazon SageMaker Inference Recommender set up\n",
    "\n",
    "Set the Domain, Task, Framework, version and Model for Inference Recommender Job. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af90f0bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "ml_domain = \"NATURAL_LANGUAGE_PROCESSING\"\n",
    "ml_task = \"FILL_MASK\"\n",
    "ml_framework = \"PYTORCH\"\n",
    "framework_version = \"1.6.0\"\n",
    "model = \"bert-base-uncased\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd1859a9",
   "metadata": {},
   "source": [
    "Create the Triton Container Dictionary object and Model Package group for Inference recommender Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04e28d89",
   "metadata": {},
   "outputs": [],
   "source": [
    "container = {\n",
    "    \"Image\": triton_image_uri,\n",
    "    \"ModelDataUrl\": model_url,\n",
    "    \"NearestModelName\": model,\n",
    "    \"Framework\": ml_framework,\n",
    "    \"Environment\": {\"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"bert\"},\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4688dff7",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_pacakge_group_response = sm_client.create_model_package_group(\n",
    "    ModelPackageGroupName=str(model_package_group_name),\n",
    "    ModelPackageGroupDescription=\"BERT large uncased Model group for Triton Serving\",\n",
    ")\n",
    "print(f\"Model Registry package group: {model_pacakge_group_response}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f09f7001",
   "metadata": {},
   "source": [
    "Amazon SageMaker model registry model package with domain, task and Inference container specification information. Specify the list of supported inference instance types in `SupportedRealtimeInferenceInstanceTypes` parameter. Also, define the `ContentType` and MIME type information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e76c8f98",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_package_version_response = sm_client.create_model_package(\n",
    "    ModelPackageGroupName=str(model_package_group_name),\n",
    "    ModelPackageDescription=\"BERT large uncased Model group for Triton Serving\",\n",
    "    Domain=ml_domain,\n",
    "    Task=ml_task,\n",
    "    SamplePayloadUrl=sample_payload_url,\n",
    "    InferenceSpecification={\n",
    "        \"Containers\": [container],\n",
    "        \"SupportedRealtimeInferenceInstanceTypes\": [\n",
    "            \"ml.g4dn.4xlarge\",\n",
    "            \"ml.g4dn.8xlarge\",\n",
    "            \"ml.g4dn.16xlarge\",\n",
    "            \"ml.g4dn.12xlarge\",\n",
    "            \"ml.g4dn.xlarge\",\n",
    "            \"ml.g4dn.2xlarge\",\n",
    "        ],\n",
    "        \"SupportedContentTypes\": [\"application/octet-stream\"],\n",
    "        \"SupportedResponseMIMETypes\": [\"application/json\"],\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32643b2d",
   "metadata": {},
   "source": [
    "### Amazon SageMaker Inference Recommender Custom Load Test\n",
    "\n",
    "Create Custom Inference Recommender Job for Triton Container serving BERT Model with 512 Token length\n",
    "\n",
    "Use the `create_inference_recommendations_job` to create an Inference Recommender load test and specify below parameters\n",
    "- Specify Advanced for the JobType field and provide:\n",
    "- A job name for your load test (JobName). \n",
    "- The Amazon Resource Name (ARN) of an IAM role that enables Inference Recommender to perform tasks on your behalf.\n",
    "- A traffic pattern of the load test (TrafficPattern)\n",
    "    - Initial number of users = 2\n",
    "    - Spawn Rate = 3 (creates 3 new users every 3 minutes for a duration of 15 minutes)\n",
    "- An endpoint configuration dictionary (InputConfig) where you specify an AWS instance type against which to run benchmarks\n",
    "- StoppingConditions (Inference recommender would adjust the initial number of instances to satisfy below stopping conditions)\n",
    "    - MaxInvocations is set to 30000 \n",
    "    - ModelLatencyThresholds p95 threshold for 500 ms"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7baed395",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\"> \ud83d\udca1 <strong> ResourceLimitExceeded Exception </strong>\n",
    "Please make sure you have access to the requested instance types in your AWS account. You may receive a similar exception if a certain instance type resource limits are inadequate \"INVALID_INPUT : 1. An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit 'ml.g4dn.12xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit\"\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6edb96ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "advanced_response = sm_client.create_inference_recommendations_job(\n",
    "    JobName=advanced_job,\n",
    "    JobDescription=\"nlp triton Inference Advanced Recommender Job\",\n",
    "    JobType=\"Advanced\",\n",
    "    RoleArn=role,\n",
    "    InputConfig={\n",
    "        \"ModelPackageVersionArn\": model_package_version_response[\"ModelPackageArn\"],\n",
    "        \"JobDurationInSeconds\": 7200,\n",
    "        \"EndpointConfigurations\": [\n",
    "            {\"InstanceType\": \"ml.p3.8xlarge\"},\n",
    "            {\"InstanceType\": \"ml.p3.2xlarge\"},\n",
    "            {\"InstanceType\": \"ml.p3.16xlarge\"},\n",
    "            {\"InstanceType\": \"ml.g4dn.xlarge\"},\n",
    "            {\"InstanceType\": \"ml.g4dn.8xlarge\"},\n",
    "            {\"InstanceType\": \"ml.g4dn.4xlarge\"},\n",
    "            {\"InstanceType\": \"ml.g4dn.2xlarge\"},\n",
    "            {\"InstanceType\": \"ml.g4dn.16xlarge\"},\n",
    "            {\"InstanceType\": \"ml.g4dn.12xlarge\"},\n",
    "        ],\n",
    "        \"TrafficPattern\": {\n",
    "            \"TrafficType\": \"PHASES\",\n",
    "            \"Phases\": [\n",
    "                {\n",
    "                    \"InitialNumberOfUsers\": 2,\n",
    "                    \"SpawnRate\": 3,\n",
    "                    \"DurationInSeconds\": 900,\n",
    "                },  # simulating 50 users, 2 initial and 3 new users every minute for 16 minutes\n",
    "            ],  # second phase, we will strt with 50 users, steady traffic for 5 minutes\n",
    "        },\n",
    "        \"ResourceLimit\": {\"MaxNumberOfTests\": 10, \"MaxParallelOfTests\": 5},\n",
    "    },\n",
    "    StoppingConditions={\n",
    "        \"MaxInvocations\": 30000,\n",
    "        \"ModelLatencyThresholds\": [{\"Percentile\": \"P95\", \"ValueInMilliseconds\": 500}],\n",
    "    },\n",
    ")\n",
    "\n",
    "print(advanced_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6640050",
   "metadata": {},
   "source": [
    "Let's get the inference recommender job details using `describe_inference_recommendations_job` boto3 API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d28ff91",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "ended = False\n",
    "while not ended:\n",
    "    inference_recommender_job = sm_client.describe_inference_recommendations_job(\n",
    "        JobName=str(advanced_job)\n",
    "    )\n",
    "    if inference_recommender_job[\"Status\"] in [\"COMPLETED\", \"STOPPED\", \"FAILED\"]:\n",
    "        print(f\"Inference recommender job status: {inference_recommender_job['Status']} \")\n",
    "        ended = True\n",
    "    else:\n",
    "        print(\"Inference recommender job in progress\")\n",
    "        time.sleep(300)\n",
    "\n",
    "if inference_recommender_job[\"Status\"] == \"FAILED\":\n",
    "    print(\"Inference recommender job failed \")\n",
    "    print(\"Failed Reason: {}\".inference_recommender_job[\"FailedReason\"])\n",
    "else:\n",
    "    print(\"Inference recommender job completed\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c7af9ae",
   "metadata": {},
   "source": [
    "### Visualize CloudWatch Metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2bdf420",
   "metadata": {},
   "source": [
    "Use `get_endpoint_metrics` helper functions, visualize the CloudWatch metrics. This will provide detailed overview of resource usage during the load test. Metrics such as GPU Memory utilization, Invocations and Model Latency metrics will allow to tweak NVIDIA Triton model configuration to improve application performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d12400b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "job_name = advanced_response[\"JobArn\"].split(\"/\")[-1]\n",
    "\n",
    "df_cw = get_endpoint_metrics(sm_client, cw_client, region, job_name, include_plots=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "878a2e15",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [\n",
    "    {**x[\"EndpointConfiguration\"], **x[\"ModelConfiguration\"], **x[\"Metrics\"]}\n",
    "    for x in inference_recommender_job[\"InferenceRecommendations\"]\n",
    "]\n",
    "df = pd.DataFrame(data)\n",
    "df.drop(\"VariantName\", inplace=True, axis=1)\n",
    "pd.set_option(\"max_colwidth\", 400)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2449d58d",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "This notebook provides an overview of NVIDIA Triton support for PyTorch framework, steps to leverage Amazon SageMaker Inference recommender to execute custom load test and find the right configurations, instance types for large NLP workloads. You can use different Triton configurations, visualize resource consumptions and optimize GPU resource utilization during inference. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-triton|benchmark|pytorch|pytorch_triton_inference_recommender.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}