{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c24fc73e",
   "metadata": {},
   "source": [
    "# Run mulitple deep learning models on GPUs with Amazon SageMaker Multi-model endpoints (MME)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c528238f",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33cc860f",
   "metadata": {},
   "source": [
    "\n",
    "[Amazon SageMaker](https://aws.amazon.com/sagemaker/) multi-model endpoints(MME) provide a scalable and cost-effective way to deploy large number of deep learning models. Previously, customers had limited options to deploy 100s of deep learning models that need accelerated compute with GPUs. Now customers can deploy 1000s of deep learning models behind one SageMaker endpoint. Now, MME will run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models and dynamically load/unload models based on the incoming traffic. With this, customers can significantly save cost and achieve best price performance.\n",
    "\n",
    "<div class=\"alert alert-info\"> <strong> Note </strong>\n",
    "This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g4dn.4xlarge`\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c7155c8",
   "metadata": {},
   "source": [
    "## Amazon SageMaker Multi-Model endpoints with GPU Support\n",
    "\n",
    "Amazon SageMaker multi-model endpoints with GPU works using [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/). NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance. Triton supports all major training and inference frameworks, such as TensorFlow, NVIDIA TensorRT, PyTorch, MXNet, Python, ONNX, XGBoost, scikit-learn, RandomForest, OpenVINO, custom C++, and more. It offers dynamic batching, concurrent execution, post-training quantization, optimal model configuration to achieve high performance inference.\n",
    "When SageMaker receives an invocation request for a particular model, it does the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ae0a48d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython import display\n",
    "\n",
    "display.Image(\"images/mme-gpu.jpg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c6dc1df",
   "metadata": {},
   "source": [
    "## How it works?\n",
    "\n",
    "1. SageMaker routes traffic to the right instance behind the endpoint where the target model is loaded. SageMaker takes care of model management behind the endpoint, loads model to the container's memory and unloads the model based on the endpoint's traffic pattern.\n",
    "2. Dynamically loads models from Amazon Simple Storage Service(S3) to the instance’s storage volume. If the invoked models is not available on instance storage volume, the model is downloaded onto instance storage volume. If the instance storage volume reaches capacity, SageMaker deletes any unused models from the storage volume.\n",
    "3. SageMaker loads the model to NVIDIA Triton container’s memory on GPU accelerated instance and serve the inference request. If the model is already loaded in the container memory, the subsequents requests are served faster as SageMaker does not need to download and load it again.\n",
    "4. SageMaker takes care of traffic shaping to the MME endpoint, SageMaker continues to routes traffics to the instance where the model is loaded. If the instance resources reach capacity due to high utilization,  SageMaker unloads least used models from the container to free up resource to load more frequently used models.\n",
    "5. SageMaker MME can horizontally scale using auto-scaling policy, provision additional GPU compute instances based on metrics such as GPU utilization, memory utilization etc to serve spiky traffic to MME endpoints."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1a911f2",
   "metadata": {},
   "source": [
    "In this notebook, we will show you how to use the new features Amazon SageMaker MME with GPU with a computer vision use case. For demonstration purpose, we will use a ResNet-50 convolutional neural network pre-trained model  that can classify images into 1000 categories. We will -\n",
    "\n",
    "*  Show how to use NVIDIA Triton inference container on SageMaker MME, leverage different model frameworks such and PyTorch and TensorRT. \n",
    "*  Walk you through steps to convert ResNet-50 models to optimized TensorRT engine format and deploy it with SageMaker MME. \n",
    "* Show how to get insights into instance and invocation metrics using Amazon CloudWatch."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a839183b",
   "metadata": {},
   "source": [
    "### Installs\n",
    "\n",
    "Installs the dependencies required to package the model and run inferences using Triton server. Update SageMaker, boto3, awscli etc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0982079f",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qU pip awscli boto3 sagemaker\n",
    "!pip install nvidia-pyindex --quiet\n",
    "!pip install tritonclient[http] --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a91424d",
   "metadata": {},
   "source": [
    "### Imports and variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b38bc485",
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "import boto3, json, sagemaker, time\n",
    "from sagemaker import get_execution_role\n",
    "import numpy as np\n",
    "from PIL import Image\n",
    "import tritonclient.http as httpclient\n",
    "\n",
    "# variables\n",
    "s3_client = boto3.client(\"s3\")\n",
    "auto_scaling_client = boto3.client(\"application-autoscaling\")\n",
    "sample_image_name = \"shiba_inu_dog.jpg\"\n",
    "ts = time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "# sagemaker variables\n",
    "role = get_execution_role()\n",
    "sm_client = boto3.client(service_name=\"sagemaker\")\n",
    "runtime_sm_client = boto3.client(\"sagemaker-runtime\")\n",
    "sagemaker_session = sagemaker.Session(boto_session=boto3.Session())\n",
    "bucket = sagemaker_session.default_bucket()\n",
    "prefix = \"resnet50-mme-gpu\"\n",
    "\n",
    "# endpoint variables\n",
    "sm_model_name = f\"{prefix}-mdl-{ts}\"\n",
    "endpoint_config_name = f\"{prefix}-epc-{ts}\"\n",
    "endpoint_name = f\"{prefix}-ep-{ts}\"\n",
    "model_data_url = f\"s3://{bucket}/{prefix}/\"\n",
    "\n",
    "# account mapping for SageMaker MME Triton Image\n",
    "account_id_map = {\n",
    "    \"us-east-1\": \"785573368785\",\n",
    "    \"us-east-2\": \"007439368137\",\n",
    "    \"us-west-1\": \"710691900526\",\n",
    "    \"us-west-2\": \"301217895009\",\n",
    "    \"eu-west-1\": \"802834080501\",\n",
    "    \"eu-west-2\": \"205493899709\",\n",
    "    \"eu-west-3\": \"254080097072\",\n",
    "    \"eu-north-1\": \"601324751636\",\n",
    "    \"eu-south-1\": \"966458181534\",\n",
    "    \"eu-central-1\": \"746233611703\",\n",
    "    \"ap-east-1\": \"110948597952\",\n",
    "    \"ap-south-1\": \"763008648453\",\n",
    "    \"ap-northeast-1\": \"941853720454\",\n",
    "    \"ap-northeast-2\": \"151534178276\",\n",
    "    \"ap-southeast-1\": \"324986816169\",\n",
    "    \"ap-southeast-2\": \"355873309152\",\n",
    "    \"cn-northwest-1\": \"474822919863\",\n",
    "    \"cn-north-1\": \"472730292857\",\n",
    "    \"sa-east-1\": \"756306329178\",\n",
    "    \"ca-central-1\": \"464438896020\",\n",
    "    \"me-south-1\": \"836785723513\",\n",
    "    \"af-south-1\": \"774647643957\",\n",
    "}\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "if region not in account_id_map.keys():\n",
    "    raise (\"UNSUPPORTED REGION\")\n",
    "\n",
    "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n",
    "mme_triton_image_uri = (\n",
    "    \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.07-py3\".format(\n",
    "        account_id=account_id_map[region], region=region, base=base\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd4ed8b1",
   "metadata": {},
   "source": [
    "### Creating Model Artifacts\n",
    "\n",
    "This section presents overview of steps to prepare ResNet-50 pre-trained model to be deployed on SageMaker MME using Triton Inference server model configurations. \n",
    "\n",
    "\n",
    "<div class=\"alert alert-info\"><strong> Note </strong>\n",
    "We are demonstrating deployment with 2 models. However, customers can prepare and 100s of models. The models may or may not share the same framework.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20f7b29d",
   "metadata": {},
   "source": [
    "#### Prepare PyTorch Model \n",
    "\n",
    "`generate_model_pytorch.sh` file in the `workspace` directory contains scripts to generate a PyTorch model. First, we load a pre-trained ResNet50 model using torchvision models package. We save the model as model.pt file in TorchScript optimized and serialized format. TorchScript needs an example inputs to do a model forward pass, so we pass one instance of a RGB image with 3 color channels of dimension 224X224. The script for exporting this model can be found [here](./workspace/generate_model_pytorch.sh)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7871e6ed",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!docker run --gpus=all --rm -it \\\n",
    "            -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.07-py3 \\\n",
    "            /bin/bash generate_model_pytorch.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "725a0764",
   "metadata": {},
   "source": [
    "#### PyTorch Model Respository\n",
    "\n",
    "The model repository contains model to serve, in our case it will be the `model.pt` and configuration file with input/output specifications and metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11ddc53a",
   "metadata": {},
   "outputs": [],
   "source": [
    "display.Image(\"images/pyt-model-repo.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2c114f3",
   "metadata": {},
   "source": [
    "#### PyTorch Model configuration\n",
    "\n",
    "Model configuration file `config.pbtxt` must specify name of the model(`resnet`), the platform and backend properties (`pytorch_libtorch`), max_batch_size(128) and the input and output tensors along with the data type(TYPE_FP32) information. Additionally, you can specify `instance_group` and `dynamic_batching` properties to achieve high performance inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3502dd08",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-serve-pt/resnet/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b8795a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile triton-serve-pt/resnet/config.pbtxt\n",
    "name: \"resnet\"\n",
    "platform: \"pytorch_libtorch\"\n",
    "max_batch_size: 128\n",
    "input {\n",
    "  name: \"INPUT__0\"\n",
    "  data_type: TYPE_FP32\n",
    "  dims: 3\n",
    "  dims: 224\n",
    "  dims: 224\n",
    "}\n",
    "output {\n",
    "  name: \"OUTPUT__0\"\n",
    "  data_type: TYPE_FP32\n",
    "  dims: 1000\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04c9d653",
   "metadata": {},
   "source": [
    "#### Prepare TensorRT Model\n",
    "\n",
    "1. We export the pre-trained ResNet50 model into an ONNX file, which runs the model once to trace its execution and then export the traced model to the specified file. It is one of the better options in terms model conversion and deployment when converting using ONNX.\n",
    "\n",
    "2. We use `trtexec` to automatically convert ONNX model to TensorRT plan. As ONNX is framework agnostic it works with models in TF, PyTorch and more. You will export the weights of your model from the framework and load them into your TensorRT network.\n",
    "\n",
    "\n",
    "In this step, we load pre-trained ResNet50 model from torch and convert to onnx representation using torch onnx exporter. Once onnx model is created, we use TensorRT trtexec command to create the model plan to be hosted with Triton. The script for exporting this model can be found [here](./workspace/generate_model_trt.sh). This is run as part of the `generate_model_trt.sh` script from the below cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "996e0df3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!docker run --gpus=all --rm -it \\\n",
    "            -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.07-py3 \\\n",
    "            /bin/bash generate_model_trt.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "384aa533",
   "metadata": {},
   "source": [
    "#### TensorRT Model Respository\n",
    "\n",
    "The model repository contains model to serve, for TensorRT model it will be the model.plan(created in above steps) and configuration file with input/output specifications and metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6917ebba",
   "metadata": {},
   "outputs": [],
   "source": [
    "display.Image(\"images/trt-model-repo.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ec0a618",
   "metadata": {},
   "source": [
    "#### TensorRT Model configuration\n",
    "\n",
    " For the TensorRT model, we specify `tensorrt_plan` as platform, input tensor specification of the image of dimension 224X224 which has 3 color channels. Output tensor with 1000 dimensions of type TYPE_FP32 corresponding the different object categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dff45fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-serve-trt/resnet/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8a19004",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile triton-serve-trt/resnet/config.pbtxt\n",
    "name: \"resnet\"\n",
    "platform: \"tensorrt_plan\"\n",
    "max_batch_size: 128\n",
    "input {\n",
    "  name: \"input\"\n",
    "  data_type: TYPE_FP32\n",
    "  dims: 3\n",
    "  dims: 224\n",
    "  dims: 224\n",
    "}\n",
    "output {\n",
    "  name: \"output\"\n",
    "  data_type: TYPE_FP32\n",
    "  dims: 1000\n",
    "}\n",
    "model_warmup {\n",
    "    name: \"bs128 Warmup\"\n",
    "    batch_size: 128\n",
    "    inputs: {\n",
    "        key: \"input\"\n",
    "        value: {\n",
    "            data_type: TYPE_FP32\n",
    "            dims: 3\n",
    "            dims: 224\n",
    "            dims: 224\n",
    "            zero_data: false\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97e0eacc",
   "metadata": {},
   "source": [
    "#### 3. Export model artifacts to S3\n",
    "\n",
    "SageMaker expects the model artifacts in below format, it should also satisfy Triton container requirements such as model name, version, config.pbtxt files etc. `tar` the folder containing the model file as `model.tar.gz` and upload it to s3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "215f3cdc",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-serve-pt/resnet/1/\n",
    "!mv -f workspace/model.pt triton-serve-pt/resnet/1/\n",
    "!tar -C triton-serve-pt/ -czf resnet_pt_v0.tar.gz resnet\n",
    "model_uri_pt = sagemaker_session.upload_data(path=\"resnet_pt_v0.tar.gz\", key_prefix=prefix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea83e748",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-serve-trt/resnet/1/\n",
    "!mv -f workspace/model.plan triton-serve-trt/resnet/1/\n",
    "!tar -C triton-serve-trt/ -czf resnet_trt_v0.tar.gz resnet\n",
    "model_uri_trt = sagemaker_session.upload_data(path=\"resnet_trt_v0.tar.gz\", key_prefix=prefix)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "876e4eeb",
   "metadata": {},
   "source": [
    "Now that we have uploaded the model artifacts to S3, we can create a SageMaker multi-model endpoint."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "356fbd99",
   "metadata": {},
   "source": [
    "#### Deploy Models with MME\n",
    "\n",
    "We will now deploy ResNet-50 model with 2 different framework backends i.e. PyTorch and TensorRT to SageMaker MME. You can reproduce all the steps using step by step notebook on GitHub.\n",
    "\n",
    "\n",
    "\n",
    "<div class=\"alert alert-info\"> <strong> Note </strong>\n",
    "you can deploy 100s of models. The models can use same framework. They can also use different frameworks as shown in this note.\n",
    "</div>\n",
    "\n",
    "We will use AWS SDK for Python (Boto) APIs [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) and [create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) to create a mulit-model endpoint.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d750dadf",
   "metadata": {},
   "source": [
    "#### Define the serving container "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef1f76d0",
   "metadata": {},
   "source": [
    " In the container definition, define the `ModelDataUrl` to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load  and serve predictions. Set `Mode` to `MultiModel` to indicates SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU, see MME [container images](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html#multi-model-support) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7a59206",
   "metadata": {},
   "outputs": [],
   "source": [
    "container = {\"Image\": mme_triton_image_uri, \"ModelDataUrl\": model_data_url, \"Mode\": \"MultiModel\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98d097db",
   "metadata": {},
   "source": [
    "#### Create a multi model object\n",
    "\n",
    "Using the SageMaker boto3 client, create the model using [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) API. We will pass the container definition to the create model API along with ModelName and ExecutionRoleArn.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "488389f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb517e9e",
   "metadata": {},
   "source": [
    "#### Define configuration for the multi model endpoint\n",
    "\n",
    "Create a multi-model endpoint configurations using [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) boto3 API. Specify an accelerated GPU computing instance in InstanceType, in this post we will use g4dn.4xlarge instance. We recommend configuring your endpoints with at least two instances. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.\n",
    "\n",
    "<div class=\"alert alert-info\"> <strong> Note </strong>\n",
    "Based on our findings, customers get price performance on ML optimized instances with single GPU core. Hence, this feature is only enabled for single GPU core instances. For full list of instances supported see this (https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html#multi-model-support to Docs page where we capture list of isntances.)\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b92ddff",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_endpoint_config_response = sm_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.g4dn.4xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": sm_model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "babb2dc2",
   "metadata": {},
   "source": [
    "#### Create Multi Model Endpoint\n",
    "\n",
    "Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1351b04b",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_endpoint_response = sm_client.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0066e338",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c32f202a",
   "metadata": {},
   "source": [
    "#### Setup Autoscaling policies for GPU Multi Model Endpoint\n",
    "\n",
    "Amazon SageMaker multi-model endpoints supports automatic scaling (auto scaling) for your hosted models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.\n",
    "\n",
    "In the below scaling policy, use a custom metric GPUUtilization in TargetTrackingScalingPolicyConfiguration configuration and set a TargetValue of 60.0 for the target value of that metric. This autoscaling policy will provision additional instances upto MaxCapacity when GPU Utilization is more than 60%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0da5e08",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform auto-scaling of the endpoint based on GPU memory utilization\n",
    "\n",
    "# This is the format in which application autoscaling references the endpoint\n",
    "resource_id = \"endpoint/\" + endpoint_name + \"/variant/\" + \"AllTraffic\"\n",
    "response = auto_scaling_client.register_scalable_target(\n",
    "    ServiceNamespace=\"sagemaker\",\n",
    "    ResourceId=resource_id,\n",
    "    ScalableDimension=\"sagemaker:variant:DesiredInstanceCount\",\n",
    "    MinCapacity=1,\n",
    "    MaxCapacity=5,\n",
    ")\n",
    "\n",
    "\n",
    "# GPUMemoryUtilization metric\n",
    "response = auto_scaling_client.put_scaling_policy(\n",
    "    PolicyName=\"GPUUtil-ScalingPolicy\",\n",
    "    ServiceNamespace=\"sagemaker\",\n",
    "    ResourceId=resource_id,\n",
    "    ScalableDimension=\"sagemaker:variant:DesiredInstanceCount\",  # SageMaker supports only Instance Count\n",
    "    PolicyType=\"TargetTrackingScaling\",  # 'StepScaling'|'TargetTrackingScaling'\n",
    "    TargetTrackingScalingPolicyConfiguration={\n",
    "        # Scale out when GPU utilization hits GPUUtilization target value.\n",
    "        \"TargetValue\": 60.0,\n",
    "        \"CustomizedMetricSpecification\": {\n",
    "            \"MetricName\": \"GPUUtilization\",\n",
    "            \"Namespace\": \"/aws/sagemaker/Endpoints\",\n",
    "            \"Dimensions\": [\n",
    "                {\"Name\": \"EndpointName\", \"Value\": endpoint_name},\n",
    "                {\"Name\": \"VariantName\", \"Value\": \"AllTraffic\"},\n",
    "            ],\n",
    "            \"Statistic\": \"Average\",  # Possible - 'Statistic': 'Average'|'Minimum'|'Maximum'|'SampleCount'|'Sum'\n",
    "            \"Unit\": \"Percent\",\n",
    "        },\n",
    "        \"ScaleInCooldown\": 600,\n",
    "        \"ScaleOutCooldown\": 200,\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6991a709",
   "metadata": {},
   "source": [
    "#### Prepare Input Payload for PyTorch and TensorRT model\n",
    "\n",
    "The following method transforms a sample image we will be using for inference into the payload that can be sent for inference to the Triton server.\n",
    "\n",
    "The `tritonclient` package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d44058b",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.download_file(\n",
    "    f\"sagemaker-example-files-prod-{region}\",\n",
    "    \"datasets/image/pets/shiba_inu_dog.jpg\",\n",
    "    \"shiba_inu_dog.jpg\",\n",
    ")\n",
    "\n",
    "\n",
    "def get_sample_image():\n",
    "    image_path = \"./shiba_inu_dog.jpg\"\n",
    "    img = Image.open(image_path).convert(\"RGB\")\n",
    "    img = img.resize((224, 224))\n",
    "    img = (np.array(img).astype(np.float32) / 255) - np.array(\n",
    "        [0.485, 0.456, 0.406], dtype=np.float32\n",
    "    ).reshape(1, 1, 3)\n",
    "    img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)\n",
    "    img = np.transpose(img, (2, 0, 1))\n",
    "    return img.tolist()\n",
    "\n",
    "\n",
    "def _get_sample_image_binary(input_name, output_name):\n",
    "    inputs = []\n",
    "    outputs = []\n",
    "    inputs.append(httpclient.InferInput(input_name, [1, 3, 224, 224], \"FP32\"))\n",
    "    input_data = np.array(get_sample_image(), dtype=np.float32)\n",
    "    input_data = np.expand_dims(input_data, axis=0)\n",
    "    inputs[0].set_data_from_numpy(input_data, binary_data=True)\n",
    "    outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))\n",
    "    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(\n",
    "        inputs, outputs=outputs\n",
    "    )\n",
    "    return request_body, header_length\n",
    "\n",
    "\n",
    "def get_sample_image_binary_pt():\n",
    "    return _get_sample_image_binary(\"INPUT__0\", \"OUTPUT__0\")\n",
    "\n",
    "\n",
    "def get_sample_image_binary_trt():\n",
    "    return _get_sample_image_binary(\"input\", \"output\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3b0948b",
   "metadata": {},
   "source": [
    "#### Invoke target model on Multi Model Endpoint\n",
    "\n",
    "Once the endpoint is successfully created, we can send inference request to multi-model endpoint using invoke_enpoint API. We specify the TargetModel in the invocation call and pass in the payload for each model type. Sample invocation for PyTorch model and TensorRT model is shown below"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92a1b419",
   "metadata": {},
   "source": [
    "#### TensorRT model prediction "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "905816d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# tensorRT payload\n",
    "trt_payload = {\n",
    "    \"inputs\": [\n",
    "        {\"name\": \"input\", \"shape\": [1, 3, 224, 224], \"datatype\": \"FP32\", \"data\": get_sample_image()}\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2c38d39",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = runtime_sm_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/octet-stream\",\n",
    "    Body=json.dumps(trt_payload),\n",
    "    TargetModel=\"resnet_trt_v0.tar.gz\",\n",
    ")\n",
    "\n",
    "print(json.loads(response[\"Body\"].read().decode(\"utf8\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc0d2e19",
   "metadata": {},
   "source": [
    "#### PyTorch Model prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdae9c1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# PyTorch payload\n",
    "pt_payload = {\n",
    "    \"inputs\": [\n",
    "        {\n",
    "            \"name\": \"INPUT__0\",\n",
    "            \"shape\": [1, 3, 224, 224],\n",
    "            \"datatype\": \"FP32\",\n",
    "            \"data\": get_sample_image(),\n",
    "        }\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d9307d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = runtime_sm_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/octet-stream\",\n",
    "    Body=json.dumps(pt_payload),\n",
    "    TargetModel=\"resnet_pt_v0.tar.gz\",\n",
    ")\n",
    "\n",
    "response = json.loads(response[\"Body\"].read().decode(\"utf8\"))\n",
    "output = response[\"outputs\"][0][\"data\"]\n",
    "\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d7c44f1",
   "metadata": {},
   "source": [
    "We can also use binary+json as the payload format to get better performance for the inference call. The specification of this format is provided [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md).\n",
    "\n",
    "**Note:** With the `binary+json` format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header `application/vnd.sagemaker-triton.binary+json;json-header-size={}`.\n",
    "\n",
    "Please not, this is different from using `Inference-Header-Content-Length` header on a stand-alone Triton server since custom headers are not allowed in SageMaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69432ae7",
   "metadata": {},
   "outputs": [],
   "source": [
    "request_body, header_length = get_sample_image_binary_pt()\n",
    "\n",
    "response = runtime_sm_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/vnd.sagemaker-triton.binary+json;json-header-size={}\".format(\n",
    "        header_length\n",
    "    ),\n",
    "    Body=request_body,\n",
    "    TargetModel=\"resnet_pt_v0.tar.gz\",\n",
    ")\n",
    "\n",
    "# Parse json header size length from the response\n",
    "header_length_prefix = \"application/vnd.sagemaker-triton.binary+json;json-header-size=\"\n",
    "header_length_str = response[\"ContentType\"][len(header_length_prefix) :]\n",
    "\n",
    "# Read response body\n",
    "result = httpclient.InferenceServerClient.parse_response_body(\n",
    "    response[\"Body\"].read(), header_length=int(header_length_str)\n",
    ")\n",
    "output0_data = result.as_numpy(\"OUTPUT__0\")\n",
    "print(output0_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1568cbb",
   "metadata": {},
   "source": [
    "#### Cloudwatch metrics for GPU Multi Model Endpoints\n",
    "\n",
    "Amazon SageMaker multi-model endpoints provides instance level metrics to monitor, for more details refer [Monitor Amazon SageMaker with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html)\n",
    "\n",
    "\n",
    "*  Number of models loaded in the containers (LoadedModelCount),  \n",
    "* Precentage of GPU units that are used by the containers (GPUUtilization), \n",
    "* Precentage of GPU memory used by the containers (GPUMemoryUtilization), \n",
    "* Precentage of disk space used by the containers (DiskUtilization) etc. \n",
    "\n",
    "SageMaker MME also provides Model loading metrics such as-\n",
    "\n",
    "*  Time interval for model to be downloaded or loaded (ModelLoadingWaitTime),\n",
    "*  Time interval to unload model from container (ModelUnloadingTime),\n",
    "*  Time to download the model from S3 (ModelDownloadingTime), \n",
    "* Number of invocations to model that are already loaded onto the container(ModelCacheHit) etc to get model invocation level insights. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbf9016e",
   "metadata": {},
   "source": [
    "#### Terminate endpoint and clean up artifacts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6d232e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_client.delete_model(ModelName=sm_model_name)\n",
    "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm_client.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0f67390",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/multi-model-endpoints|mme-on-gpu|cv|resnet50_mme_with_gpu.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.p3.2xlarge",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}