{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a64814c4-c157-4162-b859-6180ef5f0ff2",
   "metadata": {},
   "source": [
    "# Triton on SageMaker - this notebook shows ONNX back end and how you can take a RoBERTA model and create a traced model and leverage for Triton\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a64814c4-c157-4162-b859-6180ef5f0ff2",
   "metadata": {},
   "source": [
    "\n",
    "Amazon SageMaker is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.\n",
    "\n",
    "Now, NVIDIA Triton Inference Server can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.\n",
    "\n",
    "This notebook was tested on Studio with ml.g4dn.xlarge which comes with 1 GPU and with ml.m5.large which is a CPU based machine only Contents\n",
    "\n",
    "Introduction to NVIDIA Triton Server\n",
    "Set up the environment\n",
    "Basic: RoBERTA Model\n",
    "* PyTorch: ONNX compile the model and save it\n",
    "* PyTorch: Packaging model files and uploading to s3\n",
    "* PyTorch: Create SageMaker Endpoint\n",
    "* PyTorch: Run inference\n",
    "* PyTorch: Terminate endpoint and clean up artifacts\n",
    "\n",
    "\n",
    "### Introduction to NVIDIA Triton Server\n",
    "\n",
    "NVIDIA Triton Inference Server was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.\n",
    "\n",
    "Some key features of Triton are:\n",
    "\n",
    "* Support for Multiple frameworks: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats.\n",
    "* Model pipelines: Triton model ensemble represents a pipeline of one or more models or pre/post-processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.\n",
    "* Concurrent model execution: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.\n",
    "* Dynamic batching: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.\n",
    "* Diverse CPUs and GPUs: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.\n",
    "\n",
    "Note: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal config.pbtxt configuration file is required in the model artifacts. This release doesn't support inferring the model config automatically. Set up the environment\n",
    "\n",
    "Installs the dependencies required to package the model and run inferences using Triton server.\n",
    "\n",
    "Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.\n",
    "\n",
    "The purpose of this file is to show the ability to take a open source nlp model and compile to the open standard ONNX and then host a ONNX model using Triton in Multi Model mode\n",
    "\n",
    "This notebook has been tested with 'ml.g4dn.xlarge' instance which comes with 1 GPU "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b237a7c",
   "metadata": {},
   "source": [
    "## Set up the environment\n",
    "\n",
    "Installs the dependencies required to package the model and run inferences using Triton server.\n",
    "\n",
    "Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b1552a9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install timm -Uqq\n",
    "%pip install transformers -Uqq\n",
    "%pip install locust -Uqq\n",
    "%pip install boto3 -Uqq\n",
    "%pip install sagemaker -Uqq\n",
    "%pip install matplotlib -Uqq\n",
    "%pip install Jinja2 -Uqq\n",
    "%pip install nvidia-pyindex --quiet\n",
    "%pip install tritonclient[http] --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fc27e19-c615-4102-99b3-eb4f13b64483",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f53c22c3-a414-4bf6-abf2-3e4e73aee07d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install boto3 -Uqq"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77daf460-6119-4ba7-931e-5c5163c34d7a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install transformers[onnx] -Uqq"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc8b4725",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "import torch\n",
    "from pathlib import Path\n",
    "\n",
    "import boto3\n",
    "import json\n",
    "from pathlib import Path\n",
    "import time\n",
    "import datetime as dt\n",
    "import warnings\n",
    "\n",
    "from utils import model_utils\n",
    "\n",
    "role = get_execution_role()\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "account = sess.account_id()\n",
    "bucket = sess.default_bucket()  # or use your own custom bucket name\n",
    "prefix = \"mme-roberta-benchmark\"\n",
    "\n",
    "use_case = \"nlp\"\n",
    "\n",
    "sm_client = boto3.client(service_name=\"sagemaker\")\n",
    "runtime_sm_client = boto3.client(\"sagemaker-runtime\")\n",
    "\n",
    "tested_models = [\"bert-base-uncased\", \"roberta-large\"]\n",
    "\n",
    "model_name = \"roberta-large\"  # change the model name to benchmark different NLP models\n",
    "\n",
    "max_seq_len = 512"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a8a043a-3603-4eee-af62-53687881ab6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%env TOKENIZERS_PARALLELISM=False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e9a074b-0d80-47db-922f-c81687e94542",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(torch.__version__)\n",
    "import transformers\n",
    "\n",
    "print(transformers.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d5503b2",
   "metadata": {},
   "source": [
    "Account Id Mapping for triton inference containers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1041ca81",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "account_id_map = {\n",
    "    \"us-east-1\": \"785573368785\",\n",
    "    \"us-east-2\": \"007439368137\",\n",
    "    \"us-west-1\": \"710691900526\",\n",
    "    \"us-west-2\": \"301217895009\",\n",
    "    \"eu-west-1\": \"802834080501\",\n",
    "    \"eu-west-2\": \"205493899709\",\n",
    "    \"eu-west-3\": \"254080097072\",\n",
    "    \"eu-north-1\": \"601324751636\",\n",
    "    \"eu-south-1\": \"966458181534\",\n",
    "    \"eu-central-1\": \"746233611703\",\n",
    "    \"ap-east-1\": \"110948597952\",\n",
    "    \"ap-south-1\": \"763008648453\",\n",
    "    \"ap-northeast-1\": \"941853720454\",\n",
    "    \"ap-northeast-2\": \"151534178276\",\n",
    "    \"ap-southeast-1\": \"324986816169\",\n",
    "    \"ap-southeast-2\": \"355873309152\",\n",
    "    \"cn-northwest-1\": \"474822919863\",\n",
    "    \"cn-north-1\": \"472730292857\",\n",
    "    \"sa-east-1\": \"756306329178\",\n",
    "    \"ca-central-1\": \"464438896020\",\n",
    "    \"me-south-1\": \"836785723513\",\n",
    "    \"af-south-1\": \"774647643957\",\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5abb439f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "region = boto3.Session().region_name\n",
    "if region not in account_id_map.keys():\n",
    "    raise (\"UNSUPPORTED REGION\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "226d4db3",
   "metadata": {},
   "source": [
    "## Generate Pretrained Models"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17fed9ec",
   "metadata": {},
   "source": [
    "We are going to use the following SageMaker Processing script to generate our pretrained model. This script does the following:\n",
    "\n",
    "1. Generate a model using the Pytorch Hub\n",
    "\n",
    "2. jit script the model and save the torchscript file\n",
    "\n",
    "3. Create a model artifact which is comprised of the torchscript file and a model configuration (config.pbtxt) for Triton serving\n",
    "\n",
    "Helper functions have been created for each of these steps and are imported from the `utils.model_utils` local module"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ea86725-e521-43d0-8b08-92a6fc0fb8d6",
   "metadata": {},
   "source": [
    "#### This create a jit traced model which returns 2 tensors back as predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87f0bad5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "if model_name in tested_models:\n",
    "    tokenizer, model = model_utils.get_model_from_hf_hub(model_name)\n",
    "else:\n",
    "    warnings.warn(f\"{model_name} has not been tested and may not work\")\n",
    "    tokenizer, model = model_utils.get_model_from_hf_hub(model_name)\n",
    "model.eval()\n",
    "\n",
    "print(f\"loaded model {model_name} with {model_utils.count_parameters(model)} parameters\")\n",
    "\n",
    "example_input = tokenizer(\n",
    "    \"This is a sample\",\n",
    "    padding=\"max_length\",\n",
    "    max_length=max_seq_len,\n",
    "    return_tensors=\"pt\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a42e669f-7254-4e8b-bbeb-0387bd874799",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Roberta -\n",
    "from transformers import AutoTokenizer, AutoModel\n",
    "\n",
    "max_seq_len = 512\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "print(\"Using {} device\".format(device))\n",
    "\n",
    "# -- IF you use from bert it comes without HEAD\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"roberta-base\")  # roberta-large\n",
    "model = AutoModel.from_pretrained(\"roberta-base\", torchscript=True)  # roberta-large\n",
    "model = model.eval()\n",
    "\n",
    "print(f\"loaded model {model_name} with {model_utils.count_parameters(model)} parameters\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3223d338-3ed9-4b50-81e6-20ea5ec7a794",
   "metadata": {},
   "outputs": [],
   "source": [
    "example_input = tokenizer(\n",
    "    \"This is a sample\",\n",
    "    padding=\"max_length\",\n",
    "    max_length=max_seq_len,\n",
    "    return_tensors=\"pt\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ecc6b4a6-f8c6-43f9-906b-41ba91019a52",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "example_input[\"input_ids\"].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b18fd7ca-4108-48ad-a973-4e9b3b7ab6b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "example_input.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2acea252-5917-4324-a62a-f80947ea81af",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "model(**example_input)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d581bc0a-d900-47b4-8cb2-378977862d8f",
   "metadata": {},
   "outputs": [],
   "source": [
    "model(example_input[\"input_ids\"], example_input[\"attention_mask\"])[0].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c187f7c",
   "metadata": {},
   "source": [
    "## Benchmark a ONNX model\n",
    "Next we'll convert the PyTorch model to a TensorRT model with the following steps\n",
    "\n",
    "1. Convert the model to ONNX. ONNX is intemediate file format which is framework agnostic. It works with models in TF, PyTorch and more.\n",
    "\n",
    "2. Spin up a SageMaker processing job to convert ONNX model to a TensorRT model plan. You will export the weights of your model from the framework and load them into your TensorRT network.\n",
    "\n",
    "<div class=\"alert-danger\" role=\"alert\"> <strong> Warning: </strong>\n",
    "For TensorRT models, The hosting instance type must match the instance type the model is compiled on. To help ensure compatability, we'll compile the model using a SageMaker Processing job. Note that this will incur an additional cost of running the job. Also, SageMaker Processing doescurrently support ml.g5 family of instances, so these compiled models will only run on the ml.g4dn family of instances. To deploy on an ml.g5 instance, please compile the model on an EC2 or a SageMaker notebook instance \n",
    "</div>\n",
    "\n",
    "### <span style=\"color:red\"> </span>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5d9d01c-5edc-46c3-8aff-0467859d7ce1",
   "metadata": {},
   "source": [
    "## Packaging Pytorch JIT TRACED model for Triton sever on SageMaker\n",
    "\n",
    "**Note**: SageMaker expects the model tarball file to have a top level directory with the same name as the model defined in the `config.pbtxt`.\n",
    "\n",
    "```\n",
    "model_name\n",
    "\u251c\u2500\u2500 1\n",
    "\u2502   \u2514\u2500\u2500 model.onnx\n",
    "\u2514\u2500\u2500 config.pbtxt\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddb98561-812d-49de-855d-8fbac57178a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import torch\n",
    "from pathlib import Path\n",
    "from typing import Union, List\n",
    "from jinja2 import Environment, FileSystemLoader\n",
    "import sagemaker\n",
    "from sagemaker.processing import ScriptProcessor\n",
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "import shutil\n",
    "import subprocess\n",
    "import tarfile\n",
    "import transformers\n",
    "from transformers import AutoTokenizer, AutoModel\n",
    "from transformers.onnx import FeaturesManager\n",
    "import time\n",
    "import timm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75f582ff-81f5-4fee-8f50-aeb8a4d15f01",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "onnx_path = Path(\"onnx-models\")\n",
    "onnx_path.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bc0ae61",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "onnx_model_path = Path(f\"onnx-models/{model_name}\")\n",
    "onnx_model_path.mkdir(exist_ok=True)\n",
    "exported_onnx_path, onnx_config = model_utils.export_onnx_nlp(\n",
    "    model=model, tokenizer=tokenizer, save_path=onnx_model_path\n",
    ")\n",
    "exported_onnx_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12e96538-6fc8-43d4-999f-841b530037a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "onnx_model_path = Path(f\"triton-serve-pt/{model_name}/1\")\n",
    "onnx_model_path.mkdir(parents=True, exist_ok=True)\n",
    "print(exported_onnx_path)\n",
    "\n",
    "%cp {exported_onnx_path} {onnx_model_path}\n",
    "!cd {onnx_model_path} && rm -rf \".ipynb_checkpoints\"\n",
    "%ls -alrt {onnx_model_path}\n",
    "\n",
    "\n",
    "# onnx_model_path = model_utils.export_pt_jit(model, list(example_input.values()), pytorch_model_path) #export jit compiled model to specified directory"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdd693cc-f49d-49d8-9a1f-6fda4a6c94b7",
   "metadata": {},
   "source": [
    "### Use this to test with 2 output tensors"
   ]
  },
  {
   "cell_type": "raw",
   "id": "f73e1af9-da3e-4c3d-a644-016e7e9f67a3",
   "metadata": {},
   "source": [
    "%%writefile triton-serve-pt/roberta-large/config.pbtxt\n",
    "name: \"nlp-onnx\"\n",
    "platform: \"onnxruntime_onnx\"\n",
    "backend: \"onnxruntime\" \n",
    "max_batch_size: 32\n",
    "\n",
    "  input {\n",
    "    name: \"input_ids\"\n",
    "    data_type: TYPE_INT64\n",
    "    dims: [512]\n",
    "  }\n",
    "  input {\n",
    "    name: \"attention_mask\"\n",
    "    data_type: TYPE_INT64\n",
    "    dims: [512]\n",
    "  }\n",
    "\n",
    "  output {\n",
    "    name: \"last_hidden_state\"\n",
    "    data_type: TYPE_FP32\n",
    "    dims: [-1, 768]\n",
    "  }\n",
    "  output {\n",
    "    name: \"1550\"\n",
    "    data_type: TYPE_FP32\n",
    "    dims: [768]\n",
    "  }\n",
    "instance_group {\n",
    "  count: 1\n",
    "  kind: KIND_GPU\n",
    "}\n",
    "dynamic_batching {\n",
    "    max_queue_delay_microseconds: 100\n",
    "    preferred_batch_size:5\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd4a5b92-a69f-4459-9e43-111a2ae8f1d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "triton_inputs = [\n",
    "    {\"name\": input_name, \"data_type\": \"TYPE_INT64\", \"dims\": \"[512]\"} for input_name in example_input\n",
    "]\n",
    "# - 1550, last_hidden_state\n",
    "triton_outputs = [\n",
    "    {\n",
    "        \"name\": \"last_hidden_state\",\n",
    "        \"data_type\": \"TYPE_FP32\",\n",
    "        \"dims\": \"[-1, 768]\",\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"1550\",\n",
    "        \"data_type\": \"TYPE_FP32\",\n",
    "        \"dims\": \"[768]\",\n",
    "    },\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "647a98e6-9e9e-4028-a673-f56d50d964c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "triton_config_path = model_utils.generate_triton_config(\n",
    "    platform=\"onnx\",\n",
    "    triton_inputs=triton_inputs,\n",
    "    triton_outputs=triton_outputs,\n",
    "    save_path=Path(f\"triton-serve-pt/{model_name}\"),  # onnx_model_path\n",
    "    template_path=\"config_templates\",\n",
    ")\n",
    "triton_config_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c717d249-f4a3-42dd-9c00-1ba080d584a0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#!cat {onnx_model_path}/config.pbtxt\n",
    "print(triton_config_path)\n",
    "%cat {triton_config_path}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08083250-60d9-4161-abf4-554504ea8d2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "281a5807-96e4-4d5a-b754-53a4697d3bf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "onnx_model_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01b3096e-c89e-4d8d-957f-216fccd8838c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "tar_file_name = \"roberta-large-v1.tar.gz\"\n",
    "\n",
    "!cd triton-serve-pt && rm -f {tar_file_name}\n",
    "!cd triton-serve-pt/roberta-large && rm -rf \".ipynb_checkpoints\"\n",
    "!cd triton-serve-pt && tar czvf {tar_file_name} roberta-large\n",
    "#!cp triton-serve-pt/roberta-large-v1.tar.gz triton-server-pt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b78b4dde-8a91-4c56-93f2-5fd15ad1adf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "initial_model_path = sess.upload_data(\n",
    "    f\"./triton-serve-pt/{tar_file_name}\",\n",
    "    bucket=bucket,\n",
    "    key_prefix=f\"{prefix}/{model_name}\",\n",
    ")\n",
    "print(f\"finally model path = {initial_model_path}\")\n",
    "\n",
    "mme_path = f\"s3://{bucket}/{prefix}/{model_name}/\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68f8dc24-00a1-4311-82c5-23e4ebbc1962",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "initial_model_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd63a9fb-cc9a-4ca0-9c8f-5015fdb48b9a",
   "metadata": {},
   "outputs": [],
   "source": [
    "tar_file_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc9df01a-012a-4b1b-856b-1034d71872e3",
   "metadata": {},
   "source": [
    "### Create the SageMaker end point for ONNX"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "876a27c6-a677-4423-8ec9-c71141ea091a",
   "metadata": {},
   "outputs": [],
   "source": [
    "example_input = tokenizer(\n",
    "    \"This is a sample\",\n",
    "    padding=\"max_length\",\n",
    "    max_length=max_seq_len,\n",
    "    return_tensors=\"pt\",\n",
    ")\n",
    "example_input[\"input_ids\"].shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef8a93cd-c163-432c-87c9-77f57810b173",
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils.endpoint_utils import (\n",
    "    create_endpoint,\n",
    "    delete_endpoint,\n",
    "    get_instance_utilization,\n",
    "    run_load_test,\n",
    ")\n",
    "\n",
    "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n",
    "mme_triton_image_uri = (\n",
    "    f\"{account_id_map[region]}.dkr.ecr.{region}.{base}\" + \"/sagemaker-tritonserver:22.10-py3\"\n",
    ")\n",
    "print(mme_triton_image_uri)\n",
    "instance_type = \"ml.g4dn.4xlarge\"  # 'ml.g4dn.2xlarge' --"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70e7a203-caa6-406a-a86e-2d4dcd425fd3",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(mme_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "092ef5ae-2760-4089-b29b-fa5bf35cf123",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(initial_model_path)\n",
    "print(mme_path)\n",
    "!aws s3 ls {mme_path}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6a4f891-806e-4f98-9214-785951c0aa5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "container = {\n",
    "    \"Image\": mme_triton_image_uri,\n",
    "    \"ModelDataUrl\": mme_path,\n",
    "    \"Mode\": \"MultiModel\",\n",
    "    \"Environment\": {\n",
    "        #'SAGEMAKER_PROGRAM' : 'inference.py',\n",
    "        #'SAGEMAKER_SUBMIT_DIRECTORY' : 'code',\n",
    "        # \"SAGEMAKER_TRITON_BATCH_SIZE\": \"1\",\n",
    "        # \"SAGEMAKER_TRITON_MAX_BATCH_DELAY\": \"1000\",\n",
    "        \"SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE\": \"16777216000\",  # \"16777216\", #\"16777216000\",\n",
    "        \"SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE\": \"10485760\",\n",
    "        # \"SAGEMAKER_TRITON_LOG_VERBOSE\":\"false\"\n",
    "    },\n",
    "}\n",
    "container"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73dc966d-9493-44e4-95fa-252e40fcad07",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = model_name.replace(\"_\", \"-\")\n",
    "\n",
    "engine = \"onnx\"\n",
    "\n",
    "sm_model_name = f\"{model_name}-{engine}-gpu-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])\n",
    "\n",
    "endpoint_config_name = f\"{model_name}-{engine}-gpu-\" + time.strftime(\n",
    "    \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n",
    ")\n",
    "\n",
    "create_endpoint_config_response = sm_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": instance_type,\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": sm_model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "            # \"VolumeSizeInGB\" : 256\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "print(\"ebs volume NOT added\")\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])\n",
    "\n",
    "endpoint_name = f\"{model_name}-{engine}-gpu-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "create_endpoint_response = sm_client.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])\n",
    "\n",
    "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "print(sm_model_name, endpoint_config_name, endpoint_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c77c572c-cfda-4bb8-8cb1-957fe2ae5a99",
   "metadata": {},
   "outputs": [],
   "source": [
    "# sm_model_name, endpoint_config_name, endpoint_name = create_endpoint(sm_client, model_name, role, container, instance_type, \"onnx\")\n",
    "# print(sm_model_name, endpoint_config_name, endpoint_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7437b64-9ef2-4243-8335-d8ebc67d0bb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "runtime_sm_client"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e2cf8c8-865f-4f65-993b-e9f1d225eade",
   "metadata": {},
   "source": [
    "#### Invoke End point now "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a377a831-e956-4021-b955-58819fca6791",
   "metadata": {},
   "outputs": [],
   "source": [
    "max_seq_len"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "156ec6db-dd69-4304-9a9d-c1213dd223b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# make a inference request to load model into memory\n",
    "# -  THIS IS NO SQUEEZE\n",
    "\n",
    "example_input_nopt = tokenizer(\"This is a sample\", padding=\"max_length\", max_length=max_seq_len)\n",
    "\n",
    "print(tar_file_name)\n",
    "payload = {\n",
    "    \"inputs\": [\n",
    "        {\n",
    "            \"name\": \"input_ids\",\n",
    "            \"shape\": [1, max_seq_len],\n",
    "            \"datatype\": \"INT64\",\n",
    "            \"data\": example_input_nopt[\"input_ids\"],\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"attention_mask\",\n",
    "            \"shape\": [1, max_seq_len],\n",
    "            \"datatype\": \"INT64\",\n",
    "            \"data\": example_input_nopt[\"attention_mask\"],\n",
    "        },\n",
    "    ]\n",
    "}\n",
    "response = runtime_sm_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/octet-stream\",\n",
    "    Body=json.dumps(payload),\n",
    "    TargetModel=f\"{tar_file_name}\",\n",
    "    # TargetModel=f\"roberta-large-v0.tar.gz\",\n",
    ")\n",
    "response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74cfd775-f5f7-4393-bccb-f0eb74a85550",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "resp_bin = response[\"Body\"].read().decode(\"utf8\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11751e78-8717-4492-ac77-caf861e82600",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# -- keys are -- \"outputs\":[{\"name\":\"1550\",\"datatype\":\"FP32\",\"shape\":[1,768],\"data\": [0.0013,0,3433...]}]\n",
    "for data in json.loads(resp_bin)[\"outputs\"]:\n",
    "    shape_1 = list(data[\"shape\"])\n",
    "    dat_1 = np.array(data[\"data\"])\n",
    "    dat_1.resize(shape_1)\n",
    "    print(f\"Data Outputs recieved back :Shape:{dat_1.shape}\")\n",
    "# data_resp = json.loads(resp_bin)['outputs'][0]['data']\n",
    "# np.array(data_resp).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55eeca4e-0b3b-49ae-a7d8-b6f56ed04193",
   "metadata": {},
   "source": [
    "### Run load tests\n",
    "\n",
    "un comment the below to execute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "586f3012-38a8-4e26-8525-ee2de0023aff",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import boto3\n",
    "\n",
    "# max_model_in_memory = 10\n",
    "\n",
    "# # make a copy of the model\n",
    "\n",
    "# s3_client = boto3.client(\"s3\")\n",
    "# for i in range(0, max_model_in_memory):\n",
    "#     with open(f'./triton-serve-pt/{tar_file_name}', \"rb\") as f:\n",
    "#         s3_client.upload_fileobj(f, bucket, f\"{mme_path}{model_name}-v{i}.tar.gz\".format(i))\n",
    "#         print(f\"uploading:: {mme_path}{model_name}-v{i}.tar.gz\".format(i))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b366960d-9438-4ec5-bb11-98862105e403",
   "metadata": {},
   "outputs": [],
   "source": [
    "# models_loaded = 0\n",
    "# memory_utilization_threshold = 0.9\n",
    "# memory_utilization_history = []\n",
    "\n",
    "# max_model_in_memory = 10\n",
    "\n",
    "# import time\n",
    "\n",
    "# while models_loaded < max_model_in_memory:\n",
    "\n",
    "#     # make a inference request to load model into memory\n",
    "#     response = runtime_sm_client.invoke_endpoint(\n",
    "#             EndpointName=endpoint_name,\n",
    "#             ContentType=\"application/octet-stream\",\n",
    "#             Body=json.dumps(payload),\n",
    "#             TargetModel=f\"{model_name}-v{models_loaded}.tar.gz\",\n",
    "#         )\n",
    "\n",
    "#     models_loaded+=1\n",
    "\n",
    "\n",
    "#     print(f\"loaded {models_loaded} models\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3b0bfb9-64db-4cdb-900c-ba9dd91e4d87",
   "metadata": {},
   "outputs": [],
   "source": [
    "# - Delete the end point\n",
    "sm_client.delete_endpoint(EndpointName=endpoint_name)\n",
    "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm_client.delete_model(ModelName=sm_model_name)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.g4dn.xlarge",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 2.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-38"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}