{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3019e08e",
   "metadata": {},
   "source": [
    "# Triton on SageMaker - NLP Bert\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3019e08e",
   "metadata": {},
   "source": [
    "\n",
    "[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.\n",
    "\n",
    "Now, [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.\n",
    "\n",
    "This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g4dn`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe5e1c64",
   "metadata": {},
   "source": [
    "## Contents\n",
    "1. [Introduction to NVIDIA Triton Server](#Introduction-to-NVIDIA-Triton-Server)\n",
    "1. [Set up the environment](#Set-up-the-environment)\n",
    "1. [Add utility methods for preparing request payload](#Add-utility-methods-for-preparing-request-payload)\n",
    "1. [Basic: PyTorch NLP-Bert](#PyTorch-NLP-Bert)\n",
    "  1. [PyTorch: Packaging model files and uploading to s3](#PyTorch:-Packaging-model-files-and-uploading-to-s3)\n",
    "  1. [PyTorch: Create SageMaker Endpoint](#PyTorch:-Create-SageMaker-Endpoint)\n",
    "  1. [PyTorch: Run inference](#PyTorch:-Run-inference)\n",
    "  1. [PyTorch: Terminate endpoint and clean up artifacts](#PyTorch:-Terminate-endpoint-and-clean-up-artifacts)\n",
    "1. [Advanced: TensorRT NLP-Bert](#TensorRT-NLP-Bert)\n",
    "  1. [TensorRT: Packaging model files and uploading to s3](#TensorRT:-Packaging-model-files-and-uploading-to-s3)\n",
    "  1. [TensorRT: Create SageMaker Endpoint](#TensorRT:-Create-SageMaker-Endpoint)\n",
    "  1. [TensorRT: Run inference](#TensorRT:-Run-inference)\n",
    "  1. [TensorRT: Terminate endpoint and clean up artifacts](#TensorRT:-Terminate-endpoint-and-clean-up-artifacts)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1515f05",
   "metadata": {},
   "source": [
    "## Introduction to NVIDIA Triton Server\n",
    "\n",
    "[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.\n",
    "\n",
    "Some key features of Triton are:\n",
    "* **Support for Multiple frameworks**: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats. \n",
    "* **Model pipelines**: Triton model ensemble represents a pipeline of one or more models or pre/post processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.\n",
    "* **Concurrent model execution**: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.\n",
    "* **Dynamic batching**: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.\n",
    "* **Diverse CPUs and GPUs**: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.\n",
    "\n",
    "**Note**: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal `config.pbtxt` configuration file is **required** in the model artifacts. This release doesn't support inferring the model config automatically."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bc7961c",
   "metadata": {},
   "source": [
    "## Set up the environment\n",
    "\n",
    "Installs the dependencies required to package the model and run inferences using Triton server.\n",
    "\n",
    "Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb5ae6e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qU pip awscli boto3 sagemaker transformers==4.9.1\n",
    "!pip install nvidia-pyindex\n",
    "!pip install tritonclient[http]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ba8bbc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3, json, sagemaker, time\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "sess = boto3.Session()\n",
    "sm = sess.client(\"sagemaker\")\n",
    "sagemaker_session = sagemaker.Session(boto_session=sess)\n",
    "role = get_execution_role()\n",
    "client = boto3.client(\"sagemaker-runtime\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "393af1df",
   "metadata": {},
   "outputs": [],
   "source": [
    "account_id_map = {\n",
    "    'us-east-1': '785573368785',\n",
    "    'us-east-2': '007439368137',\n",
    "    'us-west-1': '710691900526',\n",
    "    'us-west-2': '301217895009',\n",
    "    'eu-west-1': '802834080501',\n",
    "    'eu-west-2': '205493899709',\n",
    "    'eu-west-3': '254080097072',\n",
    "    'eu-north-1': '601324751636',\n",
    "    'eu-south-1': '966458181534',\n",
    "    'eu-central-1': '746233611703',\n",
    "    'ap-east-1': '110948597952',\n",
    "    'ap-south-1': '763008648453',\n",
    "    'ap-northeast-1': '941853720454',\n",
    "    'ap-northeast-2': '151534178276',\n",
    "    'ap-southeast-1': '324986816169',\n",
    "    'ap-southeast-2': '355873309152',\n",
    "    'cn-northwest-1': '474822919863',\n",
    "    'cn-north-1': '472730292857',\n",
    "    'sa-east-1': '756306329178',\n",
    "    'ca-central-1': '464438896020',\n",
    "    'me-south-1': '836785723513',\n",
    "    'af-south-1': '774647643957'\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e88b9be8",
   "metadata": {},
   "outputs": [],
   "source": [
    "region = boto3.Session().region_name\n",
    "if region not in account_id_map.keys():\n",
    "    raise(\"UNSUPPORTED REGION\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e951400",
   "metadata": {},
   "outputs": [],
   "source": [
    "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n",
    "triton_image_uri = \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:21.08-py3\".format(\n",
    "    account_id=account_id_map[region], region=region, base=base\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5469481",
   "metadata": {},
   "source": [
    "## Add utility methods for preparing request payload\n",
    "\n",
    "The following method transforms the sample text we will be using for inference into the payload that can be sent for inference to the Triton server."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1b2c4ca",
   "metadata": {},
   "source": [
    "The `tritonclient` package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "469f89f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tritonclient.http as httpclient\n",
    "from transformers import BertTokenizer\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def tokenize_text(text):\n",
    "    enc = BertTokenizer.from_pretrained(\"bert-base-uncased\")\n",
    "    encoded_text = enc(text, padding=\"max_length\", max_length=128)\n",
    "    return encoded_text[\"input_ids\"], encoded_text[\"attention_mask\"]\n",
    "\n",
    "\n",
    "def _get_sample_tokenized_text_binary(text, input_names, output_names):\n",
    "    inputs = []\n",
    "    outputs = []\n",
    "    inputs.append(httpclient.InferInput(input_names[0], [1, 128], \"INT32\"))\n",
    "    inputs.append(httpclient.InferInput(input_names[1], [1, 128], \"INT32\"))\n",
    "    indexed_tokens, attention_mask = tokenize_text(text)\n",
    "\n",
    "    indexed_tokens = np.array(indexed_tokens, dtype=np.int32)\n",
    "    indexed_tokens = np.expand_dims(indexed_tokens, axis=0)\n",
    "    inputs[0].set_data_from_numpy(indexed_tokens, binary_data=True)\n",
    "\n",
    "    attention_mask = np.array(attention_mask, dtype=np.int32)\n",
    "    attention_mask = np.expand_dims(attention_mask, axis=0)\n",
    "    inputs[1].set_data_from_numpy(attention_mask, binary_data=True)\n",
    "\n",
    "    outputs.append(httpclient.InferRequestedOutput(output_names[0], binary_data=True))\n",
    "    outputs.append(httpclient.InferRequestedOutput(output_names[1], binary_data=True))\n",
    "    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(\n",
    "        inputs, outputs=outputs\n",
    "    )\n",
    "    return request_body, header_length\n",
    "\n",
    "\n",
    "def get_sample_tokenized_text_binary_pt(text):\n",
    "    return _get_sample_tokenized_text_binary(\n",
    "        text, [\"INPUT__0\", \"INPUT__1\"], [\"OUTPUT__0\", \"1634__1\"]\n",
    "    )\n",
    "\n",
    "\n",
    "def get_sample_tokenized_text_binary_trt(text):\n",
    "    return _get_sample_tokenized_text_binary(text, [\"token_ids\", \"attn_mask\"], [\"output\", \"1634\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "914b72f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "!docker run --gpus=all --rm -it \\\n",
    "            -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:21.08-py3 \\\n",
    "            /bin/bash generate_models.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ce5f7a7",
   "metadata": {},
   "source": [
    "## PyTorch NLP-Bert\n",
    "\n",
    "For a simple use case we will take the pre-trained NLP Bert model from [Hugging Face](https://huggingface.co/transformers/model_doc/bert.html) and deploy it on SageMaker with Triton as the model server. The script for exporting this model can be found [here](./workspace/pt_exporter.py). This is run as part of the `generate_models.sh` script from the previous cell. After the model is serialized we package it into the format that Triton and SageMaker expect it to be. We used the pre-configured `config.pbtxt` file provided with this repo [here](./triton-serve-pt/bert/config.pbtxt) to specify model [configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) which Triton uses to load the model. We tar the model directory and upload it to s3 to later create a [SageMaker Model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html).\n",
    "\n",
    "**Note**: SageMaker expects the model tarball file to have a top level directory with the same name as the model defined in the `config.pbtxt`.\n",
    "\n",
    "```\n",
    "bert\n",
    "\u251c\u2500\u2500 1\n",
    "\u2502   \u2514\u2500\u2500 model.pt\n",
    "\u2514\u2500\u2500 config.pbtxt\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5623e0f7",
   "metadata": {},
   "source": [
    "### PyTorch: Packaging model files and uploading to s3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fe7ed0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-serve-pt/bert/1/\n",
    "!cp workspace/model.pt triton-serve-pt/bert/1/\n",
    "!tar -C triton-serve-pt/ -czf model.tar.gz bert\n",
    "model_uri = sagemaker_session.upload_data(path=\"model.tar.gz\", key_prefix=\"triton-serve-pt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "080e5e0b",
   "metadata": {},
   "source": [
    "### PyTorch: Create SageMaker Endpoint\n",
    "\n",
    "We start off by creating a [sagemaker model](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) from the model files we uploaded to s3 in the previous step.\n",
    "\n",
    "In this step we also provide an additional Environment Variable i.e. `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` which specifies the name of the model to be loaded by Triton. **The value of this key should match the folder name in the model package uploaded to s3**. This variable is optional in case of a single model. In case of ensemble models, this key **has to be** specified for Triton to startup in SageMaker.\n",
    "\n",
    "Additionally, customers can set `SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT` and `SAGEMAKER_TRITON_THREAD_COUNT` for optimizing the thread counts.\n",
    "\n",
    "*Note*: The current release of Triton (21.08-py3) on SageMaker doesn't support running instances of different models on the same server, except in case of [ensembles](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#ensemble-models). Only multiple model instances of the same model are supported, which can be specified under the [instance-groups](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#instance-groups) section of the config.pbtxt file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfcd015d",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_model_name = \"triton-nlp-bert-pt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "container = {\n",
    "    \"Image\": triton_image_uri,\n",
    "    \"ModelDataUrl\": model_uri,\n",
    "    \"Environment\": {\"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"bert\"},\n",
    "}\n",
    "\n",
    "create_model_response = sm.create_model(\n",
    "    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09337c37",
   "metadata": {},
   "source": [
    "Using the model above, we create an [endpoint configuration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) where we can specify the type and number of instances we want in the endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54090bdc",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_config_name = \"triton-nlp-bert-pt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "create_endpoint_config_response = sm.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.g4dn.4xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": sm_model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f524cb7",
   "metadata": {},
   "source": [
    "Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "add88402",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_name = \"triton-nlp-bert-pt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "create_endpoint_response = sm.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "472f93d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp = sm.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "082ecd3a",
   "metadata": {},
   "source": [
    "### PyTorch: Run inference\n",
    "\n",
    "Once we have the endpoint running we can use a sample text to do an inference using json as the payload format. For inference request format, Triton uses the KFServing community standard [inference protocols](https://github.com/triton-inference-server/server/blob/main/docs/protocol/README.md)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d703109",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_triton = \"Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.\"\n",
    "input_ids, attention_mask = tokenize_text(text_triton)\n",
    "\n",
    "payload = {\n",
    "    \"inputs\": [\n",
    "        {\"name\": \"INPUT__0\", \"shape\": [1, 128], \"datatype\": \"INT32\", \"data\": input_ids},\n",
    "        {\"name\": \"INPUT__1\", \"shape\": [1, 128], \"datatype\": \"INT32\", \"data\": attention_mask},\n",
    "    ]\n",
    "}\n",
    "\n",
    "response = client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name, ContentType=\"application/octet-stream\", Body=json.dumps(payload)\n",
    ")\n",
    "\n",
    "print(json.loads(response[\"Body\"].read().decode(\"utf8\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1f50ae9",
   "metadata": {},
   "source": [
    "We can also use binary+json as the payload format to get better performance for the inference call. The specification of this format is provided [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md).\n",
    "\n",
    "**Note:** With the `binary+json` format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header `application/vnd.sagemaker-triton.binary+json;json-header-size={}`.\n",
    "\n",
    "Please not, this is different from using `Inference-Header-Content-Length` header on a stand-alone Triton server since custom headers are not allowed in SageMaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1592cc0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "text_sm = \"Amazon SageMaker helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML.\"\n",
    "request_body, header_length = get_sample_tokenized_text_binary_pt(text_sm)\n",
    "\n",
    "response = client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/vnd.sagemaker-triton.binary+json;json-header-size={}\".format(\n",
    "        header_length\n",
    "    ),\n",
    "    Body=request_body,\n",
    ")\n",
    "\n",
    "# Parse json header size length from the response\n",
    "header_length_prefix = \"application/vnd.sagemaker-triton.binary+json;json-header-size=\"\n",
    "header_length_str = response[\"ContentType\"][len(header_length_prefix) :]\n",
    "\n",
    "# Read response body\n",
    "result = httpclient.InferenceServerClient.parse_response_body(\n",
    "    response[\"Body\"].read(), header_length=int(header_length_str)\n",
    ")\n",
    "output0_data = result.as_numpy(\"OUTPUT__0\")\n",
    "output1_data = result.as_numpy(\"1634__1\")\n",
    "print(output0_data)\n",
    "print(output1_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64d14ab1",
   "metadata": {},
   "source": [
    "### PyTorch: Terminate endpoint and clean up artifacts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9a24ea4",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm.delete_model(ModelName=sm_model_name)\n",
    "sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a766ab5",
   "metadata": {},
   "source": [
    "## TensorRT NLP-Bert\n",
    "\n",
    "Another way to improve performance is to convert the PyTorch NLP-Bert model to a TensorRT plan and use it natively to run inferences on Triton. By using the [onnx_exporter.py](./workspace/onnx_exporter.py) script and `trtexec` we create a TensorRT plan from the pre-trained PyTorch NLP-Bert model. This is already done as part of the `generate_models.sh` script that we ran earlier in this notebook. We'll package the model and the provided `config.pbtxt` according the Triton model specification and upload to s3 for creating a SageMaker model and endpoint."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db82ec3e",
   "metadata": {},
   "source": [
    "### TensorRT: Packaging model files and uploading to s3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec60f1ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-serve-trt/bert/1/\n",
    "!cp workspace/model_bs16.plan triton-serve-trt/bert/1/model.plan\n",
    "!tar -C triton-serve-trt/ -czf model.tar.gz bert\n",
    "model_uri = sagemaker_session.upload_data(path=\"model.tar.gz\", key_prefix=\"triton-serve-trt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e9a6d1c",
   "metadata": {},
   "source": [
    "### TensorRT: Create SageMaker Endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3b94fc5",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_model_name = \"triton-nlp-bert-trt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "container = {\n",
    "    \"Image\": triton_image_uri,\n",
    "    \"ModelDataUrl\": model_uri,\n",
    "    \"Environment\": {\"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"bert\"},\n",
    "}\n",
    "\n",
    "create_model_response = sm.create_model(\n",
    "    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ce70635",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_config_name = \"triton-nlp-bert-trt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "create_endpoint_config_response = sm.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.g4dn.4xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": sm_model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cb8240b",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_name = \"triton-nlp-bert-trt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "create_endpoint_response = sm.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "286fe559",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp = sm.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92594d24",
   "metadata": {},
   "source": [
    "### TensorRT: Run inference\n",
    "\n",
    "Once we have the endpoint running we can run the inference both using a json payload and binary+json payload as described in the standard PyTorch deployment section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bee3bd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "input_ids, attention_mask = tokenize_text(text_triton)\n",
    "\n",
    "payload = {\n",
    "    \"inputs\": [\n",
    "        {\"name\": \"token_ids\", \"shape\": [1, 128], \"datatype\": \"INT32\", \"data\": input_ids},\n",
    "        {\"name\": \"attn_mask\", \"shape\": [1, 128], \"datatype\": \"INT32\", \"data\": attention_mask},\n",
    "    ]\n",
    "}\n",
    "response = client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name, ContentType=\"application/octet-stream\", Body=json.dumps(payload)\n",
    ")\n",
    "\n",
    "print(json.loads(response[\"Body\"].read().decode(\"utf8\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "897107cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "request_body, header_length = get_sample_tokenized_text_binary_trt(text_sm)\n",
    "\n",
    "response = client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/vnd.sagemaker-triton.binary+json;json-header-size={}\".format(\n",
    "        header_length\n",
    "    ),\n",
    "    Body=request_body,\n",
    ")\n",
    "\n",
    "# Parse json header size length from the response\n",
    "header_length_prefix = \"application/vnd.sagemaker-triton.binary+json;json-header-size=\"\n",
    "header_length_str = response[\"ContentType\"][len(header_length_prefix) :]\n",
    "\n",
    "# Read response body\n",
    "result = httpclient.InferenceServerClient.parse_response_body(\n",
    "    response[\"Body\"].read(), header_length=int(header_length_str)\n",
    ")\n",
    "output0_data = result.as_numpy(\"output\")\n",
    "output1_data = result.as_numpy(\"1634\")\n",
    "print(output0_data)\n",
    "print(output1_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf72b317",
   "metadata": {},
   "source": [
    "### TensorRT: Terminate endpoint and clean up artifacts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86d43509",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm.delete_endpoint(EndpointName=endpoint_name)\n",
    "sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm.delete_model(ModelName=sm_model_name)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-triton|nlp_bert|triton_nlp_bert.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}