{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "60a1fdce",
   "metadata": {},
   "source": [
    "# Triton on SageMaker - Deploying on Inferentia instance type\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6ce5cb1",
   "metadata": {},
   "source": [
    "\n",
    "[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.\n",
    "\n",
    "Now, [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.\n",
    "\n",
    "This notebook was tested on an Amazon SageMaker notebook instance of type `inf2.24xlarge`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e5b70d9-d941-4c73-8ca1-0ecc31a53cb9",
   "metadata": {},
   "source": [
    "## Introduction to INF2/Neuronx\n",
    "\n",
    "AWS Inferentia accelerators are designed by AWS to deliver high performance at the lowest cost for your deep learning (DL) inference applications. AWS Inferentia2 accelerator delivers a major leap in performance and capabilities over first-generation AWS Inferentia. Inferentia2 delivers up to 4x higher throughput and up to 10x lower latency compared to Inferentia.\n",
    "\n",
    "They are optimized to deploy increasingly complex models, such as large language models (LLM) and vision transformers, at scale. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators. You can now efficiently and cost-effectively deploy models with hundreds of billions of parameters across multiple accelerators on Inf2 instances.\n",
    "\n",
    "[AWS Neuron](https://aws.amazon.com/machine-learning/neuron/) is the SDK that helps developers deploy models on both AWS Inferentia accelerators and run your inference applications for natural language processing/understanding, language translation, text summarization, video and image generation, speech recognition, personalization, fraud detection, and more. It integrates natively with popular machine learning (ML) frameworks, such as PyTorch and TensorFlow, so that you can continue to use your existing code and workflows and run on Inferentia accelerators.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b16f14ea",
   "metadata": {},
   "source": [
    "## Introduction to NVIDIA Triton Server\n",
    "\n",
    "[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.\n",
    "\n",
    "Some key features of Triton are:\n",
    "* **Support for Multiple frameworks**: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats. \n",
    "* **Model pipelines**: Triton model ensemble represents a pipeline of one or more models or pre/post processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.\n",
    "* **Concurrent model execution**: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.\n",
    "* **Dynamic batching**: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.\n",
    "* **Diverse CPUs and GPUs**: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5efee277-f51a-4a8b-8746-5b2a72d84af3",
   "metadata": {},
   "source": [
    "## Deploying Inf2 models using python backend on SageMaker (SM) Triton Container\n",
    "\n",
    "In this notebook, we explore how to use the python backend of Triton Inference Server to load a Neuron-traced model on an inf2 accelerator on SageMaker, and run inference requests against it. To make it easier to get started, we use a model.py and a config.pbtxt generator helper as described here: https://github.com/triton-inference-server/python_backend/tree/main/inferentia#using-triton-with-inferentia-2-or-trn1. \n",
    "\n",
    "In this example, the model.py and config.pbtxt files are already provided for OPT-125M and GPTJ-6B using the above generator script, and then modifying them as per available Neuron APIs as described here - https://github.com/triton-inference-server/python_backend/tree/main/inferentia#transformers-neuronx.\n",
    "\n",
    "\n",
    "You can find more information on how to use the python backend of Triton in the documentation here: https://github.com/triton-inference-server/python_backend#usage, specifically the use of initialize(), execute(), and finalize() functions.\n",
    "\n",
    "There are two ways to load the model using python backend of Triton:\n",
    "1. Download and trace the neuron model on the SageMaker endpoint in the initialize() function itself. While this means that you can do the tracing directly on the same instance as you host the model, this process can take very long (several hours) before the endpoint is ready to serve the model.\n",
    "2. Trace the model on an inf2 instance, and upload the neuron-traced model to an S3 bucket. In the initalize() function, load the model from the current directory (relative to where the model.py file is present). In this process, the pre-traced model will take several minutes to load, making it easy to experiment with, and helps when restarting the endpoint multiple times.\n",
    "\n",
    "In this example, we will use option 2 above i.e. trace a neuron model on an inf2 instance, and use that on a SageMaker endpoint. This means that this notebook should be executed on an inf2 instance to trace the model, and then upload that to S3. This process only needs to happen once for a given version of the model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf042bea",
   "metadata": {},
   "source": [
    "### Set up the environment\n",
    "\n",
    "Installs the dependencies required to package the model and run inferences using Triton server.\n",
    "\n",
    "Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7788c22c",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install -qU pip awscli boto3 sagemaker\n",
    "!pip install nvidia-pyindex\n",
    "!pip install tritonclient[http]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4aec711f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import boto3, json, sagemaker, time\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "sm_client = boto3.client(service_name=\"sagemaker\", region_name=\"us-east-2\")\n",
    "runtime_sm_client = boto3.client(\"sagemaker-runtime\", region_name=\"us-east-2\")\n",
    "sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=\"us-east-2\"))\n",
    "role = get_execution_role(sagemaker_session=sagemaker_session)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbc64c0b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "account_id_map = {\n",
    "    \"us-east-1\": \"785573368785\",\n",
    "    \"us-east-2\": \"007439368137\",\n",
    "    \"us-west-1\": \"710691900526\",\n",
    "    \"us-west-2\": \"301217895009\",\n",
    "    \"eu-west-1\": \"802834080501\",\n",
    "    \"eu-west-2\": \"205493899709\",\n",
    "    \"eu-west-3\": \"254080097072\",\n",
    "    \"eu-north-1\": \"601324751636\",\n",
    "    \"eu-south-1\": \"966458181534\",\n",
    "    \"eu-central-1\": \"746233611703\",\n",
    "    \"ap-east-1\": \"110948597952\",\n",
    "    \"ap-south-1\": \"763008648453\",\n",
    "    \"ap-northeast-1\": \"941853720454\",\n",
    "    \"ap-northeast-2\": \"151534178276\",\n",
    "    \"ap-southeast-1\": \"324986816169\",\n",
    "    \"ap-southeast-2\": \"355873309152\",\n",
    "    \"cn-northwest-1\": \"474822919863\",\n",
    "    \"cn-north-1\": \"472730292857\",\n",
    "    \"sa-east-1\": \"756306329178\",\n",
    "    \"ca-central-1\": \"464438896020\",\n",
    "    \"me-south-1\": \"836785723513\",\n",
    "    \"af-south-1\": \"774647643957\",\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9616dd32",
   "metadata": {},
   "outputs": [],
   "source": [
    "region = boto3.Session().region_name\n",
    "if region not in account_id_map.keys():\n",
    "    raise ValueError(\"UNSUPPORTED REGION\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65934acb",
   "metadata": {},
   "outputs": [],
   "source": [
    "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n",
    "triton_image_uri = \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.03-py3\".format(\n",
    "    account_id=account_id_map[region], region=region, base=base\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1404d239-9f01-4997-9eef-6d30280110d7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "triton_image_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bb2cab3-c977-4d2e-b181-611b2773e30b",
   "metadata": {},
   "source": [
    "**Note: update the 'FROM' base image in Dockerfile located in docker/ folder as per above**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f618f8e",
   "metadata": {},
   "source": [
    "### Add utility methods for preparing request payload\n",
    "\n",
    "The following method transforms a sample image we will be using for inference into the payload that can be sent for inference to the Triton server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ef2c6e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from PIL import Image\n",
    "\n",
    "s3_client = boto3.client(\"s3\")\n",
    "s3_client.download_file(\n",
    "    \"sagemaker-sample-files\", \"datasets/image/pets/shiba_inu_dog.jpg\", \"shiba_inu_dog.jpg\"\n",
    ")\n",
    "\n",
    "\n",
    "def get_sample_image():\n",
    "    image_path = \"./shiba_inu_dog.jpg\"\n",
    "    img = Image.open(image_path).convert(\"RGB\")\n",
    "    img = img.resize((224, 224))\n",
    "    img = (np.array(img).astype(np.float32) / 255) - np.array(\n",
    "        [0.485, 0.456, 0.406], dtype=np.float32\n",
    "    ).reshape(1, 1, 3)\n",
    "    img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)\n",
    "    img = np.transpose(img, (2, 0, 1))\n",
    "    return img.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c171f622",
   "metadata": {},
   "source": [
    "The `tritonclient` package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9799f41",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tritonclient.http as httpclient\n",
    "\n",
    "\n",
    "def _get_sample_image_binary(input_name, output_name):\n",
    "    inputs = []\n",
    "    outputs = []\n",
    "    inputs.append(httpclient.InferInput(input_name, [1, 3, 224, 224], \"FP32\"))\n",
    "    input_data = np.array(get_sample_image(), dtype=np.float32)\n",
    "    input_data = np.expand_dims(input_data, axis=0)\n",
    "    inputs[0].set_data_from_numpy(input_data, binary_data=True)\n",
    "    outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))\n",
    "    request_body, header_length = httpclient.InferenceServerClient.generate_request_body(\n",
    "        inputs, outputs=outputs\n",
    "    )\n",
    "    return request_body, header_length\n",
    "\n",
    "\n",
    "def get_sample_image_binary_pt():\n",
    "    return _get_sample_image_binary(\"INPUT__0\", \"OUTPUT__0\")\n",
    "\n",
    "\n",
    "def get_sample_image_binary_trt():\n",
    "    return _get_sample_image_binary(\"input\", \"output\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00f0f261-e960-4a00-a9ad-8a884f9f27aa",
   "metadata": {},
   "source": [
    "### Set up the Tritonserver Container with neuronx libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64505ce0-5bb1-4bf6-a8da-d9a1dca46e78",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id_map[region]}.dkr.ecr.{region}.amazonaws.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b45184a-87c2-4297-8a5f-d29912f0ab55",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p docker/mylib"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ea0bd27-0e80-44b6-bb1e-322c34dbb9cb",
   "metadata": {},
   "source": [
    "Sockets and rules required by the container to interact with neuron hardware"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c887b597-effe-44a6-9c8a-f644b5c53071",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cp -R /lib/udev/rules.d/* docker/mylib/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b31ccacc-abce-4fb2-aaf1-31cd7db68eae",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "!docker build --no-cache -t tritonserver-neuronx docker/ --build-arg SM_TRITON_IMAGE_URI={triton_image_uri}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5942ca5-cdbb-49b1-a234-5662043aa056",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Upload the container to ECR for SageMaker to consume\n",
    "# inf2 instances are available in us-east-2\n",
    "curr_account_id = boto3.client(\"sts\").get_caller_identity().get(\"Account\")\n",
    "inf2_region = \"us-east-2\"\n",
    "\n",
    "triton_neuronx_image_uri = (\n",
    "    f\"{curr_account_id}.dkr.ecr.{inf2_region}.amazonaws.com/sagemaker-tritonserver:23.03-py3\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa51efe0-764f-4930-926b-d2215e54dfb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "!docker tag tritonserver-neuronx:latest {triton_neuronx_image_uri}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6de84451-dafd-4aad-8624-7818b7be0389",
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws ecr get-login-password --region {inf2_region} | docker login --username AWS --password-stdin {curr_account_id}.dkr.ecr.{inf2_region}.amazonaws.com"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "554b50cb-4e32-4ad2-8d59-0391a2294c98",
   "metadata": {},
   "source": [
    "Push the container image.\n",
    "Note: This command assumes that that the ECR repository `sagemaker-tritonserver` exists in the account. If it doesn't, please create one"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0bf5e03-9317-4b94-9277-22a6f57e2ea2",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!docker push {triton_neuronx_image_uri}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55b483bf-39ba-4fc3-bb0f-5a73d50282d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "triton_neuronx_image_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3c6ab5c-5991-4959-8b85-439ab44498ab",
   "metadata": {},
   "source": [
    "### PyTorch-Neuronx "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1c45fc3-6125-4127-bd41-2dda4d6b1016",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install torch-neuronx transformers-neuronx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53a28890-e67c-4f24-ab22-cbdde7ab52c0",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import urllib\n",
    "from PIL import Image\n",
    "\n",
    "import torch\n",
    "import torch_neuronx\n",
    "from torchvision import models\n",
    "from torchvision.transforms import functional\n",
    "\n",
    "\n",
    "def get_image(batch_size=1, image_shape=(224, 224)):\n",
    "    # Get an example input\n",
    "    filename = \"000000039769.jpg\"\n",
    "    if not os.path.exists(filename):\n",
    "        url = \"http://images.cocodataset.org/val2017/000000039769.jpg\"\n",
    "        urllib.request.urlretrieve(url, filename)\n",
    "    image = Image.open(filename).convert(\"RGB\")\n",
    "    image = functional.resize(image, (image_shape))\n",
    "    image = functional.to_tensor(image)\n",
    "    image = torch.unsqueeze(image, 0)\n",
    "    image = torch.repeat_interleave(image, batch_size, 0)\n",
    "    return (image,)\n",
    "\n",
    "\n",
    "# Create the model\n",
    "model = models.resnet50(pretrained=True)\n",
    "model.eval()\n",
    "\n",
    "# Get an example input\n",
    "image = get_image()\n",
    "\n",
    "# Run inference on CPU\n",
    "output_cpu = model(*image)\n",
    "\n",
    "# Compile the model\n",
    "model_neuron = torch_neuronx.trace(model, image)\n",
    "\n",
    "# Save the TorchScript for inference deployment\n",
    "filename = \"model.pt\"\n",
    "torch.jit.save(model_neuron, filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "056f1d2a-5bf2-4413-a41f-a11d5828c9a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p triton-pt-inf2/resnet/1\n",
    "!mv model.pt triton-pt-inf2/resnet/\n",
    "\n",
    "# Generate a pytorch-neuronx triton model.py and config.pbtxt file\n",
    "!git clone https://github.com/triton-inference-server/python_backend && cd python_backend\n",
    "!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,FP32,3x224x224 --triton_output OUTPUT__0,FP32,1000 --compiled_model model.pt --neuron_core_range 0:1 --triton_model_dir ../triton-pt-inf2/resnet --enable_dynamic_batching --max_batch_size 4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05b7fd73-2107-4705-922a-80dd7ef16833",
   "metadata": {},
   "source": [
    "Move the generated config.pbtxt and model.py to the model folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb219878-c268-4cbf-bb90-a457bed6278d",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!tar -C triton-pt-inf2/ -czf triton-pt-inf2.tar.gz resnet\n",
    "model_uri = sagemaker_session.upload_data(\n",
    "    path=\"triton-pt-inf2.tar.gz\", key_prefix=\"triton-inf2-models\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "951a03c6-f8f8-43b8-b3ae-88270b0a7edd",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15ac5e59-936e-4adb-a91e-67db42735307",
   "metadata": {},
   "source": [
    "Load the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd1b4432-fa25-4641-ac68-ac5a43404caf",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_model_name = \"triton-resnet-inf2-pt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "container = {\n",
    "    \"Image\": triton_neuronx_image_uri,\n",
    "    \"ModelDataUrl\": model_uri,\n",
    "    \"Environment\": {\"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"resnet\"},\n",
    "}\n",
    "\n",
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dc97ed0-3155-4658-96f3-7e058c801e7c",
   "metadata": {},
   "source": [
    "Create endpoint configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "935b3f39-ce23-4d68-9caf-f0e48ef9499f",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_config_name = \"triton-resnet-inf2-pt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "create_endpoint_config_response = sm_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.inf2.xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": sm_model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca2bf973-ff89-4d8c-93b7-976562d4623a",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_name = \"triton-resnet-inf2-pt-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "\n",
    "create_endpoint_response = sm_client.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aada0ee8-c63d-469c-b99e-0e4c1b856583",
   "metadata": {},
   "outputs": [],
   "source": [
    "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa5cca31-bcd6-4ded-9b1a-085ee8e2094b",
   "metadata": {},
   "source": [
    "Run regular inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b7ff165-0cd9-4c09-87da-0145be751035",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "payload = {\n",
    "    \"inputs\": [\n",
    "        {\n",
    "            \"name\": \"INPUT__0\",\n",
    "            \"shape\": [1, 3, 224, 224],\n",
    "            \"datatype\": \"FP32\",\n",
    "            \"data\": get_sample_image(),\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "response = runtime_sm_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name, ContentType=\"application/octet-stream\", Body=json.dumps(payload)\n",
    ")\n",
    "\n",
    "print(json.loads(response[\"Body\"].read().decode(\"utf8\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de473c37-7e5a-4f72-bac1-06524622e41f",
   "metadata": {},
   "source": [
    "Run inference with binary payload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92139d44-cdd6-4e80-a48b-fb6af003cc9e",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "request_body, header_length = get_sample_image_binary_pt()\n",
    "\n",
    "response = runtime_sm_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/vnd.sagemaker-triton.binary+json;json-header-size={}\".format(\n",
    "        header_length\n",
    "    ),\n",
    "    Body=request_body,\n",
    ")\n",
    "\n",
    "# Parse json header size length from the response\n",
    "header_length_prefix = \"application/vnd.sagemaker-triton.binary+json;json-header-size=\"\n",
    "header_length_str = response[\"ContentType\"][len(header_length_prefix) :]\n",
    "\n",
    "# Read response body\n",
    "result = httpclient.InferenceServerClient.parse_response_body(\n",
    "    response[\"Body\"].read(), header_length=int(header_length_str)\n",
    ")\n",
    "output0_data = result.as_numpy(\"OUTPUT__0\")\n",
    "print(output0_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbbd1c02-fcbf-4f7c-b05b-aba9775449de",
   "metadata": {},
   "source": [
    "### Terminate and cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ccd5819-c19e-4409-8095-1acd84dc581d",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_client.delete_endpoint(EndpointName=endpoint_name)\n",
    "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm_client.delete_model(ModelName=sm_model_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca3088a0-09c8-47dc-b21e-270a6f82df51",
   "metadata": {},
   "source": [
    "## Run a Transformers based OPT-125M model\n",
    "In this section we will run a pre-trained neuron-traced OPT model on an inf2.24xl instance type"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79ed11d0-9e14-46d4-955c-dd02c04e7867",
   "metadata": {},
   "source": [
    "Create and upload OPT-125M model on SageMaker"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f079437f-f90d-4ff7-b90b-efbbc9625861",
   "metadata": {},
   "source": [
    "### Note: the following commands to compile/trace the model should be run on an inf2 instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8369027e-6364-4850-a517-e170929264ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "672aab23-9e71-4520-9194-3d62aa917a10",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import time\n",
    "import argparse\n",
    "import torch\n",
    "from transformers_neuronx.module import save_pretrained_split\n",
    "from transformers_neuronx.dtypes import to_torch_dtype\n",
    "from transformers_neuronx.opt.model import OPTForSampling\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0fa64ea-1d3e-4c7d-80c5-bc31650cbba6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def amp_callback(model, dtype):\n",
    "    # cast attention and mlp to low precisions only; layernorms stay as f32\n",
    "    for block in model.model.decoder.layers:\n",
    "        block.self_attn.to(dtype)\n",
    "        block.fc1.to(dtype)\n",
    "        block.fc2.to(dtype)\n",
    "    model.lm_head.to(dtype)\n",
    "\n",
    "\n",
    "def compile(\n",
    "    model_name, batch_size, compiler_args, amp=\"bf16\", tp_degree=2, n_positions=2048, unroll=None\n",
    "):\n",
    "    os.environ[\"NEURON_CC_FLAGS\"] = compiler_args\n",
    "    os.environ[\"NEURONX_DUMP_TO\"] = \"opt-125m-tp12\"\n",
    "    os.environ[\"NEURON_CACHE\"] = \"on\"\n",
    "\n",
    "    # Split the GPT-J model for faster loading\n",
    "    model_dir = f\"opt-125m-model\"\n",
    "    if not os.path.exists(model_dir):\n",
    "        model_cpu = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)\n",
    "        dtype = to_torch_dtype(amp)\n",
    "        amp_callback(model_cpu, dtype=dtype)\n",
    "        save_pretrained_split(model_cpu, model_dir)\n",
    "\n",
    "    # Create and compile the Neuron model.\n",
    "    model_neuron = OPTForSampling.from_pretrained(\n",
    "        model_dir,\n",
    "        batch_size=batch_size,\n",
    "        amp=amp,\n",
    "        tp_degree=tp_degree,\n",
    "        n_positions=n_positions,\n",
    "        unroll=unroll,\n",
    "    )\n",
    "    model_neuron.to_neuron()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17500c6b-d59c-44bf-93af-7e1fb7fd6783",
   "metadata": {},
   "source": [
    "The above compilation to neuron may also be done in model.py (see below when creating the tar.gz archive). However, doing it here and packaging it with the model archive will save time when loading the model at the SageMaker endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3992c924-e5da-412a-b899-e9a116bb570a",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "compile(\"facebook/opt-125m\", 1, \"--model-type=transformer\", \"bf16\", 12)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fe19a0a-1c82-422f-ac7b-8e3549e79145",
   "metadata": {},
   "source": [
    "The `.to_neuron` compilation will create a folder `opt-125m-tp12` to save the split model \n",
    "\n",
    "A `model.py` and `config.pbtxt` file has been made available in the `inf2_llm/opt-125m/opt` folder. Both the files have been generated using the `gen_triton_model.py` script, and modified as per the LLM HF model to be used"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "a61000c1-c968-4231-a69c-9c42452e52d1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n# 1.0 Generate a pytorch-neuronx triton model.py and config.pbtxt file\\n\\n!git clone https://github.com/triton-inference-server/python_backend && cd python_backend\\n!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,INT64,128 --triton_output OUTPUT__0,INT64,2048 --compiled_model opt-125m-model/ --neuron_core_range 0:11 --triton_model_dir inf2_llm/opt-125m/opt --enable_dynamic_batching --max_batch_size 128\\n\\n# 2.0 Modify the model.py to use HF APIs as per the README here - https://github.com/triton-inference-server/python_backend/blob/main/inferentia/README.md#using-triton-with-inferentia-2-or-trn1. \\n'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Note that in this example, the following step is already complete and provided in the folder inf2_llm/opt-125m/opt\n",
    "\n",
    "\"\"\"\n",
    "# 1.0 Generate a pytorch-neuronx triton model.py and config.pbtxt file\n",
    "\n",
    "!git clone https://github.com/triton-inference-server/python_backend && cd python_backend\n",
    "!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,INT64,128 --triton_output OUTPUT__0,INT64,2048 --compiled_model opt-125m-model/ --neuron_core_range 0:11 --triton_model_dir inf2_llm/opt-125m/opt --enable_dynamic_batching --max_batch_size 128\n",
    "\n",
    "# 2.0 Modify the model.py to use HF APIs as per the README here - https://github.com/triton-inference-server/python_backend/blob/main/inferentia/README.md#using-triton-with-inferentia-2-or-trn1. \n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a679c5a-d37a-4e00-bcb2-b1def1e5acc9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: Use -L when copying artifacts to model folder to avoid symlinks\n",
    "# https://repost.aws/knowledge-center/sagemaker-endpoint-creation-fail\n",
    "!mkdir -p inf2_llm/opt-125m/opt/opt-125m-model\n",
    "!mkdir -p inf2_llm/opt-125m/opt/opt-125m-tp12\n",
    "!cp -R -L opt-125m-model/* inf2_llm/opt-125m/opt/opt-125m-model\n",
    "!cp -R -L opt-125m-tp12/* inf2_llm/opt-125m/opt/opt-125m-tp12"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1483d51d-bddc-4dd4-b998-3009c8e4be22",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!tar -C inf2_llm/opt-125m/ -hcvzf triton-transformers-opt-inf2.tar.gz opt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "daefe31f-c110-47ca-8f5f-235a62be1316",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_uri = sagemaker_session.upload_data(\n",
    "    path=\"triton-transformers-opt-inf2.tar.gz\", key_prefix=\"triton-inf2-models\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e343e0c-b9ad-4d77-91f5-a288a475769c",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61f8bb6f-8605-4e81-8ed2-87e9fbbc4f52",
   "metadata": {},
   "source": [
    "#### We are now ready to start a SageMaker endpoint, and load the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00756638-0be6-4891-a023-1f23ed5def3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_model_name = \"triton-opt-inf2-transformers-opt-\" + time.strftime(\n",
    "    \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n",
    ")\n",
    "\n",
    "container = {\n",
    "    \"Image\": triton_neuronx_image_uri,\n",
    "    \"ModelDataUrl\": model_uri,\n",
    "    \"Environment\": {\"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"opt\"},\n",
    "}\n",
    "\n",
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])\n",
    "\n",
    "endpoint_config_name = \"triton-opt-inf2-transformers-opt-\" + time.strftime(\n",
    "    \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n",
    ")\n",
    "\n",
    "create_endpoint_config_response = sm_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.inf2.24xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": sm_model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])\n",
    "\n",
    "endpoint_name = \"triton-opt-inf2-transformers-opt-\" + time.strftime(\n",
    "    \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n",
    ")\n",
    "\n",
    "create_endpoint_response = sm_client.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])\n",
    "\n",
    "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ea72aed-98b2-4a2d-9eb6-8f65c04f671e",
   "metadata": {},
   "source": [
    "#### Once the model is 'InService', it is ready to be served. We now create the payload and run the inference request"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e562f03-c6e9-41b7-b27a-c2655cc192b1",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "import torch\n",
    "\n",
    "prompt = \"this summer\"\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"facebook/opt-125m\")\n",
    "if not tokenizer.pad_token:\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "encoded_text = tokenizer.encode(prompt, padding=\"max_length\", max_length=128, truncation=True)\n",
    "# Run inference here\n",
    "\n",
    "payload = {\n",
    "    \"inputs\": [{\"name\": \"INPUT__0\", \"shape\": [1, 128], \"datatype\": \"INT64\", \"data\": encoded_text}]\n",
    "}\n",
    "\n",
    "response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(payload))\n",
    "\n",
    "generated_sequence = json.loads(response[\"Body\"].read().decode(\"utf8\"))[\"outputs\"][0][\"data\"]\n",
    "output = \" \".join(tokenizer.batch_decode(generated_sequence, skip_special_tokens=True)).encode(\n",
    "    \"utf-8\"\n",
    ")\n",
    "\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e4c3f87-90ed-43bb-b320-96fb657078ad",
   "metadata": {},
   "source": [
    "## Terminate Resources\n",
    "As this is a sample notebook, after a successful experiment, we will terminate the resources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07d1ae58-e2d5-4d32-9ca5-e62287bb50d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_client.delete_endpoint(EndpointName=endpoint_name)\n",
    "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm_client.delete_model(ModelName=sm_model_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3561d5a-9ab0-4205-85ee-4aefacc8f849",
   "metadata": {},
   "source": [
    "## Run a Transformers based GPTJ-6B model\n",
    "In this section we will run a pre-trained neuron-traced GPTJ-6B model on an inf2.24xl instance type"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f42bec12",
   "metadata": {},
   "source": [
    "### PyTorch: Packaging model files and uploading to s3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecd78917-ab23-46db-941b-8443c767448c",
   "metadata": {},
   "source": [
    "Create and upload GPTJ-6B model on SageMaker"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "907d3eb5-acf2-4f10-843a-715e82ea51d6",
   "metadata": {},
   "source": [
    "### Note: the following commands to compile/trace the model should be run on an inf2 instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7b2272b-dae5-4756-aa35-7463c98dd221",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca8fef31-9854-40ec-a463-1ed464a47d3c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import time\n",
    "import argparse\n",
    "import torch\n",
    "from transformers_neuronx.module import save_pretrained_split\n",
    "from transformers_neuronx.dtypes import to_torch_dtype\n",
    "from transformers_neuronx.gptj.model import GPTJForSampling\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "\n",
    "def amp_callback(model, dtype):\n",
    "    # cast attention and mlp to low precisions only; layernorms stay as f32\n",
    "    for block in model.transformer.h:\n",
    "        block.attn.to(dtype)\n",
    "        block.mlp.to(dtype)\n",
    "    model.lm_head.to(dtype)\n",
    "\n",
    "\n",
    "def compile(\n",
    "    model_name, batch_size, compiler_args, amp=\"bf16\", tp_degree=2, n_positions=2048, unroll=None\n",
    "):\n",
    "    os.environ[\"NEURON_CC_FLAGS\"] = compiler_args\n",
    "    os.environ[\"NEURONX_DUMP_TO\"] = \"gpt-j-6b-artifacts-tp4\"\n",
    "    os.enviorn[\"NEURON_CACHE\"] = \"on\"\n",
    "\n",
    "    # Split the GPT-J model for faster loading\n",
    "    model_dir = f\"EleutherAI-gpt-j-6B-bf16-local\"\n",
    "    if not os.path.exists(model_dir):\n",
    "        model_cpu = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)\n",
    "        dtype = to_torch_dtype(amp)\n",
    "        amp_callback(model_cpu, dtype=dtype)\n",
    "        save_pretrained_split(model_cpu, model_dir)\n",
    "\n",
    "    # Create and compile the Neuron model\n",
    "    model_neuron = GPTJForSampling.from_pretrained(\n",
    "        model_dir,\n",
    "        batch_size=batch_size,\n",
    "        amp=amp,\n",
    "        tp_degree=tp_degree,\n",
    "        n_positions=n_positions,\n",
    "        unroll=unroll,\n",
    "    )\n",
    "    model_neuron.to_neuron()\n",
    "    return model_neuron\n",
    "\n",
    "\n",
    "compile(\"EleutherAI/gpt-j-6B\", 1, \"--model-type=transformer\", \"bf16\", 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "076950ad-ab9d-44d2-9826-a463848af213",
   "metadata": {},
   "source": [
    "The `*.to_neuron()` compilation will create a folder `gpt-j-6b-artifacts-tp4` to cache the split model \n",
    "\n",
    "A `model.py` and `config.pbtxt` file has been made available in the `inf2_llm/gptj-6b/gptj` folder. Both the files have been generated using the `gen_triton_model.py` script, and modified as per the LLM HF model to be used"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "9851afde-45f4-449b-a7c3-5945116c4db3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n# 1. Generate a pytorch-neuronx triton model.py and config.pbtxt file\\n!git clone https://github.com/triton-inference-server/python_backend && cd python_backend\\n!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,INT64,128 --triton_output OUTPUT__0,INT64,2048 --compiled_model EleutherAI-gpt-j-6B-bf16-local/ --neuron_core_range 0:4 --triton_model_dir inf2_llm/gptj-6b/gptj --enable_dynamic_batching --max_batch_size 128\\n# 2. Modify the model.py to use HF APIs as per the README here - https://github.com/triton-inference-server/python_backend/blob/main/inferentia/README.md#using-triton-with-inferentia-2-or-trn1.\\n'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## Note that in this example, the following step is already complete and provided in the folder inf2_llm/gpt-j6b/gptj\n",
    "\n",
    "\"\"\"\n",
    "# 1. Generate a pytorch-neuronx triton model.py and config.pbtxt file\n",
    "!git clone https://github.com/triton-inference-server/python_backend && cd python_backend\n",
    "!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,INT64,128 --triton_output OUTPUT__0,INT64,2048 --compiled_model EleutherAI-gpt-j-6B-bf16-local/ --neuron_core_range 0:4 --triton_model_dir inf2_llm/gptj-6b/gptj --enable_dynamic_batching --max_batch_size 128\n",
    "# 2. Modify the model.py to use HF APIs as per the README here - https://github.com/triton-inference-server/python_backend/blob/main/inferentia/README.md#using-triton-with-inferentia-2-or-trn1.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69397620-7544-47a9-90bd-1255b1054c3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: Use -L when copying artifacts to model folder to avoid symlinks\n",
    "# https://repost.aws/knowledge-center/sagemaker-endpoint-creation-fail\n",
    "!mkdir -p inf2_llm/gptj-6b/gptj/EleutherAI-gpt-j-6B-bf16-local\n",
    "!mkdir -p inf2_llm/gptj-6b/gptj/gpt-j-6b-artifacts-tp4\n",
    "!cp -R -L EleutherAI-gpt-j-6B-bf16-local/* inf2_llm/gptj-6b/gptj/EleutherAI-gpt-j-6B-bf16-local\n",
    "!cp -R -L gpt-j-6b-artifacts-tp4/* inf2_llm/gptj-6b/gptj/gpt-j-6b-artifacts-tp4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50ecbf8b-da51-452a-9ddf-2a7841c04cda",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!tar -C inf2_llm/gptj-6b/ -hvczf triton-transformers-gptj-inf2.tar.gz gptj"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ff5d85f-4544-4e04-a735-70771982e02b",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_uri = sagemaker_session.upload_data(\n",
    "    path=\"triton-transformers-gptj-inf2.tar.gz\", key_prefix=\"triton-inf2-models\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81a98829-7497-4e77-944d-0621719f4a71",
   "metadata": {},
   "source": [
    "#### We are now ready to start a SageMaker endpoint, and load the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c1fe838-3ebd-41cc-a843-590e5e19a308",
   "metadata": {},
   "outputs": [],
   "source": [
    "preload_model_argument = \"true --strict-readiness=false\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efeb87ad-c492-410b-9f1a-be099e5b52d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_model_name = \"triton-gptj-inf2-transformers-gptj-\" + time.strftime(\n",
    "    \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n",
    ")\n",
    "\n",
    "container = {\n",
    "    \"Image\": triton_neuronx_image_uri,\n",
    "    \"ModelDataUrl\": model_uri,\n",
    "    \"Environment\": {\n",
    "        \"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"gptj\",\n",
    "        \"SAGEMAKER_TRITON_LOG_VERBOSE\": preload_model_argument,\n",
    "    },\n",
    "}\n",
    "\n",
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])\n",
    "\n",
    "endpoint_config_name = \"triton-gptj-inf2-transformers-gptj-\" + time.strftime(\n",
    "    \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n",
    ")\n",
    "\n",
    "create_endpoint_config_response = sm_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.inf2.24xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": sm_model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "            \"VolumeSizeInGB\": 256,\n",
    "            \"ModelDataDownloadTimeoutInSeconds\": 1800,\n",
    "            \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600,\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])\n",
    "\n",
    "endpoint_name = \"triton-gptj-inf2-transformers-gptj-\" + time.strftime(\n",
    "    \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n",
    ")\n",
    "\n",
    "create_endpoint_response = sm_client.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])\n",
    "\n",
    "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ddf79acd-3ad0-4e88-b746-1a831cc257c7",
   "metadata": {},
   "source": [
    "#### Once the model is 'InService', it is ready to be served. We now create the payload and run the inference request"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9077122-058e-40d0-a681-c6c9bfd6e861",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "import torch\n",
    "\n",
    "prompt = \"this summer\"\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"EleutherAI/gpt-j-6B\")\n",
    "if not tokenizer.pad_token:\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "encoded_text = tokenizer.encode(prompt, padding=\"max_length\", max_length=128, truncation=True)\n",
    "# Run inference here\n",
    "\n",
    "payload = {\n",
    "    \"inputs\": [{\"name\": \"INPUT__0\", \"shape\": [1, 128], \"datatype\": \"INT64\", \"data\": encoded_text}]\n",
    "}\n",
    "\n",
    "response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(payload))\n",
    "\n",
    "generated_sequence = json.loads(response[\"Body\"].read().decode(\"utf8\"))[\"outputs\"][0][\"data\"]\n",
    "output = \" \".join(tokenizer.batch_decode(generated_sequence, skip_special_tokens=True)).encode(\n",
    "    \"utf-8\"\n",
    ")\n",
    "\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b338dd0-f9fc-4b48-9b69-6d3658b590be",
   "metadata": {},
   "source": [
    "## Terminate Resources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbe89d0f-d528-4bf2-b84a-1c04e1c5437c",
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_client.delete_endpoint(EndpointName=endpoint_name)\n",
    "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm_client.delete_model(ModelName=sm_model_name)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}