{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "27811923-bb05-4bae-a8dc-eaf6887a72ab", "metadata": {}, "source": [ "# Triton on SageMaker - this notebook shows how you can take a RoBERTA model and create a traced model and leverage the Pytorch back end for Triton\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "id": "27811923-bb05-4bae-a8dc-eaf6887a72ab", "metadata": {}, "source": [ "\n", "Amazon SageMaker is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.\n", "\n", "Now, NVIDIA Triton Inference Server can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.\n", "\n", "This notebook was tested on Studio with ml.g4dn.xlarge which comes with 1 GPU and with ml.m5.large which is a CPU based machine only Contents\n", "\n", "Introduction to NVIDIA Triton Server\n", "Set up the environment\n", "Basic: RoBERTA Model\n", "* PyTorch: JIT Trace the model and create a Scripted model\n", "* PyTorch: Testing the JIT Traced model \n", "* PyTorch: Packaging model files and uploading to s3\n", "* PyTorch: Create SageMaker Endpoint\n", "* PyTorch: Run inference\n", "* PyTorch: Leverage the Predictions to view the results for Object detection\n", "* PyTorch: Terminate endpoint and clean up artifacts\n", "\n", "\n", "### Introduction to NVIDIA Triton Server\n", "\n", "NVIDIA Triton Inference Server was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.\n", "\n", "Some key features of Triton are:\n", "\n", "* Support for Multiple frameworks: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats.\n", "* Model pipelines: Triton model ensemble represents a pipeline of one or more models or pre/post-processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.\n", "* Concurrent model execution: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.\n", "* Dynamic batching: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.\n", "* Diverse CPUs and GPUs: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.\n", "\n", "Note: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal config.pbtxt configuration file is required in the model artifacts. This release doesn't support inferring the model config automatically. Set up the environment\n", "\n", "Installs the dependencies required to package the model and run inferences using Triton server.\n", "\n", "Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.\n", "\n", "The purpose of this file is to show the ability to take a pytorch computer vision model and create a scripted model which can then be leveraged by Triton using the pytorch back end.\n", "\n", "The other option is to build using a python back end but in that we loose some performance gains by compilation to native format\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d2064a2c", "metadata": {}, "outputs": [], "source": [ "account_id_map = {\n", " \"us-east-1\": \"785573368785\",\n", " \"us-east-2\": \"007439368137\",\n", " \"us-west-1\": \"710691900526\",\n", " \"us-west-2\": \"301217895009\",\n", " \"eu-west-1\": \"802834080501\",\n", " \"eu-west-2\": \"205493899709\",\n", " \"eu-west-3\": \"254080097072\",\n", " \"eu-north-1\": \"601324751636\",\n", " \"eu-south-1\": \"966458181534\",\n", " \"eu-central-1\": \"746233611703\",\n", " \"ap-east-1\": \"110948597952\",\n", " \"ap-south-1\": \"763008648453\",\n", " \"ap-northeast-1\": \"941853720454\",\n", " \"ap-northeast-2\": \"151534178276\",\n", " \"ap-southeast-1\": \"324986816169\",\n", " \"ap-southeast-2\": \"355873309152\",\n", " \"cn-northwest-1\": \"474822919863\",\n", " \"cn-north-1\": \"472730292857\",\n", " \"sa-east-1\": \"756306329178\",\n", " \"ca-central-1\": \"464438896020\",\n", " \"me-south-1\": \"836785723513\",\n", " \"af-south-1\": \"774647643957\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "627f8f60", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install transformers[torch]" ] }, { "cell_type": "code", "execution_count": null, "id": "dae037f2", "metadata": {}, "outputs": [], "source": [ "!pip install nvidia-pyindex -qU\n", "!pip install tritonclient[http] -qU\n", "\n", "!pip install -qU pip awscli boto3 sagemaker transformers" ] }, { "cell_type": "markdown", "id": "ddfa9ef5", "metadata": {}, "source": [ "### Start RoBERTA Base for Triton\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2bd0a228-31e1-42fb-b553-8bf3bde5c69c", "metadata": { "tags": [] }, "outputs": [], "source": [ "!mkdir -p triton-serve-pt/roberta-traced\n", "!mkdir -p triton-serve-pt/roberta-traced/1\n", "\n", "\n", "!cd triton-serve-pt/roberta-traced/1 && rm -rf \".ipynb_checkpoints\"\n", "!cd triton-serve-pt/roberta-traced && rm -rf \".ipynb_checkpoints\"\n", "!cd triton-serve-pt && rm -rf \".ipynb_checkpoints\"" ] }, { "cell_type": "code", "execution_count": null, "id": "17611bb9-d1a0-4e96-8da8-a95b30de5ad6", "metadata": { "tags": [] }, "outputs": [], "source": [ "!ls -alrt triton-serve-pt/roberta-traced/1" ] }, { "cell_type": "code", "execution_count": null, "id": "ee6cd9b5", "metadata": {}, "outputs": [], "source": [ "%%writefile triton-serve-pt/roberta-traced/config.pbtxt\n", "platform: \"pytorch_libtorch\"\n", "max_batch_size: 32\n", "input [\n", " {\n", " name: \"INPUT__0\"\n", " data_type: TYPE_INT32\n", " dims: [512]\n", " },\n", " {\n", " name: \"INPUT__1\"\n", " data_type: TYPE_INT32\n", " dims: [512]\n", " }\n", "]\n", "output [\n", " {\n", " name: \"OUTPUT__0\"\n", " data_type: TYPE_FP32\n", " dims: [512, 768]\n", " },\n", " {\n", " name: \"1634__1\"\n", " data_type: TYPE_FP32\n", " dims: [768]\n", " }\n", "]\n", "instance_group {\n", " count: 1\n", " kind: KIND_GPU\n", "}\n", "dynamic_batching {\n", " preferred_batch_size: 32\n", "}" ] }, { "cell_type": "markdown", "id": "3ba4a981", "metadata": {}, "source": [ "### Run for Triton server" ] }, { "cell_type": "markdown", "id": "ec079557", "metadata": {}, "source": [ "**Note**: Amazon SageMaker expects the model tarball file to have a top level directory with the same name as the model defined in the `config.pbtxt`. Below is the sample model directory structure\n", "\n", "```\n", "roberta-large\n", "\u251c\u2500\u2500 1\n", "\u2502 \u2514\u2500\u2500 model.pt\n", "\u2514\u2500\u2500 config.pbtxt\n", "```" ] }, { "cell_type": "markdown", "id": "3a576b1e", "metadata": {}, "source": [ "**Have to use the same Tokenizer to generate the input to test as BERT uncased**" ] }, { "cell_type": "markdown", "id": "769d635d", "metadata": {}, "source": [ "### Create the RoBERTA Model in Torch Script mode -- .pt model\n", "use the ore trained and use torchscript flag here" ] }, { "cell_type": "code", "execution_count": null, "id": "736daf7d", "metadata": {}, "outputs": [], "source": [ "from transformers import GPT2Tokenizer, GPTJModel\n", "from transformers import GPTJForCausalLM, AutoTokenizer\n", "\n", "import torch" ] }, { "cell_type": "markdown", "id": "e4a3446e", "metadata": {}, "source": [ "### Run a simple test for RoBERTA base \n", "\n", " * We run multiple tests\n", " * First we token ize and then de tokenize to make sure the vaues match\n", " * Then we use the model and run predictions to get values\n", " * Then we run on the traced Model and run predictions to get values \n", " * Check to make sure they match" ] }, { "cell_type": "markdown", "id": "2692ec25-a715-4363-a0fc-94ba87f0fb50", "metadata": {}, "source": [ "### Prepare some dummy inputs for tracing" ] }, { "cell_type": "code", "execution_count": null, "id": "1e5a3980-ea0e-4331-aa24-f3a5e5689ddb", "metadata": {}, "outputs": [], "source": [ "# Tokenizing input text\n", "tokenizer = AutoTokenizer.from_pretrained(\"roberta-large\")\n", "\n", "text = \"[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]\"\n", "tokenized_text = tokenizer.tokenize(text)\n", "print(f\"BERT:Tokenized:Text={tokenized_text}:::\")\n", "\n", "# Masking one of the input tokens\n", "masked_index = 8\n", "tokenized_text[masked_index] = \"[MASK]\"\n", "indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)\n", "print(f\"BERT:indexed_tokens:={indexed_tokens}::\")\n", "\n", "# -- segments id's\n", "segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]\n", "\n", "# Creating a dummy input\n", "tokens_tensor = torch.tensor([indexed_tokens])\n", "segments_tensors = torch.tensor([segments_ids])" ] }, { "cell_type": "code", "execution_count": null, "id": "49e33dd8", "metadata": { "tags": [] }, "outputs": [], "source": [ "### Roberta -\n", "from transformers import AutoTokenizer, AutoModel\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "print(\"Using {} device\".format(device))\n", "\n", "# -- IF you use from bert it comes without HEAD\n", "tokenizer = AutoTokenizer.from_pretrained(\"roberta-base\") # roberta-large\n", "model = AutoModel.from_pretrained(\"roberta-base\", torchscript=True) # roberta-large\n", "model = model.eval()\n", "\n", "\n", "bs = 1\n", "seq_len = 512\n", "dummy_inputs = [\n", " torch.randint(1000, (bs, seq_len)).to(\"cpu\"), # to(device),\n", " torch.zeros(bs, seq_len, dtype=torch.int).to(\"cpu\"), # to(device),\n", "]\n", "\n", "text = \"Replace me by any text you'd like.\"\n", "dummy_inputs = tokenizer(\n", " text,\n", " return_tensors=\"pt\",\n", " max_length=seq_len,\n", " padding=True,\n", " truncation=True,\n", ")\n", "print(dummy_inputs.keys())\n", "\n", "\n", "# Creating the trace\n", "# traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])\n", "traced_model = torch.jit.trace(model, [dummy_inputs[\"input_ids\"], dummy_inputs[\"attention_mask\"]])\n", "\n", "model = model.eval()\n", "# model.to(device)\n", "torch.jit.save(traced_model, \"./triton-serve-pt/roberta-traced/1/model.pt\")\n", "\n", "print(\"Saved {}\".format(traced_model))" ] }, { "cell_type": "code", "execution_count": null, "id": "2bd95803", "metadata": { "tags": [] }, "outputs": [], "source": [ "model" ] }, { "cell_type": "markdown", "id": "85246e13", "metadata": {}, "source": [ "#### Test encoders various methods" ] }, { "cell_type": "code", "execution_count": null, "id": "0824610a", "metadata": {}, "outputs": [], "source": [ "tokenizer(\n", " \"Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.\",\n", " padding=\"max_length\",\n", " max_length=64,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "9d85b28f", "metadata": { "tags": [] }, "outputs": [], "source": [ "encoded_tokens = tokenizer.encode_plus(\n", " \"Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.\",\n", " add_special_tokens=True, # Add '[CLS]' and '[SEP]'\n", " max_length=512,\n", " pad_to_max_length=True, # Pad & truncate all sentences\n", ")\n", "# encoded_tokens" ] }, { "cell_type": "markdown", "id": "d85d075e-5585-45f0-ac22-1571e9373820", "metadata": {}, "source": [ "### Test the HuggingFace and then the scripted model locally" ] }, { "cell_type": "code", "execution_count": null, "id": "80984984", "metadata": { "tags": [] }, "outputs": [], "source": [ "import torch\n", "import torch.nn.functional as F\n", "\n", "encoded_input = tokenizer(\n", " \"Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.\",\n", " return_tensors=\"pt\",\n", " add_special_tokens=True, # Add '[CLS]' and '[SEP]'\n", " max_length=512, # -- this model has max length set to 100 -- not to 512,\n", " pad_to_max_length=True, # Pad & truncate all sentences\n", ")\n", "# unscripted_output = model.generate( # --\n", "unscripted_output = model( # -- both work the same way\n", " **encoded_input,\n", " # inputs=encoded_input['attention_mask']],\n", " return_dict=True,\n", " output_attentions=False,\n", " output_hidden_states=False,\n", " # do_sample=True,\n", " # temperature=0.9,\n", " # max_length=128,\n", ") # -- BaseModelOutputWithPoolingAndCrossAttentions\n", "\n", "# tokenizer.decode(unscripted_output[0])\n", "unscripted_output[0].shape" ] }, { "cell_type": "markdown", "id": "226688a9-61de-4e27-aa9d-e6e4fa0a369c", "metadata": {}, "source": [ "#### Now test the Scripted model -- Scripted model gives us tensors back" ] }, { "cell_type": "code", "execution_count": null, "id": "2ccbc136-88d7-48c3-a191-495654f557be", "metadata": { "tags": [] }, "outputs": [], "source": [ "import torch\n", "import torch.nn.functional as F\n", "\n", "encoded_input = tokenizer(\n", " \"Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.\",\n", " return_tensors=\"pt\",\n", " add_special_tokens=True, # Add '[CLS]' and '[SEP]'\n", " max_length=512, # -- this model has max length set to 100 -- not to 512,\n", " pad_to_max_length=True, # Pad & truncate all sentences\n", ")\n", "# Traced Model expects ONLY the INPUT ID's\n", "unscripted_traced_output = traced_model( # -- both work the same way\n", " encoded_input[\"input_ids\"], encoded_input[\"attention_mask\"]\n", ")\n", "\n", "# tokenizer.decode(unscripted_output[0])\n", "print(unscripted_traced_output[0].shape)\n", "print(unscripted_traced_output[1].shape)" ] }, { "cell_type": "code", "execution_count": null, "id": "dedcb78c", "metadata": { "tags": [] }, "outputs": [], "source": [ "unscripted_output[0]" ] }, { "cell_type": "markdown", "id": "34790ced", "metadata": {}, "source": [ "### Upload the Model.tar after it has been created correctly by the above scripted and the config.pbtxt files\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "58abbcb6-98d5-4ad6-9a5c-84dcdf9b63dd", "metadata": {}, "outputs": [], "source": [ "tar_file_name = \"roberta-traced-v1.tar.gz\"" ] }, { "cell_type": "code", "execution_count": null, "id": "0f924882-5e88-430e-8631-be48b7588062", "metadata": {}, "outputs": [], "source": [ "!cd triton-serve-pt && tar --exclude=\".git\" --exclude=\".gitattributes\" --exclude=\"model.tar.gz\" --exclude=\"*.bin\" --exclude \"*.tar\" --exclude \"*.ipynb_checkpoints\" -zcvf {tar_file_name} roberta-traced" ] }, { "cell_type": "markdown", "id": "feed1b55", "metadata": {}, "source": [ "**Upload the model.tar.gz to S3 location**" ] }, { "cell_type": "code", "execution_count": null, "id": "8e9ba035", "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker import get_execution_role, Session, image_uris\n", "from sagemaker.utils import name_from_base\n", "import boto3\n", "\n", "region = boto3.Session().region_name\n", "session = sagemaker.Session()\n", "role = get_execution_role()\n", "sm_client = boto3.client(service_name=\"sagemaker\")\n", "runtime_sm_client = boto3.client(\"sagemaker-runtime\")\n", "model_bucket = session.default_bucket()" ] }, { "cell_type": "code", "execution_count": null, "id": "86f0d03c", "metadata": {}, "outputs": [], "source": [ "s3_model_path_triton = sagemaker.s3.S3Uploader().upload(\n", " local_path=f\"./triton-serve-pt/{tar_file_name}\",\n", " desired_s3_uri=f\"s3://{model_bucket}/mme-roberta-benchmark/roberta-large\",\n", " sagemaker_session=session,\n", ")\n", "s3_mme_model_path = f\"s3://{model_bucket}/mme-roberta-benchmark/roberta-large/\"\n", "print(s3_model_path_triton)\n", "print(s3_mme_model_path)" ] }, { "cell_type": "markdown", "id": "bae2b009", "metadata": {}, "source": [ "#### Start Single Model Triton for starting" ] }, { "cell_type": "markdown", "id": "741da45c", "metadata": {}, "source": [ "**Triton Image download and sagemaker variables**" ] }, { "cell_type": "code", "execution_count": null, "id": "d863f79e", "metadata": {}, "outputs": [], "source": [ "from sagemaker import get_execution_role, Session, image_uris\n", "import boto3\n", "from sagemaker.utils import name_from_base\n", "\n", "region = boto3.Session().region_name\n", "role = get_execution_role()\n", "sm_client = boto3.client(service_name=\"sagemaker\")\n", "runtime_sm_client = boto3.client(\"sagemaker-runtime\")\n", "\n", "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n", "triton_image_uri = \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.10-py3\".format(\n", " account_id=account_id_map[region], region=region, base=base\n", ")\n", "print(triton_image_uri)" ] }, { "cell_type": "markdown", "id": "d82661f3", "metadata": {}, "source": [ "**Model creation**" ] }, { "cell_type": "code", "execution_count": null, "id": "f05affc2", "metadata": {}, "outputs": [], "source": [ "endpoint_name = name_from_base(f\"roberta-base-\")\n", "print(endpoint_name)\n", "\n", "container_p5 = {\n", " \"Image\": triton_image_uri,\n", " \"ModelDataUrl\": s3_mme_model_path,\n", " \"Mode\": \"MultiModel\",\n", " \"Environment\": {\n", " #'SAGEMAKER_PROGRAM' : 'inference.py',\n", " #'SAGEMAKER_SUBMIT_DIRECTORY' : 'code',\n", " #'SAGEMAKER_TRITON_DEFAULT_MODEL_NAME': 'bert-uc',\n", " # \"SAGEMAKER_TRITON_BATCH_SIZE\": \"16\",\n", " \"SAGEMAKER_TRITON_MAX_BATCH_DELAY\": \"1000\",\n", " \"SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE\": \"16777216000\", # \"16777216000\",\n", " \"SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE\": \"104857600\",\n", " },\n", "}\n", "create_model_response = sm_client.create_model(\n", " ModelName=endpoint_name, ExecutionRoleArn=role, PrimaryContainer=container_p5\n", ")\n", "print(create_model_response)" ] }, { "cell_type": "markdown", "id": "96ba56d7", "metadata": {}, "source": [ "**Endpoint config**" ] }, { "cell_type": "code", "execution_count": null, "id": "a8e83ed5", "metadata": {}, "outputs": [], "source": [ "# Sampling percentage. Choose an integer value between 0 and 100\n", "initial_sampling_percentage = 10\n", "\n", "create_endpoint_config_response = sm_client.create_endpoint_config(\n", " EndpointConfigName=endpoint_name,\n", " ProductionVariants=[\n", " {\n", " \"InstanceType\": \"ml.g4dn.xlarge\", # \"ml.g5.8xlarge\", # \"ml.g4dn.xlarge\", \"ml.g4dn.4xlarge\"\n", " \"InitialVariantWeight\": 1,\n", " \"InitialInstanceCount\": 1,\n", " \"ModelName\": endpoint_name,\n", " \"VariantName\": \"AllTraffic\",\n", " }\n", " ],\n", ")\n", "\n", "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])" ] }, { "cell_type": "markdown", "id": "7fe59ec7", "metadata": {}, "source": [ "**Endpoint**" ] }, { "cell_type": "code", "execution_count": null, "id": "0d85217c", "metadata": {}, "outputs": [], "source": [ "create_endpoint_response = sm_client.create_endpoint(\n", " EndpointName=endpoint_name, EndpointConfigName=endpoint_name\n", ")\n", "\n", "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "81f5e1fb", "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", "status = resp[\"EndpointStatus\"]\n", "print(\"SINGLE:Model:endpoint:Triton:Status: \" + status)\n", "\n", "while status == \"Creating\":\n", " time.sleep(60)\n", " resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " status = resp[\"EndpointStatus\"]\n", " print(\"Single:model:triton:Status: \" + status)\n", "\n", "print(\"Arn: \" + resp[\"EndpointArn\"])\n", "print(\"Single:model:triton:Status: \" + status)" ] }, { "cell_type": "markdown", "id": "bdcb90ea", "metadata": {}, "source": [ "**Now Invoke The endpoint**\n", "
  • First option is JSON
  • \n", "
  • Second is native binary headers
  • " ] }, { "cell_type": "code", "execution_count": null, "id": "4b8b44b4-8c91-4840-8f25-367937aba784", "metadata": {}, "outputs": [], "source": [ "import tritonclient.http as httpclient\n", "from transformers import BertTokenizer\n", "import numpy as np\n", "from tritonclient.utils import np_to_triton_dtype\n", "\n", "\n", "def tokenize_text(text, enc, max_length=512):\n", " # enc = BertTokenizer.from_pretrained(\"bert-base-uncased\")\n", " print(f\"Tokenize:text:why??::max_length={max_length}::Tokenizer={enc}\")\n", " encoded_text = enc(text, padding=\"max_length\", max_length=max_length)\n", " return encoded_text[\"input_ids\"], encoded_text[\"attention_mask\"]\n", "\n", "\n", "# Inference hyperparameters\n", "def prepare_tensor(name, input_d):\n", " tensor = httpclient.InferInput(name, input_d.shape, np_to_triton_dtype(input_d.dtype))\n", " tensor.set_data_from_numpy(input_d)\n", " return tensor\n", "\n", "\n", "# explanation\n", "def prepare_roberta_2_inputs(input0, attention_0):\n", " input0_data = np.array(input0, dtype=np.int32) # - convert to Numpy from PyTorch tensors\n", " input_attention_data = np.array(attention_0, dtype=np.int32)\n", "\n", " inputs = [ # - match the config.pbtxt\n", " prepare_tensor(\"INPUT__0\", input0_data),\n", " prepare_tensor(\"INPUT__1\", input_attention_data),\n", " ]\n", "\n", " outputs = []\n", " outputs.append(httpclient.InferRequestedOutput(\"OUTPUT__0\", binary_data=True))\n", " outputs.append(httpclient.InferRequestedOutput(\"1634__1\", binary_data=True))\n", " (\n", " request_body,\n", " header_length,\n", " ) = httpclient.InferenceServerClient.generate_request_body(inputs, outputs=outputs)\n", " return request_body, header_length\n", "\n", "\n", "def get_decoded_text(tensors_tokens, enc):\n", " return_text = tokenizer.batch_decode(gen_tokens)[0]\n", " return return_text" ] }, { "cell_type": "markdown", "id": "b00f4bb8-f52f-4ee6-903c-c9acc1005765", "metadata": {}, "source": [ "**Run the JSON invocation**" ] }, { "cell_type": "code", "execution_count": null, "id": "32d674e1", "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "import json\n", "\n", "max_seq_length = 512\n", "text_triton = \"Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.\"\n", "print(\n", " f\"Leverage the Tokenizer={tokenizer}::max_seq_length={max_seq_length}:: create above when creating the model \"\n", ")\n", "\n", "input_ids, attention_mask = tokenize_text(text_triton, tokenizer, max_length=max_seq_length)\n", "\n", "payload = {\n", " \"inputs\": [\n", " {\n", " \"name\": \"INPUT__0\",\n", " \"shape\": [1, max_seq_length],\n", " \"datatype\": \"INT32\",\n", " \"data\": input_ids,\n", " },\n", " {\n", " \"name\": \"INPUT__1\",\n", " \"shape\": [1, max_seq_length],\n", " \"datatype\": \"INT32\",\n", " \"data\": attention_mask,\n", " },\n", " ]\n", "}\n", "\n", "response = runtime_sm_client.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " ContentType=\"application/octet-stream\",\n", " Body=json.dumps(payload),\n", " TargetModel=\"roberta-traced-v1.tar.gz\",\n", ")\n", "\n", "output = json.loads(response[\"Body\"].read().decode(\"utf8\"))\n", "\n", "print(output.keys())" ] }, { "cell_type": "code", "execution_count": null, "id": "ed92b242", "metadata": { "tags": [] }, "outputs": [], "source": [ "output[\"outputs\"][0][\"data\"]" ] }, { "cell_type": "markdown", "id": "e413c074-0fdf-4850-ad78-4ec5f1819660", "metadata": {}, "source": [ "**Invoke using the Binary Format**" ] }, { "cell_type": "code", "execution_count": null, "id": "0c3fe2c5-5a5d-46f6-bb13-451af0cbc99c", "metadata": {}, "outputs": [], "source": [ "encoded_input = tokenizer(\n", " \"Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.\",\n", " return_tensors=\"pt\",\n", " add_special_tokens=True, # Add '[CLS]' and '[SEP]'\n", " max_length=512, #\n", " pad_to_max_length=True, # Pad & truncate all sentences\n", ")\n", "# Traced Model expects ONLY the INPUT ID's\n", "input_ids = encoded_input[\"input_ids\"]\n", "attention_mask = encoded_input[\"attention_mask\"]\n", "\n", "triton_request_body, triton_header_length = prepare_roberta_2_inputs(input_ids, attention_mask)" ] }, { "cell_type": "code", "execution_count": null, "id": "ee35125a-4655-4bd5-b0b3-14f091b45611", "metadata": { "tags": [] }, "outputs": [], "source": [ "response_binary = runtime_sm_client.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " ContentType=\"application/vnd.sagemaker-triton.binary+json;json-header-size={}\".format(\n", " triton_header_length\n", " ),\n", " Body=triton_request_body,\n", " TargetModel=f\"{tar_file_name}\",\n", ")\n", "print(response_binary)\n", "\n", "# # Parse json header size length from the response\n", "# header_length_prefix = \"application/vnd.sagemaker-triton.binary+json;json-header-size=\"\n", "# header_length_str = response[\"ContentType\"][len(header_length_prefix) :]\n", "# # print(response_binary[\"Body\"].read())\n", "\n", "# try:\n", "# # Read response body\n", "# result = httpclient.InferenceServerClient.parse_response_body(\n", "# response_binary[\"Body\"].read() # , header_length=int(header_length_str)\n", "# )\n", "# output0_data = result.as_numpy(\"1634__1\")\n", "# output1_data = result.as_numpy(\"OUTPUT__0\")\n", "# print(output0_data)\n", "# print(output1_data)\n", "# except:\n", "# print(\"Error in parsing response -- \")" ] }, { "cell_type": "markdown", "id": "1745887a", "metadata": {}, "source": [ "### Stress Test it " ] }, { "cell_type": "code", "execution_count": null, "id": "c7e2ff26-d79e-4061-b702-900753a408f2", "metadata": {}, "outputs": [], "source": [ "# model_name = \"roberta-base\"\n", "# print(s3_model_path_triton)\n", "# print(s3_mme_model_path)\n", "# print(model_name)" ] }, { "cell_type": "code", "execution_count": null, "id": "56358ec4-eeba-4cbb-9977-7a981d9fdb1f", "metadata": {}, "outputs": [], "source": [ "# text_triton = \"Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.\"\n", "# print(\n", "# f\"Leverage the Tokenizer={tokenizer}::max_seq_length={max_seq_length}:: create above when creating the model \"\n", "# )\n", "\n", "# input_ids, attention_mask = tokenize_text(\n", "# text_triton, tokenizer, max_length=max_seq_length\n", "# )\n", "\n", "# payload = {\n", "# \"inputs\": [\n", "# {\n", "# \"name\": \"INPUT__0\",\n", "# \"shape\": [1, max_seq_length],\n", "# \"datatype\": \"INT32\",\n", "# \"data\": input_ids,\n", "# },\n", "# {\n", "# \"name\": \"INPUT__1\",\n", "# \"shape\": [1, max_seq_length],\n", "# \"datatype\": \"INT32\",\n", "# \"data\": attention_mask,\n", "# },\n", "# ]\n", "# }" ] }, { "cell_type": "code", "execution_count": null, "id": "14a4cb4d", "metadata": {}, "outputs": [], "source": [ "# models_loaded = 0\n", "# memory_utilization_threshold = 0.9\n", "# memory_utilization_history = []\n", "# max_models_test = 10\n", "# while models_loaded < max_models_test:\n", "# # make a copy of the model\n", "# !aws s3 cp {s3_model_path_triton} {s3_mme_model_path}/{model_name}-v{models_loaded}.tar.gz\n", "\n", "# # make a inference request to load model into memory\n", "# response = runtime_sm_client.invoke_endpoint(\n", "# EndpointName=endpoint_name,\n", "# ContentType=\"application/octet-stream\",\n", "# Body=json.dumps(payload),\n", "# TargetModel=f\"{model_name}-v{models_loaded}.tar.gz\",\n", "# )\n", "\n", "# models_loaded+=1\n", "\n", "\n", "# print(f\"loaded {models_loaded} models with memory utilzation of {memory_utilization:.2%}\")" ] }, { "cell_type": "markdown", "id": "9db9aaaa", "metadata": {}, "source": [ "### Clean up" ] }, { "cell_type": "code", "execution_count": null, "id": "a412ed75", "metadata": {}, "outputs": [], "source": [ "sm_client.delete_endpoint(EndpointName=endpoint_name)\n", "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)\n", "sm_client.delete_model(ModelName=endpoint_name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.g4dn.xlarge", "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 5 }