# Triton on SageMaker - this notebook shows ONNX back end and how you can take a RoBERTA model and create a traced model and leverage for Triton


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

---


Amazon SageMaker is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.

Now, NVIDIA Triton Inference Server can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

This notebook was tested on Studio with ml.g4dn.xlarge which comes with 1 GPU and with ml.m5.large which is a CPU based machine only Contents

Introduction to NVIDIA Triton Server
Set up the environment
Basic: RoBERTA Model
* PyTorch: ONNX compile the model and save it
* PyTorch: Packaging model files and uploading to s3
* PyTorch: Create SageMaker Endpoint
* PyTorch: Run inference
* PyTorch: Terminate endpoint and clean up artifacts


### Introduction to NVIDIA Triton Server

NVIDIA Triton Inference Server was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.

Some key features of Triton are:

* Support for Multiple frameworks: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats.
* Model pipelines: Triton model ensemble represents a pipeline of one or more models or pre/post-processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.
* Concurrent model execution: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.
* Dynamic batching: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
* Diverse CPUs and GPUs: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

Note: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal config.pbtxt configuration file is required in the model artifacts. This release doesn't support inferring the model config automatically. Set up the environment

Installs the dependencies required to package the model and run inferences using Triton server.

Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

The purpose of this file is to show the ability to take a open source nlp model and compile to the open standard ONNX and then host a ONNX model using Triton in Multi Model mode

This notebook has been tested with 'ml.g4dn.xlarge' instance which comes with 1 GPU 

## Set up the environment

Installs the dependencies required to package the model and run inferences using Triton server.

Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

In [None]:
%pip install timm -Uqq
%pip install transformers -Uqq
%pip install locust -Uqq
%pip install boto3 -Uqq
%pip install sagemaker -Uqq
%pip install matplotlib -Uqq
%pip install Jinja2 -Uqq
%pip install nvidia-pyindex --quiet
%pip install tritonclient[http] --quiet

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
!pip install boto3 -Uqq

In [None]:
%pip install transformers[onnx] -Uqq

In [None]:
import sagemaker
from sagemaker import get_execution_role
import torch
from pathlib import Path

import boto3
import json
from pathlib import Path
import time
import datetime as dt
import warnings

from utils import model_utils

role = get_execution_role()
sess = sagemaker.Session()

account = sess.account_id()
bucket = sess.default_bucket()  # or use your own custom bucket name
prefix = "mme-roberta-benchmark"

use_case = "nlp"

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")

tested_models = ["bert-base-uncased", "roberta-large"]

model_name = "roberta-large"  # change the model name to benchmark different NLP models

max_seq_len = 512

In [None]:
%env TOKENIZERS_PARALLELISM=False

In [None]:
print(torch.__version__)
import transformers

print(transformers.__version__)

Account Id Mapping for triton inference containers

In [None]:
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

In [None]:
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

## Generate Pretrained Models

We are going to use the following SageMaker Processing script to generate our pretrained model. This script does the following:

1. Generate a model using the Pytorch Hub

2. jit script the model and save the torchscript file

3. Create a model artifact which is comprised of the torchscript file and a model configuration (config.pbtxt) for Triton serving

Helper functions have been created for each of these steps and are imported from the `utils.model_utils` local module

#### This create a jit traced model which returns 2 tensors back as predictions

In [None]:
if model_name in tested_models:
    tokenizer, model = model_utils.get_model_from_hf_hub(model_name)
else:
    warnings.warn(f"{model_name} has not been tested and may not work")
    tokenizer, model = model_utils.get_model_from_hf_hub(model_name)
model.eval()

print(f"loaded model {model_name} with {model_utils.count_parameters(model)} parameters")

example_input = tokenizer(
    "This is a sample",
    padding="max_length",
    max_length=max_seq_len,
    return_tensors="pt",
)

In [None]:
### Roberta -
from transformers import AutoTokenizer, AutoModel

max_seq_len = 512
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

# -- IF you use from bert it comes without HEAD
tokenizer = AutoTokenizer.from_pretrained("roberta-base")  # roberta-large
model = AutoModel.from_pretrained("roberta-base", torchscript=True)  # roberta-large
model = model.eval()

print(f"loaded model {model_name} with {model_utils.count_parameters(model)} parameters")

In [None]:
example_input = tokenizer(
    "This is a sample",
    padding="max_length",
    max_length=max_seq_len,
    return_tensors="pt",
)

In [None]:
example_input["input_ids"].shape

In [None]:
example_input.keys()

In [None]:
model(**example_input)

In [None]:
model(example_input["input_ids"], example_input["attention_mask"])[0].shape

## Benchmark a ONNX model
Next we'll convert the PyTorch model to a TensorRT model with the following steps

1. Convert the model to ONNX. ONNX is intemediate file format which is framework agnostic. It works with models in TF, PyTorch and more.

2. Spin up a SageMaker processing job to convert ONNX model to a TensorRT model plan. You will export the weights of your model from the framework and load them into your TensorRT network.

<div class="alert-danger" role="alert"> <strong> Warning: </strong>
For TensorRT models, The hosting instance type must match the instance type the model is compiled on. To help ensure compatability, we'll compile the model using a SageMaker Processing job. Note that this will incur an additional cost of running the job. Also, SageMaker Processing doescurrently support ml.g5 family of instances, so these compiled models will only run on the ml.g4dn family of instances. To deploy on an ml.g5 instance, please compile the model on an EC2 or a SageMaker notebook instance 
</div>

### <span style="color:red"> </span>

## Packaging Pytorch JIT TRACED model for Triton sever on SageMaker

**Note**: SageMaker expects the model tarball file to have a top level directory with the same name as the model defined in the `config.pbtxt`.

```
model_name
├── 1
│   └── model.onnx
└── config.pbtxt
```

In [None]:
import boto3
import torch
from pathlib import Path
from typing import Union, List
from jinja2 import Environment, FileSystemLoader
import sagemaker
from sagemaker.processing import ScriptProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
import shutil
import subprocess
import tarfile
import transformers
from transformers import AutoTokenizer, AutoModel
from transformers.onnx import FeaturesManager
import time
import timm

In [None]:
onnx_path = Path("onnx-models")
onnx_path.mkdir(exist_ok=True)

In [None]:
onnx_model_path = Path(f"onnx-models/{model_name}")
onnx_model_path.mkdir(exist_ok=True)
exported_onnx_path, onnx_config = model_utils.export_onnx_nlp(
    model=model, tokenizer=tokenizer, save_path=onnx_model_path
)
exported_onnx_path

In [None]:
onnx_model_path = Path(f"triton-serve-pt/{model_name}/1")
onnx_model_path.mkdir(parents=True, exist_ok=True)
print(exported_onnx_path)

%cp {exported_onnx_path} {onnx_model_path}
!cd {onnx_model_path} && rm -rf ".ipynb_checkpoints"
%ls -alrt {onnx_model_path}


# onnx_model_path = model_utils.export_pt_jit(model, list(example_input.values()), pytorch_model_path) #export jit compiled model to specified directory

### Use this to test with 2 output tensors

In [None]:
triton_inputs = [
    {"name": input_name, "data_type": "TYPE_INT64", "dims": "[512]"} for input_name in example_input
]
# - 1550, last_hidden_state
triton_outputs = [
    {
        "name": "last_hidden_state",
        "data_type": "TYPE_FP32",
        "dims": "[-1, 768]",
    },
    {
        "name": "1550",
        "data_type": "TYPE_FP32",
        "dims": "[768]",
    },
]

In [None]:
triton_config_path = model_utils.generate_triton_config(
    platform="onnx",
    triton_inputs=triton_inputs,
    triton_outputs=triton_outputs,
    save_path=Path(f"triton-serve-pt/{model_name}"),  # onnx_model_path
    template_path="config_templates",
)
triton_config_path

In [None]:
#!cat {onnx_model_path}/config.pbtxt
print(triton_config_path)
%cat {triton_config_path}

In [None]:
model_name

In [None]:
onnx_model_path

In [None]:
%%time
tar_file_name = "roberta-large-v1.tar.gz"

!cd triton-serve-pt && rm -f {tar_file_name}
!cd triton-serve-pt/roberta-large && rm -rf ".ipynb_checkpoints"
!cd triton-serve-pt && tar czvf {tar_file_name} roberta-large
#!cp triton-serve-pt/roberta-large-v1.tar.gz triton-server-pt

In [None]:
initial_model_path = sess.upload_data(
    f"./triton-serve-pt/{tar_file_name}",
    bucket=bucket,
    key_prefix=f"{prefix}/{model_name}",
)
print(f"finally model path = {initial_model_path}")

mme_path = f"s3://{bucket}/{prefix}/{model_name}/"

In [None]:
initial_model_path

In [None]:
tar_file_name

### Create the SageMaker end point for ONNX

In [None]:
example_input = tokenizer(
    "This is a sample",
    padding="max_length",
    max_length=max_seq_len,
    return_tensors="pt",
)
example_input["input_ids"].shape

In [None]:
from utils.endpoint_utils import (
    create_endpoint,
    delete_endpoint,
    get_instance_utilization,
    run_load_test,
)

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
    f"{account_id_map[region]}.dkr.ecr.{region}.{base}" + "/sagemaker-tritonserver:22.10-py3"
)
print(mme_triton_image_uri)
instance_type = "ml.g4dn.4xlarge"  # 'ml.g4dn.2xlarge' --

In [None]:
print(mme_path)

In [None]:
print(initial_model_path)
print(mme_path)
!aws s3 ls {mme_path}

In [None]:
container = {
    "Image": mme_triton_image_uri,
    "ModelDataUrl": mme_path,
    "Mode": "MultiModel",
    "Environment": {
        #'SAGEMAKER_PROGRAM' : 'inference.py',
        #'SAGEMAKER_SUBMIT_DIRECTORY' : 'code',
        # "SAGEMAKER_TRITON_BATCH_SIZE": "1",
        # "SAGEMAKER_TRITON_MAX_BATCH_DELAY": "1000",
        "SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE": "16777216000",  # "16777216", #"16777216000",
        "SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE": "10485760",
        # "SAGEMAKER_TRITON_LOG_VERBOSE":"false"
    },
}
container

In [None]:
model_name = model_name.replace("_", "-")

engine = "onnx"

sm_model_name = f"{model_name}-{engine}-gpu-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

endpoint_config_name = f"{model_name}-{engine}-gpu-" + time.strftime(
    "%Y-%m-%d-%H-%M-%S", time.gmtime()
)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
            # "VolumeSizeInGB" : 256
        }
    ],
)
print("ebs volume NOT added")

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

endpoint_name = f"{model_name}-{engine}-gpu-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

print(sm_model_name, endpoint_config_name, endpoint_name)

In [None]:
# sm_model_name, endpoint_config_name, endpoint_name = create_endpoint(sm_client, model_name, role, container, instance_type, "onnx")
# print(sm_model_name, endpoint_config_name, endpoint_name)

In [None]:
runtime_sm_client

#### Invoke End point now 

In [None]:
max_seq_len

In [None]:
# make a inference request to load model into memory
# -  THIS IS NO SQUEEZE

example_input_nopt = tokenizer("This is a sample", padding="max_length", max_length=max_seq_len)

print(tar_file_name)
payload = {
    "inputs": [
        {
            "name": "input_ids",
            "shape": [1, max_seq_len],
            "datatype": "INT64",
            "data": example_input_nopt["input_ids"],
        },
        {
            "name": "attention_mask",
            "shape": [1, max_seq_len],
            "datatype": "INT64",
            "data": example_input_nopt["attention_mask"],
        },
    ]
}
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(payload),
    TargetModel=f"{tar_file_name}",
    # TargetModel=f"roberta-large-v0.tar.gz",
)
response

In [None]:
resp_bin = response["Body"].read().decode("utf8")

In [None]:
import numpy as np

# -- keys are -- "outputs":[{"name":"1550","datatype":"FP32","shape":[1,768],"data": [0.0013,0,3433...]}]
for data in json.loads(resp_bin)["outputs"]:
    shape_1 = list(data["shape"])
    dat_1 = np.array(data["data"])
    dat_1.resize(shape_1)
    print(f"Data Outputs recieved back :Shape:{dat_1.shape}")
# data_resp = json.loads(resp_bin)['outputs'][0]['data']
# np.array(data_resp).shape

### Run load tests

un comment the below to execute

In [None]:
# import boto3

# max_model_in_memory = 10

# # make a copy of the model

# s3_client = boto3.client("s3")
# for i in range(0, max_model_in_memory):
#     with open(f'./triton-serve-pt/{tar_file_name}', "rb") as f:
#         s3_client.upload_fileobj(f, bucket, f"{mme_path}{model_name}-v{i}.tar.gz".format(i))
#         print(f"uploading:: {mme_path}{model_name}-v{i}.tar.gz".format(i))

In [None]:
# models_loaded = 0
# memory_utilization_threshold = 0.9
# memory_utilization_history = []

# max_model_in_memory = 10

# import time

# while models_loaded < max_model_in_memory:

#     # make a inference request to load model into memory
#     response = runtime_sm_client.invoke_endpoint(
#             EndpointName=endpoint_name,
#             ContentType="application/octet-stream",
#             Body=json.dumps(payload),
#             TargetModel=f"{model_name}-v{models_loaded}.tar.gz",
#         )

#     models_loaded+=1


#     print(f"loaded {models_loaded} models")

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|nlp|realtime|triton|multi-model|onnx-roberta-backend|roberta_onnx_triton.ipynb)
