# Triton on SageMaker - this notebook shows how you can take a RoBERTA model and create a traced model and leverage the Pytorch back end for Triton


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

---


Amazon SageMaker is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.

Now, NVIDIA Triton Inference Server can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

This notebook was tested on Studio with ml.g4dn.xlarge which comes with 1 GPU and with ml.m5.large which is a CPU based machine only Contents

Introduction to NVIDIA Triton Server
Set up the environment
Basic: RoBERTA Model
* PyTorch: JIT Trace the model and create a Scripted model
* PyTorch: Testing the JIT Traced model 
* PyTorch: Packaging model files and uploading to s3
* PyTorch: Create SageMaker Endpoint
* PyTorch: Run inference
* PyTorch: Leverage the Predictions to view the results for Object detection
* PyTorch: Terminate endpoint and clean up artifacts


### Introduction to NVIDIA Triton Server

NVIDIA Triton Inference Server was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.

Some key features of Triton are:

* Support for Multiple frameworks: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats.
* Model pipelines: Triton model ensemble represents a pipeline of one or more models or pre/post-processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.
* Concurrent model execution: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.
* Dynamic batching: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
* Diverse CPUs and GPUs: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

Note: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal config.pbtxt configuration file is required in the model artifacts. This release doesn't support inferring the model config automatically. Set up the environment

Installs the dependencies required to package the model and run inferences using Triton server.

Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

The purpose of this file is to show the ability to take a pytorch computer vision model and create a scripted model which can then be leveraged by Triton using the pytorch back end.

The other option is to build using a python back end but in that we loose some performance gains by compilation to native format




In [None]:
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

In [None]:
!pip install transformers[torch]

In [None]:
!pip install nvidia-pyindex -qU
!pip install tritonclient[http] -qU

!pip install -qU pip awscli boto3 sagemaker transformers

### Start RoBERTA Base for Triton



In [None]:
!mkdir -p triton-serve-pt/roberta-traced
!mkdir -p triton-serve-pt/roberta-traced/1


!cd triton-serve-pt/roberta-traced/1 && rm -rf ".ipynb_checkpoints"
!cd triton-serve-pt/roberta-traced && rm -rf ".ipynb_checkpoints"
!cd triton-serve-pt && rm -rf ".ipynb_checkpoints"

In [None]:
!ls -alrt triton-serve-pt/roberta-traced/1

In [None]:
%%writefile triton-serve-pt/roberta-traced/config.pbtxt
platform: "pytorch_libtorch"
max_batch_size: 32
input [
  {
    name: "INPUT__0"
    data_type: TYPE_INT32
    dims: [512]
  },
  {
    name: "INPUT__1"
    data_type: TYPE_INT32
    dims: [512]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [512, 768]
  },
  {
    name: "1634__1"
    data_type: TYPE_FP32
    dims: [768]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
  preferred_batch_size: 32
}

### Run for Triton server

**Note**: Amazon SageMaker expects the model tarball file to have a top level directory with the same name as the model defined in the `config.pbtxt`. Below is the sample model directory structure

```
roberta-large
├── 1
│   └── model.pt
└── config.pbtxt
```

**Have to use the same Tokenizer to generate the input to test as BERT uncased**

### Create the RoBERTA Model in Torch Script mode -- .pt model
use the ore trained and use torchscript flag here

In [None]:
from transformers import GPT2Tokenizer, GPTJModel
from transformers import GPTJForCausalLM, AutoTokenizer

import torch

### Run a simple test for RoBERTA base 

    * We run multiple tests
        * First we token ize and then de tokenize to make sure the vaues match
        * Then we use the model and run predictions to get values
        * Then we run on the traced Model and run predictions to get values 
        * Check to make sure they match

### Prepare some dummy inputs for tracing

In [None]:
# Tokenizing input text
tokenizer = AutoTokenizer.from_pretrained("roberta-large")

text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
print(f"BERT:Tokenized:Text={tokenized_text}:::")

# Masking one of the input tokens
masked_index = 8
tokenized_text[masked_index] = "[MASK]"
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(f"BERT:indexed_tokens:={indexed_tokens}::")

# -- segments id's
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

In [None]:
### Roberta -
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

# -- IF you use from bert it comes without HEAD
tokenizer = AutoTokenizer.from_pretrained("roberta-base")  # roberta-large
model = AutoModel.from_pretrained("roberta-base", torchscript=True)  # roberta-large
model = model.eval()


bs = 1
seq_len = 512
dummy_inputs = [
    torch.randint(1000, (bs, seq_len)).to("cpu"),  # to(device),
    torch.zeros(bs, seq_len, dtype=torch.int).to("cpu"),  # to(device),
]

text = "Replace me by any text you'd like."
dummy_inputs = tokenizer(
    text,
    return_tensors="pt",
    max_length=seq_len,
    padding=True,
    truncation=True,
)
print(dummy_inputs.keys())


# Creating the trace
# traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
traced_model = torch.jit.trace(model, [dummy_inputs["input_ids"], dummy_inputs["attention_mask"]])

model = model.eval()
# model.to(device)
torch.jit.save(traced_model, "./triton-serve-pt/roberta-traced/1/model.pt")

print("Saved {}".format(traced_model))

In [None]:
model

#### Test encoders various methods

In [None]:
tokenizer(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    padding="max_length",
    max_length=64,
)

In [None]:
encoded_tokens = tokenizer.encode_plus(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
    max_length=512,
    pad_to_max_length=True,  # Pad & truncate all sentences
)
# encoded_tokens

### Test the HuggingFace and then the scripted model locally

In [None]:
import torch
import torch.nn.functional as F

encoded_input = tokenizer(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    return_tensors="pt",
    add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
    max_length=512,  # -- this model has max length set to 100 -- not to 512,
    pad_to_max_length=True,  # Pad & truncate all sentences
)
# unscripted_output = model.generate( # --
unscripted_output = model(  # -- both work the same way
    **encoded_input,
    # inputs=encoded_input['attention_mask']],
    return_dict=True,
    output_attentions=False,
    output_hidden_states=False,
    # do_sample=True,
    # temperature=0.9,
    # max_length=128,
)  # -- BaseModelOutputWithPoolingAndCrossAttentions

# tokenizer.decode(unscripted_output[0])
unscripted_output[0].shape

#### Now test the Scripted model -- Scripted model gives us tensors back

In [None]:
import torch
import torch.nn.functional as F

encoded_input = tokenizer(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    return_tensors="pt",
    add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
    max_length=512,  # -- this model has max length set to 100 -- not to 512,
    pad_to_max_length=True,  # Pad & truncate all sentences
)
# Traced Model expects ONLY the INPUT ID's
unscripted_traced_output = traced_model(  # -- both work the same way
    encoded_input["input_ids"], encoded_input["attention_mask"]
)

# tokenizer.decode(unscripted_output[0])
print(unscripted_traced_output[0].shape)
print(unscripted_traced_output[1].shape)

In [None]:
unscripted_output[0]

### Upload the Model.tar after it has been created correctly by the above scripted and the config.pbtxt files



In [None]:
tar_file_name = "roberta-traced-v1.tar.gz"

In [None]:
!cd triton-serve-pt && tar --exclude=".git" --exclude=".gitattributes" --exclude="model.tar.gz" --exclude="*.bin" --exclude "*.tar" --exclude "*.ipynb_checkpoints"  -zcvf {tar_file_name} roberta-traced

**Upload the model.tar.gz to S3 location**

In [None]:
import sagemaker
from sagemaker import get_execution_role, Session, image_uris
from sagemaker.utils import name_from_base
import boto3

region = boto3.Session().region_name
session = sagemaker.Session()
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
model_bucket = session.default_bucket()

In [None]:
s3_model_path_triton = sagemaker.s3.S3Uploader().upload(
    local_path=f"./triton-serve-pt/{tar_file_name}",
    desired_s3_uri=f"s3://{model_bucket}/mme-roberta-benchmark/roberta-large",
    sagemaker_session=session,
)
s3_mme_model_path = f"s3://{model_bucket}/mme-roberta-benchmark/roberta-large/"
print(s3_model_path_triton)
print(s3_mme_model_path)

#### Start Single Model Triton for starting

**Triton Image download and sagemaker variables**

In [None]:
from sagemaker import get_execution_role, Session, image_uris
import boto3
from sagemaker.utils import name_from_base

region = boto3.Session().region_name
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.10-py3".format(
    account_id=account_id_map[region], region=region, base=base
)
print(triton_image_uri)

**Model creation**

In [None]:
endpoint_name = name_from_base(f"roberta-base-")
print(endpoint_name)

container_p5 = {
    "Image": triton_image_uri,
    "ModelDataUrl": s3_mme_model_path,
    "Mode": "MultiModel",
    "Environment": {
        #'SAGEMAKER_PROGRAM' : 'inference.py',
        #'SAGEMAKER_SUBMIT_DIRECTORY' : 'code',
        #'SAGEMAKER_TRITON_DEFAULT_MODEL_NAME': 'bert-uc',
        # "SAGEMAKER_TRITON_BATCH_SIZE": "16",
        "SAGEMAKER_TRITON_MAX_BATCH_DELAY": "1000",
        "SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE": "16777216000",  # "16777216000",
        "SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE": "104857600",
    },
}
create_model_response = sm_client.create_model(
    ModelName=endpoint_name, ExecutionRoleArn=role, PrimaryContainer=container_p5
)
print(create_model_response)

**Endpoint config**

In [None]:
# Sampling percentage. Choose an integer value between 0 and 100
initial_sampling_percentage = 10

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g4dn.xlarge",  # "ml.g5.8xlarge",  # "ml.g4dn.xlarge", "ml.g4dn.4xlarge"
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": endpoint_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

**Endpoint**

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("SINGLE:Model:endpoint:Triton:Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Single:model:triton:Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Single:model:triton:Status: " + status)

**Now Invoke The endpoint**
<li>First option is JSON</li>
<li>Second is native binary headers</li>

In [None]:
import tritonclient.http as httpclient
from transformers import BertTokenizer
import numpy as np
from tritonclient.utils import np_to_triton_dtype


def tokenize_text(text, enc, max_length=512):
    # enc = BertTokenizer.from_pretrained("bert-base-uncased")
    print(f"Tokenize:text:why??::max_length={max_length}::Tokenizer={enc}")
    encoded_text = enc(text, padding="max_length", max_length=max_length)
    return encoded_text["input_ids"], encoded_text["attention_mask"]


# Inference hyperparameters
def prepare_tensor(name, input_d):
    tensor = httpclient.InferInput(name, input_d.shape, np_to_triton_dtype(input_d.dtype))
    tensor.set_data_from_numpy(input_d)
    return tensor


# explanation
def prepare_roberta_2_inputs(input0, attention_0):
    input0_data = np.array(input0, dtype=np.int32)  # - convert to Numpy from PyTorch tensors
    input_attention_data = np.array(attention_0, dtype=np.int32)

    inputs = [  # - match the config.pbtxt
        prepare_tensor("INPUT__0", input0_data),
        prepare_tensor("INPUT__1", input_attention_data),
    ]

    outputs = []
    outputs.append(httpclient.InferRequestedOutput("OUTPUT__0", binary_data=True))
    outputs.append(httpclient.InferRequestedOutput("1634__1", binary_data=True))
    (
        request_body,
        header_length,
    ) = httpclient.InferenceServerClient.generate_request_body(inputs, outputs=outputs)
    return request_body, header_length


def get_decoded_text(tensors_tokens, enc):
    return_text = tokenizer.batch_decode(gen_tokens)[0]
    return return_text

**Run the JSON invocation**

In [None]:
%%time

import json

max_seq_length = 512
text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
print(
    f"Leverage the Tokenizer={tokenizer}::max_seq_length={max_seq_length}:: create above when creating the model "
)

input_ids, attention_mask = tokenize_text(text_triton, tokenizer, max_length=max_seq_length)

payload = {
    "inputs": [
        {
            "name": "INPUT__0",
            "shape": [1, max_seq_length],
            "datatype": "INT32",
            "data": input_ids,
        },
        {
            "name": "INPUT__1",
            "shape": [1, max_seq_length],
            "datatype": "INT32",
            "data": attention_mask,
        },
    ]
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(payload),
    TargetModel="roberta-traced-v1.tar.gz",
)

output = json.loads(response["Body"].read().decode("utf8"))

print(output.keys())

In [None]:
output["outputs"][0]["data"]

**Invoke using the Binary Format**

In [None]:
encoded_input = tokenizer(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    return_tensors="pt",
    add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
    max_length=512,  #
    pad_to_max_length=True,  # Pad & truncate all sentences
)
# Traced Model expects ONLY the INPUT ID's
input_ids = encoded_input["input_ids"]
attention_mask = encoded_input["attention_mask"]

triton_request_body, triton_header_length = prepare_roberta_2_inputs(input_ids, attention_mask)

In [None]:
response_binary = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        triton_header_length
    ),
    Body=triton_request_body,
    TargetModel=f"{tar_file_name}",
)
print(response_binary)

# # Parse json header size length from the response
# header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
# header_length_str = response["ContentType"][len(header_length_prefix) :]
# # print(response_binary["Body"].read())

# try:
#     # Read response body
#     result = httpclient.InferenceServerClient.parse_response_body(
#         response_binary["Body"].read()  # , header_length=int(header_length_str)
#     )
#     output0_data = result.as_numpy("1634__1")
#     output1_data = result.as_numpy("OUTPUT__0")
#     print(output0_data)
#     print(output1_data)
# except:
#     print("Error in parsing response -- ")

### Stress Test it 

In [None]:
# model_name = "roberta-base"
# print(s3_model_path_triton)
# print(s3_mme_model_path)
# print(model_name)

In [None]:
# text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
# print(
#     f"Leverage the Tokenizer={tokenizer}::max_seq_length={max_seq_length}:: create above when creating the model "
# )

# input_ids, attention_mask = tokenize_text(
#     text_triton, tokenizer, max_length=max_seq_length
# )

# payload = {
#     "inputs": [
#         {
#             "name": "INPUT__0",
#             "shape": [1, max_seq_length],
#             "datatype": "INT32",
#             "data": input_ids,
#         },
#         {
#             "name": "INPUT__1",
#             "shape": [1, max_seq_length],
#             "datatype": "INT32",
#             "data": attention_mask,
#         },
#     ]
# }

In [None]:
# models_loaded = 0
# memory_utilization_threshold = 0.9
# memory_utilization_history = []
# max_models_test = 10
# while models_loaded < max_models_test:
#     # make a copy of the model
#     !aws s3 cp {s3_model_path_triton} {s3_mme_model_path}/{model_name}-v{models_loaded}.tar.gz

#     # make a inference request to load model into memory
#     response = runtime_sm_client.invoke_endpoint(
#         EndpointName=endpoint_name,
#         ContentType="application/octet-stream",
#         Body=json.dumps(payload),
#         TargetModel=f"{model_name}-v{models_loaded}.tar.gz",
#     )

#     models_loaded+=1


#     print(f"loaded {models_loaded} models with memory utilzation of {memory_utilization:.2%}")

### Clean up

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
sm_client.delete_model(ModelName=endpoint_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|nlp|realtime|triton|multi-model|pytorch-traced-roberta|roberta_traced_triton.ipynb)
