# Pre-processing and XGBoost model inference pipeline with NVIDIA Triton Inference Server on Amazon SageMaker using Multi-model endpoint(MME)

With the 22.05 version release of [NVIDIA Triton](https://github.com/triton-inference-server/server/) container image on SageMaker you can now use Triton's Forest Inference Library (FIL) backend to easily serve tree based ML models like XGBoost for high-performance CPU and GPU inference in SageMaker. Using Triton's FIL backend allows you to benefit from performance optimizations like dynamic batching and concurrent execution which help maximize the utilization of GPU and CPU, further lowering the cost of inference. The multi-framework support provided by NVIDIA Triton allows you to seamlessly deploy tree-based ML models alongside deep learning models for fast, unified inference pipelines.

Machine Learning applications are complex and can often require data pre-processing. In this notebook, we will not only deep dive into how to deploy a tree-based ML model like XGBoost using the FIL Backend in Triton on SageMaker endpoint but also cover how to implement python-based data pre-processing inference pipeline for your model using the ensemble feature in Triton. This will allow us to send in the raw data from client side and have both data pre-processing and model inference happen in Triton SageMaker endpoint for the optimal inference performance.

## To Run This Notebook Please Select `Python 3 (Data Science)` Kernel from the Kernel Dropdown menu

**Note:** This notebook was tested with the `Python 3 (Data Science)` kernel on an Amazon SageMaker Studio instance of type `ml.c5.xlarge`.

The alternate Studio instance types - `ml.c5.large`, `ml.c5.2xlarge`

## Forest Inference Library (FIL)

RAPIDS Forest Inference Library (FIL) is a library to provide high-performance inference for tree-based models. Here are some important FIL features:

* Supports XGBoost, LightGBM, cuML RandomForest, and Scikit Learn Random Forest
* No conversion needed for XGBoost and LightGBM. SKLearn or cuML pickle models need to be converted to Treelite's binary checkpoint format 
* SKLearn Random Forest is supported for single-output regression and multi-class classification
* Both CPU and GPU are supported

Below we show benchmark highlighting FIL's throughput performance against CPU XGBoost.

"fil-benchmark"

## Triton FIL Backend
FIL is available as a backend in Triton with features to allow for serving XGBoost, LightGBM and RandomForest models both on CPU and GPU with high performance. Here are some important features of the FIL Backend:

* **Shapley Value Support (GPU)**: GPU Shapley Values are supported for Model Explainability
* **Categorical Feature Support**: Models trained on categorical features fully supported.
* **CPU Optimizations**: Optimized CPU mode offers faster execution than native XGBoost.

To learn more about FIL Backend's features please see the [FAQ Notebook](https://github.com/triton-inference-server/fil_backend/blob/fea-faq_nb/notebooks/faq/FAQs.ipynb) and [Triton FIL Backend GitHub.](https://github.com/triton-inference-server/fil_backend/tree/main)

## Triton Model Ensemble Feature
Triton Inference Server greatly simplifies the deployment of AI models at scale in production. Triton Server comes with a convenient solution that simplifies building pre-processing and post-processing pipelines. Triton Server platform provides the ensemble scheduler, which is responsible for pipelining models participating in the inference process while ensuring efficiency and optimizing throughput. Using ensemble models can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton.

"triton-ensemble"

In this notebook we will be show how to use the ensemble feature for building a pipeline of data preprocessing with XGBoost model inference and you can extrapolate from it to add custom postprocessing to the pipeline.

## Set up Environment

We begin by setting up the required environment. We will install the dependencies required to package our model pipeline and run inferences using Triton server. Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

In [None]:
!pip install nvidia-pyindex
!pip install tritonclient[http]

In [None]:
import boto3
import json
import sagemaker
import time
import os
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import subprocess

sess = boto3.Session()
sm = sess.client("sagemaker")

default_bucket="" #Enter just the bucket name i.e do not include the s3:// prefix
assert default_bucket != "", "Please enter the bucket you wish to use for this lab. Enter without s3://"
sagemaker_session = sagemaker.Session(default_bucket=default_bucket)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
s3_bucket = sagemaker_session.default_bucket()
print(f"Will use S3 bucket '{s3_bucket}' for storing all resources related to this notebook")
print(f"Using Role '{role}'")

##NOTE : Make sure to have SageMakerFullAccess permission to the above IAM Role

proc=subprocess.Popen('cat /opt/ml/metadata/resource-metadata.json', shell=True, stdout=subprocess.PIPE, )
studio_user_profile_output=json.loads(proc.communicate()[0].decode('utf-8'))['UserProfileName'] # retrieve current Studio User Profile name
studio_user_profile_output

In [None]:
account_id_map = {
 "us-east-1": "785573368785",
 "us-east-2": "007439368137",
 "us-west-1": "710691900526",
 "us-west-2": "301217895009",
 "eu-west-1": "802834080501",
 "eu-west-2": "205493899709",
 "eu-west-3": "254080097072",
 "eu-north-1": "601324751636",
 "eu-south-1": "966458181534",
 "eu-central-1": "746233611703",
 "ap-east-1": "110948597952",
 "ap-south-1": "763008648453",
 "ap-northeast-1": "941853720454",
 "ap-northeast-2": "151534178276",
 "ap-southeast-1": "324986816169",
 "ap-southeast-2": "355873309152",
 "cn-northwest-1": "474822919863",
 "cn-north-1": "472730292857",
 "sa-east-1": "756306329178",
 "ca-central-1": "464438896020",
 "me-south-1": "836785723513",
 "af-south-1": "774647643957",
}

region = boto3.Session().region_name
if region not in account_id_map.keys():
 raise ("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"

triton_image_uri = (
 "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.10-py3".format(
 account_id=account_id_map[region], region=region, base=base
 )
)
triton_image_uri

## Set up pre-processing with Triton Python Backend

We will be using Triton's [Python Backend](https://github.com/triton-inference-server/python_backend) to perform the some tabular data preprocessing (categotical encoding) during inference time for raw data requests coming into the server. For more information to see the preprocessing that was done during training feel free to take a look at the training notebook [here](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-triton/fil_ensemble/1_prep_rapids_train_xgb.ipynb).


The Python backend enables pre-process, post-processing and any other custom logic to be implemented in Python and served with Triton.

Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. We have already set up model for python data preprocessing called `preprocessing` in the `model_repository`.

"preprocessing-model"

Now Triton has specific requirements for model repository layout. Within the top-level model repository directory each model has its own sub-directory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric sub-directory representing a version of the model. Here that is `1` representing version 1 of our python preprocessing model. Each model is executed by a specific backend so within each version sub-directory there must be the model artifact required by that backend. Here, we are using the Python backend and it requires the python file you are serving to be called `model.py` and the file needs to implement [certain functions](https://github.com/triton-inference-server/python_backend#usage). If we were using a PyTorch backend a `model.pt` file would be required and so on. For more details on naming conventions for model files please see the [model files doc](https://github.com/triton-inference-server/server/blob/185253ce225a0b012e73cade5c9a948ef9e75abd/docs/model_repository.md#model-files).


[Our model.py](model_repository/preprocessing/1/model.py) python file we are using here implements all the tabular data preprocessing logic to convert raw data into features that can be fed into our XGBoost model.

Every Triton model must also provide a `config.pbtxt` file describing the model configuration. To learn more about the config settings please see [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md) doc. Our `config.pbtxt` specifies the backend as `python` and specifies all the input columns for raw data along with preprocessed output that consists of 15 features. We also specify we want to run this python preprocessing model on the CPU.

### Create Conda Env for Preprocessing Dependencies

The Python backend in Triton requires us to use conda environment for any additional dependencies. In this case we are using the Python backend to do preprocessing of the raw data before feeding it into the XGBoost model being run in FIL Backend. Even though we originally used RAPIDS cuDF and cuML to do the data preprocessing here we use Pandas and Scikit-learn as preprocessing dependencies for inference time. We do this for three reasons. 
* Firstly, to show how to create conda environment for your dependencies and how to package it in [format expected](https://github.com/triton-inference-server/python_backend#2-packaging-the-conda-environment) by Triton's Python backend. 
* Secondly, by showing the preprocessing model running in Python backend on the CPU while the XGBoost runs on the GPU in FIL Backend we illustrate how each model in Triton's ensemble pipeline can run on different framework backend as well as different hardware configurations
* Thirdly, it highlights how the RAPIDS libraries (cuDF, cuML) are compatible with their CPU counterparts (Pandas, Scikit-learn). For example this way we get to show how LabelEncoders created in cuML can be used in Scikit-learn and vice-versa

We follow the instructions from the [Triton documentation](https://github.com/triton-inference-server/python_backend#2-packaging-the-conda-environment) for packaging preprocessing dependencies (scikit-learn and pandas) to be used in the python backend as conda env tar file. The bash script [create_prep_env.sh](./create_prep_env.sh) creates the conda environment tar file and then we move it into the preprocessing model directory.

In [None]:
!bash create_prep_env.sh
time.sleep(5)
!cp preprocessing_env.tar.gz model_cpu_repository/preprocessing/

In [None]:
time.sleep(5)
!cp preprocessing_env.tar.gz model_gpu_repository/preprocessinggpu/

After creating the tar file from the conda environment and placing it in model folder, you need to tell Python backend to use that environment for your model. We do this by including the lines below in the model `config.pbtxt` file:

```
parameters: {
 key: "EXECUTION_ENV_PATH",
 value: {string_value: "$$TRITON_MODEL_DIRECTORY/preprocessing_env.tar.gz"}
}
```

Here, `$$TRITON_MODEL_DIRECTORY` helps provide environment path relative to the model folder in model repository and is resolved to `$pwd/model_repository/preprocessing`. Finally `preprocessing_env.tar.gz` is the name we gave to our conda env file. 

### Set up Label Encoders

We also move the label encoders we had serialized earlier into `preprocessing` model folder so that we can use them to encode raw data categorical features at inference time.

In [None]:
!cp label_encoders.pkl model_cpu_repository/preprocessing/1/
!cp label_encoders.pkl model_gpu_repository/preprocessinggpu/1/

## Set up Tree-based ML Model for FIL Backend

Next, we set up the model directory for tree-based ML model like XGBoost which will be using FIL Backend.

The expected layout for model directory is similar to the one we showed above:

"fil-model"

Here, `fil` is the name of the model. We can give it a different name like xgboost if we want to. `1` is the version sub-directory which contains the model artifact, in this case it's the `xgboost.json` model that we saved at the end of [first notebook](1_prep_rapids_train_xgb.ipynb). Let's create this expected layout.

In [None]:
# move saved xgboost model into fil model directory
!mkdir -p model_cpu_repository/fil/1
!cp xgboost.json model_cpu_repository/fil/1/
!cp xgboost.json model_gpu_repository/filgpu/1/

And then finally we need to have configuration file `config.pbtxt` describing the model configuration for tree-based ML model so that FIL Backend in Triton can understand how to serve it.

### Create Config File for FIL Backend Model

You can read about all generic Triton configuration options [here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md) and about configuration options specific to the FIL backend [here](https://github.com/triton-inference-server/fil_backend#configuration), but we will focus on just a few of the most common and relevant options in this example. Below are general descriptions of these options:

* **max_batch_size:** The maximum batch size that can be passed to this model. In general, the only limit on the size of batches passed to a FIL backend is the memory available with which to process them. 
* **input:** Options in this section tell Triton the number of features to expect for each input sample.
* **output:** Options in this section tell Triton how many output values there will be for each sample. If the "predict_proba" option (described further on) is set to true, then a probability value will be returned for each class. Otherwise, a single value will be returned indicating the class predicted for the given sample.
* **instance_group:** This determines how many instances of this model will be created and whether they will use the GPU or CPU.
* **model_type:** A string indicating what format the model is in ("xgboost_json" in this example, but "xgboost", "lightgbm", and "tl_checkpoint" are valid formats as well).
* **predict_proba:** If set to true, probability values will be returned for each class rather than just a class prediction.
* **output_class:** True for classification models, false for regression models.
* **threshold:** A score threshold for determining classification. When output_class is set to true, this must be provided, although it will not be used if predict_proba is also set to true.
* **storage_type:** In general, using "AUTO" for this setting should meet most usecases. If "AUTO" storage is selected, FIL will load the model using either a sparse or dense representation based on the approximate size of the model. In some cases, you may want to explicitly set this to "SPARSE" in order to reduce the memory footprint of large models.

Here we have 15 input features and 2 classes (FRAUD, NOT FRAUD) that we are doing classification for in our XGBoost Model. Based on this information, let's set up FIL Backend configuration file for our tree-based model for serving on GPU.

In [None]:
USE_GPU = False
FIL_MODEL_DIR = "./model_cpu_repository/fil"

# Maximum size in bytes for input and output arrays. If you are
# using Triton 21.11 or higher, all memory allocations will make
# use of Triton's memory pool, which has a default size of
# 67_108_864 bytes
MAX_MEMORY_BYTES = 60_000_000
NUM_FEATURES = 15
NUM_CLASSES = 2
bytes_per_sample = (NUM_FEATURES + NUM_CLASSES) * 4
max_batch_size = MAX_MEMORY_BYTES // bytes_per_sample

IS_CLASSIFIER = True
model_format = "xgboost_json"

# Select deployment hardware (GPU or CPU)
if USE_GPU:
 instance_kind = "KIND_GPU"
else:
 instance_kind = "KIND_CPU"

# whether the model is doing classification or regression
if IS_CLASSIFIER:
 classifier_string = "true"
else:
 classifier_string = "false"

# whether to predict probabilites or not
predict_proba = False

if predict_proba:
 predict_proba_string = "true"
else:
 predict_proba_string = "false"

config_text = f"""backend: "fil"
max_batch_size: {max_batch_size}
input [ 
 {{ 
 name: "input__0"
 data_type: TYPE_FP32
 dims: [ {NUM_FEATURES} ] 
 }} 
]
output [
 {{
 name: "output__0"
 data_type: TYPE_FP32
 dims: [ 1 ]
 }}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
 {{
 key: "model_type"
 value: {{ string_value: "{model_format}" }}
 }},
 {{
 key: "predict_proba"
 value: {{ string_value: "{predict_proba_string}" }}
 }},
 {{
 key: "output_class"
 value: {{ string_value: "{classifier_string}" }}
 }},
 {{
 key: "threshold"
 value: {{ string_value: "0.5" }}
 }},
 {{
 key: "storage_type"
 value: {{ string_value: "AUTO" }}
 }},
 {{
 key: "use_experimental_optimizations"
 value: {{ string_value: "true" }}
 }}
]

dynamic_batching {{}}"""

config_path = os.path.join(FIL_MODEL_DIR, "config.pbtxt")
with open(config_path, "w") as file_:
 file_.write(config_text)

In [None]:
USE_GPU = True
FIL_MODEL_DIR = "./model_gpu_repository/filgpu"

# Maximum size in bytes for input and output arrays. If you are
# using Triton 21.11 or higher, all memory allocations will make
# use of Triton's memory pool, which has a default size of
# 67_108_864 bytes
MAX_MEMORY_BYTES = 60_000_000
NUM_FEATURES = 15
NUM_CLASSES = 2
bytes_per_sample = (NUM_FEATURES + NUM_CLASSES) * 4
max_batch_size = MAX_MEMORY_BYTES // bytes_per_sample

IS_CLASSIFIER = True
model_format = "xgboost_json"

# Select deployment hardware (GPU or CPU)
if USE_GPU:
 instance_kind = "KIND_GPU"
else:
 instance_kind = "KIND_CPU"

# whether the model is doing classification or regression
if IS_CLASSIFIER:
 classifier_string = "true"
else:
 classifier_string = "false"

# whether to predict probabilites or not
predict_proba = False

if predict_proba:
 predict_proba_string = "true"
else:
 predict_proba_string = "false"

config_text = f"""backend: "fil"
max_batch_size: {max_batch_size}
input [ 
 {{ 
 name: "input__0"
 data_type: TYPE_FP32
 dims: [ {NUM_FEATURES} ] 
 }} 
]
output [
 {{
 name: "output__0"
 data_type: TYPE_FP32
 dims: [ 1 ]
 }}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
 {{
 key: "model_type"
 value: {{ string_value: "{model_format}" }}
 }},
 {{
 key: "predict_proba"
 value: {{ string_value: "{predict_proba_string}" }}
 }},
 {{
 key: "output_class"
 value: {{ string_value: "{classifier_string}" }}
 }},
 {{
 key: "threshold"
 value: {{ string_value: "0.5" }}
 }},
 {{
 key: "storage_type"
 value: {{ string_value: "AUTO" }}
 }}
]

dynamic_batching {{}}"""

config_path = os.path.join(FIL_MODEL_DIR, "config.pbtxt")
with open(config_path, "w") as file_:
 file_.write(config_text)

## Set up Inference Pipeline of Data Preprocessing Python Backend and FIL Backend using Ensemble

Now we are ready to set up the inference pipeline for data preprocessing and tree-based model inference using an [ensemble model](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#ensemble-models). An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. Here we use the ensemble model to build a pipeline of Data Preprocessing in Python backend followed by XGBoost in FIL Backend. 

The expected layout for `ensemble` model directory is similar to the ones we showed above:

"ensemble-model"

In [None]:
# create model version directory for ensemble CPU model
!mkdir -p model_cpu_repository/ensemble/1
# create model version directory for ensemble GPU model
!mkdir -p model_gpu_repository/ensemble/1

We created the ensemble model's [config.pbtxt](model_repository/ensemble/config.pbtxt) following the guidance on [ensemble doc](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#ensemble-models). Importantly, we need to set up the ensemble scheduler in config.pbtxt which specifies the dataflow between models within the ensemble. The ensemble scheduler collects the output tensors in each step, provides them as input tensors for other steps according to the specification.

## Package model repository and upload to S3

Finally, we end up with the following model repository directory structure, containing a Python preprocessing model and its dependencies along with XGBoost FIL model, and the model ensemble.

"model-repo"

We will package this up as `model.tar.gz` for uploading it to S3.

### Create and Upload the model package for CPU-based instance (optimized for CPU)

In [None]:
!tar --exclude='.ipynb_checkpoints' -czvf model-cpu.tar.gz -C model_cpu_repository .


If you do not have access to the default bucket. You can upload the model tar ball to the bucket and prefix of your choice using the following code:

```
model_uri="s3:////model.tar.gz"

!aws s3 cp model.tar.gz "$model_uri"
```

In [None]:
# This method will upload the model tar ball to the SageMaker default bucket for the account in a prefix named as the User Profile for this Studio User. 

model_uri_cpu = sagemaker_session.upload_data(path="model-cpu.tar.gz", key_prefix=f"{studio_user_profile_output}/lab2")
print(model_uri_cpu)


### Create and Upload the model package for GPU-based instance (optimized for GPU)

In [None]:
!tar --exclude='.ipynb_checkpoints' -czvf model-gpu.tar.gz -C model_gpu_repository .

In [None]:
model_uri_gpu = sagemaker_session.upload_data(path="model-gpu.tar.gz", key_prefix=f"{studio_user_profile_output}/lab2") 
print(model_uri_gpu)

In [None]:
# Set the primary path for where all the models are stored on S3 bucket

model_location = f"s3://{s3_bucket}/{studio_user_profile_output}/lab2/"
model_location

## Create SageMaker Endpoint

We start off by creating a SageMaker model from the model repository we uploaded to S3 in the previous step.

In this step we also provide an additional Environment Variable `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` which specifies the name of the model to be loaded by Triton. **The value of this key should match the folder name in the model package uploaded to S3.** This variable is optional in case of a single model. In case of ensemble models, this **key has to be specified** for Triton to startup in SageMaker.

Additionally, customers can set `SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT` and `SAGEMAKER_TRITON_THREAD_COUNT` for optimizing the thread counts.

In [None]:
sm_model_name = f"{studio_user_profile_output}-lab2-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

container = {
 "Image": triton_image_uri,
 "ModelDataUrl": model_location,
 "Mode": "MultiModel",
 "Environment": {
 # "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": model_uri.rsplit('/')[-2], #m_name,
 #"SAGEMAKER_TRITON_LOG_VERBOSE": "true", #"200",
 #"SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE" : "20000000", #"1677721600", #"16777216000", "16777216"
 #"SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE": "1048576"

 }
}

create_model_response = sm.create_model(
 ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Using the model above, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint.

In [None]:
endpoint_config_name = f"{studio_user_profile_output}-lab2-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[
 {
 "InstanceType": "ml.g4dn.2xlarge",
 #"InstanceType": "ml.g4dn.xlarge",
 #"InstanceType": "ml.g4dn.4xlarge",
 #"InstanceType": "ml.g5.xlarge",
 "InitialVariantWeight": 1,
 "InitialInstanceCount": 1,
 "ModelName": sm_model_name,
 "VariantName": "AllTraffic",
 }
 ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the above endpoint configuration we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [None]:
endpoint_name = f"{studio_user_profile_output}-lab2-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm.create_endpoint(
 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
waiter = sm.get_waiter("endpoint_in_service")
print("Waiting for endpoint to create...")
waiter.wait(EndpointName=endpoint_name)
resp = sm.describe_endpoint(EndpointName=endpoint_name)
print(f"Endpoint Status: {resp['EndpointStatus']}")

print("Arn: " + resp["EndpointArn"])

## Run Inference

Once we have the endpoint running we can use some sample raw data to do an inference using json as the payload format. For the inference request format, Triton uses the KFServing community standard [inference protocols.](https://github.com/triton-inference-server/server/blob/main/docs/protocol/README.md)

In [None]:
data_infer = pd.read_csv("data_infer.csv")
data_infer

In [None]:
STR_COLUMNS = [
 "Time",
 "Amount",
 "Zip",
 "MCC",
 "Merchant Name",
 "Use Chip",
 "Merchant City",
 "Merchant State",
 "Errors?",
]

batch_size = len(data_infer)

payload = {}
payload["inputs"] = []
data_dict = {}
for col_name in data_infer.columns:
 data_dict[col_name] = {}
 data_dict[col_name]["name"] = col_name
 if col_name in STR_COLUMNS:
 data_dict[col_name]["data"] = data_infer[col_name].astype(str).tolist()
 data_dict[col_name]["datatype"] = "BYTES"
 else:
 data_dict[col_name]["data"] = data_infer[col_name].astype("float32").tolist()
 data_dict[col_name]["datatype"] = "FP32"
 data_dict[col_name]["shape"] = [batch_size, 1]
 payload["inputs"].append(data_dict[col_name])

### Call Model A (optimized for CPU)

In [None]:
import time
start = time.time()
response = client.invoke_endpoint(
 EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload),TargetModel="model-gpu.tar.gz"
)
end = time.time()
print(end - start)

response_body = json.loads(response["Body"].read().decode("utf8"))
predictions = response_body["outputs"][0]["data"]

CLASS_LABELS = ["NOT FRAUD", "FRAUD"]
predictions = [CLASS_LABELS[int(idx)] for idx in predictions]
print(predictions)

### Call Model B (optimized for GPU)

In [None]:
import time
start = time.time()
response = client.invoke_endpoint(
 EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload),TargetModel="model-gpu.tar.gz"
)
end = time.time()
print(end - start)

response_body = json.loads(response["Body"].read().decode("utf8"))
predictions = response_body["outputs"][0]["data"]

CLASS_LABELS = ["NOT FRAUD", "FRAUD"]
predictions = [CLASS_LABELS[int(idx)] for idx in predictions]
print(predictions)

### Binary + Json Payload

We can also use binary+json as the payload format to get better performance for the inference call. The specification of this format is provided [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md).

**Note:** With the `binary+json` format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header `application/vnd.sagemaker-triton.binary+json;json-header-size={}`.

Please note, this is different from using `Inference-Header-Content-Length` header on a stand-alone Triton server since custom headers are not allowed in SageMaker.

The [tritonclient](https://github.com/triton-inference-server/client) package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference.

In [None]:
import tritonclient.http as httpclient


def get_sample_data_binary(data, output_name):
 inputs = []
 outputs = []
 batch_size = len(data)
 for col_name in data.columns:
 if col_name in STR_COLUMNS:
 np_data = np.expand_dims(data[col_name], axis=1).astype("object")
 infer_input = httpclient.InferInput(col_name, [batch_size, 1], "BYTES")
 else:
 np_data = np.expand_dims(data[col_name], axis=1).astype("float32")
 infer_input = httpclient.InferInput(col_name, [batch_size, 1], "FP32")
 infer_input.set_data_from_numpy(np_data, binary_data=True)
 inputs.append(infer_input)
 outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
 request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
 inputs, outputs=outputs
 )
 return request_body, header_length

### Call Model A (optimized for CPU)

In [None]:
import time

output_name = "predictions"
request_body, header_length = get_sample_data_binary(data_infer, output_name)
start = time.time()
response = client.invoke_endpoint(
 EndpointName=endpoint_name,
 ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
 header_length
 ),
 Body=request_body,
 TargetModel="model-cpu.tar.gz"
)
end = time.time()
print(end - start)

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]

# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
 response["Body"].read(), header_length=int(header_length_str)
)
predictions = result.as_numpy(output_name)
CLASS_LABELS = ["NOT FRAUD", "FRAUD"]
predictions = [CLASS_LABELS[int(idx)] for idx in predictions]
print(predictions)

### Call Model B (optimized for CPU)

In [None]:
import time

output_name = "predictions"
request_body, header_length = get_sample_data_binary(data_infer, output_name)
start = time.time()
response = client.invoke_endpoint(
 EndpointName=endpoint_name,
 ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
 header_length
 ),
 Body=request_body,
 TargetModel="model-gpu.tar.gz"
)
end = time.time()
print(end - start)

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]

# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
 response["Body"].read(), header_length=int(header_length_str)
)
predictions = result.as_numpy(output_name)
CLASS_LABELS = ["NOT FRAUD", "FRAUD"]
predictions = [CLASS_LABELS[int(idx)] for idx in predictions]
print(predictions)

## Terminate endpoint and clean up artifacts

In [None]:
sm.delete_endpoint(EndpointName=endpoint_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_model(ModelName=sm_model_name)

## Conclusion

In this lab2, we leveraged Triton Inference Server to create an ensemble to do Python preprocessing and used the XGBoost model to show how fraud can be detected using Triton and its corresponding Python and FIL backends. This example can further be used as a guide to create your own ensembles leveraging the other backends that Triton provides solving a wide variety of use cases that you may have that require scale and performance while using hardware for acceleration. 