# Deploy a TensorFlow Model using NVIDIA Triton on SageMaker

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

---

Amazon SageMaker is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.

Now, NVIDIA Triton Inference Server can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

This example will showcase how to deploy a pre-trained TensorFlow model using NVIDIA Triton on SageMaker.

The model used here was pre-trained on the MNIST dataset. See this [Deploy a Trained TensorFlow V2 Model example](https://github.com/aws/amazon-sagemaker-examples/blob/1c5da8941bc933b176b56a93157073d5645d8cdf/frameworks/tensorflow/get_started_mnist_deploy.ipynb) for the training of the model. 

## Contents
1. [Introduction to NVIDIA Triton Server](#Introduction-to-NVIDIA-Triton-Server)
1. [Set up the environment](#Set-up-the-environment)
1. [Transform TensorFlow Model structure](#Transform-TensorFlow-Model-structure)
  1. [Inspect the model using the saved_model_cli](#Inspect-the-model-using-the-saved_model_cli)
  1. [Create the config.pbtxt](#Create-the-config.pbtxt)
  1. [Create the tar ball in the required Triton structure](#Create-the-tar-ball-in-the-required-Triton-structure)
1. [Deploy model to SageMaker Endpoint](#Deploy-model-to-SageMaker-Endpoint)
1. [Clean up](#Clean-up)

## Introduction to NVIDIA Triton Server

[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.

Some key features of Triton are:
* **Support for Multiple frameworks**: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats. 
* **Model pipelines**: Triton model ensemble represents a pipeline of one or more models or pre/post processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.
* **Concurrent model execution**: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.
* **Dynamic batching**: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
* **Diverse CPUs and GPUs**: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

**Note**: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal `config.pbtxt` configuration file is **required** in the model artifacts. This release doesn't support inferring the model config automatically.




## Set up the environment

Download the pre-trained TensorFlow model from a public S3 bucket.
Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.


In [None]:
%%time
import boto3

# use the region-specific saved model object
region = boto3.Session().region_name
!mkdir -p models/SavedModel/
s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-example-files-prod-{region}",
    "datasets/image/MNIST/model/tensorflow-training-2020-11-20-23-57-13-077/model.tar.gz",
    "models/SavedModel/model.tar.gz",
)

In [None]:
import sagemaker

sm_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket_name = sm_session.default_bucket()

In [None]:
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

In [None]:
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

In [None]:
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:21.08-py3".format(
    account_id=account_id_map[region], region=region, base=base
)

In [None]:
!tar -xf models/SavedModel/model.tar.gz -C models/SavedModel/

## Transform TensorFlow Model structure


The model that we want to deploy currently has the following structure:

```
00000000
        ├── saved_model.pb
        ├── assets/
        └── variables/
            ├── variables.data-00000-of-00001
            └── variables.index
```
For Triton, the model needs to have the following structure:
```
<model-name>
├── config.pbtxt
└── 1/
    └── model.savedmodel
        ├── saved_model.pb
        ├── assets/
        └── variables/
            ├── variables.data-00000-of-00001
            └── variables.index
            

```

In [None]:
! mkdir -p models/TritonModel/MNIST/1
! cp models/SavedModel/00000000 --recursive ./models/TritonModel/MNIST/1/model.savedmodel/

### Inspect the model using the `saved_model_cli`

In order to create the `config.pbtxt` we need to confirm the model inputs and outputs (Signature).
We use the `saved_model_cli` to inspect the model and take note of the input and output shape.

In [None]:
!saved_model_cli show --all --dir  {"models/SavedModel/00000000"}

### Create the config.pbtxt 

Triton requires a [Model Configuration file](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) known as a `config.pbtxt`. We create one below in the correct directory.

The `name` in the `config.pbtxt` must match the name of our model directory. In this case we will use `MNIST`.


In [None]:
%%writefile models/TritonModel/MNIST/config.pbtxt
name: "MNIST"
platform: "tensorflow_savedmodel"
max_batch_size: 0

instance_group {
  count: 1
  kind: KIND_GPU
}

dynamic_batching {

}

input [
  {
    name: "input_1"
    data_type: TYPE_FP32
    dims: [-1, 28, 28, 1]
  }
]
output [
  {
    name: "output_1"
    data_type: TYPE_FP32
    dims: [-1, 10]
  }
]

In [None]:
model_location = f"s3://{bucket_name}/TritonModel/TritonModel.tar.gz"

### Create the tar ball in the required Triton structure

In [None]:
%%sh
cd models/TritonModel/ 
tar -czvf TritonModel.tar.gz MNIST/

### Upload the new tar ball containing the Triton model structure to s3

In [None]:
!aws s3 cp models/TritonModel/TritonModel.tar.gz $model_location

## Deploy model to SageMaker Endpoint
We start off by creating a sagemaker model from the model files we uploaded to s3 in the previous step.

In this step we also provide an additional Environment Variable i.e. `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to s3. This variable is optional in case of a single model. In case of ensemble models, this key has to be specified for Triton to startup in SageMaker.

Additionally, customers can set `SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT` and `SAGEMAKER_TRITON_THREAD_COUNT` for optimizing the thread counts.

In [None]:
from sagemaker.model import Model

tensorflow_model = Model(
    model_data=model_location,
    role=role,
    env={"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "MNIST"},
    image_uri=triton_image_uri,
)

In [None]:
from datetime import datetime

date = datetime.now().strftime("%Y-%m-%d-%H-%m-%S")

endpoint_name = f"Triton-MNIST-{date}"

In [None]:
predictor = tensorflow_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
    endpoint_name=endpoint_name,
)

In [None]:
import numpy as np
import json

payload = {
    "inputs": [
        {
            "name": "input_1",
            "shape": [4, 28, 28, 1],
            "datatype": "FP32",
            "data": np.random.rand(4, 28, 28, 1).tolist(),
        }
    ]
}
runtime_sm_client = boto3.client("sagemaker-runtime")
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(payload),
)

predictions = json.loads(response["Body"].read())["outputs"][0]["data"]
predictions = np.array(predictions, dtype=np.float32)
predictions = np.argmax(predictions)
predictions

## Clean up
We strongly recommend to delete the Real-time endpoint created to stop incurring cost when finished with the example

In [None]:
sm_client = boto3.client("sagemaker")
# Delete endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-triton|TensorFlow|Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb)
