# Image classification at low latency with TensorFlow serving on Amazon SageMaker. 


In this notebook, we walkthrough 3 ways of serving an image classification model with TensorFlow serving on SageMaker endpoints. 

1. Default with no custom inference script. This is adapted from https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/tensorflow_serving_container and expects preprocessing to be run at the client side. 
2. Custom inference script for preprocessing and triggering TFS internally via REST. This includes the flexibility for custom preprocessing of image byte stream, but is comparatively slower than option #3. 
3. Custom inference script for preprocessing and triggering TFS internally via gRPC. This includes the flexibility for custom preprocessing of image byte stream and 75% reduction of latency over option #2. 

## Setup

First, we need to ensure we have an up-to-date version of the SageMaker Python SDK, and install a few
additional python packages.

In [1]:
import sagemaker

In [2]:
sagemaker.__version__

'2.33.0'

In [None]:
!pip install tensorflow==2.4.1 -U 

Next, we'll get the IAM execution role from our notebook environment, so that SageMaker can access resources in your AWS account later in the example.

In [None]:
from sagemaker import get_execution_role

sagemaker_role = get_execution_role()

## Download and prepare a model from TensorFlow Hub

The TensorFlow Serving Container works with any model stored in TensorFlow's [SavedModel format](https://www.tensorflow.org/guide/saved_model). This could be the output of your own training job or a model trained elsewhere. For this example, we will use a pre-trained version of the MobileNet V2 image classification model. There are 2 options to retrieve the pre-trained model, 1) [TensorFlow Hub](https://tfhub.dev/) and 2) [Keras applications](https://keras.io/api/applications/mobilenet/#mobilenetv2-function). You can refer the differences in [stackoverflow.](https://stackoverflow.com/questions/60251715/difference-between-keras-and-tensorflow-hub-version-of-mobilenetv2)
We will use option 2 and get the model from keras applications.



In [6]:
import tensorflow as tf

In [42]:
tf.__version__

'2.4.1'

In [16]:
#2 options : get the model from tensorflow hub  and get the model from keras applications
#refer differences here : https://stackoverflow.com/questions/60251715/difference-between-keras-and-tensorflow-hub-version-of-mobilenetv2

#option 1 ((logit outputs need to add softmax))
#hub_url = 'https://tfhub.dev/google/imagenet/mobilenet_v2_140_224/classification/4'
#model = tf.keras.Sequential([
#    hub.KerasLayer(hub_url)
#])
#model.build([None, 224, 224, 3])  # Batch input shape.


In [61]:
#option 2 
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2, preprocess_input
model = MobileNetV2()
model.save('model/1/')

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5
INFO:tensorflow:Assets written to: model/1/assets


INFO:tensorflow:Assets written to: model/1/assets


In [62]:
model_path = 'model/1/'

After exporting the model, we can inspect it using TensorFlow's ``saved_model_cli`` command. In the command output, you should see 

```
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
...
```

The command output should also show details of the model inputs and outputs.

In [None]:
!saved_model_cli show --all --dir {model_path}

In [None]:
model.summary()

Next we need to create a model archive file containing the exported model.

## Create a model archive file

SageMaker models need to be packaged in `.tar.gz` files. When your endpoint is provisioned, the files in the archive will be extracted and put in `/opt/ml/model/` on the endpoint. 

In [66]:
!tar -C "$PWD" -czf model.tar.gz model/

## Upload the model archive file to S3

We now have a suitable model archive ready in our notebook. We need to upload it to S3 before we can create a SageMaker Model that. We'll use the SageMaker Python SDK to handle the upload.

In [None]:
from sagemaker.session import Session

model_data = Session().upload_data(path='model.tar.gz', key_prefix='mobilenet')
print('model uploaded to: {}'.format(model_data))

## Create a SageMaker Model and Endpoint

Now that the model archive is in S3, we can create a Model and deploy it to an 
Endpoint with a few lines of python code:

In [69]:
from sagemaker.tensorflow.serving import Model

model = Model(model_data=model_data, role=sagemaker_role, framework_version='2.4.1')
predictor = model.deploy(initial_instance_count=1, instance_type='ml.g4dn.xlarge')

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


-----------------!

## Make predictions using the endpoint

The endpoint is now up and running, and ready to handle inference requests. The `deploy` call above returned a `predictor` object. The `predict` method of this object handles sending requests to the endpoint. It also automatically handles JSON serialization of our input arguments, and JSON deserialization of the prediction results.

We'll use these sample images:

<img src="kitten.jpg" align="left" style="padding: 8px;">
<img src="bee.jpg" style="padding: 8px;">

In [71]:
import numpy as np

In [28]:
from tensorflow.keras.preprocessing import image
from PIL import Image
HEIGHT = 224
WIDTH  = 224
def image_file_to_tensor(path):
    img = Image.open(path).convert('RGB')
    img = img.resize((WIDTH, HEIGHT))
    img_array = image.img_to_array(img) #, data_format = "channels_first")
    # the image is now in an array of shape (224, 224, 3) or (3, 224, 224) based on data_format
    # need to expand it to add dim for num samples, e.g. (1, 224, 224, 3)
    x = np.expand_dims(img_array, axis=0)
    instance = preprocess_input(x)
    return instance

In [None]:
# read the image files into a tensor (numpy array)
kitten_image = image_file_to_tensor('kitten.jpg')
print(kitten_image.shape)
# get a prediction from the endpoint
# the image input is automatically converted to a JSON request.
# the JSON response from the endpoint is returned as a python dict
result = predictor.predict(kitten_image)

In [80]:
import time
results = []
for i in range(1,1000):
    start = time.time()
    kitten_image = image_file_to_tensor('kitten.jpg')
    predictor.predict(kitten_image)
    results.append((time.time() - start) * 1000)
print("\nPredictions for TF2 serving default: \n")
print('\nP95: ' + str(np.percentile(results, 95)) + ' ms\n')    
print('P90: ' + str(np.percentile(results, 90)) + ' ms\n')
print('Average: ' + str(np.average(results)) + ' ms\n')


Predictions for TF2 serving default: 


P95: 360.88467836380005 ms

P90: 355.2844762802124 ms

P50: 310.48285961151123 ms

Average: 310.48285961151123 ms



In [41]:
print(model_data)

s3://sagemaker-us-east-1-436518610213/mobilenet/model.tar.gz


### Custom inference script for preprocessing and REST communication with TensorFlow serving

In [43]:
from sagemaker.tensorflow.serving import TensorFlowModel

model2 = TensorFlowModel(source_dir='code',entry_point='inference.py',model_data=model_data, role=sagemaker_role, framework_version='2.4.1', env = {'PREDICT_USING_GRPC' : 'false'})
#'ml.g4dn.xlarge'
predictor2 = model2.deploy(initial_instance_count=1, instance_type='ml.g4dn.xlarge')

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


-------------------!

In [45]:
import boto3
import numpy as np
kitten_image = open('kitten.jpg', 'rb').read()
endpoint_name = predictor2.endpoint_name
runtime_client = boto3.client('runtime.sagemaker')
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/x-image', 
                                   Body=kitten_image)
result = response['Body'].read().decode('ascii')

In [47]:
import time
results = []
for i in range(1,1000):
    start = time.time()
    kitten_image = open('kitten.jpg', 'rb').read()
    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/x-image', 
                                   Body=kitten_image)
    results.append((time.time() - start) * 1000)
print("\nPredictions for TF2 serving REST API: \n")
print('\nP95: ' + str(np.percentile(results, 95)) + ' ms\n')    
print('P90: ' + str(np.percentile(results, 90)) + ' ms\n')
print('Average: ' + str(np.average(results)) + ' ms\n')


Predictions for TF2 serving REST API: 


P95: 279.7132968902588 ms

P90: 278.2435894012451 ms

Average: 266.48592948913574 ms



### Custom inference script for preprocessing and gRPC communication with TensorFlow serving

In [20]:
from sagemaker.tensorflow.serving import TensorFlowModel

model2 = TensorFlowModel(source_dir='code',entry_point='inference.py',model_data=model_data, role=sagemaker_role, framework_version='2.4.1', env = {'PREDICT_USING_GRPC' : 'true'})
#'ml.g4dn.xlarge'
predictor2 = model2.deploy(initial_instance_count=1, instance_type='ml.g4dn.xlarge')

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


-----------------!

In [21]:
import boto3
kitten_image = open('kitten.jpg', 'rb').read()
endpoint_name = predictor2.endpoint_name
runtime_client = boto3.client('runtime.sagemaker')
response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/x-image', 
                                   Body=kitten_image)
result = response['Body'].read().decode('ascii')

In [22]:
import time
results = []
for i in range(1,1000):
    start = time.time()
    kitten_image = open('kitten.jpg', 'rb').read()
    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/x-image', 
                                   Body=kitten_image)
    results.append((time.time() - start) * 1000)
print("\nPredictions for TF2 serving with gRPC : \n")
print('\nP95: ' + str(np.percentile(results, 95)) + ' ms\n')    
print('P90: ' + str(np.percentile(results, 90)) + ' ms\n')
print('Average: ' + str(np.average(results)) + ' ms\n')


Predictions for TF2 serving with gRPC : 


P95: 76.57002210617065 ms

P90: 74.57253932952881 ms

P50: 58.59267711639404 ms

Average: 58.59267711639404 ms



## Additional Information

The TensorFlow Serving Container supports additional features not covered in this notebook, including support for:

- TensorFlow Serving REST API requests:classify and regress requests
- CSV input
- Other JSON formats

For information on how to use these features, refer to the documentation in the 
[SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst).

## Cleaning up

To avoid incurring charges to your AWS account for the resources used in this tutorial, you need to delete the SageMaker Endpoint.

In [None]:
predictor.delete_endpoint()

In [1]:
range(1,100)

range(1, 100)