# Register pretrained 🤗 models using SageMaker Model Registry - Deploy 🤗 Transformer models for inference with Shadow Deployment
***
This notebooks is designed to run on `Python 3 Data Science 2.0` kernel in Amazon SageMaker Studio
***

In this notebook, we will use [Hugging Face Inference DLCs and Pytorch DLCs](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) to deploy two pretrained transformer models for real-time inference. You will firstly register the models to Amazon SageMaker model registry and then deploy each model to a SageMaker real-time endpoint and invoke the endpoint with the test payload. 
This example will use [SageMaker boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html) (low level api). You can also use [SageMaker python sdk](https://github.com/aws/sagemaker-python-sdk) to achieve the same results.

We will perform following steps:
1. [Introduction](#Introduction) 
2. [Setup](#Setup)
3. [Download and register HuggingFace Transformer models](#Download-and-register-HuggingFace-Transformer-models)
4. [Deploy registered models for real-time inference](#Deploy-registered-models-for-real\-time-inference)

## Introduction

For inference, you can use your trained Hugging Face model or one of the pretrained Hugging Face models to deploy an inference job with SageMaker. You can also run inference jobs without having to write any custom inference code. With custom inference code, you can customize the inference logic by providing your own Python script.

### How to deploy an inference job using the Hugging Face Deep Learning Containers
You have two options for running inference with SageMaker. You can run inference using a model that you trained, or deploy a pre-trained Hugging Face model.

* Run inference with your trained model: You have two options for running inference with your own trained model. You can run inference with a model that you trained using an existing Hugging Face model with the SageMaker Hugging Face Deep Learning Containers, or you can bring your own existing Hugging Face model and deploy it using SageMaker. When you run inference with a model that you trained with the SageMaker Hugging Face Estimator, you can deploy the model immediately after training completes or you can upload the trained model to an Amazon S3 bucket and ingest it when running inference later. If you bring your own existing Hugging Face model, you must upload the trained model to an Amazon S3 bucket and ingest that bucket when running inference.

* Run inference with a pre-trained HuggingFace model: You can use one of the thousands of pre-trained Hugging Face models to run your inference jobs with no additional training needed. We will see this in our lab today.


### SageMaker shadow testing overview
Amazon SageMaker now enables you to evaluate any changes to your model serving infrastructure, consisting of the ML model, the serving container, or the ML instance by shadow testing its performance against the currently deployed one. Shadow testing can help you catch potential configuration errors and performance issues before they impact end users. With SageMaker, you don’t need to invest in building your own shadow testing infrastructure, allowing you to focus on model development.

You can use this to validate changes to any component to your production variant, namely the model, the container, or the instance, without any end user impact. It is useful in situations such as:

- You are considering promoting a new model that has been validated offline to production but want to evaluate operational performance metrics such as latency, error rate before making this decision
- You are considering changes to your serving infrastructure container, such as patching vulnerabilities or upgrading to newer versions, and want to assess the impact of these changes prior to promotion
- You are considering changing your ML instance and want to evaluate how the new instance would perform with live inference requests.
Just select a production variant you want to test against, and SageMaker automatically deploys the new variant in shadow mode and routes a copy of the inference requests to it in real time within the same endpoint. Only the responses of the production variant are returned to the calling application. You can choose to discard or log the responses of the shadow variant for offline comparison.

This notebook provides a walkthrough of the feature using the SageMaker Inference APIs.

**SageMaker Background**

![arch](../img/03_lab_Shadow.png)

A `production variant` consists of the ML model, Serving Container, and ML Instance. Since each variant is independent of others, you can have different models, containers, or instance types across variants. SageMaker lets you specify autoscaling policies on a per-variant basis so they can scale independently based on incoming load. SageMaker supports up to 10 production variants per endpoint. You can either configure a variant to receive a portion of the incoming traffic by setting variant weights or specify the target variant in the incoming request. The response from the production variant is forwarded back to the invoker.

A `shadow variant (new)` has the same components as a production production variant. A user specified portion of the requests, known as the traffic sampling percentage (VariantWeight parameter in the ShadowProductionVariants object), is forwarded to the shadow variant. You can choose to log the response of the shadow variant in S3 or discard it. For an endpoint with a shadow variant, you can have a maximum of one production variant.

You can monitor the [invocation metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html) for both production and shadow variants in CloudWatch under the AWS/SageMaker namespace

## Setup

In [None]:
# cell-00
%pip install -U transformers ipywidgets sagemaker torch==1.13.0 -q

In [None]:
# cell-01
import datetime
import json
import os
import shutil
import sys
import tarfile
import time
from pathlib import Path
from uuid import uuid4

import boto3
import numpy as np
import pandas as pd
import sagemaker
import torch
from sagemaker import get_execution_role, image_uris
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.s3 import S3Uploader, s3_path_join
from transformers import AutoModel, AutoModelForSequenceClassification, AutoTokenizer, pipeline

p = os.path.abspath("..")
if p not in sys.path:
 sys.path.append(p)
import utils

### Useful objects and variables
Common objects to interact with SageMaker API

In [None]:
# cell-02
sm_session = sagemaker.Session()
role = get_execution_role()
bucket = sm_session.default_bucket()
region = sm_session.boto_region_name
sm_client = sm_session.sagemaker_client
sm_runtime = boto3.client("sagemaker-runtime")
prefix = "sagemaker/huggingface-pytorch-sentiment-analysis"
deploy_instance_type = "ml.m5.xlarge"
%store deploy_instance_type

# The name of the Model Package Group in Amazon SageMaker Model Registry
model_package_group_name = "HuggingFaceModels"
%store model_package_group_name

print(region)
print(role)
print(bucket)

## Download and prepare HuggingFace Transformer models

In [None]:
# cell-03
HF_TASK = "sentiment-analysis"
HF_MODEL_ROBERTA = "cardiffnlp/twitter-roberta-base-sentiment"
HF_MODEL_DISTILBERT = "distilbert-base-uncased-finetuned-sst-2-english"

%store HF_TASK
%store HF_MODEL_ROBERTA
%store HF_MODEL_DISTILBERT

### Download Hugging Face models
#### twitter-roberta-base-sentiment Pretrained Model

In this example we are downloading a pre-trained HuggingFace model - `twitter-roberta-base-sentiment` from the HuggingFace library. We will use this model for classifying the text as `Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive`.

In [None]:
# cell-04
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_ROBERTA)
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ROBERTA)
model.save_pretrained("model_token_roberta")
tokenizer.save_pretrained("model_token_roberta")

### Package the saved model to tar.gz format
Once the model is downloaded, we need to package (tokenizer and model weights) it to `.tar.gz` format as expected by Amazon SageMaker.

In [None]:
# cell-05
tar_file_roberta = "model_roberta.tar.gz"
tar_size = utils.create_tar(tar_file_roberta, Path("model_token_roberta"))
print(f"Created {tar_file_roberta}, size {tar_size:.2f} MB")

#### Download distilbert-base-uncased-finetuned-sst-2-english by initiating a `Huggingface pipeline`

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the [task summary](https://huggingface.co/transformers/task_summary.html) for examples of use.

In [None]:
# cell-06
local_artifact_path = Path("model_token_distilbert")
local_artifact_path.mkdir(exist_ok=True, parents=True)
tar_file_distilbert = "model_distilbert.tar.gz"

In [None]:
# cell-07
sentiment_analysis = pipeline(HF_TASK, model=HF_MODEL_DISTILBERT)
sentiment_analysis.save_pretrained(local_artifact_path)

#### Write the Inference Script

To deploy a pretrained `PyTorch` model, you'll need to use the `PyTorch` estimator object to create a `PyTorchModel` object and set a different `entry_point`.

You'll use the `PyTorchModel` object to deploy a `PyTorchPredictor`. This creates a `SageMaker` Endpoint -- a hosted prediction service that we can use to perform inference.

An implementation of `model_fn` is required for inference script. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `model_fn` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).

Here's an example of the inference script:

In [None]:
#!cat ../code/inference.py # uncomment this line of code to see the details in the py file

In [None]:
# !cat ../code/requirements.txt # uncomment this line to show the packages defined in the requirements.txt

#### Create the directory structure for your model files

The directory structure where you saved your PyTorch model should look something like the following:

```
| model
| |--pytorch_model.bin
| |--config.json
| |--vocab.txt
| |--tokenizer.json
| |--tokenizer_config.json
| |--special_tokens_map.json
|
| code
| |--inference.py
| |--requirements.txt
```

Where `requirements.txt` is an optional file that specifies dependencies on third-party libraries.

#### Copy code to the model directory and tar the model and code

In [None]:
# cell-08
shutil.copytree("../code", "model_token_distilbert/code", dirs_exist_ok=True)
tar_size =utils.create_tar(tar_file_distilbert, local_artifact_path)
print(f"Created {tar_file_distilbert}, size {tar_size:.2f} MB")

#### Upload the model to S3

We now have the model archives ready. We need to upload them to S3 before we can use them for hosting.

In [None]:
# cell-09
model_data_path = s3_path_join("s3://", bucket, prefix + "/models")
print(f"Uploading Models to {model_data_path}")
model_roberta_uri = S3Uploader.upload("model_roberta.tar.gz", model_data_path)
print(f"Uploaded roberta model to {model_roberta_uri}")
model_distilbert_uri = S3Uploader.upload("model_distilbert.tar.gz", model_data_path)
print(f"Uploaded distilbert model to {model_distilbert_uri}")
%store model_data_path
%store model_roberta_uri
%store model_distilbert_uri

### Deploy the two models as production and shadow variants to a real-time Inference endpoint
The first step in deploying a trained model to SageMaker Inference is to created a SageMaker Model using the create_model API.

#### Prebuilt HuggingFace DLC
You can choose to use a prebuilt HuggingFace DLC as the inference image, which has the [SageMaker huggingface inference toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) for serving 🤗 Transformers models on Amazon SageMaker. The inference toolkit leverages the pipeline for the transformer library to allow zero-code deployments of models, without requiring any code for pre- or post-processing. (see more information of the default [handler service](https://github.com/aws/sagemaker-huggingface-inference-toolkit/blob/main/src/sagemaker_huggingface_inference_toolkit/handler_service.py) provided bythe inference toolkit).

In addition to zero-code deployment, the Inference Toolkit supports "bring your own code" methods, where you can override the default methods. You can learn more about "bring your own code" in the documentation [here](https://github.com/aws/sagemaker-huggingface-inference-toolkit#-user-defined-codemodules). In the second lab section, we will use the bring your own code method to deploy models.

In [None]:
# cell-10
ml_domain = "NATURAL_LANGUAGE_PROCESSING"
ml_task = "OTHER"
ml_framework = "PYTORCH"
framework_version = "1.10.2"
# nearest_model = "bert-base-uncased"

In [None]:
# cell-11
framework = "huggingface"
transformer_version = "4.17.0"
py_version = "py38"
instance_type = "ml.g"
image_scope = "inference"

inference_image_roberta = image_uris.retrieve(
 framework=framework,
 base_framework_version=ml_framework.lower() + framework_version,
 region=region,
 version=transformer_version,
 py_version=py_version,
 instance_type=instance_type,
 image_scope=image_scope,
)

print(inference_image_roberta)

#### Prebuilt Pytorch DLC
You can also use a SageMaker prebuilt [Pytorch DLC](https://github.com/aws/deep-learning-containers/tree/master/pytorch) to deploy the huggingface model. In this case, as the prebuilt Pytorch container doesn't have the transformer package, we have provided a `requirements.txt` file with the additional packages that are required to be installed to the container in the model package. See section [Create the directory structure for your model files](#Create-the-directory-structure-for-your-model-files). We also included the `inference.py` file to define the necessary functions for model loading and model serving.

In [None]:
# cell-12
inference_image_distilbert = image_uris.retrieve(
 framework=ml_framework.lower(),
 region=region,
 version=framework_version,
 py_version=py_version,
 instance_type=instance_type,
 image_scope=image_scope,
)

print(inference_image_distilbert)

#### Create SageMaker models

The first step in deploying a trained model to SageMaker Inference is to created a SageMaker Model using the create_model API

In [None]:
# cell-13
# provide the consistent time stamp for model, endpoint config and endpoint
now = f"{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"

roberta_model_name = f"hf-pytorch-model-roberta-{now}"
print("Model name : {}".format(roberta_model_name))
%store roberta_model_name

In [None]:
# cell-14
distilbert_model_name = f"hf-pytorch-model-distilbert-{now}"
print("Model name : {}".format(distilbert_model_name))
%store distilbert_model_name

In [None]:
# cell-15
resp = sm_client.create_model(
 ModelName=roberta_model_name,
 ExecutionRoleArn=role,
 Containers=[{"Image": inference_image_roberta, "ModelDataUrl": model_roberta_uri}],
)
print(f"Created Model: {resp}")

In [None]:
# cell-16
resp2 = sm_client.create_model(
 ModelName=distilbert_model_name,
 ExecutionRoleArn=role,
 Containers=[{"Image": inference_image_distilbert, "ModelDataUrl": model_distilbert_uri}],
)
print(f"Created Model: {resp2}")


The next step is to create an endpoint config with the production and shadow variants. The ProductionVariants and the ShadowProductionVariants are of particular interest. We set the InitialVariantWeight in the ShadowProductionVariants to sample and send 50% of the production variant requests to the shadow variant. The production variant receives 100% of the traffic.

Both these variants have ml.m5.xlarge instances with 4 vCPUs and 16 GiB of memory and the initial instance count is set to 1.

In [None]:
# cell-17
ep_config_name = f"Shadow-EpConfig-{now}"
production_variant_name = "production"
shadow_variant_name = "shadow"

create_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=ep_config_name,
 ProductionVariants=[
 {
 "VariantName": production_variant_name,
 "ModelName": roberta_model_name,
 "InstanceType": "ml.m5.xlarge",
 "InitialInstanceCount": 2,
 "InitialVariantWeight": 1,
 }
 ],
 ShadowProductionVariants=[
 {
 "VariantName": shadow_variant_name,
 "ModelName": distilbert_model_name,
 "InstanceType": "ml.m5.2xlarge",
 "InitialInstanceCount": 1,
 "InitialVariantWeight": 0.5,
 }
 ],
)
print(f"Created EndpointConfig: {create_endpoint_config_response['EndpointConfigArn']}")

In [None]:
# cell-18
endpoint_name = f"hf-prod-shadow-{now}"
create_endpoint_api_response = sm_client.create_endpoint(
 EndpointName=endpoint_name,
 EndpointConfigName=ep_config_name,
)

Now, wait for the endpoint creation to complete. This should take 2-5 minutes, depending on your model artifact and serving container size.

In [None]:
%%time
# cell-19
utils.endpoint_creation_wait(endpoint_name)

### Invoke Endpoint with `boto3`

After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint.

For an overview of Amazon SageMaker, [see How It Works](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works.html).

Amazon SageMaker strips all POST headers except those supported by the API. Amazon SageMaker might add additional headers. You should not rely on the behavior of headers outside those enumerated in the request syntax.

Calls to `InvokeEndpoint` are authenticated by using AWS Signature Version 4. For information, see Authenticating Requests (AWS Signature Version 4) in the Amazon S3 API Reference.

A customer's model containers must respond to requests within 60 seconds. The model itself can have a maximum processing time of 60 seconds before responding to invocations. If your model is going to take 50-60 seconds of processing time, the SDK socket timeout should be set to be 70 seconds.

More info on `invoke_endpoint` can be found on the [Boto3 `SageMakerRuntime` documentation page](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint).

In [None]:
# cell-20
test_data = pd.read_csv("../sample_payload/test_data.csv", header=None, names=["inputs"])
json_data = dict({"inputs": test_data.iloc[:, 0].to_list()})
print(json_data)
test_data.head()

In [None]:
# cell-21
def invoke_endpoint(endpoint_name, should_raise_exp=False):

 try:
 for i in range(50): # send the same payload 50 times for testing purpose
 response = sm_runtime.invoke_endpoint(
 EndpointName=endpoint_name,
 Body=test_data.to_csv(header=True, index=False),
 ContentType="text/csv",
 )
 print(response["Body"].read())
 except Exception as e:
 print("E", end="", flush=True)
 if should_raise_exp:
 raise e


invoke_endpoint(endpoint_name)

Now that the endpoint is InService and has been invoked, the following cells help collect CloudWatch metrics between the production and shadow variants for metrics comparison.

In [None]:
# cell-22
%matplotlib inline
import pandas as pd

cw = boto3.Session().client("cloudwatch", region_name=region)


def get_sagemaker_metrics(
 endpoint_name,
 variant_name,
 metric_name,
 statistic,
 start_time,
 end_time,
):
 dimensions = [
 {"Name": "EndpointName", "Value": endpoint_name},
 {"Name": "VariantName", "Value": variant_name},
 ]
 namespace = "AWS/SageMaker"
 if metric_name in ["CPUUtilization", "MemoryUtilization", "DiskUtilization"]:
 namespace = "/aws/sagemaker/Endpoints"

 metrics = cw.get_metric_statistics(
 Namespace=namespace,
 MetricName=metric_name,
 StartTime=start_time,
 EndTime=end_time,
 Period=1,
 Statistics=[statistic],
 Dimensions=dimensions,
 )

 if len(metrics["Datapoints"]) == 0:
 return
 return (
 pd.DataFrame(metrics["Datapoints"])
 .sort_values("Timestamp")
 .set_index("Timestamp")
 .drop(["Unit"], axis=1)
 .rename(columns={statistic: variant_name})
 )


def plot_endpoint_invocation_metrics(
 endpoint_name,
 metric_name,
 statistic,
 start_time=None,
):
 from datetime import datetime, timezone, timedelta
 start_time = start_time or datetime.now(timezone.utc) - timedelta(minutes=10)
 end_time = datetime.now(timezone.utc)
 metrics_production = get_sagemaker_metrics(
 endpoint_name,
 production_variant_name,
 metric_name,
 statistic,
 start_time,
 end_time,
 )
 metrics_shadow = get_sagemaker_metrics(
 endpoint_name,
 shadow_variant_name,
 metric_name,
 statistic,
 start_time,
 end_time,
 )
 try:
 metrics_variants = pd.merge(metrics_production, metrics_shadow, on="Timestamp")
 return metrics_variants.plot(y=["production", "shadow"])
 except Exception as e:
 print(e)

#### Metric Comparison
Now that we have deployed both the production and shadow models, let us compare the invocation metrics. Here is a [list](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html) of invocation metrics available for comparison. Let us start by comparing invocations between the production and shadow variants

In [None]:
# cell-23
invocations = plot_endpoint_invocation_metrics(endpoint_name, "Invocations", "Sum")
invocations_per_instance = plot_endpoint_invocation_metrics(
 endpoint_name, "InvocationsPerInstance", "Sum"
)

The Invocation metric refers to the number of invocations sent to the production variant. A fraction of these invocations, specified in the variant weight, are sent to the shadow variant. The invocation per instance is calculated by dividing the total number of invocations by the number of instances in a variant. From the chart above, we can confirm that both the production and shadow variants are receiving invocation requests according to the weights specified in the endpoint config.

Next let us compare the model latency between the production and shadow variants. Model latency is the time taken by a model to respond as viewed from SageMaker.

In [None]:
# cell-24
model_latency = plot_endpoint_invocation_metrics(endpoint_name, "ModelLatency", "Average")

Using the chart above, we can observe how the model latency of the shadow variant compares with the production variant without exposing end users to the shadow variant.

We expect the overhead latency to be comparable across production and shadow variants. Overhead latency is the interval measured from the time SageMaker receives the request until it returns a response to the client, minus the model Latency.

In [None]:
# cell-25
overhead_latency = plot_endpoint_invocation_metrics(endpoint_name, "OverheadLatency", "Average")

Finally, let us review the 4xx, 5xx and total model errors returned by the model serving container.

In [None]:
# cell-26
Invocation4xxErrors = plot_endpoint_invocation_metrics(endpoint_name, "Invocation4XXErrors", "Sum")
Invocation5xxErrors = plot_endpoint_invocation_metrics(endpoint_name, "Invocation5XXErrors", "Sum")
Invocation5xxErrors = plot_endpoint_invocation_metrics(
 endpoint_name, "InvocationModelErrors", "Sum"
)

We can consider promoting the shadow model if we do not see any differences in 4xx and 5xx errors between the production shadow variants.

To promote the shadow model to production, create a new endpoint configuration with current ShadowProductionVariant as the new ProductionVariant and removing the ShadowProductionVariant. This will remove the current ProductionVariant and promote the shadow variant to become the new production variant. As always, all SageMaker updates are orchestrated as blue/green deployments under the hood and there is no loss of availability while performing the update. Optionally, you can leverage [Deployment Guardrails](https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails.html) if you want to use all-at-once traffic shifting and auto rollbacks during your update.

In [None]:
# cell-27
promote_ep_config_name = f"PromoteShadow-EpConfig-{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"

create_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=promote_ep_config_name,
 ProductionVariants=[
 {
 "VariantName": shadow_variant_name,
 "ModelName": distilbert_model_name,
 "InstanceType": "ml.m5.xlarge",
 "InitialInstanceCount": 2,
 "InitialVariantWeight": 1.0,
 }
 ],
)
print(f"Created EndpointConfig: {create_endpoint_config_response['EndpointConfigArn']}")

In [None]:
# cell-28
update_endpoint_api_response = sm_client.update_endpoint(
 EndpointName=endpoint_name,
 EndpointConfigName=promote_ep_config_name,
)

utils.endpoint_creation_wait(endpoint_name)

sm_client.describe_endpoint(EndpointName=endpoint_name)

If you do not want to create multiple endpoint configurations and want SageMaker to manage the end to end workflow of creating, managing, and acting on the results of the shadow tests, consider using the SageMaker Inference Experiement APIs/Console experience. As stated earlier, they enable you to setup shadow tests for a predefined duration of time, monitor the progress through a live dashboard, presents clean up options upon completion, and act on the results. To get started, please navigate to the 'Shadow Tests' section of the SageMaker Inference console.

### Cleanup
If you do not plan to use this endpoint further, you should delete the endpoint to avoid incurring additional charges and clean up other resources created in this notebook.



In [None]:
# cell-29
sm_client.delete_endpoint(EndpointName=endpoint_name)