# Register pretrained 🤗 models using SageMaker Model Registry - Deploy 🤗 Transformer models for inference
***
This notebooks is designed to run on `Python 3 Data Science 2.0` kernel in Amazon SageMaker Studio
***

In this notebook, we will use [Hugging Face Inference DLCs and Pytorch DLCs](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) to deploy two pretrained transformer models for real-time inference. You will firstly register the models to Amazon SageMaker model registry and then deploy each model to a SageMaker real-time endpoint and invoke the endpoint with the test payload. 
This example will use [SageMaker boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html) (low level api). You can also use [SageMaker python sdk](https://github.com/aws/sagemaker-python-sdk) to achieve the same results.

We will perform following steps:
1. [Introduction](#Introduction) 
2. [Setup](#Setup)
3. [Download and register HuggingFace Transformer models](#Download-and-register-HuggingFace-Transformer-models)
4. [Deploy registered models for real-time inference](#Deploy-registered-models-for-real\-time-inference)


## Introduction

For inference, you can use your trained Hugging Face model or one of the pretrained Hugging Face models to deploy an inference job with SageMaker. You can also run inference jobs without having to write any custom inference code. With custom inference code, you can customize the inference logic by providing your own Python script.

### How to deploy an inference job using the Hugging Face Deep Learning Containers
You have two options for running inference with SageMaker. You can run inference using a model that you trained, or deploy a pre-trained Hugging Face model.

* Run inference with your trained model: You have two options for running inference with your own trained model. You can run inference with a model that you trained using an existing Hugging Face model with the SageMaker Hugging Face Deep Learning Containers, or you can bring your own existing Hugging Face model and deploy it using SageMaker. When you run inference with a model that you trained with the SageMaker Hugging Face Estimator, you can deploy the model immediately after training completes or you can upload the trained model to an Amazon S3 bucket and ingest it when running inference later. If you bring your own existing Hugging Face model, you must upload the trained model to an Amazon S3 bucket and ingest that bucket when running inference.

* Run inference with a pre-trained HuggingFace model: You can use one of the thousands of pre-trained Hugging Face models to run your inference jobs with no additional training needed. We will see this in our lab today.

## Setup

In [None]:
%pip install -U transformers ipywidgets sagemaker torch -q

In [None]:
import datetime
import json
import os
import shutil
import sys
import tarfile
import time
from pathlib import Path
from uuid import uuid4

import boto3
import numpy as np
import pandas as pd
import sagemaker
import torch
from sagemaker import get_execution_role, image_uris
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.s3 import S3Uploader, s3_path_join
from transformers import AutoModel, AutoModelForSequenceClassification, AutoTokenizer, pipeline

p = os.path.abspath("..")
if p not in sys.path:
 sys.path.append(p)
import utils

### Useful objects and variables
Common objects to interact with SageMaker API

In [None]:
sm_session = sagemaker.Session()
role = get_execution_role()
bucket = sm_session.default_bucket()
region = sm_session.boto_region_name
sm_client = sm_session.sagemaker_client
sm_runtime = boto3.client("sagemaker-runtime")
prefix = "sagemaker/huggingface-pytorch-sentiment-analysis"
deploy_instance_type = "ml.m5.xlarge"
%store deploy_instance_type

# The name of the Model Package Group in Amazon SageMaker Model Registry
model_package_group_name = "HuggingFaceModels"
%store model_package_group_name

print(region)
print(role)
print(bucket)

## Download and register HuggingFace Transformer models

In [None]:
HF_TASK = "sentiment-analysis"
%store HF_TASK

In [None]:
HF_MODEL_ROBERTA = "cardiffnlp/twitter-roberta-base-sentiment"
HF_MODEL_DISTILBERT = "distilbert-base-uncased-finetuned-sst-2-english"
%store HF_MODEL_ROBERTA
%store HF_MODEL_DISTILBERT

### Download Hugging Face models
#### twitter-roberta-base-sentiment Pretrained Model

In this example we are downloading a pre-trained HuggingFace model - `twitter-roberta-base-sentiment` from the HuggingFace library. We will use this model for classifying the text as `Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive`.

In [None]:
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(HF_MODEL_ROBERTA)
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ROBERTA)
model.save_pretrained("model_token_roberta")
tokenizer.save_pretrained("model_token_roberta")

### Package the saved model to tar.gz format
Once the model is downloaded, we need to package (tokenizer and model weights) it to `.tar.gz` format as expected by Amazon SageMaker.

In [None]:
tar_file_roberta = "model_roberta.tar.gz"
tar_size = utils.create_tar(tar_file_roberta, Path("model_token_roberta"))
print(f"Created {tar_file_roberta}, size {tar_size:.2f} MB")

#### Download distilbert-base-uncased-finetuned-sst-2-english by initiating a `Huggingface pipeline`

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the [task summary](https://huggingface.co/transformers/task_summary.html) for examples of use.

In [None]:
local_artifact_path = Path("model_token_distilbert")
local_artifact_path.mkdir(exist_ok=True, parents=True)
tar_file_distilbert = "model_distilbert.tar.gz"

In [None]:
sentiment_analysis = pipeline(HF_TASK, model=HF_MODEL_DISTILBERT)
sentiment_analysis.save_pretrained(local_artifact_path)

#### Write the Inference Script

To deploy a pretrained `PyTorch` model, you'll need to use the `PyTorch` estimator object to create a `PyTorchModel` object and set a different `entry_point`.

You'll use the `PyTorchModel` object to deploy a `PyTorchPredictor`. This creates a `SageMaker` Endpoint -- a hosted prediction service that we can use to perform inference.

An implementation of `model_fn` is required for inference script. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `model_fn` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).

Here's an example of the inference script:

In [None]:
#!cat ../code/inference.py # uncomment this line of code to see the details in the py file

In [None]:
# !cat ../code/requirements.txt # uncomment this line to show the packages defined in the requirements.txt

#### Create the directory structure for your model files

The directory structure where you saved your PyTorch model should look something like the following:

```
| model
| |--pytorch_model.bin
| |--config.json
| |--vocab.txt
| |--tokenizer.json
| |--tokenizer_config.json
| |--special_tokens_map.json
|
| code
| |--inference.py
| |--requirements.txt
```

Where `requirements.txt` is an optional file that specifies dependencies on third-party libraries.

#### Copy code to the model directory and tar the model and code

In [None]:
shutil.copytree("../code", "model_token_distilbert/code", dirs_exist_ok=True)
tar_size =utils.create_tar(tar_file_distilbert, local_artifact_path)
print(f"Created {tar_file_distilbert}, size {tar_size:.2f} MB")

#### Upload the model to S3

We now have the model archives ready. We need to upload them to S3 before we can use them for hosting.

In [None]:
model_data_path = s3_path_join("s3://", bucket, prefix + "/models")
print(f"Uploading Models to {model_data_path}")
model_roberta_uri = S3Uploader.upload("model_roberta.tar.gz", model_data_path)
print(f"Uploaded roberta model to {model_roberta_uri}")
model_distilbert_uri = S3Uploader.upload("model_distilbert.tar.gz", model_data_path)
print(f"Uploaded distilbert model to {model_distilbert_uri}")

In [None]:
%store model_data_path
%store model_roberta_uri
%store model_distilbert_uri

### Register the models to SageMaker model registry
To use the models, it is recommended to register the models into Model Registry. We start creating a `HuggingFaceModel` object using *boto3*. We also use `boto3` to register the model to include parameters required for future use:
- Domain
- Task
- Framework
- FrameworkVersion

In [None]:
# # uncomment the cell to list the domain, framework, task,
# # and model name of standard machine learning models found in common model zoos.
# df = utils.list_model_metadata_df()

# display(df.sort_values(by=["Domain", "Task", "Framework", "FrameworkVersion"]))

In this example, as we are predicting Sentiment analysis with `HuggingFace` `BERT`, we select `NATURAL_LANGUAGE_PROCESSING` as the Domain, `OTHERs` as the Task, `PYTORCH` as the Framework, and `bert-base-uncased` as the Model.

In [None]:
ml_domain = "NATURAL_LANGUAGE_PROCESSING"
ml_task = "OTHER"
ml_framework = "PYTORCH"
framework_version = "1.10.2"
nearest_model = "bert-base-uncased"

#### Prebuilt HuggingFace DLC
You can choose to use a prebuilt HuggingFace DLC as the inference image, which has the [SageMaker huggingface inference toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) for serving 🤗 Transformers models on Amazon SageMaker. The inference toolkit leverages the pipeline for the transformer library to allow zero-code deployments of models, without requiring any code for pre- or post-processing. (see more information of the default [handler service](https://github.com/aws/sagemaker-huggingface-inference-toolkit/blob/main/src/sagemaker_huggingface_inference_toolkit/handler_service.py) provided bythe inference toolkit).

In addition to zero-code deployment, the Inference Toolkit supports "bring your own code" methods, where you can override the default methods. You can learn more about "bring your own code" in the documentation [here](https://github.com/aws/sagemaker-huggingface-inference-toolkit#-user-defined-codemodules). In the second lab section, we will use the bring your own code method to deploy models.

In [None]:
framework = "huggingface"
transformer_version = "4.17.0"
py_version = "py38"
instance_type = "ml.g"
image_scope = "inference"

inference_image_roberta = image_uris.retrieve(
 framework=framework,
 base_framework_version=ml_framework.lower() + framework_version,
 region=region,
 version=transformer_version,
 py_version=py_version,
 instance_type=instance_type,
 image_scope=image_scope,
)

print(inference_image_roberta)

In [None]:
inference_image_hf_mme = image_uris.retrieve(
 framework=framework,
 base_framework_version=ml_framework.lower() + framework_version,
 region=region,
 version=transformer_version,
 py_version=py_version,
 instance_type="ml.c",
 image_scope=image_scope,
)

print(inference_image_hf_mme)
%store inference_image_hf_mme

#### Prebuilt Pytorch DLC
You can also use a SageMaker prebuilt [Pytorch DLC](https://github.com/aws/deep-learning-containers/tree/master/pytorch) to deploy the huggingface model. In this case, as the prebuilt Pytorch container doesn't have the transformer package, we have provided a `requirements.txt` file with the additional packages that are required to be installed to the container in the model package. See section [Create the directory structure for your model files](#Create-the-directory-structure-for-your-model-files). We also included the `inference.py` file to define the necessary functions for model loading and model serving.

In [None]:
inference_image_distilbert = image_uris.retrieve(
 framework=ml_framework.lower(),
 region=region,
 version=framework_version,
 py_version=py_version,
 instance_type=instance_type,
 image_scope=image_scope,
)

print(inference_image_distilbert)

#### Create model package group and model packages

In [None]:
try:
 sm_client.describe_model_package_group(ModelPackageGroupName=model_package_group_name)
except:
 model_pacakge_group_response = sm_client.create_model_package_group(
 ModelPackageGroupName=model_package_group_name,
 ModelPackageGroupDescription="My sample HuggingFace PyTorch model package group",
 )
 print(model_pacakge_group_response)

In [None]:
roberta_model_package_response = sm_client.create_model_package(
 ModelPackageGroupName=str(model_package_group_name),
 ModelPackageDescription=f"Hugging Face Roberta Model - sentiment analysis",
 Domain=ml_domain,
 Task=ml_task,
 InferenceSpecification={
 "Containers": [
 {
 "ContainerHostname": "huggingface-pytorch-roberta",
 "Image": inference_image_roberta,
 "ModelDataUrl": model_roberta_uri,
 "Framework": ml_framework,
 "NearestModelName": nearest_model,
 "Environment": {
 "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
 "SAGEMAKER_REGION": region,
 "SAGEMAKER_SUBMIT_DIRECTORY": model_roberta_uri,
 "HF_TASK": HF_TASK,
 },
 },
 ],
 # "SupportedRealtimeInferenceInstanceTypes": [
 # "ml.c5.large",
 # "ml.c5.xlarge",
 # "ml.c5.2xlarge",
 # "ml.m5.xlarge",
 # "ml.m5.2xlarge",
 # ],
 "SupportedContentTypes": ["application/json"],
 "SupportedResponseMIMETypes": ["application/json"],
 },
)

In [None]:
distilbert_model_package_response = sm_client.create_model_package(
 ModelPackageGroupName=str(model_package_group_name),
 ModelPackageDescription=f"Hugging Face Distilbert Model - sentiment analysis",
 Domain=ml_domain,
 Task=ml_task,
 InferenceSpecification={
 "Containers": [
 {
 "ContainerHostname": "huggingface-pytorch-distilbert",
 "Image": inference_image_distilbert,
 "ModelDataUrl": model_distilbert_uri,
 "Framework": ml_framework,
 "NearestModelName": nearest_model,
 "Environment": {
 "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
 "SAGEMAKER_PROGRAM": "inference.py",
 "SAGEMAKER_REGION": region,
 "SAGEMAKER_SUBMIT_DIRECTORY": model_distilbert_uri,
 "HF_TASK": HF_TASK,
 },
 },
 ],
 # "SupportedRealtimeInferenceInstanceTypes": [
 # "ml.c5.large",
 # "ml.c5.xlarge",
 # "ml.c5.2xlarge",
 # "ml.m5.xlarge",
 # "ml.m5.2xlarge",
 # ],
 "SupportedContentTypes": ["application/json"],
 "SupportedResponseMIMETypes": ["application/json"],
 },
)

## Deploy registered models for real-time inference

Next we will create a SageMaker real-time endpoint for each of the registered model version.

In [None]:
roberta_model_package_arn = roberta_model_package_response["ModelPackageArn"]
print(f"ModelPackage Version ARN : {roberta_model_package_arn}")
%store roberta_model_package_arn

In [None]:
distilbert_model_package_arn = distilbert_model_package_response["ModelPackageArn"]
print(f"ModelPackage Version ARN : {distilbert_model_package_arn}")
%store distilbert_model_package_arn

### View Model Groups and Versions

You can view details of a specific model version by using either the AWS SDK for Python (Boto3) or by using Amazon SageMaker Studio.
To view the details of a model version by using Boto3, Call the `list_model_packages` method to view the model versions in a model group

In [None]:
list_model_packages_response = sm_client.list_model_packages(
 ModelPackageGroupName=model_package_group_name
)
list_model_packages_response

In [None]:
roberta_model_version_arn = list_model_packages_response["ModelPackageSummaryList"][1][
 "ModelPackageArn"
]
print("roberta model: {}".format(roberta_model_version_arn))
distilbert_model_version_arn = list_model_packages_response["ModelPackageSummaryList"][0][
 "ModelPackageArn"
]
print("distilbert model: {}".format(distilbert_model_version_arn))

### View Model Version Details

Call `describe_model_package` to see the details of the model version. You pass in the ARN of a model version that you got in the output of the call to list_model_packages.

In [None]:
sm_client.describe_model_package(ModelPackageName=roberta_model_version_arn)

### Update Model Approval Status

After you create a model version, you typically want to evaluate its performance before you deploy it to a production endpoint. If it performs to your requirements, you can update the approval status of the model version to `Approved`. Setting the status to `Approved` can initiate CI/CD deployment for the model. If the model version does not perform to your requirements, you can update the approval status to `Rejected`.

In [None]:
model_package_update_input_dict = {
 "ModelPackageArn": roberta_model_package_arn,
 "ModelApprovalStatus": "Approved",
}
model_package_update_response1 = sm_client.update_model_package(**model_package_update_input_dict)
model_package_update_response1

In [None]:
model_package_update_input_dict = {
 "ModelPackageArn": distilbert_model_package_arn,
 "ModelApprovalStatus": "Approved",
}
model_package_update_response2 = sm_client.update_model_package(**model_package_update_input_dict)
model_package_update_response2

### Deploy the Roberta Model from the Model Registry

After you register a model version and approve it for deployment, deploy it to a SageMaker endpoint for real-time inference.

When you create a `MLOps` project and choose a `MLOps` project template that includes model deployment, approved model versions in the model registry are automatically deployed to production. For information about using SageMaker `MLOps` projects, see [Automate `MLOps` with SageMaker Projects](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects.html).

To deploy a model version using the AWS SDK for Python (Boto3) we'll create a model object from the model version by calling the [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) method. Pass the Amazon Resource Name (ARN) of the model version as part of the Containers for the model object.

In [None]:
# provide the consistent time stamp for model, endpoint config and endpoint
now_roberta = f"{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"
now_roberta

In [None]:
roberta_model_name = f"hf-pytorch-model-roberta-{now_roberta}"
print("Model name : {}".format(roberta_model_name))
%store roberta_model_name

In [None]:
primary_container_roberta = {
 "ModelPackageName": roberta_model_version_arn,
}

create_model_roberta_respose = sm_client.create_model(
 ModelName=roberta_model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container_roberta
)

print("Model arn : {}".format(create_model_roberta_respose["ModelArn"]))

### Create an Endpoint Config from the model

This will create an endpoint configuration that Amazon SageMaker hosting services uses to deploy models. In the configuration, you identify one or more models, created using the `CreateModel` API, to deploy and the resources that you want Amazon SageMaker to provision. Then you call the `CreateEndpoint` API.

More info on `create_endpoint_config` can be found on the [Boto3 SageMaker documentation page](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config).

In [None]:
deploy_instance_type = "ml.m5.xlarge"
roberta_endpoint_config_name = f"hf-pytorch-endpoint-config-roberta-{now_roberta}"
roberta_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=roberta_endpoint_config_name,
 ProductionVariants=[
 {
 "VariantName": "AllTrafficVariant",
 "ModelName": roberta_model_name,
 "InitialInstanceCount": 1,
 "InstanceType": deploy_instance_type,
 "InitialVariantWeight": 1,
 },
 ],
)

roberta_endpoint_config_response

### Deploy the Endpoint Config to a real-time endpoint

This will create an endpoint using the endpoint configuration specified in the request. Amazon SageMaker uses the endpoint to provision resources and deploy models. Note that you have already created the endpoint configuration with the `CreateEndpointConfig` API in the previous step.

More info on `create_endpoint` can be found on the [Boto3 SageMaker documentation page](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint).

In [None]:
roberta_endpoint_name = f"hf-pytorch-endpoint-roberta-{now_roberta}"
roberta_create_endpoint_response = sm_client.create_endpoint(
 EndpointName=roberta_endpoint_name,
 EndpointConfigName=roberta_endpoint_config_name,
)

roberta_create_endpoint_response

### Wait for Endpoint to be ready

In [None]:
%%time
utils.endpoint_creation_wait(roberta_endpoint_name)

### Invoke Endpoint with `boto3`

After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint.

For an overview of Amazon SageMaker, [see How It Works](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works.html).

Amazon SageMaker strips all POST headers except those supported by the API. Amazon SageMaker might add additional headers. You should not rely on the behavior of headers outside those enumerated in the request syntax.

Calls to `InvokeEndpoint` are authenticated by using AWS Signature Version 4. For information, see Authenticating Requests (AWS Signature Version 4) in the Amazon S3 API Reference.

A customer's model containers must respond to requests within 60 seconds. The model itself can have a maximum processing time of 60 seconds before responding to invocations. If your model is going to take 50-60 seconds of processing time, the SDK socket timeout should be set to be 70 seconds.

More info on `invoke_endpoint` can be found on the [Boto3 `SageMakerRuntime` documentation page](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint).

In [None]:
test_data = pd.read_csv("../sample_payload/test_data.csv", header=None, names=["inputs"])
json_data = dict({"inputs": test_data.iloc[:, 0].to_list()})
print(json_data)
test_data.head()

In [None]:
%%time
response = sm_runtime.invoke_endpoint(
 EndpointName=roberta_endpoint_name,
 Body=test_data.to_csv(header=True, index=False),
 ContentType="text/csv",
)

print(response["Body"].read())

In [None]:
%%time
response = sm_runtime.invoke_endpoint(
 EndpointName=roberta_endpoint_name,
 Body=json.dumps(json_data),
 ContentType="application/json",
)

print(response["Body"].read())

### Deploy the distilbert model to an endpoint

we will follow similar steps to deploy the registered Distilbert model to a real-time endpoint for inference.

In [None]:
now_distilbert = f"{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"
now_distilbert

In [None]:
distilbert_model_name = f"hf-pytorch-model-distilbert-{now_distilbert}"
print("Model name : {}".format(distilbert_model_name))
%store distilbert_model_name

primary_container = {
 "ModelPackageName": distilbert_model_version_arn,
}

create_model_respose = sm_client.create_model(
 ModelName=distilbert_model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
)

print("Model arn : {}".format(create_model_respose["ModelArn"]))

In [None]:
distilbert_endpoint_config_name = f"hf-pytorch-endpoint-config-distilbert-{now_distilbert}"

distilbert_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=distilbert_endpoint_config_name,
 ProductionVariants=[
 {
 "VariantName": "AllTrafficVariant",
 "ModelName": distilbert_model_name,
 "InitialInstanceCount": 1,
 "InstanceType": deploy_instance_type,
 "InitialVariantWeight": 1,
 },
 ],
)

distilbert_endpoint_config_response

In [None]:
%%time
distilbert_endpoint_name = f"hf-pytorch-endpoint-distilbert-{now_distilbert}"

distilbert_create_endpoint_response = sm_client.create_endpoint(
 EndpointName=distilbert_endpoint_name,
 EndpointConfigName=distilbert_endpoint_config_name,
)
utils.endpoint_creation_wait(distilbert_endpoint_name)

In [None]:
%%time
response = sm_runtime.invoke_endpoint(
 EndpointName=distilbert_endpoint_name,
 Body=test_data.to_csv(header=False, index=False),
 ContentType="text/csv",
)

print(response["Body"].read())

In [None]:
%%time
response = sm_runtime.invoke_endpoint(
 EndpointName=distilbert_endpoint_name,
 Body=json.dumps(json_data),
 ContentType="application/json",
)

print(response["Body"].read())

## Delete the endpoint (Optional)

If you do not plan to use this endpoint further, you should delete the endpoint to avoid incurring additional charges.

In [None]:
sm_session.delete_endpoint(roberta_endpoint_name)
sm_session.delete_endpoint(distilbert_endpoint_name)