# Deploy Hugging Face transformer models with multi-model endpoints 

***
This notebooks is designed to run on `Python 3 Data Science 2.0` kernel in Amazon SageMaker Studio
***

We will describe the steps for deploying a multi-model endpoint on Amazon SageMaker with TorchServe serving stack. An additional step compared to single model deployment is the requirement to create a manifest file for each model prior to deployment. For training Hugging Face models on SageMaker, refer the examples [here](https://github.com/huggingface/notebooks/tree/master/sagemaker)

We will perform following steps:
1. [Introduction](#Introduction) 
2. [Setup](#Setup)
3. [Register a new HuggingFace Transformer model version](#Register-a-new-HuggingFace-Transformer-model-version)
4. [Create the model metadata for multi-model endpoint](#Create-the-model-metadata-for-multi\-model-endpoint)
5. [Create the multi-model endpoint](#Create-the-multi\-model-endpoint)

## Introduction

In lab 1, we have demonstrated how to deploy models to Amazon SageMaker single model endpoints. SageMaker also supports deploying multiple models to one endpoint. There are three multi-model hosting options:
- [Host multiple models in one container behind one endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html)
- [Host multiple models which use different containers behind one endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html)
- [Host models along with pre-processing logic as serial inference pipeline behind one endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html)

This notebook is a step-by-step instruction on deploying multiple pre-trained PyTorch Hugging Face model in one container with multi-model endpoint on Amazon SageMaker. 


## Setup

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.utils import name_from_base
from sagemaker.pytorch import PyTorchModel
from sagemaker.s3 import S3Uploader, s3_path_join
import boto3
import time
import pandas as pd
from pathlib import Path
import tarfile
import shutil 
import datetime
import json
import os,sys
p = os.path.abspath('..')
if p not in sys.path:
 sys.path.append(p)
import utils

sm_session = sagemaker.Session()
role = get_execution_role()
region = sm_session.boto_region_name
bucket = sm_session.default_bucket()
sm_client = sm_session.sagemaker_client
sm_runtime = sm_session.sagemaker_runtime_client
prefix = "sagemaker/huggingface-pytorch-sentiment-analysis"

In [None]:
%store
%store -r

In [None]:
try:
 describe_model_package_group_response = sm_client.describe_model_package_group(
 ModelPackageGroupName=model_package_group_name
 )
 print(describe_model_package_group_response)
except:
 print(f"model package group {model_package_group_name} does not exist, please go through lab 1 first")

## Register a new HuggingFace Transformer model version

### Register a new model version for Hugging Face roberta model with entry point script helper function
To deploy the models in one container, we will use the Hugging Face prebuilt container which has the required packages for transformer models. However, we will use a custom entry point script for each of the model and define our own data preprocessing function. We will firstly download the roberta model file and prepare the model with inference script to be used in the endpoint. This updated model tar file will be registered to model registry as a new model version.

In [None]:
local_artifact_path = Path("model_artifacts")
local_artifact_path.mkdir(exist_ok=True, parents=True)
model_tar_name = 'model_roberta_MME.tar.gz'
org_model_tar_name = Path(model_roberta_uri).parts[-1]

In [None]:
sm_session.download_data('./', bucket, str(Path(prefix, "models", org_model_tar_name)))

In [None]:
with tarfile.open(org_model_tar_name) as tar:
 tar.extractall(path=local_artifact_path.stem)

In [None]:
shutil.copytree('../code', local_artifact_path / 'code', dirs_exist_ok=True) 

In [None]:
tar_size = utils.create_tar(model_tar_name, local_artifact_path)
print(f"Created {model_tar_name}, size {tar_size:.2f} MB")

In [None]:

model_data_path = s3_path_join("s3://",bucket,prefix+"/models")
model_roberta_mme_uri =S3Uploader.upload(model_tar_name, model_data_path)
print(f"Uploaded roberta MME model to {model_roberta_mme_uri}")
%store model_roberta_mme_uri

Prepare model package parameters based on the existing roberta model package

In [None]:
describe_model_package_response = sm_client.describe_model_package(
 ModelPackageName=roberta_model_package_arn
)
describe_model_package_response

In [None]:
model_package_keys = ["ModelPackageGroupName", "Domain", "Task", "InferenceSpecification"]
roberta_model_package_mme = dict()
for key in model_package_keys:
 roberta_model_package_mme[key] = describe_model_package_response[key]
roberta_model_package_mme["ModelPackageDescription"] = "Hugging Face Roberta Model MME - sentiment analysis"
roberta_model_package_mme["InferenceSpecification"]["Containers"][0]["ContainerHostname"] = "huggingface-pytorch-roberta-mme"
roberta_model_package_mme["InferenceSpecification"]["Containers"][0]["ModelDataUrl"] = model_roberta_mme_uri
roberta_model_package_mme["InferenceSpecification"]["Containers"][0]["Environment"]["SAGEMAKER_SUBMIT_DIRECTORY"] = model_roberta_mme_uri 
roberta_model_package_mme

In [None]:
model_package_response = sm_client.create_model_package(**roberta_model_package_mme)

In [None]:
list_model_packages_response = sm_client.list_model_packages(
 ModelPackageGroupName=model_package_group_name
)
list_model_packages_response

In [None]:
# we will use the roberta mme model and the distilbert model to create the multi-model endpoint
roberta_mme_model_version_arn = list_model_packages_response["ModelPackageSummaryList"][0]["ModelPackageArn"]
print(f"roberta MME model: {roberta_mme_model_version_arn}")
distilbert_model_version_arn = list_model_packages_response["ModelPackageSummaryList"][1]["ModelPackageArn"]
print(f"distilbert model: {distilbert_model_version_arn}")

In [None]:
# before deploying the model from model registry, we need to approve the model package version
model_package_update_input_dict = {
 "ModelPackageArn": roberta_mme_model_version_arn,
 "ModelApprovalStatus": "Approved",
}
model_package_update_response = sm_client.update_model_package(**model_package_update_input_dict)
model_package_update_response

#### Create the Roberta MME model object

In [None]:
now_roberta_mme = f'{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}'
roberta_mme_model_name = f"hf-pytorch-model-roberta-mme-{now_roberta_mme}"
print(f"Model name : {roberta_mme_model_name}")
%store roberta_mme_model_name

In [None]:
primary_container_roberta = {
 "ModelPackageName": roberta_mme_model_version_arn,
}

create_model_roberta_respose = sm_client.create_model(
 ModelName=roberta_mme_model_name, 
 ExecutionRoleArn=role, 
 PrimaryContainer=primary_container_roberta
)

print(f"Model arn : {create_model_roberta_respose['ModelArn']}")

In [None]:
inference_image_hf_mme

## Create the model metadata for multi-model endpoint
Here we use `boto3` to establish the model metadata. Instead of describing a single model, this metadata will indicate the use of multi-model semantics and will identify the source location of all specific model artifacts. You also need to pass the ModelDataUrl field that specifies the prefix in Amazon S3 where the model artifacts are located, instead of the path to a single model artifact, as you would when deploying a single model.

In [None]:
# establish the place in S3 from which the endpoint will pull individual models
multi_model_now = f'{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}'
multi_model_name = f'pytorch-multi-model-senti-{multi_model_now}'
_container = {
 'Image': inference_image_hf_mme,
 'ModelDataUrl': model_data_path,
 'Mode': 'MultiModel'
}
create_model_response = sm_client.create_model(
 ModelName = multi_model_name,
 ExecutionRoleArn = role,
 Containers = [_container])
%store multi_model_name
print(f'Multi Model name {multi_model_name}')

## Create the multi-model endpoint
There is nothing special about the SageMaker endpoint config metadata for a multi-model endpoint. You need to consider the appropriate instance type and number of instances for the projected prediction workload. The number and size of the individual models will drive memory requirements.

Once the endpoint config is in place, the endpoint creation is straightforward.

In [None]:
endpoint_config_name = f'pytorch-multi-model-config-{multi_model_now}'
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName = endpoint_config_name,
 ProductionVariants=[{
 'InstanceType': deploy_instance_type,
 'InitialInstanceCount': 1,
 'InitialVariantWeight': 1,
 'ModelName': multi_model_name,
 'VariantName': 'AllTraffic'}])

In [None]:
endpoint_name = f'pytorch-multi-model-endpoint-{multi_model_now}'
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
 EndpointName=endpoint_name,
 EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

In [None]:
%%time
utils.endpoint_creation_wait(endpoint_name)

### Invoke multi-model endpoint

In [None]:
test_data = pd.read_csv("../sample_payload/test_data.csv", header=None)
json_data = dict({'inputs':test_data.iloc[:,0].to_list()})
test_data

In [None]:
def invoke_multi_model_endpoint(model_archive=None, content_type="JSON", test_data=None):

 if content_type == "JSON":

 response = sm_runtime.invoke_endpoint(
 EndpointName=endpoint_name,
 Body=json.dumps(test_data),
 ContentType="application/json",
 TargetModel=model_archive,
 )
 elif content_type == "CSV":
 response = sm_runtime.invoke_endpoint(
 EndpointName=endpoint_name,
 Body=test_data.to_csv(header=False, index=False),
 ContentType="text/csv",
 TargetModel=model_archive,
 )
 else:
 print(f"input content type {content_type} is not supported, please selece CSV or JSON.")
 return response["Body"].read()

In [None]:
%%time
model_archive = '/model_roberta_MME.tar.gz'
content_type = "JSON" #"CSV"
payload = json_data #test_data
results = invoke_multi_model_endpoint(model_archive, content_type, payload)
print(results)

In [None]:
%%time
model_archive = '/model_roberta_MME.tar.gz'
content_type = "CSV"
payload = test_data
results = invoke_multi_model_endpoint(model_archive, content_type, payload)
print(results)

In [None]:
%%time
model_archive = '/model_distilbert.tar.gz'
content_type = "JSON" #"CSV"
payload = json_data #test_data
results = invoke_multi_model_endpoint(model_archive, content_type, payload)
print(results)

In [None]:
%%time
model_archive = '/model_distilbert.tar.gz'
content_type = "CSV"
payload = test_data
results = invoke_multi_model_endpoint(model_archive, content_type, payload)
print(results)

## Delete the endpoint

If you do not plan to use this endpoint further, you should delete the endpoint to avoid incurring additional charges.

In [None]:
sm_session.delete_endpoint(endpoint_name)