# Deploy Hugging Face BART transformer models with multi-model endpoints 

This notebook is a step-by-step tutorial on deploying multiple pre-trained PyTorch Hugging Face model [BART](https://huggingface.co/transformers/model_doc/bart.html) with multi-model endpoint on Amazon SageMaker. Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT). Specifically, we use the BART Model with a language modeling head [BartForConditionalGeneration](https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration) for summarization task. 

We will describe the steps for deploying a multi-model endpoint on Amazon SageMaker with TorchServe serving stack. An additional step compared to single model deployment is the requirement to create a manifest file for each model prior to deployment. For training Hugging Face models on SageMaker, refer the examples [here](https://github.com/huggingface/notebooks/tree/master/sagemaker)

The outline of steps is as follows:

1. Download 2 pre-trained Hugging Face model
2. Use torch-archiver to create a manifest file for each model
3. Save and upload model artifact to S3
4. Create an inference entrypoint script
5. Deploy multi-model endpoint
6. Trigger endpoint invocation

In [1]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.utils import name_from_base
from sagemaker.pytorch import PyTorchModel
import boto3
import torch


In [2]:
from sagemaker import get_execution_role

role = get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'hf-multimodel-deploy-pytorch'
hf_cache_dir = 'hf_cache_dir/'

print(region)
print(role)
print(bucket)

model_data_path = 's3://{0}/{1}/models'.format(bucket,prefix)


us-east-1
arn:aws:iam::208480242416:role/service-role/AmazonSageMaker-ExecutionRole-endtoendml
sagemaker-us-east-1-208480242416


## Download the Hugging Face pretrained model

In [3]:
!pip install transformers==4.5.1 --quiet

In [4]:
!pip freeze | grep transformers

transformers==4.5.1


In [5]:
pip install -U ipywidgets --quiet

Note: you may need to restart the kernel to use updated packages.


In [6]:
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [7]:
#Download a pre-tuned bart transformer and move the model artifact to  S3 bucket
PRE_TRAINED_MODEL_NAME1='facebook/bart-large-cnn'
# Note that we use a specific HF cache dir, to avoid using the default cache dirs that might fill 
# root disk space.
model1 = BartForConditionalGeneration.from_pretrained(PRE_TRAINED_MODEL_NAME1, cache_dir=hf_cache_dir)

Downloading:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [8]:
#Create a fine-tuned bart transformer  and host on S3 bucket
#This is generally a training process to fine tune as per your own data
#In this example, we save the exact pretrained model itself
PRE_TRAINED_MODEL_NAME2='sshleifer/distilbart-cnn-12-6'
model2 = BartForConditionalGeneration.from_pretrained(PRE_TRAINED_MODEL_NAME2, cache_dir=hf_cache_dir)


Downloading:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

In [9]:
tokenizer1 = BartTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME1)
tokenizer2 = BartTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME2)

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

## Create and upload model archive to S3 


In [10]:
model1.save_pretrained('./models/model1/bart_model/')
tokenizer1.save_pretrained('./models/model1/bart_tokenizer/')

('./models/model1/bart_tokenizer/tokenizer_config.json',
 './models/model1/bart_tokenizer/special_tokens_map.json',
 './models/model1/bart_tokenizer/vocab.json',
 './models/model1/bart_tokenizer/merges.txt',
 './models/model1/bart_tokenizer/added_tokens.json')

In [11]:
model2.save_pretrained('./models/model2/bart_model/')
tokenizer2.save_pretrained('./models/model2/bart_tokenizer/')

('./models/model2/bart_tokenizer/tokenizer_config.json',
 './models/model2/bart_tokenizer/special_tokens_map.json',
 './models/model2/bart_tokenizer/vocab.json',
 './models/model2/bart_tokenizer/merges.txt',
 './models/model2/bart_tokenizer/added_tokens.json')

In [12]:
!pip install -q torch-model-archiver

In [13]:
from model_archiver import model_packaging
from model_archiver import model_packaging_utils
from model_archiver import arg_parser

#we add the handler from the sagemaker pytorch inference toolkit to the list of known handlers to model archiver
model_packaging_utils.model_handlers['sagemaker_pytorch_serving_container.handler_service'] = 'toolkit_handler'

arguments = '--model-name summarizer1 --version 1.0 \
--export-path models \
--extra-files models/model1/ \
--handler sagemaker_pytorch_serving_container.handler_service --archive-format no-archive'
args = arg_parser.ArgParser.export_model_args_parser().parse_args(arguments.split(' '))
manifest = model_packaging_utils.ModelExportUtils.generate_manifest_json(args)
model_packaging.package_model(args, manifest=manifest)

In [14]:
!rm -rf models/model1

In [15]:
from model_archiver import model_packaging
from model_archiver import model_packaging_utils
from model_archiver import arg_parser

#we add the handler from the sagemaker pytorch inference toolkit to the list of known handlers to model archiver
model_packaging_utils.model_handlers['sagemaker_pytorch_serving_container.handler_service'] = 'toolkit_handler'

arguments = '--model-name summarizer2 --version 1.0 \
--export-path models/ \
--extra-files models/model2/ \
--handler sagemaker_pytorch_serving_container.handler_service --archive-format no-archive'
args = arg_parser.ArgParser.export_model_args_parser().parse_args(arguments.split(' '))
manifest = model_packaging_utils.ModelExportUtils.generate_manifest_json(args)
model_packaging.package_model(args, manifest=manifest)

In [16]:
!rm -rf models/model2

## Add inference code 

We are manually adding the inference code  to the model folder, to avoid the SM Python SDK having to repack the model.tar.gz archive when executing deployment. Since there are large models, the repack operation can take some time (downlaod from S3, repack, re-upload).
The custom inference code must be stored in the code/ folder in the model archive, and the name of the entrypoint module is inference.py by default. You can customize that by passing an environment variable named SAGEMAKER_PROGRAM when creating the Model object (see below).

In [17]:
!mkdir models/summarizer1/code
!mkdir models/summarizer2/code

! cp source_dir/model1/inference.py models/summarizer1/code/inference.py
! cp source_dir/model2/inference.py models/summarizer2/code/inference.py

## Create model archive and upload to S3

In [None]:
!tar -czvf models/summarizer1.tar.gz -C models/summarizer1/ .
!tar -czvf models/summarizer2.tar.gz -C models/summarizer2/ .

./
./code/
./code/inference.py
./bart_tokenizer/
./bart_tokenizer/merges.txt
./bart_tokenizer/tokenizer_config.json
./bart_tokenizer/vocab.json
./bart_tokenizer/special_tokens_map.json
./bart_model/
./bart_model/config.json
./bart_model/pytorch_model.bin


In [None]:
from sagemaker.s3 import S3Uploader
S3Uploader.upload('models/summarizer1.tar.gz', model_data_path)
S3Uploader.upload('models/summarizer2.tar.gz', model_data_path)

## Create a MultiDataModel and deploy to a SageMaker endpoint

In [None]:
from sagemaker.model import Model
from sagemaker.multidatamodel import MultiDataModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

from sagemaker.image_uris import retrieve

deploy_instance_type = 'ml.m5.4xlarge'

pytorch_inference_image_uri = retrieve('pytorch',
                                       region,
                                       version='1.8.1',
                                       py_version='py3',
                                       instance_type = deploy_instance_type,
                                       accelerator_type=None,
                                       image_scope='inference')
print(pytorch_inference_image_uri)

In [None]:
import time

dummy_model = Model(name = 'dummy_model_pt',
                    image_uri = pytorch_inference_image_uri,
                    role = role,
                    model_data = '')

multi_model = MultiDataModel(name              = 'pytorch-multi-model-summarizer-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()),
                             model             = dummy_model,
                             model_data_prefix = model_data_path)

endpoint_name = 'torchserve-multimodel-summarizer-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

predictor = multi_model.deploy(instance_type=deploy_instance_type,
                               initial_instance_count=1,
                               endpoint_name = endpoint_name)

In [29]:
print(model_data_path)

s3://sagemaker-us-east-1-208480242416/hf-multimodel-deploy-pytorch/models


In [30]:
from sagemaker.predictor import Predictor
pred = Predictor(endpoint_name)
pred.serializer = sagemaker.serializers.JSONSerializer()
pred.deserializer = sagemaker.deserializers.JSONDeserializer()

## Invoke endpoint 

In [31]:
with open('article.txt') as f:
    content = f.read()
content = content.replace('\n', ' ')

json_request_data = {"text": "{0}"}
json_request_data["text"] = json_request_data["text"].format(content)

json_request_data

{'text': 'The Amazon Technical Academy upskilling program targets Amazon employees aspiring to become software engineers. Its leader says education is key to long-term success. The benefits are vast for Amazon employees accepted into Amazon Technical Academy, one of six training programs included in Upskilling 2025, Amazon’s $700 million commitment to equip more than 100,000 Amazon employees with new professional skills by 2025.  Amazon Technical Academy trains employees in the essential skills needed to transition to entry-level software developer engineer roles at Amazon. The program requires no previous computer training from applicants, only a high school diploma or GED—and the fortitude to get through a rigorous nine-month, full-time program created by expert Amazon software engineers.  Hundreds of Amazon employees have enrolled in Amazon Technical Academy since its launch in 2017. Amazon Technical Academy has placed 98% of its graduates into software development engineer roles wi

In [32]:
%%time
model_archive = '/summarizer1.tar.gz'
pred.predict(json_request_data, target_model=model_archive, target_variant=None)

CPU times: user 11.2 ms, sys: 0 ns, total: 11.2 ms
Wall time: 3.45 s


'Amazon Technical Academy trains employees in the essential skills needed to transition to entry-level software developer engineer roles at Amazon. The program requires no previous computer training from applicants, only a high school diploma or GED. Hundreds of Amazon employees have enrolled in Amazon Technical Academy since its launch in 2017.'

In [33]:
%%time
model_archive = '/summarizer2.tar.gz'
pred.predict(json_request_data, target_model=model_archive, target_variant=None)

CPU times: user 3.99 ms, sys: 0 ns, total: 3.99 ms
Wall time: 2.61 s


' Amazon upskilling program targets Amazon employees aspiring to become software engineers. The program requires no previous computer training from applicants, only a high school diploma or GED. Amazon Technical Academy has placed 98% of its graduates into software development engineer roles within Amazon. Graduates come from a vast array of professional backgrounds at Amazon.'

In [35]:
import time,numpy as np
results = []
for i in (1,100):
    start = time.time()
    model_archive = '/summarizer1.tar.gz'
    pred.predict(json_request_data, target_model=model_archive, target_variant=None)
    results.append((time.time() - start) * 1000)
print("\nPredictions for HuggingFace PyTorch multi-model endpoints with SageMaker : \n")
print('\nP95: ' + str(np.percentile(results, 95)) + ' ms\n')    
print('P90: ' + str(np.percentile(results, 90)) + ' ms\n')
print('Average: ' + str(np.average(results)) + ' ms\n')


Predictions for HuggingFace PyTorch multi-model endpoints with SageMaker : 


P95: 3644.779336452484 ms

P90: 3639.8558855056763 ms

Average: 3600.4682779312134 ms



## Cleanup

In [None]:

predictor.delete_endpoint(delete_endpoint_config=True)

In [None]:
predictor.delete_model()