# Deploy Hugging Face BART transformer model in Amazon SageMaker 

This notebook is a step-by-step tutorial on deploying a pre-trained Hugging Face model [BART](https://huggingface.co/transformers/model_doc/bart.html) on [PyTorch](https://pytorch.org/) framework. Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT). Specifically, we use the BART Model with a language modeling head [BartForConditionalGeneration](https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration) for summarization task. 

We will describe the steps for deploying this model similar to any other PyTorch model on Amazon SageMaker with TorchServe serving stack. For training Hugging Face models on SageMaker, refer the examples [here](https://github.com/huggingface/notebooks/tree/master/sagemaker)

The outline of steps is as follows:

1. Download pre-trained Hugging Face model
2. Save and upload model artifact to S3
2. Create an inference entrypoint script
3. Deploy endpoint
4. Trigger endpoint invocation

In [2]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.utils import name_from_base
from sagemaker.pytorch import PyTorchModel
import boto3
import torch


In [9]:
from sagemaker import get_execution_role

role = get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'hfdeploypytorch'
hf_cache_dir = 'hf_cache_dir/'

print(region)
print(role)
print(bucket)

us-east-1
arn:aws:iam::208480242416:role/service-role/AmazonSageMaker-ExecutionRole-endtoendml
sagemaker-us-east-1-208480242416


## Download the Hugging Face pretrained model

In [None]:
!pip install transformers==4.5.1 --quiet

In [22]:
!pip freeze | grep transformers

transformers==4.5.1


In [6]:
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig

In [24]:
#Download a pre-tuned bart transformer and move the model artifact to  S3 bucket
PRE_TRAINED_MODEL_NAME='facebook/bart-large-cnn'
# Note that we use a specific HF cache dir, to avoid using the default cache dirs that might fill 
# root disk space.
model = BartForConditionalGeneration.from_pretrained(PRE_TRAINED_MODEL_NAME, cache_dir=hf_cache_dir)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…




In [26]:
model.save_pretrained('./models/bart_model/')

In [27]:
tokenizer = BartTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
tokenizer.save_pretrained('./models/bart_tokenizer/')

('./models/bart_tokenizer/tokenizer_config.json',
 './models/bart_tokenizer/special_tokens_map.json',
 './models/bart_tokenizer/vocab.json',
 './models/bart_tokenizer/merges.txt',
 './models/bart_tokenizer/added_tokens.json')

## Add inference code and requirements.txt

We are manually adding the inference code and requirements.txt to the model folder, to avoid the SM Python SDK having to repack the model.tar.gz archive when executing deployment. Since there are large models, the repack operation can take some time (downlaod from S3, repack, re-upload).

In [28]:
! mkdir -p models/code

The custom inference code must be stored in the code/ folder in the model archive, and the name of the entrypoint module is inference.py by default. You can customize that by passing an environment variable named SAGEMAKER_PROGRAM when creating the Model object (see below).

In [29]:
! cp source_dir/inference.py models/code/inference.py
! cp source_dir/requirements.txt models/code/requirements.txt

## Create model archive and upload to S3 


In [30]:
!tar -C models/ -cvzf model.tar.gz bart_model/ bart_tokenizer/ code/

bart_model/
bart_model/config.json
bart_model/pytorch_model.bin
bart_tokenizer/
bart_tokenizer/merges.txt
bart_tokenizer/tokenizer_config.json
bart_tokenizer/vocab.json
bart_tokenizer/special_tokens_map.json
code/
code/inference.py
code/requirements.txt


In [31]:
from sagemaker.s3 import S3Uploader
file_key = 'model.tar.gz'
model_artifact = S3Uploader.upload(file_key,'s3://{}/{}/model'.format(bucket, prefix))
print(model_artifact)

s3://sagemaker-us-east-1-208480242416/hfdeploypytorch/model/model.tar.gz


## Deploy model to a SageMaker endpoint

In [7]:
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

class Summarizer(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super().__init__(endpoint_name, sagemaker_session=sagemaker_session,
                         serializer=JSONSerializer(), 
                         deserializer=JSONDeserializer())


from sagemaker.image_uris import retrieve

deploy_instance_type = 'ml.g4dn.xlarge'

pytorch_inference_image_uri = retrieve('pytorch',
                                       region,
                                       version='1.8.1',
                                       py_version='py3',
                                       instance_type = deploy_instance_type,
                                       accelerator_type=None,
                                       image_scope='inference')
print(pytorch_inference_image_uri)

763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.1-gpu-py3


In [33]:
from sagemaker.model import Model

hf_model = Model(model_data=model_artifact,
                 image_uri=pytorch_inference_image_uri,
                 predictor_cls=Summarizer,
                 sagemaker_session=sagemaker_session,
                 #env = {
                 #    'SAGEMAKER_PROGRAM': 'inference.py'
                 #},
                 role=role)

In [34]:
predictor = hf_model.deploy(instance_type=deploy_instance_type,
                            initial_instance_count=1)
predictor

-----------------------------!

<__main__.Summarizer at 0x7f38b5ea2cf8>

In [None]:
endpoint_name = predictor.endpoint_name

## Test inference

In [14]:
with open('article.txt') as f:
    content = f.read()
content = content.replace('\n', ' ')

json_request_data = {"text": "{0}"}
json_request_data["text"] = json_request_data["text"].format(content)

json_request_data

{'text': 'The Amazon Technical Academy upskilling program targets Amazon employees aspiring to become software engineers. Its leader says education is key to long-term success. The benefits are vast for Amazon employees accepted into Amazon Technical Academy, one of six training programs included in Upskilling 2025, Amazon’s $700 million commitment to equip more than 100,000 Amazon employees with new professional skills by 2025.  Amazon Technical Academy trains employees in the essential skills needed to transition to entry-level software developer engineer roles at Amazon. The program requires no previous computer training from applicants, only a high school diploma or GED—and the fortitude to get through a rigorous nine-month, full-time program created by expert Amazon software engineers.  Hundreds of Amazon employees have enrolled in Amazon Technical Academy since its launch in 2017. Amazon Technical Academy has placed 98% of its graduates into software development engineer roles wi

In [36]:
%%time
prediction = predictor.predict(json_request_data)
print(prediction)

Amazon Technical Academy trains employees in the essential skills needed to transition to entry-level software developer engineer roles at Amazon. The program requires no previous computer training from applicants, only a high school diploma or GED. Hundreds of Amazon employees have enrolled in Amazon Technical Academy since its launch in 2017.
CPU times: user 11.1 ms, sys: 0 ns, total: 11.1 ms
Wall time: 2.74 s


In [12]:
predictor = Summarizer(endpoint_name,sagemaker_session)

In [18]:
import time,numpy as np
results = []
for i in (1,100):
    start = time.time()
    prediction = predictor.predict(json_request_data)
    results.append((time.time() - start) * 1000)
print("\nPredictions for HuggingFace PyTorch models with SageMaker : \n")
print('\nP95: ' + str(np.percentile(results, 95)) + ' ms\n')    
print('P90: ' + str(np.percentile(results, 90)) + ' ms\n')
print('Average: ' + str(np.average(results)) + ' ms\n')


Predictions for HuggingFace PyTorch models with SageMaker : 


P95: 967.6558971405029 ms

P90: 967.0803546905518 ms

Average: 962.4760150909424 ms



## Delete endpoint

In [None]:

predictor.delete_endpoint(delete_endpoint_config=True)

In [None]:
predictor.delete_model()