# Compiling a Hugging Face model for AWS Inferentia with SageMaker Neo

The notebook describes the process of downloading and preparing a pre-trained PyTorch language model from the Hugging Face repository, to be deployed on AWS Inferentia, a purpose-built hardware accelerator. 

Things to take into account when preparing the PyTorch model:

#### PyTorch versions :

- the model is pre-trained but will be traced and saved with 'torch.jit.trace' as a torch script
- the version of PyTorch used for saving the model is the version that needs to be passed to the SageMaker PyTorch Estimator and Neo Compilation Job

#### Neo Compilation : 

- the compilation job needs to find the model under 'model_data' with filename model.pth
- the PyTorch version used for SageMaker Neo must be the same as the PyTorch version used to save the model in this notebook

#### Inference script:

- the inference script added to the SageMaker PyTorch Estimator will have a model loading function using 'torch.jit.load'
- there has to be a requirements.txt added to the code directory with the transformers package so that the inference script can use the tokenizer from the Hugging Face transformers library




## Setting up our environment

For supported PyTorch versions with Neo please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-cloud.html). For the purpose of this workshop we will compile a PyTorch version of the model available as standard SageMaker kernel. Please make sure you are using the available **SageMaker Python 3 (PyTorch 1.10 Python 3.8 CPU Optimized)** kernel, indicated at the top-right of the JupyterLab interface.

Let's begin by installing the Hugging Face Transformers package to be able to download the pre-trained model and tokenizer and save it locally as a torch script.

In [None]:
%pip install -U transformers==4.15.0

In [None]:
import transformers
print(transformers.__version__)

If you run this notebook in SageMaker Studio, you need to make sure ipywidgets is installed and restart the kernel, so please uncomment the code in the next cell, and run it.

In [None]:
%%capture
import IPython
import sys

!{sys.executable} -m pip install ipywidgets
IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used

In [None]:
import transformers
import sagemaker
import torch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
sess_bucket = sagemaker_session.default_bucket()

## Retrieving the model from Hugging Face Model Hub

The model [bert-base-cased-finetuned-mrpc](https://huggingface.co/bert-base-cased-finetuned-mrpc) is one of the most downloaded models from the Hugging Face Model Hub. This model is a fine-tuned version of bert-base-cased on the GLUE MRPC dataset. It achieves the following results on the evaluation set:

 Loss: 0.7132
 Accuracy: 0.8603
 F1: 0.9026
 Combined Score: 0.8814


**Note:** It is important to set the `return_dict` parameter to `False` when instantiating the model. In `transformers` v4.x, this parameter is `True` by default and it enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. Neuron compilation does not support dictionary-based model ouputs, and compilation would fail if we didn't explictly set it to `False`.

We also get the tokenizer corresponding to this same model, in order to create a sample input to trace our model. 

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")

model = transformers.AutoModelForSequenceClassification.from_pretrained(
 "bert-base-cased-finetuned-mrpc", return_dict=False
)

## Tracing model with `torch.jit` and uploading to S3 

Using the `jit.trace` to create a torch script; this is a required step to have SageMaker Neo compile the model artifact, which will take a `tar.gz` file containing the traced model.

The `.pth` extension when saving our model is required.

In [None]:
# Prepare sample input for jit model tracing
seq_0 = "This is just sample text for model tracing, the length of the sequence does not matter because we will pad to the max length that Bert accepts."
seq_1 = seq_0
max_length = 512

tokenized_sequence_pair = tokenizer.encode_plus(
 seq_0, seq_1, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt"
)

example = tokenized_sequence_pair["input_ids"], tokenized_sequence_pair["attention_mask"]

traced_model = torch.jit.trace(model.eval(), example)
traced_model.save("model.pth")

In [None]:
import tarfile

with tarfile.open("model.tar.gz", "w:gz") as f:
 f.add("model.pth")

Next, upload the traced model `tar.gz` file to Amazon S3, where our compilation job will download it from

In [None]:
traced_model_url = sagemaker_session.upload_data(
 path="model.tar.gz",
 key_prefix="neuron-experiments/bert-seq-classification/traced-model",
)

## Understanding the inference code

The inference code is being placed in this instance under the /code directory as well as the requirements.txt. Both files will be uploaded by the SageMaker PyTorch Estimator when specified as 

- entry_point="inference_inf1.py",
- source_dir="code",

The SageMaker PyTorch Estimator will automatically pull a specifically built PyTorch container for the PyTorch version specified and use the script specified on the entry_point to override the functions model_fn, input_fn, and output_fn and predict_fn. 

- model_fn - receives the model directory, is responsible for loading and returning the model -, an i
- nput_fn and output_fn - in charge of pre-processing/checking content types of input and output to the endpoint 
- predict_fn, which receives the outputs of model_fn and input_fn (meaning, the loaded model and the deserialized/pre-processed input data) and defines how the model will run inference.



In [None]:
!mkdir -p code

In [None]:
%%writefile code/inference_inf1.py
import os
import json
import torch
import torch_neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

JSON_CONTENT_TYPE = 'application/json'

def model_fn(model_dir):
 
 model_dir = '/opt/ml/model/'
 dir_contents = os.listdir(model_dir)
 model_path = next(filter(lambda item: 'model' in item, dir_contents), None)
 
 tokenizer_init = AutoTokenizer.from_pretrained('bert-base-cased-finetuned-mrpc')
 model = torch.jit.load(os.path.join(model_dir, model_path))

 
 return (model, tokenizer_init)


def input_fn(serialized_input_data, content_type=JSON_CONTENT_TYPE):
 if content_type == JSON_CONTENT_TYPE:
 input_data = json.loads(serialized_input_data)
 return input_data
 else:
 raise Exception('Requested unsupported ContentType in Accept: ' + content_type)
 return
 

def predict_fn(input_data, models):

 model_bert, tokenizer = models
 sequence_0 = input_data[0] 
 sequence_1 = input_data[1]
 
 max_length = 512
 tokenized_sequence_pair = tokenizer.encode_plus(sequence_0,
 sequence_1,
 max_length=max_length,
 padding='max_length',
 truncation=True,
 return_tensors='pt')
 
 # Convert example inputs to a format that is compatible with TorchScript tracing
 example_inputs = tokenized_sequence_pair['input_ids'], tokenized_sequence_pair['attention_mask']
 
 with torch.no_grad():
 paraphrase_classification_logits_neuron = model_bert(*example_inputs)
 
 classes = ['not paraphrase', 'paraphrase']
 paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()
 out_str = 'BERT predicts that "{}" and "{}" are {}'.format(sequence_0, sequence_1, classes[paraphrase_prediction])
 
 return out_str


def output_fn(prediction_output, accept=JSON_CONTENT_TYPE):
 if accept == JSON_CONTENT_TYPE:
 return json.dumps(prediction_output), accept
 
 raise Exception('Requested unsupported ContentType in Accept: ' + accept)


In this case, within the `model_fn` the model artifact located in `model_dir` is loaded (the compilation step will name the artifact `model_neuron.pt`). Then, the Neuron compiled model is loaded with `torch.jit.load`. 

Together with the `model_fn`, the torch_neuron package needs to be imported as well as the transformers package.

In [None]:
%%writefile code/requirements.txt
transformers==4.15.0


## Compiling and deploying model on an Inferentia inf1 instance

The newly downloaded `PyTorchModel` will use `inference_inf1.py` as its entry point script. PyTorch version 1.10.1 is specified, as it is the latest version supported by Neo.

In [None]:
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
from datetime import datetime

prefix = "neuron-experiments/bert-paraphrase"
flavour = "INF"
date_string = datetime.now().strftime("%Y%m-%d%H-%M%S")

compiled_sm_model = PyTorchModel(
 model_data=traced_model_url,
 predictor_cls=Predictor,
 framework_version="1.10.1",
 role=role,
 sagemaker_session=sagemaker_session,
 entry_point="inference_inf1.py",
 source_dir="code",
 py_version="py3",
 name=f"{flavour}-bert-mrpc-pt101-{date_string}",
 env={"SAGEMAKER_CONTAINER_LOG_LEVEL": "10"},
)

Finally, we are ready to compile the model. Two notes here:
* HuggingFace models should be compiled to `dtype` `int64`
* the format for `compiler_options` differs from the standard Python `dict` that you can use when compiling for "normal" instance types; for inferentia, you must provide a JSON string with CLI arguments, which correspond to the ones supported by the [Neuron Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-cc/command-line-reference.html) (read more about `compiler_options` [here](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html#API_OutputConfig_Contents))

Compilation of the model will take ~13 minutes.

In [None]:
%%time
import json

hardware = "inf1"
flavour = "compiled-inf"
compilation_job_name = f"bert-{flavour}-{hardware}-" + date_string

compiled_inf1_model = compiled_sm_model.compile(
 target_instance_family=f"ml_{hardware}",
 input_shape={"input_ids": [1, 512], "attention_mask": [1, 512]},
 job_name=compilation_job_name,
 role=role,
 framework="pytorch",
 framework_version="1.10.1",
 output_path=f"s3://{sess_bucket}/{prefix}/neo-compilations/{flavour}-model",
 compiler_options=json.dumps("--dtype int64"),
 # compiler_options={'dtype': 'int64'}, # For compiling to "normal" instance types, cpu or gpu-based
 compile_max_run=900,
)

After successful compilation, we deploy our model to an inf1.xlarge Inferentia-powered instance. Endpoint deployment will take ~10 minutes.

In [None]:
%%time
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

date_string = datetime.now().strftime("%Y%m-%d%H-%M%S")

compiled_inf1_predictor = compiled_inf1_model.deploy(
 instance_type="ml.inf1.xlarge",
 initial_instance_count=1,
 endpoint_name=f"test-neo-{hardware}-{date_string}",
 serializer=JSONSerializer(),
 deserializer=JSONDeserializer(),
)

Next, we submit an inference request to the endpoint

In [None]:
# Predict with model endpoint
payload = seq_0, seq_1
compiled_inf1_predictor.predict(payload)

### Clean up

When you are finished with your Inferentia-based SageMaker endpoint, run the following code to remove the associated resources:

In [None]:
compiled_inf1_predictor.delete_model()
compiled_inf1_predictor.delete_endpoint()