# Fine tune a PyTorch BERT model and deploy it with Elastic Inference on Amazon SageMaker

Text classification is a technique for putting text into different categories and has a wide range of applications: email providers use text classification to detect to spam emails, marketing agencies use it for sentiment analysis of customer reviews, and moderators of discussion forums use it to detect inappropriate comments.

In the past, data scientists used methods such as [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), [word2vec](https://en.wikipedia.org/wiki/Word2vec), or [bag-of-words (BOW)](https://en.wikipedia.org/wiki/Bag-of-words_model) to generate features for training classification models. While these techniques have been very successful in many NLP tasks, they don't always capture the meanings of words accurately when they appear in different contexts. Recently, we see increasing interest in using Bidirectional Encoder Representations from Transformers (BERT) to achieve better results in text classification tasks, due to its ability more accurately encode the meaning of words in different contexts.

BERT was trained on BookCorpus and English Wikipedia data, which contain 800 million words and 2,500 million words, respectively. Training BERT from scratch would be prohibitively expensive. By taking advantage of transfer learning, one can quickly fine tune BERT for another use case with a relatively small amount of training data to achieve state-of-the-art results for common NLP tasks, such as text classification and question answering. 

Amazon SageMaker is a fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models. The SageMaker Python SDK provides open source APIs and containers that make it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks.

Our customers often ask for quick fine-tuning and easy deployment of their NLP models. Furthermore, customers prefer low inference latency and low model inference cost. [Amazon Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference) enables attaching GPU-powered inference acceleration to endpoints, reducing the cost of deep learning inference without sacrificing performance.

This blog post demonstrates how to use Amazon SageMaker to fine tune a PyTorch BERT model and deploy it with Elastic Inference. This work is inspired by a post by [Chris McCormick and Nick Ryan](https://mccormickml.com/2019/07/22/BERT-fine-tuning).

In this example, we walk through our dataset, the training process, and finally model deployment. 

# Setup

To start, we import some Python libraries and initialize a SageMaker session, S3 bucket and prefix, and IAM role.

In [None]:
# need torch 1.3.1 for elastic inference
!pip install torch
!pip install transformers

In [2]:
import os
import numpy as np
import pandas as pd
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-bert"

role = sagemaker.get_execution_role()

# Prepare training data

We use Corpus of Linguistic Acceptability (CoLA) (https://nyu-mll.github.io/CoLA/), a dataset of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. We download and unzip the data using the following code:

### Download data

In [3]:
if not os.path.exists("./cola_public_1.1.zip"):
 !curl -o ./cola_public_1.1.zip https://nyu-mll.github.io/CoLA/cola_public_1.1.zip
if not os.path.exists("./cola_public/"):
 !unzip cola_public_1.1.zip

### Get sentences and labels

Let us take a quick look at our data. First we read in the training data. The only two columns we need are the sentence itself and its label. 

In [4]:
df = pd.read_csv(
 "./cola_public/raw/in_domain_train.tsv",
 sep="\t",
 header=None,
 usecols=[1, 3],
 names=["label", "sentence"],
)
sentences = df.sentence.values
labels = df.label.values

Printing out a few sentences shows us how sentences are labeled based on their grammatical completeness. 

In [5]:
print(sentences[20:25])
print(labels[20:25])

['The professor talked us.' 'We yelled ourselves hoarse.'
 'We yelled ourselves.' 'We yelled Harry hoarse.'
 'Harry coughed himself into a fit.']
[0 1 0 0 1]


We then split the dataset for training and testing.

In [6]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df)
train.to_csv("./cola_public/train.csv", index=False)
test.to_csv("./cola_public/test.csv", index=False)

Next, we upload both to Amazon S3 for use later. The SageMaker Python SDK provides a helpful function for uploading to Amazon S3:

In [7]:
inputs_train = sagemaker_session.upload_data("./cola_public/train.csv", bucket=bucket, key_prefix=prefix)
inputs_test = sagemaker_session.upload_data("./cola_public/test.csv", bucket=bucket, key_prefix=prefix)

# Run training

## Training script

We use the [PyTorch-Transformers library](https://pytorch.org/hub/huggingface_pytorch-transformers), which contains PyTorch implementations and pre-trained model weights for many NLP models, including BERT.

Our training script should save model artifacts learned during training to a file path called `model_dir`, as stipulated by the SageMaker PyTorch image. Upon completion of training, model artifacts saved in `model_dir` will be uploaded to S3 by SageMaker and will become available in S3 for deployment.

We save this script in a file named `train_deploy.py`, and put the file in a directory named `code/`. The full training script can be viewed under `code/`.

In [8]:
!pygmentize code/train_deploy.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m [34mas[39;49;00m [04m[36mdist[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m
[34mfrom[39;49;00m

## Train on Amazon SageMaker

We use Amazon SageMaker to train and deploy a model using our custom PyTorch code. The Amazon SageMaker Python SDK makes it easier to run a PyTorch script in Amazon SageMaker using its PyTorch estimator. After that, we can use the SageMaker Python SDK to deploy the trained model and run predictions. For more information on how to use this SDK with PyTorch, see [the SageMaker Python SDK documentation](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html).

To start, we use the `PyTorch` estimator class to train our model. When creating our estimator, we make sure to specify a few things:

* `entry_point`: the name of our PyTorch script. It contains our training script, which loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model. It also contains code to load and run the model during inference.
* `source_dir`: the location of our training scripts and requirements.txt file. "requirements.txt" lists packages you want to use with your script.
* `framework_version`: the PyTorch version we want to use

The PyTorch estimator supports multi-machine, distributed PyTorch training. To use this, we just set train_instance_count to be greater than one. Our training script supports distributed training for only GPU instances. 

After creating the estimator, we then call fit(), which launches a training job. We use the Amazon S3 URIs where we uploaded the training data earlier.

In [12]:
from sagemaker.pytorch import PyTorch

# place to save model artifact
output_path = f"s3://{bucket}/{prefix}"

estimator = PyTorch(
 entry_point="train_deploy.py",
 source_dir="code",
 role=role,
 framework_version="1.3.1",
 py_version="py3",
 instance_count=2, # this script only support distributed training for GPU instances.
 instance_type="ml.c4.4xlarge",
 output_path=output_path,
 hyperparameters={
 "epochs": 1,
 "num_labels": 2,
 "backend": "gloo",
 },
 disable_profiler=True, # disable debugger
)
estimator.fit({"training": inputs_train, "testing": inputs_test})

2022-06-06 10:54:41 Starting - Starting the training job...
2022-06-06 10:55:07 Starting - Preparing the instances for training............
2022-06-06 10:57:08 Downloading - Downloading input data
2022-06-06 10:57:08 Training - Downloading the training image......
2022-06-06 10:57:54 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-06-06 10:57:56,555 sagemaker-containers INFO Imported framework sagemaker_pytorch_container.training[0m
[34m2022-06-06 10:57:56,557 sagemaker-containers INFO No GPUs detected (normal if no gpus installed)[0m
[34m2022-06-06 10:57:56,569 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.[0m
[34m2022-06-06 10:57:56,570 sagemaker_pytorch_container.training INFO Invoking user training script.[0m
[34m2022-06-06 10:57:56,895 sagemaker-containers INFO Module default_user

# Host

After training our model, we host it on an Amazon SageMaker Endpoint. To make the endpoint load the model and serve predictions, we implement a few methods in `train_deploy.py`.

* `model_fn()`: function defined to load the saved model and return a model object that can be used for model serving. The SageMaker PyTorch model server loads our model by invoking model_fn.
* `input_fn()`: deserializes and prepares the prediction input. In this example, our request body is first serialized to JSON and then sent to model serving endpoint. Therefore, in `input_fn()`, we first deserialize the JSON-formatted request body and return the input as a `torch.tensor`, as required for BERT.
* `predict_fn()`: performs the prediction and returns the result.

To deploy our endpoint, we call `deploy()` on our PyTorch estimator object, passing in our desired number of instances and instance type:


In [10]:
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

------!

We then configure the predictor to use `application/json` for the content type when sending requests to our endpoint:

In [None]:
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

Finally, we use the returned predictor object to call the endpoint:

In [None]:
result = predictor.predict("Somebody just left - guess who.")
print("predicted class: ", np.argmax(result, axis=1))

We can see the predicted class is 1 as expected because test sentence is a grammatically correct sentence. 

Before moving on, let's delete the Amazon SageMaker endpoint to avoid charges:

In [None]:
predictor.delete_endpoint()

## Use a pretrained model

If you want to reuse pretrained model, you can create a `PyTorchModel` from existing model artifacts. For example,
we can retrieve model artifacts we just trained. 

In [None]:
model_data = estimator.model_data
print(model_data)

In [None]:
from sagemaker.pytorch.model import PyTorchModel 

pytorch_model = PyTorchModel(model_data=model_data,
 role=role,
 framework_version="1.3.1",
 source_dir="code",
 py_version="py3",
 entry_point="train_deploy.py")

predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

In [None]:
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

In [None]:
result = predictor.predict("Remember to delete me when are done")
print("predicted class: ", np.argmax(result, axis=1))

In [None]:
# batch inference 
result = predictor.predict([
 "This is how you do batch inference", 
 "Put several sentences in a list",
 "Make sure they are shorter than 64 words"])

In [None]:
print("Predicted class: ", np.argmax(result, axis=1))

In [None]:
predictor.delete_endpoint()

## Elastic Inference

Selecting the right instance type for inference requires deciding between different amounts of GPU, CPU, and memory resources, and optimizing for one of these resources on a standalone GPU instance usually leads to under-utilization of other resources. [Amazon Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference/) solves this problem by enabling us to attach the right amount of GPU-powered inference acceleration to our endpoint. In March 2020, [Elastic Inference support for PyTorch became available](https://aws.amazon.com/blogs/machine-learning/reduce-ml-inference-costs-on-amazon-sagemaker-for-pytorch-models-using-amazon-elastic-inference/) for both Amazon SageMaker and Amazon EC2.

To use Elastic Inference, we must convert our trained model to TorchScript. The location of the model artifacts is `estimator.model_data`. 

First we create a folder to save model trained model, and download the `model.tar.gz` file to local directory. 

In [None]:
%%sh -s $estimator.model_data
mkdir model
aws s3 cp $1 model/ 
tar xvzf model/model.tar.gz --directory ./model

The following code converts our model into the TorchScript format:

In [None]:
import subprocess
import torch
from transformers import BertForSequenceClassification

model_torchScript = BertForSequenceClassification.from_pretrained("model/", torchscript=True)
device = "cpu"
# max length for the sentences: 64
max_len = 64

for_jit_trace_input_ids = [0] * max_len
for_jit_trace_attention_masks = [0] * max_len
for_jit_trace_input = torch.tensor([for_jit_trace_input_ids])
for_jit_trace_masks = torch.tensor([for_jit_trace_input_ids])

traced_model = torch.jit.trace(
 model_torchScript, [for_jit_trace_input.to(device), for_jit_trace_masks.to(device)]
)
torch.jit.save(traced_model, "traced_bert.pt")

subprocess.call(["tar", "-czvf", "traced_bert.tar.gz", "traced_bert.pt"])

Loading the TorchScript model and using it for prediction require small changes in our model loading and prediction functions. We create a new script `deploy_ei.py` that is slightly different from `train_deploy.py` script.

In [None]:
!pygmentize code/deploy_ei.py

Next we upload TorchScript model to S3 and deploy using Elastic Inference. The accelerator_type=`ml.eia2.xlarge` parameter is how we attach the Elastic Inference accelerator to our endpoint.

In [None]:
from sagemaker.pytorch import PyTorchModel

instance_type = 'ml.m5.large'
accelerator_type = 'ml.eia2.xlarge'

# TorchScript model
tar_filename = 'traced_bert.tar.gz'

# Returns S3 bucket URL
print('Upload tarball to S3')
model_data = sagemaker_session.upload_data(path=tar_filename, bucket=bucket, key_prefix=prefix)

In [None]:
import time

endpoint_name = 'bert-ei-traced-{}-{}-{}'.format(instance_type, 
 accelerator_type, time.time()).replace('.', '').replace('_', '')

pytorch = PyTorchModel(
 model_data=model_data,
 role=role,
 entry_point='deploy_ei.py',
 source_dir='code',
 framework_version='1.3.1',
 py_version='py3',
 sagemaker_session=sagemaker_session
)

# Function will exit before endpoint is finished creating
predictor = pytorch.deploy(
 initial_instance_count=1,
 instance_type=instance_type,
 accelerator_type=accelerator_type,
 endpoint_name=endpoint_name,
 wait=True,
)

In [None]:
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

In [None]:
res = predictor.predict('Please remember to delete me when you are done.')
print("Predicted class:", np.argmax(res, axis=1))

# Cleanup

Lastly, please remember to delete the Amazon SageMaker endpoint to avoid charges:

In [None]:
predictor.delete_endpoint()