## Fine Tune BERT on Amazon Reviews Dataset


This notebook demonstrates how to use SageMaker with AWS Trainium to train a text classification model. We are going to start with a pretrained BERT model from Hugging Face, and fine-tune it with Amazon Reviews dataset. This dataset consists of sentences labeled to be either positive or negative sentiment. The training job will take place on ml.trn1 instance which hosts the AWS Trainium accelerator. 

### Lets begin by installing dependent libraries

In [None]:
!pip install -U sagemaker

In [None]:
!pip install torch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0

In [None]:
!pip install transformers==4.21.3 datasets==2.5.2

### Data Preparation

We will use an existing Dataset Amazon reviews part of the HuggingFace Datasets. We will convert the dataset into a CSV format and upload it to S3. For practical use cases we can easily replace this step with actual data in csv format.

In [None]:
import sagemaker
import transformers
from sagemaker.pytorch import PyTorch
from datasets import load_dataset
from tqdm.auto import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from sagemaker import utils
import os
import boto3
import botocore
from datasets.filesystems import S3FileSystem
from pathlib import Path
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
from datetime import datetime
import json
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
import torch

In [None]:
# import the amazon polarity dataset

dataset = load_dataset("amazon_polarity")

Lets look at the dataset structure

In [None]:
dataset["train"][1]

The dataset consists of 3 fields label, title and content. For this training lets just use 'label' which is the target field and 'content' that is used to learn the features. The 'content' field is free text which contains the actual review for a product.


In [None]:
train_ds = dataset["train"]
test_ds = dataset["test"]

In [None]:
import pandas as pd

train_df = pd.DataFrame(train_ds)
test_df = pd.DataFrame(test_ds)

In [None]:
# let use only label and content field

train_df = train_df.drop(["title"], axis=1)
test_df = test_df.drop(["title"], axis=1)

In [None]:
num_labels = len(train_df["label"].unique())

print("Total number of labels {}".format(num_labels))

Lets save the train and test dataset as CSV files.

In [None]:
train_df.to_csv("train.csv", index=False)
test_df.to_csv("test.csv", index=False)

### Upload the data to S3.

Lets upload the train.csv and test.csv files to S3 for us to be able to access this data during training.

In [None]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists

sagemaker_session_bucket = (
 None # Provide a bucket if you don't want to use the default bucket
)
if sagemaker_session_bucket is None and sess is not None:
 # set to default bucket if a bucket name is not given
 sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

In [None]:
train_data_url = sess.upload_data(
 path="train.csv",
 key_prefix="classification/data/amazon",
)

In [None]:
test_data_url = sess.upload_data(
 path="test.csv",
 key_prefix="classification/data/amazon",
)

In [None]:
print("training data path - {}".format(train_data_url))
print("Test data path - {}".format(test_data_url))

### Start the training Job

Now we are ready to run the training using Trn1 instance. A training script is required for SageMaker PyTorch estimator to run a model training job. Below is the script for fine-tuning a pretrained Hugging Face BERT model with the dataset (Amazon reviews) we just put in the S3.

In [None]:
!pygmentize ./code/train.py

In the training script, there are several important details worth mentioning:

1. distributed training (hardware) This is an example of data parallel distributed training. In this training scenario, since there are multiple NeuronCores in this trn1 instance, each NeuronCore receives a copy of the model and a shard of data. Each NeuronCore is managed by a worker that runs a copy of the training script. Gradient from each worker is aggregated and averaged, such that each worker receives exactly same updates to the model weights. Then another iteration of training resumes.


2. Distributed training (software) A specialized backend torch.xla.distributed.xla_backend is required for PyTorch to run on XLA device such as Trainium. In the training loop, since each worker generates its own gradient, xm.optimiser_Step(optimizer) makes sure all workers receive same gradient update before next iteration of training.


3. The data from S3 will be copied to the training instance and the path will be made available as environment variables under channel names SM_CHANNEL_TRAIN and SM_CHANNEL_VAL


4. The trained model config and weights are stored in a path provided by environment variable SM_MODEL_DIR. Amazon SageMaker will subsequently copy the files in SM_MODEL_DIR path to the S3 bucket once the training is complete. We can then use the model to deploy it to any hardware of our choice.

In [None]:
# start the training job with tranium
base_job_name = "amazon-review-classification"

In [None]:
hyperparameters = {}

hyperparameters[
 "model_name_or_path"
] = "bert-base-uncased" # we can change this mode to any other pretrained bert base model
hyperparameters["seed"] = 100
hyperparameters["max_length"] = 128
hyperparameters["per_device_train_batch_size"] = 8
hyperparameters["per_device_eval_batch_size"] = 8
hyperparameters["learning_rate"] = 5e-5
hyperparameters["max_train_steps"] = 2000
hyperparameters["num_train_epochs"] = 1

In [None]:
pt_estimator = PyTorch(
 entry_point="train.py", # Specify your train script
 source_dir="code",
 role=sagemaker.get_execution_role(),
 instance_count=1,
 instance_type="ml.trn1.32xlarge",
 framework_version="1.11.0",
 py_version="py38",
 disable_profiler=True,
 base_job_name=base_job_name,
 hyperparameters=hyperparameters,
 volume_size=512,
 distribution={"torch_distributed": {"enabled": True}},
)

In [None]:
pt_estimator.fit({"train": train_data_url, "val": test_data_url})




Now that model is successfully trained and the model weights are stored to S3, We can take this model and deploy it using any hardware such as GPU,CPU or Inferentia.



### Deploy the trained model

The trained model can be taken and deployed to any instance such as CPU, GPU or AWS Inferentia. In this example we will take the trained model deploy it to a CPU instance and get some predictions. Inorder to deploy a model we need to do the following steps:

1. Create a model.tar.gz with all the model files. 
2. Create an inference script to load, process and predict.
3. Create a Pytorch Model and deploy it.

#### 1. Create model.tar.gz

The output from the above training job is stored as an tar.gz file in S3. So we can directly retrive the url from the estimator and use it.

In [None]:
model_url = (
 pt_estimator.model_data
) # Alternatively we can retrieve this from the training job details in console.

In [None]:
print(model_url)

#### 2. Create an inference script

We need to write an inference script which tells how to load the model and do inference. The inference script should atlease include a model_fn function that loads the model. Optionally you may also implement input_fn and output_fn to process input and output, and predict_fn to customize how the model server gets predictions form the loaded model.

The inference.py script we use contains implementation for the functions mentioned above. Lets see how it looks.



In [None]:
!pygmentize code/inference.py

#### 3. Create a Pytorch Model and deploy

Once we have the model and the source files, deploying the model is as simple as creating a model object pointing to the model files and the source files and then deploying to the instance type we need.

In [None]:
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

pytorch_model = PyTorchModel(
 model_data=model_url,
 role=role,
 framework_version="1.12.0",
 py_version="py38",
 source_dir="code",
 entry_point="inference.py",
)

In [None]:
predictor = pytorch_model.deploy(
 instance_type="ml.c5.xlarge", # can be changed to GPU instance as well.
 initial_instance_count=1,
 serializer=JSONSerializer(),
 deserializer=JSONDeserializer(),
)

##### Note : The instance type can be changed as need to either a GPU/CPU based instance

### Predict using the model

In [None]:
# Predict with model endpoint with a positive sample

payload1 = "I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."

In [None]:
import json

# invoke the endpoint
out = predictor.predict(payload1)
print("The prediction from the model --")
print(out)

In [None]:
# Predict using a negative sample

payload2 = "I guess you have to be a romance novel lover for this one, and not a very discerning one. All others beware! It is absolute drivel. I figured I was in trouble when a typo is prominently featured on the back cover, but the first page of the book removed all doubt. Wait - maybe I'm missing the point. A quick re-read of the beginning now makes it clear. This has to be an intentional churning of over-heated prose for satiric purposes. Phew, so glad I didn't waste $10.95 after all."

In [None]:
# invoke the endpoint
out = predictor.predict(payload2)
print("The prediction from the model --")
print(out)

### Clean Up

Now that we have run some predicts, we should ideally free up the resource by deleting the model.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()