# Lab1: Finetuning HuggingFace models with Amazon SageMaker
### Multi-class Classification with `Trainer` and `amazon_us_reviews` dataset

# Introduction

Welcome to the first Lab and our end-to-end multi-class Text-Classification example. In this Lab, we will use the Hugging Faces `transformers` and `datasets` library together with Amazon SageMaker to fine-tune a pre-trained transformer on text classification. In particular, the pre-trained model will be fine-tuned using the `amazon_us_reviews` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

# Development Environment and Permissions 

## Installation

_*Note:* we install the required libraries from Hugging Face and AWS. Additionally, we make sure we have a compatible PyTorch version installed_

In [None]:
!pip install "sagemaker>=2.48.0" --upgrade

In [None]:
!pip install datasets=='1.8.0'

*Note: Restart the kernel after installing the above packages.*

In [None]:
from IPython.display import display_html
def restartkernel() :
 display_html("",raw=True)

In [None]:
restartkernel()

## Permissions

_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._

In [None]:
import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
 # set to default bucket if a bucket name is not given
 sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

# Preparing the dataset

As dataset are we going to use the [amazon_us_reviews](https://huggingface.co/datasets/amazon_us_reviews).

the dataset contains the customer review text with accompanying metadata, consisting of three major components:

1. A collection of reviews written in the Amazon.com marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale. (130M+ customer reviews)
2. A collection of reviews about products in multiple languages from different Amazon marketplaces, intended to facilitate analysis of customers’ perception of the same products and wider consumer preferences across languages and countries. (200K+ customer reviews in 5 countries)
3. A collection of reviews that have been identified as non-compliant with respect to Amazon policies. This is intended to provide a reference dataset for research on detecting promotional or biased reviews. (several thousand customer reviews). This part of the dataset is distributed separately and is available upon request – please contact the email address below if you are interested in obtaining this dataset. 

_https://s3.amazonaws.com/amazon-reviews-pds/readme.html_

We will use `create_dataset.py` script to downsample the dataset and split it into train `data/amazon_us_reviews_apparel_v1_00_train.json` and test dataset `data/amazon_us_reviews_apparel_v1_00_test.json`. The train dataset contains 29750 rows and the test dataset 5250. 
For the dataset files the `Apparel_v1_00` split was used. 

You can find the script at `data/create_dataset.py`. You can use this to change for example the size or category split of the reviews.

In [None]:
!pygmentize ../data/create_dataset.py

### Download data and process it to have only reviews and label, then split it into train and test. 
* This step will take approximately `7mins - 10mins`. 

In [None]:
%%time
!python ../data/create_dataset.py

## Uploading data to `sagemaker_session_bucket`

Upload the `dataset` files to the default bucket in Amazon S3

In [None]:
import os
from sagemaker.s3 import S3Uploader

local_train_dataset = "amazon_us_reviews_apparel_v1_00_train.json"
local_test_dataset = "amazon_us_reviews_apparel_v1_00_test.json"

# s3 uris for datasets
remote_train_dataset = f"s3://{sess.default_bucket()}/lab1/data"
remote_test_dataset = f"s3://{sess.default_bucket()}/lab1/data"

# upload datasets
S3Uploader.upload(os.path.join('../data', local_train_dataset),remote_train_dataset)
S3Uploader.upload(os.path.join('../data',local_test_dataset),remote_test_dataset)

print(f"train dataset uploaded to: {remote_train_dataset}/{local_train_dataset}")
print(f"test dataset uploaded to: {remote_test_dataset}/{local_test_dataset}")


# Fine-tuning & starting Sagemaker Training Job

In order to create our sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles all end-to-end Amazon SageMaker training and deployment tasks. In the Estimator we define, which fine-tuning script (`entry_point`) should be used, which `instance_type` should be used, which `hyperparameters` are passed in .....



```python
huggingface_estimator = HuggingFace(entry_point='train.py',
 source_dir='./scripts',
 base_job_name='huggingface-sdk-extension',
 instance_type='ml.p3.2xlarge',
 instance_count=1,
 transformers_version='4.4',
 pytorch_version='1.6',
 py_version='py36',
 role=role,
 hyperparameters = {'epochs': 1,
 'train_batch_size': 32,
 'model_name':'distilbert-base-uncased'
 })
```

When we create a SageMaker training job, SageMaker takes care of starting and managing the required ec2 instances for us, providing the fine-tuning script `train.py` and downloading the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. When starting the training SageMaer executes the following command:

```python
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
```

The CLI arguments you see are passed in as `hyperparameters`, when creating the `HuggingFace` estimator.

Sagemaker is also providing useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run your training job locally you can define `instance_type='local'` or `instance_type='local_gpu'` for gpu usage. _Note: this does not working within SageMaker Studio_


## Creating an Estimator and start a training job

The training script that performs fine tuning is located here: `training/scripts/train.py`. Navigate to the source code location and open the `train.py` file. You can also go through it's contents by executing the cell below.

In [None]:
!pygmentize ../scripts/train.py

In [None]:
from sagemaker.huggingface import HuggingFace
import time

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1, # number of training epochs
 'train_batch_size': 32, # batch size for training
 'eval_batch_size': 64, # batch size for evaluation
 'learning_rate': 3e-5, # learning rate used during training
 'model_id':'distilbert-base-uncased', # pre-trained model
 'fp16': True, # Whether to use 16-bit (mixed) precision training
 'train_file': local_train_dataset, # training dataset
 'test_file': local_test_dataset, # test dataset
 }

List of supported models: https://huggingface.co/models?library=pytorch,transformers&sort=downloads

We create a `metric_definition` dictionary that contains regex-based definitions that will be used to parse the job logs and extract metrics. You can read more about parsing the cloudwatch logs [here](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html).

In [None]:
metric_definitions=[
 {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
 {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
 {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
 {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"}]

In [None]:
# define Training Job Name 
job_name = f'huggingface-workshop-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# create the Estimator
huggingface_estimator = HuggingFace(
 entry_point = 'train.py', # fine-tuning script used in training jon
 source_dir = '../scripts', # directory where fine-tuning script is stored
 instance_type = 'ml.p3.2xlarge', # instances type used for the training job
 instance_count = 1, # the number of instances used for training
 base_job_name = job_name, # the name of the training job
 role = role, # Iam role used in training job to access AWS ressources, e.g. S3
 transformers_version = '4.6', # the transformers version used in the training job
 pytorch_version = '1.7', # the pytorch_version version used in the training job
 py_version = 'py36', # the python version used in the training job
 hyperparameters = hyperparameters, # the hyperparameter used for running the training job
 metric_definitions = metric_definitions # the metrics regex definitions to extract logs
)

In [None]:
# define a data input dictonary with our uploaded s3 uris
training_data = {
 'train': remote_train_dataset,
 'test': remote_test_dataset
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(training_data, wait=True)

# Accessing Logs

In [None]:
# access the logs of the training job
huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)

# Accessing Training Metrics

In [None]:
from sagemaker import TrainingJobAnalytics

# Captured metrics can be accessed as a Pandas dataframe
training_job_name = huggingface_estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df

## Bonus: Deploying the endpoint

To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type.

In [None]:
predictor = huggingface_estimator.deploy(1,"ml.g4dn.xlarge")

Then, we use the returned predictor object to call the endpoint.

In [None]:
sentiment_input= {"inputs":"I love using the new Inference DLC."}

predictor.predict(sentiment_input)

Finally, we delete the endpoint again.

In [None]:
predictor.delete_endpoint()