# Fine-tuning and deploying a Hugging Face summarization model on SageMaker with your own scripts and dataset

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

---

For ease of use, we advise opening this notebook in an Amazon SageMaker notebook instance using the `conda_pytorch_latest_p36` kernel, or in Amazon SageMaker Studio using a `Python 3 (PyTorch 1.8 Python 3.6 CPU Optimized)` kernel on a `ml.t3.medium` instance.

In this notebook, we will see how to fine-tune and deploy one of the [🤗 Transformers](https://github.com/huggingface/transformers) model for a summarization task on [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html) with your own scripts and data.

In the first part "Preparing the dataset" we show how to load your own dataset to s3 into separated files for training, validation and testing. We will use the [Women's E-Commerce Clothing Reviews dataset](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/) which contains e-commerce clothing reviews and review titles, but we also provide code to do it for your own custom dataset. In our case the text and summary columns are called `review_text` and `title` respectively, and the data is saved in s3 under the prefix `DEMO-sagemaker-huggingface-summarization`.

Afterwards, we walk you through how to create your own train and inference scripts to fine-tune and deploy a Hugging Face model on Amazon SageMaker.

Make sure that the latest version of SageMaker SDK is installed

In [None]:
# Install the required libraries
import sys

!{sys.executable} -m pip install datasets
!{sys.executable} -m pip install py7zr
!{sys.executable} -m pip install -U sagemaker

In [None]:
# Ensure packages are reloaded without having to restart Kernel
import importlib

import datasets
import py7zr
import sagemaker

importlib.reload(datasets)
importlib.reload(py7zr)
importlib.reload(sagemaker)

## Part 1: Preparing the dataset for Hugging Face on Amazon SageMaker

One way to prepare your dataset for training on Amazon SageMaker is to have your training, validation and test datasets saved separately. This enables to effectively decouple data preparation from training in an architecture and for example ensure that the same datasets can be reused by different models with the same split. In this example we download the [Women's E-Commerce Clothing Reviews dataset](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/) and prepare it for Hugging Face using the [`datasets`](https://github.com/huggingface/datasets) library. Any dataset containing text and something that could be considered a summary (e.g. titles) can work here.

We first import required packages and define the prefix where we will save the data:

In [None]:
import os
import json
import io, boto3, sagemaker
import pandas as pd

from datasets import load_dataset, filesystems, DatasetDict


s3_resource = boto3.resource("s3")
session = sagemaker.Session()
session_bucket = session.default_bucket()

s3_prefix = "DEMO-sagemaker-huggingface-summarization"

We read the raw dataset directly from its source

In [None]:
s3_client = boto3.client("s3")
s3_client.download_file(
 f"sagemaker-example-files-prod-{session.boto_region_name}",
 "datasets/tabular/womens_clothing_ecommerce/Womens_Clothing_E-Commerce_Reviews.csv",
 "Womens_Clothing_E-Commerce_Reviews.csv",
)

In [None]:
path_to_input_file = "Womens_Clothing_E-Commerce_Reviews.csv"
df = pd.read_csv(path_to_input_file)

This raw dataset has missing values in the columns that are interesting for us: "Review text" and "Title". So we drop rows with missing values in those 2 columns. Additionally, we reformat the column names to be lowercase and replace space by underscore.

In [None]:
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(" ", "_")
df = df.dropna(subset=["title", "review_text"])
df.head()

The cleaned dataset should contain 19675 rows.

In [None]:
path_to_your_file = "Womens_Clothing_E-Commerce_Reviews.csv"
df.to_csv(path_to_your_file, index=False)

Now that we've cleaned the data from missing reviews and titles, we will split it into train, validation and test set using the `load_dataset()` functions from the `datasets` library.

In [None]:
# When using your own custom dataset (single CSV/JSON), you can use the datasets.Dataset.train_test_split() method to shuffle and split your data.
# The splits will be shuffled by default. You can deactivate this behavior by setting shuffle=False

# Replace type to 'json' if you are using a JSON files, the rest of the steps are exactly the same
data = load_dataset("csv", data_files=path_to_your_file, split="train") # path to your file


# Split into 70% train, 30% test + validation
train_test_validation = data.train_test_split(test_size=0.3)

# Split 30% test + validation into half test, half validation
test_validation = train_test_validation["test"].train_test_split(test_size=0.5)

# Gather the splits to have a single DatasetDict

dataset = DatasetDict(
 {
 "train": train_test_validation["train"],
 "validation": test_validation["train"],
 "test": test_validation["test"],
 }
)

In [None]:
dataset

We can inspect an example review:

In [None]:
print("Review Text\n{text}".format(text=dataset["train"]["review_text"][12]))
print("\nTitle\n{summary}".format(summary=dataset["train"]["title"][12]))
print("\nRating\n{rating}".format(rating=dataset["train"]["rating"][12]))

Finally, we write the training, validation and test data frames to separate CSVs and upload them to S3.

Use the `save_to_disk` method to directly save your dataset to S3 in Hugging Face dataset format. The format is backed by the Apache Arrow format which enables processing of large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. You can use the `load_to_disk` method in your train script to directly load the dataset in the format it was saved.

In [None]:
s3 = filesystems.S3FileSystem()
dataset.save_to_disk(f"s3://{session_bucket}/{s3_prefix}/train/", fs=s3)

## Part 2: Fine-tune and deploy a Hugging Face model on Amazon SageMaker

Now that the data is ready and saved in s3, we will demonstrate how to fine-tune and deploy a Hugging Face model on Amazon SageMaker with your own scripts.

In [None]:
text_column = "review_text"
target_column = "title"

This notebook is built to run with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a sequence-to-sequence version in the Transformers library. Here we picked the [`pegasus-xsum`](https://huggingface.co/google/pegasus-xsum) checkpoint. 

In [None]:
model_name = "google/pegasus-xsum"

### Write the training script

To fine-tune a Hugging Face model with a custom dataset on Amazon SageMaker, we will write a training script to be used by the Amazon SageMaker Training Job.

The training script will need to do the following steps:
- Load a pretrained Tokenizer and Model
- Load and Tokenize datasets
- Define the Training Arguments
- Define a Trainer
- Train the model and save the checkpoint with the best performance on the validation set
- Evaluate the best checkpoint on the test set

These steps will be done in a `train()` function which uses a couple helper functions:
`tokenize()` takes a batch, specified text and target columns, and tokenizes them with the Tokenizer loaded in memory,
`load_and_tokenize()` which reads data from s3 and applies the `tokenize()` function, and `compute_metrics()` to compute ROUGE scores for evaluation.

The script uses `AutoTokenizer` and `AutoModelForSeq2SeqLM` which works with any [🤗 Transformers](https://github.com/huggingface/transformers) model for summarization. You might however want to change some hyperparameters depending on what works best for each model. Here we used `adafactor` as optimizer for Pegasus for example.

All computations will be running inside Amazon SageMaker Hugging Face training and inference containers, which we call using the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)

In [None]:
!pygmentize source/train.py

By default, the `Trainer` saves several checkpoints before selecting the best one. Once the best checkpoint is loaded in memory and saved, those remaining checkpoints are not needed anymore. They can be safely deleted (which we do in the last line of the `train()`) to liberate space in the `SM_MODEL_DIR` which content will be used later for creating a SageMaker Model and deploy it to an endpoint.

### Fine-tuning the model on SageMaker

We first load a couple of libraries and objects, namely `sagemaker` and the `HuggingFace` SageMaker Estimator which will be used to launch a training job.

In [None]:
role = sagemaker.get_execution_role()

In [None]:
from sagemaker.huggingface import HuggingFace

In [None]:
output_path = f"s3://{session_bucket}/{s3_prefix}"

We define a few arguments to be sent to the training script which will be read by the parser.

In [None]:
# We set the number of epochs to 1 to reduce the training time in this demo.
# For complete fine-tuning of the model please consider increasing the number of epochs to e.g. 5
hyperparameters = {
 "model-name": model_name,
 "text-column": text_column,
 "target-column": target_column,
 "epoch": 1,
}

In [None]:
metric_definitions = [
 {"Name": "training:loss", "Regex": "'loss': (.*?),"},
 {"Name": "validation:loss", "Regex": "'eval_loss': (.*?),"},
 {"Name": "validation:rouge1", "Regex": "'eval_rouge1': (.*?),"},
 {"Name": "validation:rouge2", "Regex": "'eval_rouge2': (.*?),"},
 {"Name": "validation:rougeL", "Regex": "'eval_rougeL': (.*?),"},
 {"Name": "validation:rougeLsum", "Regex": "'eval_rougeLsum': (.*?),"},
 {"Name": "validation:gen_len", "Regex": "'eval_gen_len': (.*?),"},
]

Thanks to [🤗 Transformers'](https://github.com/huggingface/transformers) `Trainer` seamless integration with [SageMaker Distributed Data Parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html), we can make use of instances with several GPU units to parallelize and speed up training, without any modification to our training script.

When defining the SageMaker Hugging Face Estimator we specify a training script and source directory (here only containing `train.py`, but it could contain any additional modules and a `requirements.txt`), as well as the instance type on which to run the Training Job.

In [None]:
# configuration for running training on smdistributed Data Parallel
# Estimated runtime: 1.5h for 1 epoch
distribution = {"smdistributed": {"dataparallel": {"enabled": True}}}
huggingface_estimator = HuggingFace(
 entry_point="train.py",
 source_dir="source",
 base_job_name="huggingface-summarizer",
 instance_type="ml.p3.16xlarge",
 instance_count=1,
 volume_size=200,
 transformers_version="4.17.0",
 pytorch_version="1.10.2",
 py_version="py38",
 output_path=output_path,
 role=role,
 hyperparameters=hyperparameters,
 metric_definitions=metric_definitions,
 distribution=distribution,
)

We then launch the training job by specifying where to read the data from.
'train' will be loaded inside `SM_CHANNEL_TRAIN`, 'validation' inside `SM_CHANNEL_VALIDATION` and 'test' inside `SM_CHANNEL_TEST`, which will be the data directories inside the container running `train.py`.

In [None]:
huggingface_estimator.fit({"train": f"s3://{session_bucket}/{s3_prefix}/train/"})

With distributed training on a p3.16xlarge instance, the training should take around 6 hours for 5 epochs.

### Bring your own inference script

Our friends at Hugging Face have made inference on SageMaker for transformers model simpler than ever thanks to the [SageMaker Hugging Face Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit). You can directly deploy the previously trained model by simply setting up the environment variable "HF_TASK":"summarization" following the instructions on the [HuggingFace website](https://huggingface.co/google/pegasus-xsum) selecting "Deploy" and then "Amazon SageMaker", without the need to write an inference script.

However, when needing specific post-processing, for example if for a same input you want to return several summaries based on different text generation parameters, bringing your own `inference.py` script might be useful, and relatively straightforward:

In [None]:
!pygmentize source/inference.py

As we can see, the only requirements to writing such an inference script for Hugging Face on SageMaker is that the inference script shall contain the following template functions:
- `model_fn()` reading the content of what was saved at the end of the training job inside `SM_MODEL_DIR`, or from an existing model weights directory saved as a `tar.gz` in s3. We will use it to load the trained Model and associated Tokenizer
- `input_fn()` used here simply to format the data receives from a request made to the endpoint.
- `predict_fn()` calling the output of `model_fn()` (so here the model and tokenizer) to run inference on the output of `input_fn()`.

Optionally a `output_fn()` can be created for inference formatting, using the output of `predict_fn()`, but we did not use it here.


### Create and deploy a SageMaker Model to an endpoint and test it

This time we will import the SageMaker `HuggingFaceModel` object which will help us create a SageMaker Model and deploy it to an endpoint.

In [None]:
from sagemaker.huggingface import HuggingFaceModel

Again, we specify here the inference script that we wrote earlier, a source directory (here again containing only `inference.py` but could contain modules and a `requirements.txt`) and `model_data` specifying where to load the model weights from. Using `huggingface_estimator.model_data` directly points to the s3 location where the output of the `huggingface_estimator` (after training) was saved, but any s3 arn containing pre-trained weights compressed as a `tar.gz` could work.

In [None]:
model_name = "summarization-model"

model_for_deployment = HuggingFaceModel(
 entry_point="inference.py",
 source_dir="source",
 model_data=huggingface_estimator.model_data,
 role=role,
 pytorch_version="1.10.2",
 py_version="py38",
 transformers_version="4.17.0",
 name=model_name,
)

Finally, we deploy the register model by specifying the instance type.

In [None]:
endpoint_name = "summarization-endpoint"

predictor = model_for_deployment.deploy(
 initial_instance_count=1,
 instance_type="ml.g4dn.xlarge",
 endpoint_name=endpoint_name,
 serializer=sagemaker.serializers.JSONSerializer(),
 deserializer=sagemaker.deserializers.JSONDeserializer(),
)

Once the model is deployed, you can test it directly:
Feel free to change the parameters list to see different predictions

In [None]:
article_index = 12
print("Review Text\n{text}".format(text=dataset["test"]["review_text"][article_index]))
print("\nTitle\n{summary}".format(summary=dataset["test"]["title"][article_index]))
print("\nRating\n{rating}".format(rating=dataset["test"]["rating"][article_index]))

In [None]:
# Examples taken from the test set
texts = [dataset["test"]["review_text"][article_index]]

inputs = {
 "inputs": texts,
 "parameters_list": [
 {"length_penalty": 2, "num_beams": 5, "do_sample": True},
 {"length_penalty": 1, "num_beams": 5, "do_sample": True},
 {"length_penalty": 0.6, "num_beams": 3, "do_sample": True},
 {"max_length": 25, "top_p": 0.92, "top_k": 50, "do_sample": True},
 ],
}

In [None]:
summaries = predictor.predict(inputs)

In [None]:
for s in summaries:
 print(s)

Lastly, please remember to delete the Amazon SageMaker endpoint to avoid charges.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

## Conclusion

In this notebook, we trained and deployed a Hugging Face model for Text Summarization with custom scripts and data on Amazon SageMaker. You can use this solution to train and deploy other [pretrained models](https://huggingface.co/models) provided by Hugging Face. [Sample notebooks](https://github.com/huggingface/notebooks/tree/master/sagemaker) are available on GitHub.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/advanced_functionality|huggingface_byo_scripts_and_data|huggingface-custom-text-summarizer.ipynb)
