# Paraphrase Identification using HuggingFace on SageMaker - Background

Many customers today deal with use cases where identifying paraphrased text has business value. For example, by identifying sentence paraphrases, a text summarization system could remove redundant information. Another application is to identify plagiarized documents. Here, we will fine-tune a Hugging Face transformer on SageMaker to identify paraphrased sentence pairs in a few, simple steps.

# Setup and install libraries

We will begin by installing the necessary libraries, importing them into the development environment, selecting the appropriate IAM role and the Amazon S3 bucket.

Select the `conda_pytorch_p38` notebook kernel

Install the required libraries from Hugging Face - `transformers` and `datasets`. We'll also ensure that we have the updated version of `SageMaker Python SDK`

Documentation on [Installing Transformers](https://huggingface.co/docs/transformers/installation) and [Installing SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk/tree/master)

In [None]:
!pip --quiet install "sagemaker" "transformers==4.6.1" "datasets==1.6.2" --upgrade

In [None]:
import sagemaker.huggingface
import sagemaker
from datasets import load_dataset

In [None]:
sess = sagemaker.Session()

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket()

print(f"SageMaker Role Arn: {role}")
print(f"SageMaker - Amazon S3 Bucket: {bucket}")
print(f"SageMaker Session Region: {sess.boto_region_name}")

# Import dataset

We will be using the `PAWS (Paraphrase Adversaries from Word Scrambling)` dataset. The final labeled dataset contains pairs that are generated from both word swapping and back translation methods. All pairs have had human review on both paraphrasing and fluency and they are also split into `Train/Validation/Test` sections. The `Train` dataset contains a total of 49,401 sentence pairs, while the `Validation` and `Test` datasets contain a total of 8,000 sentence pairs each.

![image info](./img/PAWS-dataset-sample.png)

In [None]:
dataset_train, dataset_val, dataset_test = load_dataset("paws", "labeled_final", split=['train', 'validation', 'test'])

# Understand Dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = dataset_train.to_pandas()

ax = sns.countplot(x="label", data=df)
ax.set_title('Label Count for PAWS Dataset', fontsize=15)
for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x()+0.4, p.get_height()), ha='center', va='top', color='white', size=13)

We can see the dataset is only slightly imbalanced, so we will skip applying class imbalance mitigation techniques such as _oversampling_, _undersampling_, _SMOTE_, etc

In [None]:
df['label'].value_counts(normalize=True)

# Tokenize the text

In [None]:
from transformers import AutoTokenizer
tokenizer_and_model_name = 'roberta-large'

# Download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_and_model_name)

# Tokenizer helper function
def tokenize(batch, max_len=128):
    return tokenizer(batch['sentence1'], batch['sentence2'], max_length=max_len, truncation=True)

dataset_train_tokenized = dataset_train.map(tokenize, batched=True, batch_size=len(dataset_train))
dataset_val_tokenized = dataset_val.map(tokenize, batched=True, batch_size=len(dataset_val))

In [None]:
dataset_train_tokenized, dataset_val_tokenized

In [None]:
# convert tokenized datasets to PyTorch tensors
dataset_train_tokenized = dataset_train_tokenized.rename_column("label", "labels")
dataset_train_tokenized.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

dataset_val_tokenized = dataset_val_tokenized.rename_column("label", "labels")
dataset_val_tokenized.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

# Upload tokenized dataset to Amazon S3

After we processed the datasets we are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3.

In [None]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()
s3_prefix = 'sts-paws-datasets/paws-tokenized/' + tokenizer_and_model_name

# save train dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
dataset_train_tokenized.save_to_disk(training_input_path,fs=s3)

# save val dataset to s3
val_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/val'
dataset_val_tokenized.save_to_disk(val_input_path,fs=s3)

# Model Training: Amazon SageMaker - Hugging Face

![](img/bert_transfer_learning.png)

This illustration is from Jacob D. et al (2019). The overall pre-training and fine-tuning procedure for BERT. Retrieved from https://arxiv.org/pdf/1810.04805.pdf

Other resources for learning more about transfer learning and BERT include - 1) [Recent Advances in Language Model Fine-tuning](https://ruder.io/recent-advances-lm-fine-tuning/) by Sebastian Ruder; 2) [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) by Jay Alammar; 3) [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805); 4) [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) and 5) [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108)

# Model fine-tuning

Now that we are done with data preparation, we are ready to fine-tune our pre-trained roberta-base model on the task of identifying paraphrased sentences. We can leverage the HuggingFace Estimator class within SageMaker to initiate the fine-tuning process in a few simple steps.


Note: when fine-tuning HuggingFace transformers, ensure that the `transformers_version`, `pytorch_version` and `py_version` are aligned, as described [here](https://huggingface.co/docs/sagemaker/main)

In [None]:
## Optional: View session bucket name
# f's3://{sess.default_bucket()}'

In [None]:
metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}
]

In [None]:
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

hyperparameters = {'epochs': 4,
                   'train_batch_size': 16,
                   'model_name': tokenizer_and_model_name}


huggingface_estimator = HuggingFace(
                            entry_point='train.py',
                            source_dir='./scripts',
                            output_path=f's3://{sess.default_bucket()}',
                            base_job_name='huggingface-sdk-extension',
                            instance_type='ml.p3.8xlarge',
                            instance_count=1,
                            volume_size=100,
                            transformers_version='4.6.1',
                            pytorch_version='1.7.1',
                            py_version='py36',
                            role=role,
                            hyperparameters=hyperparameters,
                            metric_definitions=metric_definitions
                        )

We will be calling our `train.py` file store in `./scripts/train.py`, as shown below:

In [None]:
## Optional: View training script
# !pygmentize ./scripts/train.py

In [None]:
import time

# Begin fine-tuning
huggingface_estimator.fit({'train': training_input_path, 'test': val_input_path}, 
                          wait=True, 
                          job_name='sm-sts-blog-{}'.format(int(time.time())))

![](./img/sagemaker-training-job.png)

# Model Deployment

## SageMaker Endpoint Deployment

To deploy the trained model to an endpoint, we call the `deploy()` method on the HuggingFace estimator object.

In [None]:
rt_predictor = huggingface_estimator.deploy(initial_instance_count=1,
                                         instance_type="ml.g4dn.xlarge", 
                                         endpoint_name="sts-sbert-paws-roberta-base-realtime-inference")

**Optional: Alternatively, we can also load a previously fine-tuned model from s3**

In [None]:
# import boto3

# sm_client = boto3.client('sagemaker')
# bucket_name = 'sts-sbert-paws-blog'
# latest_sm_training_job_name = sm_client.list_training_jobs()['TrainingJobSummaries'][0]['TrainingJobName']

# S3_PATH_TRAINED_MODEL_FILE = 's3://' + bucket_name + '/' + latest_sm_training_job_name + '/output/model.tar.gz'

In [None]:
# print('Selected SageMaker Training Job Name:', latest_sm_training_job_name)

In [None]:
# # recreate the huggingface_model object

# from sagemaker.huggingface import HuggingFaceModel

# huggingface_model = HuggingFaceModel(
#     model_data=S3_PATH_TRAINED_MODEL_FILE,
#     role=role,
#     transformers_version='4.6.1',
#     pytorch_version='1.7.1',
#     py_version='py36',
# )

In [None]:
# rt_predictor = huggingface_estimator.deploy(initial_instance_count=1,
#                                      instance_type='ml.g4dn.xlarge', 
#                                      endpoint_name='sts-sbert-paws-roberta-base-realtime-inference',
#                                      wait=True)

Other scenarios to deploy a model to a SageMaker endpoint include - 1) from a model stored in the [Hugging Face Hub](https://huggingface.co/models) and 2) by using a custom inference container. For more information on these methods refer to [Announcing managed inference for Hugging Face models in Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/announcing-managed-inference-for-hugging-face-models-in-amazon-sagemaker/)

## SageMaker Serverless Endpoint Deployment

First create a serverless config file.

In [None]:
from sagemaker.serverless.serverless_inference_config import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=6144,
    max_concurrency=10,
)

Retrieve the image uri.

In [None]:
image_uri = sagemaker.image_uris.retrieve(
    framework="huggingface",
    base_framework_version="pytorch1.7",
    region=sess.boto_region_name,
    version="4.6",
    py_version="py36",
    instance_type="ml.m5.large",
    image_scope="inference",
)
image_uri

Create serverless inference endpoint.

In [None]:
sl_predictor = huggingface_estimator.deploy(
    endpoint_name= "huggingface-serverless-ep" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()),
    serverless_inference_config=serverless_config,
    image_uri=image_uri
)

# Model Inference

In [None]:
# Re-run this cell to see predictions on alternative sample test inputs

import random 

rand = random.randrange(0, 8000)

true_label = dataset_test[rand]['label']
sent_1 = dataset_test[rand]['sentence1']
sent_2 = dataset_test[rand]['sentence2']

sentence_pair = '[CLS] ' + sent_1 + ' [SEP] ' + sent_2 + ' [SEP]'

## Inference using SageMaker Endpoint

Once the model is deployed, we can send observations from the unseen test dataset - `df_test` to the endpoint, to make predictions.

Let's select a few sentences from the test dataset and send it to the real-time endpoint for predictions.

In [None]:
print('Sentence 1:', sent_1) 
print('Sentence 2:', sent_2)
print()
print('Inference Endpoint:', rt_predictor.endpoint_name)
print('True Label:', true_label)
print('Predicted Label:', rt_predictor.predict({"inputs": sentence_pair})[0]['label'])
print('Prediction Confidence:', rt_predictor.predict({"inputs": sentence_pair})[0]['score'])

## Inference using SageMaker Serverless Endpoint

Invoking serverless inference endpoint works the same way.

In [None]:
print('Sentence 1:', sent_1) 
print('Sentence 2:', sent_2)
print()
print('Inference Endpoint:', sl_predictor.endpoint_name)
print('True Label:', true_label)
print('Predicted Label:', sl_predictor.predict({"inputs": sentence_pair})[0]['label'])
print('Prediction Confidence:', sl_predictor.predict({"inputs": sentence_pair})[0]['score'])

# Model Evaluation

Let's apply the fine-tuned model on the whole unseen test set and evaluate the model's performance

In [None]:
print('Number of test set records:', len(dataset_test))

In [None]:
from tqdm import tqdm

preds = []
labels = []

# Inference takes ~5 minutes for 8,000 test records using a fine-tuned roberta-large and ml.g4dn.xlarge instance

for i in tqdm(range(len(dataset_test))):
    true_label = dataset_test[i]['label']
    sent_1 = dataset_test[i]['sentence1']
    sent_2 = dataset_test[i]['sentence2']
    
    sentence_pair = {"inputs": ['[CLS] ' + sent_1 + ' [SEP] ' + sent_2 + ' [SEP]']}
    pred = rt_predictor.predict(sentence_pair)
    
    labels.append(true_label)
    preds.append(int(pred[0]['label'].split('_')[1]))

In [None]:
from sklearn.metrics import classification_report

print('Endpoint Name:', rt_predictor.endpoint_name)
class_names = ['paraphase', 'not paraphrase']
print(classification_report(labels, preds, target_names=class_names))

We will also test the performance on Serverless Inference endpoint.

In [None]:
from tqdm import tqdm

preds2 = []
labels2 = []

# Inference takes ~5 minutes for 8,000 test records using a fine-tuned roberta-large and ml.g4dn.xlarge instance

for i in tqdm(range(len(dataset_test))):
    true_label = dataset_test[i]['label']
    sent_1 = dataset_test[i]['sentence1']
    sent_2 = dataset_test[i]['sentence2']
    
    sentence_pair = {"inputs": ['[CLS] ' + sent_1 + ' [SEP] ' + sent_2 + ' [SEP]']}
    pred = sl_predictor.predict(sentence_pair)
    
    labels2.append(true_label)
    preds2.append(int(pred[0]['label'].split('_')[1]))

In [None]:
from sklearn.metrics import classification_report

print('Endpoint Name:', rt_predictor.endpoint_name)
class_names = ['paraphase', 'not paraphrase']
print(classification_report(labels2, preds2, target_names=class_names))

# Cleanup

When we are done with the endpoint, we can delete it to save cost.

In [None]:
# delete real-time endpoint
rt_predictor.delete_model()
rt_predictor.delete_endpoint()

# delete severless endpoint
sl_predictor.delete_model()
sl_predictor.delete_endpoint()

# Additional Resources

1. [Use Hugging Face with Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html)
2. [Hugging Face sample notebooks](https://github.com/huggingface/notebooks/tree/master/sagemaker)
3. [AWS Blog - AWS and Hugging Face collaborate to simplify and accelerate adoption of Natural Language Processing models](https://aws.amazon.com/blogs/machine-learning/aws-and-hugging-face-collaborate-to-simplify-and-accelerate-adoption-of-natural-language-processing-models/)
4. [AWS Blog - Announcing managed inference for Hugging Face models in Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/announcing-managed-inference-for-hugging-face-models-in-amazon-sagemaker/)
5. [The Partnership: Amazon SageMaker and Hugging Face](https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face)
6. Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J Biomed Inform. 2015;53:196-207. doi:10.1016/j.jbi.2014.11.002
