# Hugging Face Sentiment Classification


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

---

__Binary Classification with `Trainer` and `sst2` dataset__

## Runtime

This notebook takes approximately 45 minutes to run.

## Contents

1. [Introduction](#Introduction)  
2. [Development environment and permissions](#Development-environment-and-permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
    3. [Permissions](#Permissions)
3. [Pre-processing](#Pre-processing)   
    1. [Tokenize sentences](#Tokenize-sentences)  
    2. [Upload data to sagemaker_session_bucket](#Upload-data-to-sagemaker_session_bucket)  
4. [Fine-tune the model and start a SageMaker training job](#Fine-tune-the-model-and-start-a-SageMaker-training-job)  
    1. [Create an Estimator and start a training job](#Create-an-Estimator-and-start-a-training-job)  
    2. [Estimator Parameters](#Estimator-Parameters)   
    3. [Attach a previous training job to an estimator](#Attach-a-previous-training-job-to-an-estimator)  

## Introduction

Welcome to our end-to-end binary text classification example. This notebook uses Hugging Face's `transformers` library with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on binary text classification. The pre-trained model is fine-tuned using the `sst2` dataset. To get started, we need to set up the environment with a few prerequisite steps for permissions, configurations, and so on. 

This notebook is adapted from Hugging Face's notebook [Huggingface Sagemaker-sdk - Getting Started Demo](https://github.com/huggingface/notebooks/blob/master/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb) and provided here courtesy of Hugging Face.

<img src="text_classification.png" width="700"/>

## Runtime

This notebook takes approximately 40 minutes to run.

<i>NOTE: You can run this notebook in SageMaker Studio, a SageMaker notebook instance, or your local machine. This notebook was tested in a notebook instance using the conda\_pytorch\_p36 kernel.</i>


## Development environment and permissions 

### Installation

_*Note:* We install the required libraries from Hugging Face and AWS. You also need PyTorch, if you haven't installed it already._

In [None]:
!pip install "sagemaker" "transformers" "datasets[s3]" "s3fs" --upgrade

### Development environment 

In [None]:
import sagemaker.huggingface

### Permissions

_If you are going to use SageMaker in a local environment, you need access to an IAM Role with the required permissions for SageMaker. You can read more at [SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html)._

In [None]:
import sagemaker

sess = sagemaker.Session()
# The SageMaker session bucket is used for uploading data, models and logs
# SageMaker will automatically create this bucket if it doesn't exist
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # Set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"Role arn: {role}")
print(f"Bucket: {sess.default_bucket()}")
print(f"Region: {sess.boto_region_name}")

## Pre-processing

We use the `datasets` library to pre-process the `sst2` dataset (Stanford Sentiment Treebank). After pre-processing, the dataset is uploaded to the `sagemaker_session_bucket` for use within the training job. The [sst2](https://nlp.stanford.edu/sentiment/index.html) dataset consists of 67349 training samples and _ testing samples of highly polar movie reviews.

### Download the dataset

In [None]:
from datasets import Dataset
from transformers import AutoTokenizer
import pandas as pd
import boto3

# Tokenizer used in pre-processing
tokenizer_name = "distilbert-base-uncased"

# S3 key prefix for the data
s3_prefix = "DEMO-samples/datasets/sst"

# Download the SST2 data from s3
s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-example-files-prod-{sess.boto_region_name}",
    "datasets/text/SST2/sst2.test",
    "sst2.test",
)
s3.download_file(
    f"sagemaker-example-files-prod-{sess.boto_region_name}",
    "datasets/text/SST2/sst2.train",
    "sst2.train",
)
s3.download_file(
    f"sagemaker-example-files-prod-{sess.boto_region_name}",
    "datasets/text/SST2/sst2.val",
    "sst2.val",
)

### Tokenize sentences

In [None]:
# Download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)


# Tokenizer helper function
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)


# Load dataset
test_df = pd.read_csv("sst2.test", sep="delimiter", header=None, engine="python", names=["line"])
train_df = pd.read_csv("sst2.train", sep="delimiter", header=None, engine="python", names=["line"])

test_df[["label", "text"]] = test_df["line"].str.split(" ", 1, expand=True)
train_df[["label", "text"]] = train_df["line"].str.split(" ", 1, expand=True)

test_df.drop("line", axis=1, inplace=True)
train_df.drop("line", axis=1, inplace=True)

test_df["label"] = pd.to_numeric(test_df["label"], downcast="integer")
train_df["label"] = pd.to_numeric(train_df["label"], downcast="integer")

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# Set format for pytorch
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

### Upload data to `sagemaker_session_bucket`

After processing the `datasets`, we use the `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload the dataset to S3.

In [None]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

# save train_dataset to s3
training_input_path = f"s3://{sess.default_bucket()}/{s3_prefix}/train"
train_dataset.save_to_disk(training_input_path, fs=s3)

# save test_dataset to s3
test_input_path = f"s3://{sess.default_bucket()}/{s3_prefix}/test"
test_dataset.save_to_disk(test_input_path, fs=s3)

## Fine-tune the model and start a SageMaker training job

In order to create a SageMaker training job, we need a `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In an Estimator, we define which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in, etc:



```python
hf_estimator = HuggingFace(entry_point="train.py",
                            source_dir="./scripts",
                            base_job_name="huggingface-sdk-extension",
                            instance_type="ml.p3.2xlarge",
                            instance_count=1,
                            transformers_version="4.4",
                            pytorch_version="1.6",
                            py_version="py36",
                            role=role,
                            hyperparameters = {"epochs": 1,
                                               "train_batch_size": 32,
                                               "model_name":"distilbert-base-uncased"
                                                })
```

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required EC2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py`, and downloads the data from the `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running:

```python
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
```

The `hyperparameters` defined in the `HuggingFace` estimator are passed in as named arguments. 

SageMaker provides useful properties about the training environment through various environment variables, including the following:

* `SM_MODEL_DIR`: A string representing the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.

* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

* `SM_CHANNEL_XXXX:` A string representing the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the Hugging Face estimator's `fit()` call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.


To run the training job locally, you can define `instance_type="local"` or `instance_type="local_gpu"` for GPU usage.

_Note: local mode is not supported in SageMaker Studio._


In [None]:
!pygmentize ./scripts/train.py

### Create an Estimator and start a training job

In [None]:
from sagemaker.huggingface import HuggingFace

# Hyperparameters which are passed into the training job
hyperparameters = {"epochs": 1, "train_batch_size": 32, "model_name": "distilbert-base-uncased"}

In [None]:
hf_estimator = HuggingFace(
    entry_point="train.py",
    source_dir="./scripts",
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    role=role,
    transformers_version="4.12",
    pytorch_version="1.9",
    py_version="py38",
    hyperparameters=hyperparameters,
)

In [None]:
# Start the training job with the uploaded dataset as input
hf_estimator.fit({"train": training_input_path, "test": test_input_path})

### Deploy the endpoint

To deploy the endpoint, call `deploy()` on the HuggingFace estimator object, passing in the desired number of instances and instance type.

In [None]:
predictor = hf_estimator.deploy(1, "ml.p3.2xlarge")

Then use the returned predictor object to perform inference.

In [None]:
sentiment_input = {"inputs": "I love using the new Inference DLC."}

predictor.predict(sentiment_input)

We see that the fine-tuned model classifies the test sentence "I love using the new Inference DLC." as having positive sentiment with 98% probability!

Finally, delete the endpoint.

In [None]:
predictor.delete_endpoint()

## Extras

### Estimator Parameters

In [None]:
print(f"Container image used for training job: \n{hf_estimator.image_uri}\n")
print(f"S3 URI where the trained model is located: \n{hf_estimator.model_data}\n")
print(f"Latest training job name for this estimator: \n{hf_estimator.latest_training_job.name}\n")

In [None]:
hf_estimator.sagemaker_session.logs_for_job(hf_estimator.latest_training_job.name)

### Attach a previous training job to an estimator

In SageMaker, you can attach a previous training job to an estimator to continue training, get results, etc.

In [None]:
from sagemaker.estimator import Estimator

# Uncomment the following lines and supply your training job name

# old_training_job_name = "<your-training-job-name>"
# hf_estimator_loaded = Estimator.attach(old_training_job_name)
# hf_estimator_loaded.model_data

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)
