# Classifying news headlines (SageMaker Version)

> This notebook works well with the `Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized)` kernel on SageMaker Studio

In this example, you'll train a news headline classifier model using a custom script and the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) framework.

This "SageMaker" notebook will demonstrate training the model on an Amazon SageMaker Training Job, and deploying it to a managed real-time inference endpoint.

> ⚠️ We assume you've already run the companion ["Headline Classifier Local" notebook](Headline%20Classifier%20Local.ipynb), which demonstrates how you'd run training and inference here on the notebook itself.

## Installation and setup

As in the local notebook, we'll make sure that the widgets library is set up before starting out.

🟢 But **Unlike** the local notebook, note that we **do not need to install HF Transformers**: Because the actual training and inference will be happening in containerized jobs and not in this kernel.

> ℹ️ (In fact, when you start multiple SageMaker Studio notebooks on the same kernel image and same instance, they share an environment: So assuming you ran the local notebook and selected the same kernel, everything is installed already).

In [None]:
%pip install "ipywidgets<8" "sagemaker>2.140,<3"

With installs complete, we'll load the libraries and Python built-ins to be used in the rest of the notebook.

The [%autoreload magic](https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html) is useful when working with local .py files, because re-loading libraries on each cell execution lets you consume locally edited/updated scripts without having to restart your notebook kernel.

🟢 This time, we'll be using some **AWS libraries** we didn't need in the local notebook:

- `boto3`, the [general-purpose AWS SDK for Python](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
- `sagemaker`, the [high-level Python SDK for Amazon SageMaker](https://sagemaker.readthedocs.io/en/stable/)

Both of these libraries are open-source, published on PyPI and GitHub.

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import os # Operating system utils e.g. file paths

# External Dependencies:
import boto3 # General AWS SDK for Python
import ipywidgets as widgets # Interactive prediction widget
import pandas as pd # Utilities for working with data tables (dataframes)
import sagemaker # High-level Python SDK for Amazon SageMaker

local_dir = "data"

## Prepare and upload the dataset

This example will download the **FastAi AG News** dataset from the [Registry of Open Data on AWS](https://registry.opendata.aws/fast-ai-nlp/) public repository. This dataset contains a table of news headlines and their corresponding topic classes.

In [None]:
%%time
# Download the AG News data from the Registry of Open Data on AWS.
!mkdir -p {local_dir}
!aws s3 cp s3://fast-ai-nlp/ag_news_csv.tgz {local_dir} --no-sign-request

# Un-tar the AG News data.
!tar zxf {local_dir}/ag_news_csv.tgz -C {local_dir}/ --strip-components=1 --no-same-owner

# Push data partitions to separate subfolders, which is useful for local script debugging later
os.renames(f"{local_dir}/test.csv", f"{local_dir}/test/test.csv")
os.renames(f"{local_dir}/train.csv", f"{local_dir}/train/train.csv")
print("Done!")

With the data downloaded and extracted, we can explore some of the examples as shown below:

In [None]:
column_names = ["CATEGORY", "TITLE", "CONTENT"]
# we use the train.csv only
df = pd.read_csv(f"{local_dir}/train/train.csv", names=column_names, header=None, delimiter=",")
# shuffle the DataFrame rows
df = df.sample(frac=1, random_state=1337)

# Make the (1-indexed) category classes more readable:
class_names = ["Other", "World", "Sports", "Business", "Sci/Tech"]
idx2label = {ix: name for ix, name in enumerate(class_names)}
label2idx = {name: ix for ix, name in enumerate(class_names)}

df = df.replace({"CATEGORY": idx2label})
df.head()

For this exercise we'll **only use**:

- The **title** (Headline) of the news story, as our input
- The **category**, as our target variable to predict

This dataset contains 4 evenly distributed topic classes, as shown below.

> ℹ️ **What about 'Other'?:** Since the raw dataset represents categories with a number from 1-4, and our model will expect numbers starting from 0, we've inserted the un-used 'Other' class to keep data preparation simple and avoid introducing an extra, confusing, numeric representation of the classes.

In [None]:
df["CATEGORY"].value_counts()

So far, nothing new...

🟢 The key difference for training on SageMaker, is that we'll need to **upload our datasets** [somewhere the training job will have access to them](https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html).

Here we'll upload the data to [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) using the [SageMaker default bucket](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-bucket.html). You can customize the bucket and folder prefix, if you'd like. It will be helpful to separate training and test data into separate S3 folders, rather than just two files in the same folder.

In [None]:
bucket_name = sagemaker.Session().default_bucket()
s3_prefix = "sm101/news"

s3 = boto3.resource("s3")

s3.Bucket(bucket_name).upload_file(f"{local_dir}/train/train.csv", f"{s3_prefix}/train/train.csv")
train_s3_uri = f"s3://{bucket_name}/{s3_prefix}/train"
print(f"train_s3_uri: {train_s3_uri}")

s3.Bucket(bucket_name).upload_file(f"{local_dir}/test/test.csv", f"{s3_prefix}/test/test.csv")
test_s3_uri = f"s3://{bucket_name}/{s3_prefix}/test"
print(f"test_s3_uri: {test_s3_uri}")

## Define training parameters

We'll be fine-tuning a (relatively small) pre-trained model from the [Hugging Face Hub](https://huggingface.co/models), and using their high-level [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) rather than writing a low-level training loop from scratch.

🟢 Our training script will ultimately use similar parameters as before, but this time we'll be passing them **through the training job API**.

We'll define **JSON-serializable parameters** here in the notebook, and then use those to build the `transformers.TrainingArguments` later:

In [None]:
hyperparameters = {
 "model_id": "amazon/bort", # ID of the pre-trained model to start from
 "class_names": ",".join(class_names), # Comma-separated list of category names
 "num_train_epochs": 3, # This time, we'll actually train for a full 3 epochs
 "per_device_train_batch_size": 32, # Note this is higher than we could set on local hardware
 "per_device_eval_batch_size": 64, # Note this is higher than we could set on local hardware
 "warmup_steps": 500, # Higher than we could set with the reduced local training
}
hyperparameters

## Define metrics

We'd like to define how to measure the quality of our trained model, and make this information visible to SageMaker to enable features like metric logging, automatic model tuning and leaderboards.

🟢 We'll have our training code print metrics as usual, and [use regular expressions](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html#define-train-metrics) to define how SageMaker should scrape structured metrics from the job logs:

In [None]:
metric_definitions = [
 {"Name": "Epoch", "Regex": r"'epoch': ([0-9\.\-e]+)"},
 {"Name": "Train:Loss", "Regex": r"'loss': ([0-9\.\-e]+)"},
 {"Name": "Train:LearningRate", "Regex": r"'learning_rate': ([0-9\.\-e]+)"},
 {"Name": "Validation:Loss", "Regex": r"'eval_loss': ([0-9\.\-e]+)"},
 {"Name": "Validation:Accuracy", "Regex": r"'eval_accuracy': ([0-9\.\-e]+)"},
 {"Name": "Validation:F1", "Regex": r"'eval_f1': ([0-9\.\-e]+)"},
 {"Name": "Validation:Precision", "Regex": r"'eval_precision': ([0-9\.\-e]+)"},
 {"Name": "Validation:Recall", "Regex": r"'eval_recall': ([0-9\.\-e]+)"},
 {"Name": "Validation:Runtime", "Regex": r"'eval_runtime': ([0-9\.\-e]+)"},
 {"Name": "Validation:SamplesPerSecond", "Regex": r"'eval_samples_per_second': ([0-9\.\-e]+)"},
 {"Name": "Validation:StepsPerSecond", "Regex": r"'eval_steps_per_second': ([0-9\.\-e]+)"},
]
metric_definitions

## Train and validate the model on SageMaker

This time, we'll create a [SageMaker training job](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) to run our training process on a separate instance from the notebook itself: Allowing us to right-size temporary training infrastructure independently from the long-lived notebook environment.

🟢 We've factored the actual training code out of the notebook into **[scripts/train.py](scripts/train.py)**, and will use the pre-built [Hugging Face Framework Container through the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html) to train and deploy the model from this script.

### How Amazon SageMaker runs your script with pre-built containers

AWS provides a pre-packaged set of Docker images to help you accelerate building your projects on major ML frameworks: The [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-prebuilt.html).

These containers take care of basic setup like GPU drivers, serving stack implementation, core libraries, and so on - leaving us free to simply inject some Python scripts for the training process and any inference behaviour overrides. We can even provide a *requirements.txt* file to specify additional dependencies to be dynamically installed at start-up - without having to build these into the container image.

**As a result, our first task is to understand the interfaces** between our script(s) and the runtime: How will the script read input data? Parameters? Where should it store results?

#### Your container during training

When the training job container is started, your **code and input data** are downloaded to **local files** under the `/opt/ml` directory. You'll also **save your trained model** and any other file outputs to the local filesystem - as shown below:

```
 /opt/ml
 |-- code
 | `-- 
 |-- input
 | |-- config
 | | |-- hyperparameters.json
 | | `-- resourceConfig.json
 | `-- data
 | `-- 
 | `-- 
 |-- model
 | `-- 
 `-- output
 `-- failure
```

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.
* `/opt/ml/input/data//` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/_` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.

#### Further information

For more information, you can refer to:

- The [SageMaker Python SDK guide for Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html) and [API doc](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html) for HF framework classes. (The equivalent pages for PyTorch may also be useful).
- The [AWS Deep Learning Containers repository](https://github.com/aws/deep-learning-containers) on GitHub, which defines the underlying container images.
- The open source [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) and [SageMaker Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) for more details on the framework code for training and serving. (Some frameworks use variants on these toolkits, e.g. the [sagemaker-pytorch-training-toolkit](https://github.com/aws/sagemaker-pytorch-training-toolkit))

### (Optional) Testing your script

> ℹ️ **Note:** This step is optional because in this example, the training script has already been built and tested for you!

Although the job script [train.py](scripts/train.py) is mainly the same logic and process as previously done in the notebook itself, of course it would be good to **test** the adaptations we made to prepare it for SageMaker.

For initial functional testing and debugging of your script, you may not want to spin up a full SageMaker training job each time: Because of the short delay while each new job spins up its on-demand compute resources.

There are multiple ways you can speed up this process. We'd usually recommend [SageMaker Warm Pools](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html) or [SageMaker Local Mode](https://sagemaker.readthedocs.io/en/stable/overview.html?#local-mode), but these aren't available in the standard workshop environment. Instead, you can **simulate a training job within the notebook** by invoking your training script through CLI.

You can un-comment (with `Ctrl`+`/`) and run the cell below to try this - ⚠️ but watch out: It's quite memory-intensive, so you'll want to shut down or restart the kernel from the previous [Headline Classifier Local notebook](Headline%20Classifier%20Local.ipynb) first.

In [None]:
# class_names_str = ",".join(class_names) # Comma-separated list for CLI
# !python3 scripts/train.py \
# --train data/train \
# --test data/test \
# --output_data_dir data/local-output \
# --model_dir data/local-model \
# --model_id=amazon/bort --class_names={class_names_str} --train_max_steps=20 \
# --train_batch_size=8 --eval_batch_size=16 --fp16=0

### Creating the job

The actual [SageMaker CreateTrainingJob API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) requires several low-level details that the high-level [high-level 'Estimator' classes](https://sagemaker.readthedocs.io/en/stable/overview.html) in the SageMaker Python SDK help to simplify. In particular:

- Instead of specifying the exact container image URI, the SDK will look this up for us based on the selected framework and version(s)
- The SDK will transparently compress and upload our `scripts` bundle to S3, and configure the training job to load it from there.

So first, we'll create an `estimator` object configuring the job and what infrastructure (how many compute instances and what type) it should run on:

> ℹ️ Like other services that run jobs on your behalf, the training job will assume an [IAM role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) to allow it to access resources, like your input training data on S3. Since SageMaker notebooks themselves already run with an assumed role, we'll set the training job role the same as the notebook role for simplicity.

In [None]:
from sagemaker.huggingface.estimator import HuggingFace as HuggingFaceEstimator

nb_role = sagemaker.get_execution_role()

estimator = HuggingFaceEstimator(
 transformers_version="4.26",
 pytorch_version="1.13",
 py_version="py39",

 source_dir="scripts", # Local folder where fine-tuning script is stored
 entry_point="train.py", # Actual script the training job should run

 base_job_name="news-classifier", # Prefix for the training job name (timestamp will be added)
 instance_count=1, # Number of instances train on (need to prepare your script for using >1!)
 instance_type="ml.p3.2xlarge", # Type of compute instance to use: p* and g* include GPUs
 role=nb_role, # IAM role the job will use to access AWS resources (e.g. data on S3)

 hyperparameters=hyperparameters, # Training job parameters, as we set up earlier
 metric_definitions=metric_definitions, # RegEx to extract metric data from training job logs
)

Once the configuration is done, you can start the actual training job by running `estimator.fit()` and specifying your input data location(s).

The number, names, and types of your data input "channels" are [up to you](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-InputDataConfig): Just make sure your notebook configures the same channels that your script expects.



In [None]:
%%time

estimator.fit(
 {
 "train": train_s3_uri,
 "test": test_s3_uri,
 },
 wait=True, # Wait for the training to complete (default=True)
 logs=True, # Stream training job logs to the notebook (default=True, requires wait=True)
)

> ⏰ This training job should take around 10 minutes to complete, but should reach significantly higher accuracy than the 'local' model

Training itself should be much faster than the previous 'local' example, due to running on a GPU-accelerated instance rather than a small CPU-only notebook. However, it will likely take a couple of minutes for the job to provision the infrastructure and start up.

You can also check on the status of current and past jobs in *Training jobs* page of the [AWS Console for Amazon SageMaker](https://console.aws.amazon.com/sagemaker/home?#/jobs), and in the **Experiments** UI here in SageMaker Studio (From the 🏠 **Home** button on the left sidebar).

🟢 Although the default behaviour of waiting and streaming logs gives a local-like experience, the training job doesn't depend on the notebook:

- If you disconnect or shut down your notebook, the training job will still continue
- A notebook could kick off multiple training jobs in parallel, by setting `wait=False`
- If you ever need to link a restarted notebook to an old training job, you can `.attach()` by training job name as shown below:

In [None]:
# estimator = HuggingFaceEstimator.attach("news-hf-2020-01-01-12-00-00-000")

Once the training job completes, the contents of the container's model output folder will be archived to S3 automatically.

You can refer to this file as shown below, and also import models trained outside of SageMaker for deployment by preparing them in a similar tarball format:

In [None]:
estimator.latest_training_job.describe()["ModelArtifacts"]["S3ModelArtifacts"]

## Use the model for inference

Once the model is trained, we're ready to use it for inference on new data.

SageMaker offers multiple fully-managed options for [deploying models for on-demand inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) or [running batch inference jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html).

> ℹ️ **Remember:** Choose the right inference option for your use-case - You don't need to deploy a real-time endpoint if you only want to process a batch of data!
>
> See [Using SageMaker Batch Transform](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-batch-transform) for more details on how to run batch inference through the same high-level SageMaker Python SDK we've been using so far.

For this example we'll deploy the model to a [real-time inference endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html), which will allow us to classify headlines on-demand.

We'll again specify what type of infrastructure we'd like to run the endpoint on, so the start-up will take a few minutes. Note that we can use a **different type** of instance from training - since this test endpoint will handle very little traffic, so can use smaller/cheaper infrastructure:

In [None]:
predictor = estimator.deploy(
 initial_instance_count=1,
 instance_type="ml.m5.large",
)

After deployment, you should be able to find your endpoint in the *Endpoints* page of the [AWS Console for Amazon SageMaker](https://console.aws.amazon.com/sagemaker/home?#/endpoints) - as well as the **Deployments > Endpoints** section of the SageMaker Studio UI (From the 🏠 **Home** button on the left sidebar).

As with training jobs, endpoints are decoupled from the notebook itself. You can attach a notebook to a previously-deployed endpoint as follows:

In [None]:
# from sagemaker.huggingface import HuggingFacePredictor
# predictor = HuggingFacePredictor("news-classifier-2023-03-24-13-31-09-895")

### Your model should now be in production as a RESTful API!

The [Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) doesn't load your model into memory here, but instead wraps HTTPS API calls to the deployed endpoint.

Here we're using the default `application/json` serialization support provided by the Hugging Face framework, but different frameworks have different default formats and it's possible to set up pretty much any request or response format you like with custom [serializers](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html) and [deserializers](https://sagemaker.readthedocs.io/en/stable/api/inference/deserializers.html) (on the client/`predictor` side) and custom [`input_fn`s and `output_fn`s](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#process-model-input) (on the endpoint container side): No need to write your own serving stacks from scratch.

Since request de/serialization and processing is already handled for us by the [HuggingFacePredictor](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-predictor) and the pre-built inference container, calling our deployed model from the notebook is just as easy as calling the local in-memory model was:

In [None]:
def classify(text: str) -> dict:
 """Classify a headline and print the results"""
 return predictor.predict({"inputs":[text]})[0]


# Either try out the interactive widget:
interaction = widgets.interact_manual(
 classify,
 text=widgets.Text(
 value="The markets were bullish after news of the merger",
 placeholder="Type a news headline...",
 description="Headline:",
 layout=widgets.Layout(width="99%"),
 ),
)
interaction.widget.children[1].description = "Classify!"

Alternatively (if e.g. you're struggling with the UI widget library), you can call the endpoint direct from code:

In [None]:
classify("Retailers are expanding after the recent economic growth")

## Clean-up

Note that while SageMaker jobs (like training, processing, and batch inference) use on-demand compute only for the duration they run, deployed real-time inference endpoints continue to consume resources until you turn them off.

When you're done experimenting, delete endpoints that are no longer needed to avoid unnecessary costs:

In [None]:
# predictor.delete_endpoint(delete_endpoint_config=True)

## Review

In this notebook, we showed how you could train and deploy a text classification model using Hugging Face transformers on Amazon SageMaker.

Some benefits of this approach as compared to the companion [Headline Classifier Local notebook](Headline%20Classifier%20Local.ipynb) are:

- We can automatically provision specialist computing resources (e.g. high-performance, or GPU-accelerated instances) for **only** the duration of the training job: Getting good performance in training, without leaving resources sitting around under-utilized
- The history of training jobs (including parameters, metrics, outputs, etc.) is automatically tracked - unlike local notebook experiments where the user needs to keep notes on what worked and what didn't
- Our trained model can be deployed to a secure, production-ready web endpoint with just one SDK call: No container or web application packaging required, unless we want to deeply customize the behavior

By comparing the local notebook with this SageMaker version and the accompanying [scripts/train.py](scripts/train.py) script file, you can get an idea of how to migrate your own or open-source in-notebook ML workflows into SageMaker "script mode" training jobs.

In the next "migration challenge" exercise of this workshop, you'll try to repeat this process for a different "local" notebook on your own.

You might also be interested in [aws-samples/amazon-sagemaker-from-idea-to-production](https://github.com/aws-samples/amazon-sagemaker-from-idea-to-production), which shows further steps like connecting your SageMaker jobs together into pipelines, and automating workflows with CI/CD.