# Classifying news headlines (Local Version)

> This notebook works well with the `Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized)` kernel on SageMaker Studio

In this example, you'll train a news headline classifier model using a custom script and the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) framework.

This "local" notebook will demonstrate training and testing the model here on the notebook itself, while the companion ["SageMaker" notebook](Headline%20Classifier%20SageMaker.ipynb) will repeat the same process using containerized SageMaker training jobs and endpoint deployment.

If you're new to Hugging Face, you might like to read the [Transformers quick tour](https://huggingface.co/docs/transformers/quicktour) or watch the following (1hr) introductory video:

In [None]:
%%html


## Installation and setup

The PyTorch SageMaker kernel specified above has most of the libraries we need, but not quite everything. First, we'll need to install appropriate versions of HF transformers/datasets, and also IPyWidgets to power our interactive classifier widget later on:

> âš ī¸ **Note:** So long as you run this cell first, there should be no need to restart your notebook kernel. If you `import`ed anything already though, you'll need to click the circular 'restart the kernel' button in the toolbar above for these installs to take effect.

You can ignore any *warnings* from pip in the output below, but should not see any *errors*.

In [None]:
%pip install datasets "ipywidgets<8" transformers==4.26

With installs complete, we'll load the libraries and Python built-ins to be used in the rest of the notebook.

The [%autoreload magic](https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html) is useful when working with local .py files, because re-loading libraries on each cell execution lets you consume locally edited/updated scripts without having to restart your notebook kernel.

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import os # Operating system utils e.g. file paths

# External Dependencies:
import datasets # Hugging Face data loading utilities
import ipywidgets as widgets # Interactive prediction widget
import pandas as pd # Utilities for working with data tables (dataframes)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import transformers # Hugging Face Transformers framework

local_dir = "data"

## Prepare the dataset

This example will download the **FastAi AG News** dataset from the [Registry of Open Data on AWS](https://registry.opendata.aws/fast-ai-nlp/) public repository. This dataset contains a table of news headlines and their corresponding topic classes.

In [None]:
%%time
# Download the AG News data from the Registry of Open Data on AWS.
!mkdir -p {local_dir}
!aws s3 cp s3://fast-ai-nlp/ag_news_csv.tgz {local_dir} --no-sign-request

# Un-tar the AG News data.
!tar zxf {local_dir}/ag_news_csv.tgz -C {local_dir}/ --strip-components=1 --no-same-owner

# Push data partitions to separate subfolders, which is useful for local script debugging later
os.renames(f"{local_dir}/test.csv", f"{local_dir}/test/test.csv")
os.renames(f"{local_dir}/train.csv", f"{local_dir}/train/train.csv")
print("Done!")

With the data downloaded and extracted, we can explore some of the examples as shown below:

In [None]:
column_names = ["CATEGORY", "TITLE", "CONTENT"]
# we use the train.csv only
df = pd.read_csv(f"{local_dir}/train/train.csv", names=column_names, header=None, delimiter=",")
# shuffle the DataFrame rows
df = df.sample(frac=1, random_state=1337)

# Make the (1-indexed) category classes more readable:
class_names = ["Other", "World", "Sports", "Business", "Sci/Tech"]
idx2label = {ix: name for ix, name in enumerate(class_names)}
label2idx = {name: ix for ix, name in enumerate(class_names)}

df = df.replace({"CATEGORY": idx2label})
df.head()

For this exercise we'll **only use**:

- The **title** (Headline) of the news story, as our input
- The **category**, as our target variable to predict

This dataset contains 4 evenly distributed topic classes, as shown below.

> â„šī¸ **What about 'Other'?:** Since the raw dataset represents categories with a number from 1-4, and our model will expect numbers starting from 0, we've inserted the un-used 'Other' class to keep data preparation simple and avoid introducing an extra, confusing, numeric representation of the classes.

In [None]:
df["CATEGORY"].value_counts()

## Define training parameters

We'll be fine-tuning a (relatively small) pre-trained model from the [Hugging Face Hub](https://huggingface.co/models), and using their high-level [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) rather than writing a low-level training loop from scratch.

Below, we'll set up the basic parameters for the training.

> đŸŽī¸ For this in-notebook example we'll be using **low-cost, CPU-only compute** by default. Although the model we're training is "small" by modern LLM standards, We'll still need to stop the training very early so it can complete in a reasonable time.
>
> The resulting model will be very under-trained, and much less accurate than the same architecture could ultimately get to.

In [None]:
model_id = "amazon/bort" # ID of the pre-trained model to start from

training_args = transformers.TrainingArguments(
 output_dir=f"{local_dir}/model", # Where to save trained model snapshots
 #logging_dir=f"{local_dir}/local-logs", # Optionally, save logs too
 max_steps=500, # Maximum number of training steps to run
 num_train_epochs=3, # Maximum number of times to loop through the training data
 per_device_train_batch_size=16, # Examples per mini-batch for training
 per_device_eval_batch_size=32, # Examples per mini-batch for validation
 evaluation_strategy="steps", # Run validation every N 'steps' instead of every 'epoch'
 eval_steps=100, # Number of training steps between validation runs
 save_strategy="steps", # Must be same as evaluation_strategy when load_best_model_at_end=True
 load_best_model_at_end=True, # If current model at end is not the best, load the best
 metric_for_best_model="f1", # Use F1 score for judging which model is 'best'
 learning_rate=5e-5, # Initial learning rate (decays over time by default)
 warmup_steps=100, # Number of steps to gradually increase the learning rate from the start
)

## Define metrics

Here, we'll define how we'd like to measure the quality of the trained model by setting up a [callback function](https://huggingface.co/docs/transformers/main_classes/callback) that'll be run every time the model gets validated.

In [None]:
def compute_metrics(pred):
 labels = pred.label_ids
 preds = pred.predictions.argmax(-1)
 precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="micro")
 acc = accuracy_score(labels, preds)
 return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

## Train and validate the model

In this section we'll load up the base model and datasets, and run the actual training and validation process.

First, we'll need to load the pre-trained model and accompanying [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) for the given `model_id` - which will be automatically downloaded from the Hugging Face Hub.

As part of setting up the model, we'll need to specify how many topic classes we'll fine-tune on and set up their human-readable names:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

model = transformers.AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=len(class_names))
model.config.label2id = label2idx
model.config.id2label = idx2label

data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

We'll use the raw `train.csv` and `test.csv` files already provided in the dataset as inputs for our training, but will need to set up some pre-processing first:

- Since the CSVs don't have column headers, we'll need to specify their `column_names` manually
- The `tokenizer` will convert raw text to the (numeric) inputs the model expects - including truncating any long headlines to the maximum length supported by the model

In [None]:
def preprocess(batch):
 """Tokenize and pre-process raw examples for training/validation"""
 result = tokenizer(batch["title"], truncation=True)
 result["label"] = batch["category"]
 return result


# Load the raw datasets:
raw_train_dataset = datasets.load_dataset(
 "csv",
 data_files=os.path.join(local_dir, "train", "train.csv"),
 column_names=["category", "title", "content"],
 split=datasets.Split.ALL,
)
raw_test_dataset = datasets.load_dataset(
 "csv",
 data_files=os.path.join(local_dir, "test", "test.csv"),
 column_names=["category", "title", "content"],
 split=datasets.Split.ALL,
)

# Run the tokenization/pre-processing, keeping only the output fields from preprocess()
train_dataset = raw_train_dataset.map(
 preprocess, batched=True, batch_size=1000, remove_columns=raw_train_dataset.column_names
)
test_dataset = raw_test_dataset.map(
 preprocess, batched=True, batch_size=1000, remove_columns=raw_test_dataset.column_names
)

With the parameters and pre-processed data loaded, we're ready to train and evaluate the model.

> ⏰ **Note:** On the default `ml.t3.medium` (2 vCPU + 4 GiB RAM) Studio instance type, this process should take about 20 minutes to complete.
>
> While waiting, you can move on to the [SageMaker notebook](Headline%20Classifier%20SageMaker.ipynb) to start exploring how this process would be different when migrated to a SageMaker training job.

In [None]:
%%time

# create Trainer instance
trainer = transformers.Trainer(
 model=model,
 args=training_args,
 compute_metrics=compute_metrics,
 train_dataset=train_dataset,
 eval_dataset=test_dataset,
 tokenizer=tokenizer,
 data_collator=data_collator,
)

# train model
trainer.train()

# evaluate model
eval_result = trainer.evaluate(eval_dataset=test_dataset)

As you'll see from the metrics, the model trained here is likely not very accurate and the accuracy will still be rapidly increasing at the point training was terminated.

## Use the model for inference

Once the model is trained, we're ready to use it for inference on new data.

Here, the model is already loaded in memory from the training process so we can wrap it with a [Pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) for easy usage.

The cell below creates an interactive widget where you can enter your own news headline and have the model classify it in real time:

In [None]:
pipe = transformers.pipeline(
 task="text-classification",
 model=model,
 tokenizer=tokenizer,
)


def classify(text: str) -> dict:
 """Classify a headline and print the results"""
 print(pipe(text)[0])


# Either try out the interactive widget:
interaction = widgets.interact_manual(
 classify,
 text=widgets.Text(
 value="The markets were bullish after news of the merger",
 placeholder="Type a news headline...",
 description="Headline:",
 layout=widgets.Layout(width="99%"),
 ),
)
interaction.widget.children[1].description = "Classify!"

Alternatively, you can call the pipeline direct from code:

In [None]:
classify("Retailers are expanding after the recent economic growth")

## Review

In this notebook, we showed how you might train a text classification model using Hugging Face transformers locally in a plain Jupyter environment.

Because the default notebook compute infrastructure (`ml.t3.medium`) was quite small, training took a long time and we had to stop it early for a chance to explore the results.

- You could extend the training epochs/steps cut-offs to train a better model, but then the process would take even longer
- You could toggle your Studio notebook to a higher-resource instance (perhaps even using a GPU), but then that extra resource might be idle in the time you're not actually training the model - such as during data exploration or evaluation.
- We would also need to manually record our experiments to keep track of different parameters we tried in the training process

Next, head on over to the [SageMaker notebook](Headline%20Classifier%20SageMaker.ipynb) where we'll show how SageMaker training jobs and endpoint deployments can help you take advantage of on-demand compute for faster training and automatic metadata tracking - while only paying for what you need.