# End-to-End NLP: News Headline Classifier (Local Version)

_**Train a PyTorch-based model to classify news headlines between four domains**_

This notebook works well with the `Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized)` kernel on SageMaker Studio, or `conda_pytorch_p38` on classic SageMaker Notebook Instances.

---

In this version, the model is trained and evaluated here on the notebook instance itself. We'll show in the follow-on notebook how to take advantage of Amazon SageMaker to separate these infrastructure needs.

Note that you can safely ignore the WARNING about the pip version.


In [None]:
# First install some libraries which might not be available across all kernels (e.g. in Studio):
!pip install "ipywidgets<8" torchtext==0.6

### Download News Aggregator Dataset

We will download **FastAi AG News** dataset from the [Registry of Open Data on AWS](https://registry.opendata.aws/fast-ai-nlp/) public repository. This dataset contains a table of news headlines and their corresponding classes.


In [None]:
%%time
local_dir = "data"
# Download the AG News data from the Registry of Open Data on AWS.
!mkdir -p {local_dir}
!aws s3 cp s3://fast-ai-nlp/ag_news_csv.tgz {local_dir} --no-sign-request

# Un-tar the AG News data.
!tar zxf {local_dir}/ag_news_csv.tgz -C {local_dir}/ --strip-components=1 --no-same-owner
print("Done!")

### Let's visualize the dataset

We will load the ag_news_csv/train.csv file to a Pandas dataframe for our data processing work.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import re

import numpy as np
import pandas as pd
import util.preprocessing

In [None]:
column_names = ["CATEGORY", "TITLE", "CONTENT"]
# we use the train.csv only
df = pd.read_csv(f"{local_dir}/train.csv", names=column_names, header=None, delimiter=",")
# shuffle the DataFrame rows
df = df.sample(frac=1, random_state=1337)
# make the category classes more readable
mapping = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}
df = df.replace({"CATEGORY": mapping})
df.head()

For this exercise we'll **only use**:

- The **title** (Headline) of the news story, as our input
- The **category**, as our target variable


In [None]:
df["CATEGORY"].value_counts()

The dataset has **four article categories** with equal weighting:

- Business
- Sci/Tech
- Sports
- World


## Natural Language Pre-Processing

We'll do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.

We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.


### Dummy Encode the Labels


In [None]:
encoded_y, labels = util.preprocessing.dummy_encode_labels(df, "CATEGORY")
print(labels)
print(encoded_y)

For example, looking at the first record in our (shuffled) dataframe:

In [None]:
df["CATEGORY"].iloc[0]

In [None]:
encoded_y[0]

### Tokenize and Set Fixed Sequence Lengths

We want to describe our inputs at the more meaningful word level (rather than individual characters), and ensure a fixed length of the input feature dimension.


In [None]:
processed_docs, tokenizer = util.preprocessing.tokenize_and_pad_docs(df, "TITLE")

In [None]:
df["TITLE"].iloc[0]

In [None]:
processed_docs[0]

### Import Word Embeddings

To represent our words in numeric form, we'll use pre-trained vector representations for each word in the vocabulary: In this case we'll be using [pre-trained word embeddings from FastText](https://fasttext.cc/docs/en/crawl-vectors.html), which are also available for a broad range of languages other than English.

You could also explore training custom, domain-specific word embeddings using SageMaker's built-in [BlazingText algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html). See the official [blazingtext_word2vec_text8 sample](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/blazingtext_word2vec_text8) for an example notebook showing how.


In [None]:
%%time
embedding_matrix = util.preprocessing.get_word_embeddings(tokenizer, f"{local_dir}/embeddings")

In [None]:
np.save(
    file=f"{local_dir}/embeddings/docs-embedding-matrix",
    arr=embedding_matrix,
    allow_pickle=False,
)
vocab_size = embedding_matrix.shape[0]
print(embedding_matrix.shape)

### Split Train and Test Sets

Finally we need to divide our data into model training and evaluation sets:


In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    processed_docs, encoded_y, test_size=0.2, random_state=42
)

In [None]:
# Do you always remember to save your datasets for traceability when experimenting locally? ;-)
os.makedirs(f"{local_dir}/train", exist_ok=True)
np.save(f"{local_dir}/train/train_X.npy", x_train)
np.save(f"{local_dir}/train/train_Y.npy", y_train)
os.makedirs(f"{local_dir}/test", exist_ok=True)
np.save(f"{local_dir}/test/test_X.npy", x_test)
np.save(f"{local_dir}/test/test_Y.npy", y_test)

## Define the Model


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader

seed = 42
np.random.seed(seed)
num_classes = len(labels)

In [None]:
class Net(nn.Module):
    def __init__(self, vocab_size=400000, emb_dim=300, num_classes=4):
        super(Net, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.conv1 = nn.Conv1d(emb_dim, 128, kernel_size=3)
        self.max_pool1d = nn.MaxPool1d(5)
        self.flatten1 = nn.Flatten()
        self.dropout1 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(896, 128)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.embedding(x)
        x = torch.transpose(x, 1, 2)
        x = self.flatten1(self.max_pool1d(self.conv1(x)))
        x = self.dropout1(x)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=-1)

## Define Train and Helper Functions


In [None]:
def test(model, test_loader, device):
    model.eval()
    test_loss = 0.0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.binary_cross_entropy(output, target, reduction="sum").item()
            pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
            target_index = target.max(1, keepdim=True)[1]
            correct += pred.eq(target_index).sum().item()

    test_loss /= len(test_loader.dataset)  # Average loss over dataset samples
    print(f"val_loss: {test_loss:.4f}, val_acc: {correct/len(test_loader.dataset):.4f}")


def train(
    train_loader, test_loader, embedding_matrix, num_classes=4, epochs=12, learning_rate=0.001
):
    ###### Setup model architecture ############
    model = Net(
        vocab_size=embedding_matrix.shape[0],
        emb_dim=embedding_matrix.shape[1],
        num_classes=num_classes,
    )
    model.embedding.weight = torch.nn.parameter.Parameter(
        torch.FloatTensor(embedding_matrix), False
    )
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)

    for epoch in range(1, epochs + 1):
        model.train()
        running_loss = 0.0
        n_batches = 0
        for batch_idx, (X_train, y_train) in enumerate(train_loader, 1):
            data, target = X_train.to(device), y_train.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.binary_cross_entropy(output, target)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
            n_batches += 1
        print(f"epoch: {epoch}, train_loss: {running_loss / n_batches:.6f}")  # (Avg over batches)
        print("Evaluating model")
        test(model, test_loader, device)
    return model

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, data, labels):
        """Initialization"""
        self.labels = labels
        self.data = data

    def __len__(self):
        """Denotes the total number of samples"""
        return len(self.data)

    def __getitem__(self, index):
        # Load data and get label
        X = torch.as_tensor(self.data[index]).long()
        y = torch.as_tensor(self.labels[index])
        return X, y

## Fit (Train) and Evaluate the Model


In [None]:
%%time
# fit the model here in the notebook:
epochs = 5
learning_rate = 0.001
model_dir = "model/"
trainloader = torch.utils.data.DataLoader(Dataset(x_train, y_train), batch_size=16, shuffle=True)
testloader = torch.utils.data.DataLoader(Dataset(x_test, y_test), batch_size=32, shuffle=True)

print("Training model")
model = train(
    trainloader,
    testloader,
    embedding_matrix,
    num_classes=num_classes,
    epochs=epochs,
    learning_rate=learning_rate,
)

## Use the Model (Locally)

Let's evaluate our model with some example headlines...

If you struggle with the widget, you can always simply call the `classify()` function from Python. You can be creative with your headlines!


In [None]:
import ipywidgets as widgets
from IPython import display


def classify(text):
    """Classify a headline and print the results"""
    processed = tokenizer.preprocess(text)
    padded = tokenizer.pad([processed])
    final_text = []
    for w in padded[0]:
        final_text.append(tokenizer.vocab.stoi[w])
    final_text = torch.tensor([final_text])
    model.cpu()
    model.eval()
    with torch.no_grad():
        result = model(final_text)
    print(result)
    ix = np.argmax(result.detach())
    print(f"Predicted class: '{labels[ix]}' with confidence {result[0][ix]:.2%}")


# Either try out the interactive widget:
interaction = widgets.interact_manual(
    classify,
    text=widgets.Text(
        value="The markets were bullish after news of the merger",
        placeholder="Type a news headline...",
        description="Headline:",
        layout=widgets.Layout(width="99%"),
    ),
)
interaction.widget.children[1].description = "Classify!"

In [None]:
# Or just use the function to classify your own headline:
classify("Retailers are expanding after the recent economic growth")

## Review

In this notebook we pre-processed publicly downloadable data and trained a neural news headline classifier model: As a data scientist might normally do when working on a local machine.

...But can we use the cloud more effectively to allocate high-performance resources; and easily deploy our trained models for use by other applications?

Head on over to the next notebook, [Headline Classifier SageMaker.ipynb](Headline%20Classifier%20SageMaker.ipynb), where we'll show how the same model can be trained and then deployed on specific target infrastructure with Amazon SageMaker.
