# End-to-End NLP: News Headline Classifier (Local Version)

_**Train a Keras-based model to classify news headlines between four domains**_

This notebook works well with the `Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)` kernel on SageMaker Studio, or `conda_tensorflow2_p37` on classic SageMaker Notebook Instances.


---

In this version, the model is trained and evaluated here on the notebook instance itself. We'll show in the follow-on notebook how to take advantage of Amazon SageMaker to separate these infrastructure needs.

Note that you can safely ignore the WARNING about the pip version.


In [None]:
# First install some libraries which might not be available across all kernels (e.g. in Studio):
!pip install "ipywidgets<8"

### Download News Aggregator Dataset

We will download **FastAI AG News** dataset from the [Registry of Open Data on AWS](https://registry.opendata.aws/fast-ai-nlp/) public repository. This dataset contains a table of news headlines and their corresponding classes.


In [None]:
%%time
local_dir = "data"
# Download the AG News data from the Registry of Open Data on AWS.
!mkdir -p {local_dir}
!aws s3 cp s3://fast-ai-nlp/ag_news_csv.tgz {local_dir} --no-sign-request

# Un-tar the AG News data.
!tar zxf {local_dir}/ag_news_csv.tgz -C {local_dir}/ --strip-components=1 --no-same-owner
print("Done!")

### Let's visualize the dataset

We will load the ag_news_csv/train.csv file to a Pandas dataframe for our data processing work.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import re

import numpy as np
import pandas as pd
import util.preprocessing

In [None]:
column_names = ["CATEGORY", "TITLE", "CONTENT"]
# we use the train.csv only
df = pd.read_csv(f"{local_dir}/train.csv", names=column_names, header=None, delimiter=",")
# shuffle the DataFrame rows
df = df.sample(frac=1, random_state=1337)
# make the category classes more readable
mapping = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}
df = df.replace({"CATEGORY": mapping})
df.head()

For this exercise we'll **only use**:

- The **title** (Headline) of the news story, as our input
- The **category**, as our target variable


In [None]:
df["CATEGORY"].value_counts()

The dataset has **four article categories** with equal weighting:

- Business
- Sci/Tech
- Sports
- World


## Natural Language Pre-Processing

We'll do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.

We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.


### Dummy Encode the Labels


In [None]:
encoded_y, labels = util.preprocessing.dummy_encode_labels(df, "CATEGORY")
print(labels)
print(encoded_y)

For example, looking at the first record in our (shuffled) dataframe:

In [None]:
df["CATEGORY"].iloc[0]

In [None]:
encoded_y[0]

### Tokenize and Set Fixed Sequence Lengths

We want to describe our inputs at the more meaningful word level (rather than individual characters), and ensure a fixed length of the input feature dimension.


In [None]:
processed_docs, tokenizer = util.preprocessing.tokenize_and_pad_docs(df, "TITLE")

In [None]:
df["TITLE"].iloc[0]

In [None]:
processed_docs[0]

### Import Word Embeddings

To represent our words in numeric form, we'll use pre-trained vector representations for each word in the vocabulary: In this case we'll be using [pre-trained word embeddings from FastText](https://fasttext.cc/docs/en/crawl-vectors.html), which are also available for a broad range of languages other than English.

You could also explore training custom, domain-specific word embeddings using SageMaker's built-in [BlazingText algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html). See the official [blazingtext_word2vec_text8 sample](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/blazingtext_word2vec_text8) for an example notebook showing how.


In [None]:
%%time
embedding_matrix = util.preprocessing.get_word_embeddings(tokenizer, f"{local_dir}/embeddings")

In [None]:
np.save(
 file=f"{local_dir}/embeddings/docs-embedding-matrix",
 arr=embedding_matrix,
 allow_pickle=False,
)
vocab_size = embedding_matrix.shape[0]
print(embedding_matrix.shape)

### Split Train and Test Sets

Finally we need to divide our data into model training and evaluation sets:


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
 processed_docs,
 encoded_y,
 test_size=0.2,
 random_state=42,
)

In [None]:
# Do you always remember to save your datasets for traceability when experimenting locally? ;-)
os.makedirs(f"{local_dir}/train", exist_ok=True)
np.save(f"{local_dir}/train/train_X.npy", X_train)
np.save(f"{local_dir}/train/train_Y.npy", y_train)
os.makedirs(f"{local_dir}/test", exist_ok=True)
np.save(f"{local_dir}/test/test_X.npy", X_test)
np.save(f"{local_dir}/test/test_Y.npy", y_test)

## Define the Model


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Conv1D, Dense, Dropout, Embedding, Flatten, MaxPooling1D
from tensorflow.keras.models import Sequential

seed = 42
np.random.seed(seed)
num_classes = len(labels)

In [None]:
model = Sequential()
model.add(
 Embedding(
 embedding_matrix.shape[0], # Final vocabulary size
 embedding_matrix.shape[1], # Word vector dimensions
 weights=[embedding_matrix],
 input_length=40,
 trainable=False,
 name="embed",
 )
)
model.add(Conv1D(filters=128, kernel_size=3, activation="relu", name="conv_1"))
model.add(MaxPooling1D(pool_size=5, name="maxpool_1"))
model.add(Flatten(name="flat_1"))
model.add(Dropout(0.3, name="dropout_1"))
model.add(Dense(128, activation="relu", name="dense_1"))
model.add(Dense(num_classes, activation="softmax", name="out_1"))

# Compile the model
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)
model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["acc"])

model.summary()

## Fit (Train) and Evaluate the Model


In [None]:
%%time
# fit the model here in the notebook:
print("Training model")
model.fit(X_train, y_train, batch_size=16, epochs=5, verbose=1)
print("Evaluating model")
# TODO: Better differentiate train vs val loss in logs
scores = model.evaluate(X_test, y_test, verbose=2)
print(
 "Validation results: "
 + "; ".join(
 map(lambda i: f"{model.metrics_names[i]}={scores[i]:.5f}", range(len(model.metrics_names)))
 )
)

## Use the Model (Locally)

Let's evaluate our model with some example headlines...

If you struggle with the widget, you can always simply call the `classify()` function from Python. You can be creative with your headlines!


In [None]:
import ipywidgets as widgets
from IPython import display
from tensorflow.keras.preprocessing.sequence import pad_sequences


def classify(text):
 """Classify a headline and print the results"""
 encoded_example = tokenizer.texts_to_sequences([text])
 # Pad documents to a max length of 40 words
 max_length = 40
 padded_example = pad_sequences(encoded_example, maxlen=max_length, padding="post")
 result = model.predict(padded_example)
 print(result)
 ix = np.argmax(result)
 print(f"Predicted class: '{labels[ix]}' with confidence {result[0][ix]:.2%}")


interaction = widgets.interact_manual(
 classify,
 text=widgets.Text(
 value="The markets were bullish after news of the merger",
 placeholder="Type a news headline...",
 description="Headline:",
 layout=widgets.Layout(width="99%"),
 ),
)
interaction.widget.children[1].description = "Classify!"

In [None]:
# Or just use the function to classify your own headline:
classify("Retailers are expanding after the recent economic growth")

## Review

In this notebook we pre-processed publicly downloadable data and trained a neural news headline classifier model: As a data scientist might normally do when working on a local machine.

...But can we use the cloud more effectively to allocate high-performance resources; and easily deploy our trained models for use by other applications?

Head on over to the next notebook, [Headline Classifier SageMaker.ipynb](Headline%20Classifier%20SageMaker.ipynb), where we'll show how the same model can be trained and then deployed on specific target infrastructure with Amazon SageMaker.
