{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# End-to-End NLP: News Headline Classifier (Local Version)\n", "\n", "_**Train a Keras-based model to classify news headlines between four domains**_\n", "\n", "This notebook works well with the `Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)` kernel on SageMaker Studio, or `conda_tensorflow2_p37` on classic SageMaker Notebook Instances.\n", "\n", "\n", "---\n", "\n", "In this version, the model is trained and evaluated here on the notebook instance itself. We'll show in the follow-on notebook how to take advantage of Amazon SageMaker to separate these infrastructure needs.\n", "\n", "Note that you can safely ignore the WARNING about the pip version.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# First install some libraries which might not be available across all kernels (e.g. in Studio):\n", "!pip install \"ipywidgets<8\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download News Aggregator Dataset\n", "\n", "We will download **FastAI AG News** dataset from the [Registry of Open Data on AWS](https://registry.opendata.aws/fast-ai-nlp/) public repository. This dataset contains a table of news headlines and their corresponding classes.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "local_dir = \"data\"\n", "# Download the AG News data from the Registry of Open Data on AWS.\n", "!mkdir -p {local_dir}\n", "!aws s3 cp s3://fast-ai-nlp/ag_news_csv.tgz {local_dir} --no-sign-request\n", "\n", "# Un-tar the AG News data.\n", "!tar zxf {local_dir}/ag_news_csv.tgz -C {local_dir}/ --strip-components=1 --no-same-owner\n", "print(\"Done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's visualize the dataset\n", "\n", "We will load the ag_news_csv/train.csv file to a Pandas dataframe for our data processing work." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import os\n", "import re\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import util.preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column_names = [\"CATEGORY\", \"TITLE\", \"CONTENT\"]\n", "# we use the train.csv only\n", "df = pd.read_csv(f\"{local_dir}/train.csv\", names=column_names, header=None, delimiter=\",\")\n", "# shuffle the DataFrame rows\n", "df = df.sample(frac=1, random_state=1337)\n", "# make the category classes more readable\n", "mapping = {1: \"World\", 2: \"Sports\", 3: \"Business\", 4: \"Sci/Tech\"}\n", "df = df.replace({\"CATEGORY\": mapping})\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this exercise we'll **only use**:\n", "\n", "- The **title** (Headline) of the news story, as our input\n", "- The **category**, as our target variable\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"CATEGORY\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset has **four article categories** with equal weighting:\n", "\n", "- Business\n", "- Sci/Tech\n", "- Sports\n", "- World\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Natural Language Pre-Processing\n", "\n", "We'll do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.\n", "\n", "We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dummy Encode the Labels\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "encoded_y, labels = util.preprocessing.dummy_encode_labels(df, \"CATEGORY\")\n", "print(labels)\n", "print(encoded_y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, looking at the first record in our (shuffled) dataframe:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"CATEGORY\"].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "encoded_y[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenize and Set Fixed Sequence Lengths\n", "\n", "We want to describe our inputs at the more meaningful word level (rather than individual characters), and ensure a fixed length of the input feature dimension.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_cell_guid": "7bcf422f-0e75-4d49-b3b1-12553fcaf4ff", "_uuid": "46b7fc9aef5a519f96a295e980ba15deee781e97" }, "outputs": [], "source": [ "processed_docs, tokenizer = util.preprocessing.tokenize_and_pad_docs(df, \"TITLE\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"TITLE\"].iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "processed_docs[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Word Embeddings\n", "\n", "To represent our words in numeric form, we'll use pre-trained vector representations for each word in the vocabulary: In this case we'll be using [pre-trained word embeddings from FastText](https://fasttext.cc/docs/en/crawl-vectors.html), which are also available for a broad range of languages other than English.\n", "\n", "You could also explore training custom, domain-specific word embeddings using SageMaker's built-in [BlazingText algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html). See the official [blazingtext_word2vec_text8 sample](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/blazingtext_word2vec_text8) for an example notebook showing how.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "embedding_matrix = util.preprocessing.get_word_embeddings(tokenizer, f\"{local_dir}/embeddings\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.save(\n", " file=f\"{local_dir}/embeddings/docs-embedding-matrix\",\n", " arr=embedding_matrix,\n", " allow_pickle=False,\n", ")\n", "vocab_size = embedding_matrix.shape[0]\n", "print(embedding_matrix.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split Train and Test Sets\n", "\n", "Finally we need to divide our data into model training and evaluation sets:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " processed_docs,\n", " encoded_y,\n", " test_size=0.2,\n", " random_state=42,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Do you always remember to save your datasets for traceability when experimenting locally? ;-)\n", "os.makedirs(f\"{local_dir}/train\", exist_ok=True)\n", "np.save(f\"{local_dir}/train/train_X.npy\", X_train)\n", "np.save(f\"{local_dir}/train/train_Y.npy\", y_train)\n", "os.makedirs(f\"{local_dir}/test\", exist_ok=True)\n", "np.save(f\"{local_dir}/test/test_X.npy\", X_test)\n", "np.save(f\"{local_dir}/test/test_Y.npy\", y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define the Model\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tensorflow as tf\n", "from tensorflow.keras.layers import Conv1D, Dense, Dropout, Embedding, Flatten, MaxPooling1D\n", "from tensorflow.keras.models import Sequential\n", "\n", "seed = 42\n", "np.random.seed(seed)\n", "num_classes = len(labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = Sequential()\n", "model.add(\n", " Embedding(\n", " embedding_matrix.shape[0], # Final vocabulary size\n", " embedding_matrix.shape[1], # Word vector dimensions\n", " weights=[embedding_matrix],\n", " input_length=40,\n", " trainable=False,\n", " name=\"embed\",\n", " )\n", ")\n", "model.add(Conv1D(filters=128, kernel_size=3, activation=\"relu\", name=\"conv_1\"))\n", "model.add(MaxPooling1D(pool_size=5, name=\"maxpool_1\"))\n", "model.add(Flatten(name=\"flat_1\"))\n", "model.add(Dropout(0.3, name=\"dropout_1\"))\n", "model.add(Dense(128, activation=\"relu\", name=\"dense_1\"))\n", "model.add(Dense(num_classes, activation=\"softmax\", name=\"out_1\"))\n", "\n", "# Compile the model\n", "optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)\n", "model.compile(optimizer=optimizer, loss=\"binary_crossentropy\", metrics=[\"acc\"])\n", "\n", "model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit (Train) and Evaluate the Model\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# fit the model here in the notebook:\n", "print(\"Training model\")\n", "model.fit(X_train, y_train, batch_size=16, epochs=5, verbose=1)\n", "print(\"Evaluating model\")\n", "# TODO: Better differentiate train vs val loss in logs\n", "scores = model.evaluate(X_test, y_test, verbose=2)\n", "print(\n", " \"Validation results: \"\n", " + \"; \".join(\n", " map(lambda i: f\"{model.metrics_names[i]}={scores[i]:.5f}\", range(len(model.metrics_names)))\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use the Model (Locally)\n", "\n", "Let's evaluate our model with some example headlines...\n", "\n", "If you struggle with the widget, you can always simply call the `classify()` function from Python. You can be creative with your headlines!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ipywidgets as widgets\n", "from IPython import display\n", "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "\n", "\n", "def classify(text):\n", " \"\"\"Classify a headline and print the results\"\"\"\n", " encoded_example = tokenizer.texts_to_sequences([text])\n", " # Pad documents to a max length of 40 words\n", " max_length = 40\n", " padded_example = pad_sequences(encoded_example, maxlen=max_length, padding=\"post\")\n", " result = model.predict(padded_example)\n", " print(result)\n", " ix = np.argmax(result)\n", " print(f\"Predicted class: '{labels[ix]}' with confidence {result[0][ix]:.2%}\")\n", "\n", "\n", "interaction = widgets.interact_manual(\n", " classify,\n", " text=widgets.Text(\n", " value=\"The markets were bullish after news of the merger\",\n", " placeholder=\"Type a news headline...\",\n", " description=\"Headline:\",\n", " layout=widgets.Layout(width=\"99%\"),\n", " ),\n", ")\n", "interaction.widget.children[1].description = \"Classify!\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Or just use the function to classify your own headline:\n", "classify(\"Retailers are expanding after the recent economic growth\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "In this notebook we pre-processed publicly downloadable data and trained a neural news headline classifier model: As a data scientist might normally do when working on a local machine.\n", "\n", "...But can we use the cloud more effectively to allocate high-performance resources; and easily deploy our trained models for use by other applications?\n", "\n", "Head on over to the next notebook, [Headline Classifier SageMaker.ipynb](Headline%20Classifier%20SageMaker.ipynb), where we'll show how the same model can be trained and then deployed on specific target infrastructure with Amazon SageMaker.\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-southeast-1:492261229750:image/tensorflow-2.3-cpu-py37-ubuntu18.04-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }