{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# End-to-End NLP: News Headline Classifier (Local Version)\n",
    "\n",
    "_**Train a Keras-based model to classify news headlines between four domains**_\n",
    "\n",
    "This notebook works well with the `Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)` kernel on SageMaker Studio, or `conda_tensorflow2_p37` on classic SageMaker Notebook Instances.\n",
    "\n",
    "\n",
    "---\n",
    "\n",
    "In this version, the model is trained and evaluated here on the notebook instance itself. We'll show in the follow-on notebook how to take advantage of Amazon SageMaker to separate these infrastructure needs.\n",
    "\n",
    "Note that you can safely ignore the WARNING about the pip version.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# First install some libraries which might not be available across all kernels (e.g. in Studio):\n",
    "!pip install \"ipywidgets<8\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download News Aggregator Dataset\n",
    "\n",
    "We will download **FastAI AG News** dataset from the [Registry of Open Data on AWS](https://registry.opendata.aws/fast-ai-nlp/) public repository. This dataset contains a table of news headlines and their corresponding classes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "local_dir = \"data\"\n",
    "# Download the AG News data from the Registry of Open Data on AWS.\n",
    "!mkdir -p {local_dir}\n",
    "!aws s3 cp s3://fast-ai-nlp/ag_news_csv.tgz {local_dir} --no-sign-request\n",
    "\n",
    "# Un-tar the AG News data.\n",
    "!tar zxf {local_dir}/ag_news_csv.tgz -C {local_dir}/ --strip-components=1 --no-same-owner\n",
    "print(\"Done!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's visualize the dataset\n",
    "\n",
    "We will load the ag_news_csv/train.csv file to a Pandas dataframe for our data processing work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import os\n",
    "import re\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import util.preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "column_names = [\"CATEGORY\", \"TITLE\", \"CONTENT\"]\n",
    "# we use the train.csv only\n",
    "df = pd.read_csv(f\"{local_dir}/train.csv\", names=column_names, header=None, delimiter=\",\")\n",
    "# shuffle the DataFrame rows\n",
    "df = df.sample(frac=1, random_state=1337)\n",
    "# make the category classes more readable\n",
    "mapping = {1: \"World\", 2: \"Sports\", 3: \"Business\", 4: \"Sci/Tech\"}\n",
    "df = df.replace({\"CATEGORY\": mapping})\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this exercise we'll **only use**:\n",
    "\n",
    "- The **title** (Headline) of the news story, as our input\n",
    "- The **category**, as our target variable\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"CATEGORY\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset has **four article categories** with equal weighting:\n",
    "\n",
    "- Business\n",
    "- Sci/Tech\n",
    "- Sports\n",
    "- World\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Natural Language Pre-Processing\n",
    "\n",
    "We'll do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.\n",
    "\n",
    "We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dummy Encode the Labels\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoded_y, labels = util.preprocessing.dummy_encode_labels(df, \"CATEGORY\")\n",
    "print(labels)\n",
    "print(encoded_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, looking at the first record in our (shuffled) dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"CATEGORY\"].iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoded_y[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenize and Set Fixed Sequence Lengths\n",
    "\n",
    "We want to describe our inputs at the more meaningful word level (rather than individual characters), and ensure a fixed length of the input feature dimension.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "7bcf422f-0e75-4d49-b3b1-12553fcaf4ff",
    "_uuid": "46b7fc9aef5a519f96a295e980ba15deee781e97"
   },
   "outputs": [],
   "source": [
    "processed_docs, tokenizer = util.preprocessing.tokenize_and_pad_docs(df, \"TITLE\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"TITLE\"].iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processed_docs[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Word Embeddings\n",
    "\n",
    "To represent our words in numeric form, we'll use pre-trained vector representations for each word in the vocabulary: In this case we'll be using [pre-trained word embeddings from FastText](https://fasttext.cc/docs/en/crawl-vectors.html), which are also available for a broad range of languages other than English.\n",
    "\n",
    "You could also explore training custom, domain-specific word embeddings using SageMaker's built-in [BlazingText algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html). See the official [blazingtext_word2vec_text8 sample](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/blazingtext_word2vec_text8) for an example notebook showing how.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "embedding_matrix = util.preprocessing.get_word_embeddings(tokenizer, f\"{local_dir}/embeddings\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.save(\n",
    "    file=f\"{local_dir}/embeddings/docs-embedding-matrix\",\n",
    "    arr=embedding_matrix,\n",
    "    allow_pickle=False,\n",
    ")\n",
    "vocab_size = embedding_matrix.shape[0]\n",
    "print(embedding_matrix.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split Train and Test Sets\n",
    "\n",
    "Finally we need to divide our data into model training and evaluation sets:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    processed_docs,\n",
    "    encoded_y,\n",
    "    test_size=0.2,\n",
    "    random_state=42,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Do you always remember to save your datasets for traceability when experimenting locally? ;-)\n",
    "os.makedirs(f\"{local_dir}/train\", exist_ok=True)\n",
    "np.save(f\"{local_dir}/train/train_X.npy\", X_train)\n",
    "np.save(f\"{local_dir}/train/train_Y.npy\", y_train)\n",
    "os.makedirs(f\"{local_dir}/test\", exist_ok=True)\n",
    "np.save(f\"{local_dir}/test/test_X.npy\", X_test)\n",
    "np.save(f\"{local_dir}/test/test_Y.npy\", y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define the Model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow.keras.layers import Conv1D, Dense, Dropout, Embedding, Flatten, MaxPooling1D\n",
    "from tensorflow.keras.models import Sequential\n",
    "\n",
    "seed = 42\n",
    "np.random.seed(seed)\n",
    "num_classes = len(labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Sequential()\n",
    "model.add(\n",
    "    Embedding(\n",
    "        embedding_matrix.shape[0],  # Final vocabulary size\n",
    "        embedding_matrix.shape[1],  # Word vector dimensions\n",
    "        weights=[embedding_matrix],\n",
    "        input_length=40,\n",
    "        trainable=False,\n",
    "        name=\"embed\",\n",
    "    )\n",
    ")\n",
    "model.add(Conv1D(filters=128, kernel_size=3, activation=\"relu\", name=\"conv_1\"))\n",
    "model.add(MaxPooling1D(pool_size=5, name=\"maxpool_1\"))\n",
    "model.add(Flatten(name=\"flat_1\"))\n",
    "model.add(Dropout(0.3, name=\"dropout_1\"))\n",
    "model.add(Dense(128, activation=\"relu\", name=\"dense_1\"))\n",
    "model.add(Dense(num_classes, activation=\"softmax\", name=\"out_1\"))\n",
    "\n",
    "# Compile the model\n",
    "optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)\n",
    "model.compile(optimizer=optimizer, loss=\"binary_crossentropy\", metrics=[\"acc\"])\n",
    "\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit (Train) and Evaluate the Model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# fit the model here in the notebook:\n",
    "print(\"Training model\")\n",
    "model.fit(X_train, y_train, batch_size=16, epochs=5, verbose=1)\n",
    "print(\"Evaluating model\")\n",
    "# TODO: Better differentiate train vs val loss in logs\n",
    "scores = model.evaluate(X_test, y_test, verbose=2)\n",
    "print(\n",
    "    \"Validation results: \"\n",
    "    + \"; \".join(\n",
    "        map(lambda i: f\"{model.metrics_names[i]}={scores[i]:.5f}\", range(len(model.metrics_names)))\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use the Model (Locally)\n",
    "\n",
    "Let's evaluate our model with some example headlines...\n",
    "\n",
    "If you struggle with the widget, you can always simply call the `classify()` function from Python. You can be creative with your headlines!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ipywidgets as widgets\n",
    "from IPython import display\n",
    "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
    "\n",
    "\n",
    "def classify(text):\n",
    "    \"\"\"Classify a headline and print the results\"\"\"\n",
    "    encoded_example = tokenizer.texts_to_sequences([text])\n",
    "    # Pad documents to a max length of 40 words\n",
    "    max_length = 40\n",
    "    padded_example = pad_sequences(encoded_example, maxlen=max_length, padding=\"post\")\n",
    "    result = model.predict(padded_example)\n",
    "    print(result)\n",
    "    ix = np.argmax(result)\n",
    "    print(f\"Predicted class: '{labels[ix]}' with confidence {result[0][ix]:.2%}\")\n",
    "\n",
    "\n",
    "interaction = widgets.interact_manual(\n",
    "    classify,\n",
    "    text=widgets.Text(\n",
    "        value=\"The markets were bullish after news of the merger\",\n",
    "        placeholder=\"Type a news headline...\",\n",
    "        description=\"Headline:\",\n",
    "        layout=widgets.Layout(width=\"99%\"),\n",
    "    ),\n",
    ")\n",
    "interaction.widget.children[1].description = \"Classify!\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Or just use the function to classify your own headline:\n",
    "classify(\"Retailers are expanding after the recent economic growth\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review\n",
    "\n",
    "In this notebook we pre-processed publicly downloadable data and trained a neural news headline classifier model: As a data scientist might normally do when working on a local machine.\n",
    "\n",
    "...But can we use the cloud more effectively to allocate high-performance resources; and easily deploy our trained models for use by other applications?\n",
    "\n",
    "Head on over to the next notebook, [Headline Classifier SageMaker.ipynb](Headline%20Classifier%20SageMaker.ipynb), where we'll show how the same model can be trained and then deployed on specific target infrastructure with Amazon SageMaker.\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-southeast-1:492261229750:image/tensorflow-2.3-cpu-py37-ubuntu18.04-v1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}