{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Machine Learning Accelerator - Natural Language Processing - Lecture 3</a>\n",
    "\n",
    "## Recurrent Neural Networks (RNNs) for the Product Review Problem - Classify Product Reviews as Positive or Not\n",
    "\n",
    "In this exercise, we will learn how to use Recurrent Neural Networks. \n",
    "\n",
    "We will follow these steps:\n",
    "1. <a href=\"#1\">Reading the dataset</a>\n",
    "2. <a href=\"#2\">Exploratory data analysis</a>\n",
    "3. <a href=\"#3\">Train-validation dataset split</a>\n",
    "4. <a href=\"#4\">Text processing and Transformation</a>\n",
    "5. <a href=\"#5\">Generating data batch and iterator</a>\n",
    "6. <a href=\"#6\">Using pre-trained GloVe Word Embeddings</a>\n",
    "7. <a href=\"#7\">Setting Hyperparameters and Bulding the Network</a>\n",
    "8. <a href=\"#8\">Training the Network</a>\n",
    "9. <a href=\"#9\">Test the classifier on the validation data</a>\n",
    "10. <a href=\"#10\">Improvement ideas</a>\n",
    "\n",
    "Overall dataset schema:\n",
    "* __reviewText:__ Text of the review\n",
    "* __summary:__ Summary of the review\n",
    "* __verified:__ Whether the purchase was verified (True or False)\n",
    "* __time:__ UNIX timestamp for the review\n",
    "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n",
    "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n",
    "\n",
    "__Important note:__ One big distinction betweeen the regular neural networks and RNNs is that RNNs work with sequential data. In our case, RNNs will help us with the text field. If we also want to consider other fields such as time, log_votes, verified, etc. , we need to use the regular neural networks with the RNN network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:02:47.162268Z",
     "start_time": "2021-01-09T05:02:47.160085Z"
    }
   },
   "outputs": [],
   "source": [
    "%pip install -q -r ../../requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:02:48.342987Z",
     "start_time": "2021-01-09T05:02:47.164823Z"
    }
   },
   "outputs": [],
   "source": [
    "import time\n",
    "import numpy as np\n",
    "import torch, torchtext\n",
    "import pandas as pd\n",
    "from collections import Counter\n",
    "from torch import nn, optim\n",
    "from torch.nn import BCEWithLogitsLoss\n",
    "from torch.utils.data import TensorDataset, DataLoader\n",
    "from torchtext.data.utils import get_tokenizer\n",
    "from torchtext.vocab import GloVe\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix, classification_report, accuracy_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Reading the dataset</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's read the dataset below and fill-in the reviewText field. We will use this field as input to our ML model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:02:48.995226Z",
     "start_time": "2021-01-09T05:02:48.344888Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('../../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first five rows in the dataset. As you can see the __log_votes__ field is numeric. That's why we will build a regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:02:49.015545Z",
     "start_time": "2021-01-09T05:02:48.997444Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewText</th>\n",
       "      <th>summary</th>\n",
       "      <th>verified</th>\n",
       "      <th>time</th>\n",
       "      <th>log_votes</th>\n",
       "      <th>isPositive</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...</td>\n",
       "      <td>IDEAL FOR BEGINNER!</td>\n",
       "      <td>True</td>\n",
       "      <td>1361836800</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>unable to open or use</td>\n",
       "      <td>Two Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1452643200</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Waste of money!!! It wouldn't load to my system.</td>\n",
       "      <td>Dont buy it!</td>\n",
       "      <td>True</td>\n",
       "      <td>1433289600</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>I attempted to install this OS on two differen...</td>\n",
       "      <td>I attempted to install this OS on two differen...</td>\n",
       "      <td>True</td>\n",
       "      <td>1518912000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>I've spent 14 fruitless hours over the past tw...</td>\n",
       "      <td>Do NOT Download.</td>\n",
       "      <td>True</td>\n",
       "      <td>1441929600</td>\n",
       "      <td>1.098612</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          reviewText  \\\n",
       "0  PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...   \n",
       "1                              unable to open or use   \n",
       "2   Waste of money!!! It wouldn't load to my system.   \n",
       "3  I attempted to install this OS on two differen...   \n",
       "4  I've spent 14 fruitless hours over the past tw...   \n",
       "\n",
       "                                             summary  verified        time  \\\n",
       "0                                IDEAL FOR BEGINNER!      True  1361836800   \n",
       "1                                          Two Stars      True  1452643200   \n",
       "2                                       Dont buy it!      True  1433289600   \n",
       "3  I attempted to install this OS on two differen...      True  1518912000   \n",
       "4                                   Do NOT Download.      True  1441929600   \n",
       "\n",
       "   log_votes  isPositive  \n",
       "0   0.000000         1.0  \n",
       "1   0.000000         0.0  \n",
       "2   0.000000         0.0  \n",
       "3   0.000000         0.0  \n",
       "4   1.098612         0.0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Exploratory Data Analysis</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the range and distribution of log_votes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:02:49.024615Z",
     "start_time": "2021-01-09T05:02:49.017492Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0    43692\n",
       "0.0    26308\n",
       "Name: isPositive, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"isPositive\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check the number of missing values for each columm below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:02:49.040120Z",
     "start_time": "2021-01-09T05:02:49.026288Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reviewText    11\n",
      "summary       14\n",
      "verified       0\n",
      "time           0\n",
      "log_votes      0\n",
      "isPositive     0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.isna().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have missing values in our text fields. We can the rows with __reviewText__ field missing as we will use that field with the model aftwerwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.dropna(subset=['reviewText'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reviewText     0\n",
      "summary       12\n",
      "verified       0\n",
      "time           0\n",
      "log_votes      0\n",
      "isPositive     0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.isna().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. <a name=\"3\">Train-validation split</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's split the dataset into training and validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:02:49.098503Z",
     "start_time": "2021-01-09T05:02:49.041948Z"
    }
   },
   "outputs": [],
   "source": [
    "# This separates 10% of the entire dataset into validation dataset.\n",
    "train_text, val_text, train_label, val_label = \\\n",
    "    train_test_split(df[\"reviewText\"].tolist(),\n",
    "                     df[\"isPositive\"].tolist(),\n",
    "                     test_size=0.10,\n",
    "                     shuffle=True,\n",
    "                     random_state=324)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. <a name=\"4\">Text processing and Transformation</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We will apply the following processes here:\n",
    "1. Creating a vocabulary\n",
    "2. Text transformation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__1. Creating a vocabulary:__ \n",
    "\n",
    "We will create a vocabulary with the tokens from the text data. We use a simple english tokenizer and use these tokens to create our vocabulary. In this vocabulary, tokens will map to unique ids, such as \"car\"->32, \"house\"->651, etc. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = get_tokenizer(\"basic_english\")\n",
    "counter = Counter()\n",
    "for line in train_text:\n",
    "    counter.update(tokenizer(line))\n",
    "    \n",
    "# Create a vocabulary with words seen at least 5 (min_freq) times\n",
    "vocab = torchtext.vocab.vocab(counter, min_freq=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:02:50.025716Z",
     "start_time": "2021-01-09T05:02:49.274150Z"
    }
   },
   "outputs": [],
   "source": [
    "# Add the unknown token\n",
    "# and use it by default for unknown words\n",
    "unk_token = '<unk>'\n",
    "vocab.insert_token(unk_token, 0)\n",
    "vocab.set_default_index(0)\n",
    "\n",
    "# Add the pad token\n",
    "pad_token = '<pad>'\n",
    "vocab.insert_token(pad_token, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are some examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:03:26.376574Z",
     "start_time": "2021-01-09T05:02:50.027397Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'home' -> 524\n",
      "'wash' -> 13931\n",
      "'fhshbasdhb' -> 0\n"
     ]
    }
   ],
   "source": [
    "print(f\"'home' -> {vocab['home']}\")\n",
    "print(f\"'wash' -> {vocab['wash']}\")\n",
    "# unknown word (assume from test set)\n",
    "print(f\"'fhshbasdhb' -> {vocab['fhshbasdhb']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:04:29.369651Z",
     "start_time": "2021-01-09T05:03:26.378501Z"
    }
   },
   "source": [
    "__2. Text transformation:__ \n",
    "\n",
    "We will use the vocabulary and map tokens in the text to unique ids of the tokens. For example: `[\"this\", \"is\", \"a\", \"sentence\"] -> [14, 12, 9, 2066]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's create a mapper to transform our text data\n",
    "text_transform_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:04:29.374706Z",
     "start_time": "2021-01-09T05:04:29.371615Z"
    }
   },
   "source": [
    "Let's see some text before and after transformation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Before transform:\tvery easy to install. purchased for the XP factor and it works great. haven't tried all upgrades yet but plan to.\n",
      "After transform:\t[62, 382, 17, 388, 12, 178, 39, 13, 797, 863, 9, 45, 490, 2, 12, 311, 34, 174, 303, 412, 864, 399, 21, 391, 17, 12]\n"
     ]
    }
   ],
   "source": [
    "print(f\"Before transform:\\t{train_text[37]}\")\n",
    "print(f\"After transform:\\t{text_transform_pipeline(train_text[37])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a function for this. In this function, we transform and pad (if necessary) our text data. We cut the series of words at the point where it reaches a certain lenght (we used `max_len=50` here). If the text is shorter than max_len, we `pad 1s` to the end (corresponding to the pad token)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def transformText(text_list, max_len):\n",
    "    # Transform the text\n",
    "    transformed_data = [text_transform_pipeline(text)[:max_len] for text in text_list]\n",
    "\n",
    "    # Pad zeros if the text is shoter than max_len\n",
    "    for data in transformed_data:\n",
    "        data[len(data) : max_len] = np.ones(max_len - len(data))\n",
    "\n",
    "    return torch.tensor(transformed_data, dtype=torch.int64)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'I have been using Webroot for several years now. I like how easy it is to update my subscription, conduct my own scans if needed, and set up regular scans. There is a fantastic feature that checks websites while I am browsing the web using search engines. Webroot will color coat the security \"safeness\" for the website and place the appropriate color bubble with a check-mark or \\'x\\' to the left of the name. Webroot provides a detail explanation on why a particular website is deemed safe or unsafe and some of those settings can be changed to fit the user\\'s preferences.\\n\\nThis particular purchase: \"Webroot SecureAnywhere Internet Security Plus 3 Device Download\" provides the download code once purchase has been completed. It\\'s great because, like I said I\\'ve been using Webroot for several years and already have the software on my computer. I didn\\'t need to buy and wait for a disc to ship to me. This download enabled me to install if I needed too (which I didn\\'t need) or I could pull the code to renew my current subscription (which is what I did), completely disc free. Like the description states, it can be used for 3 devices. So if you need to install Webroot or just need the code to renew, this particular product will fit your needs.\\n\\nI am completely satisfied with Webroot\\'s performance. I not only browse on my own home internet but I have used wifi provided by airports, coffee shops, libraries, and office buildings. I\\'ve even used international wifi and Webroot does a fantastic job keeping my computer and my information safe from preying eyes, viruses, and mal-ware.'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_text[129]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text: [\"We have recently installed this software at work. Its terrible. It overreacts to everything. It has even blocked us out of Yahoo sports, claiming that it was a malicious website. We have also been locked out of news sites, health sites and other benign websites. My guess is that it is detecting website advertisements as viruses. I 'm going to advocate that my office remove this frustrating garbage. Even my supervisor is getting frustrated with this software..\", \"Don't have it\"]\n",
      "\n",
      "Num sentences: 2\n",
      "\n",
      "Transformed text: \n",
      "tensor([[137,  56, 138, 139,  37,  64,  53, 140,  12, 141, 142,  12,  45,   0,\n",
      "          17,  52,  12,  45, 143,  59, 144, 145, 108,   7, 146, 147,   4, 148,\n",
      "          82,  45, 149,   5, 150, 151,  12, 137,  56,  94,  57, 152, 108,   7,\n",
      "         153, 154,   4, 155, 154,   9, 156, 157],\n",
      "        [173,  34, 174,  56,  45,   1,   1,   1,   1,   1,   1,   1,   1,   1,\n",
      "           1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,\n",
      "           1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,\n",
      "           1,   1,   1,   1,   1,   1,   1,   1]])\n",
      "\n",
      "Shape of transformed text: torch.Size([2, 50])\n"
     ]
    }
   ],
   "source": [
    "text = train_text[5:7]\n",
    "print(f\"Text: {text}\\n\")\n",
    "print(f\"Num sentences: {len(text)}\\n\")\n",
    "tt = transformText(text, max_len=50)\n",
    "print(f\"Transformed text: \\n{tt}\\n\")\n",
    "print(f\"Shape of transformed text: {tt.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. <a name=\"5\">Generating data batch and iterator</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's use the transformText() function and create the data loaders. Here, we use __max_len=100__ to consider the first 100 words in the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:04:29.864398Z",
     "start_time": "2021-01-09T05:04:29.376025Z"
    }
   },
   "outputs": [],
   "source": [
    "max_len = 100\n",
    "batch_size = 16\n",
    "\n",
    "# Pass transformed and padded data to dataset\n",
    "# Create data loaders\n",
    "train_dataset = TensorDataset(\n",
    "    transformText(train_text, max_len), torch.tensor(train_label)\n",
    ")\n",
    "train_loader = DataLoader(train_dataset, batch_size=batch_size)\n",
    "\n",
    "val_dataset = TensorDataset(transformText(val_text, max_len), torch.tensor(val_label))\n",
    "val_loader = DataLoader(val_dataset, batch_size=batch_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. <a name=\"6\">Using pre-trained GloVe Word Embeddings</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "In this example, we will use GloVe word vectors. `name='6B'` `dim=300` gives us 6 billion words/phrases vectors. Each word vector has 300 numbers in it. The following code shows how to get the word vectors and create an embedding matrix from them. We will connect our vocabulary indexes to the GloVe embedding with the `get_vecs_by_tokens()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:04:29.868989Z",
     "start_time": "2021-01-09T05:04:29.866241Z"
    }
   },
   "outputs": [],
   "source": [
    "glove = GloVe(name=\"6B\", dim=300)\n",
    "embedding_matrix = glove.get_vecs_by_tokens(vocab.get_itos())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. <a name=\"7\">Setting Hyperparameters and Bulding the Network</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will set our parameters like below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:04:29.880059Z",
     "start_time": "2021-01-09T05:04:29.871107Z"
    }
   },
   "outputs": [],
   "source": [
    "# Size of the state vectors\n",
    "hidden_size = 8\n",
    "\n",
    "# General NN training parameters\n",
    "learning_rate = 0.001\n",
    "epochs = 25\n",
    "\n",
    "# Embedding vector and vocabulary sizes\n",
    "embed_size = 300  # glove.6B.300d.txt\n",
    "vocab_size = len(vocab.get_itos())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to put our data into correct format before the process.\n",
    "Our model is made of these layers:\n",
    "* Embedding layer: This is where our words/tokens are mapped to word vectors.\n",
    "* RNN layer: We are using a simple RNN model. We stack 2 RNN layers in this example. More details about the RNN are available [here](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).\n",
    "* Linear layer: A linear layer with a single neuron is used to output the `isPositive` prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:04:29.892791Z",
     "start_time": "2021-01-09T05:04:29.881808Z"
    }
   },
   "outputs": [],
   "source": [
    "class Net(nn.Module):\n",
    "    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):\n",
    "        super().__init__()\n",
    "        self.embedding = nn.Embedding(vocab_size, embed_size)\n",
    "        self.rnn = nn.RNN(\n",
    "            embed_size, hidden_size, num_layers=num_layers\n",
    "        )\n",
    "\n",
    "        self.linear = nn.Linear(hidden_size*max_len, 1)\n",
    "        self.act = nn.Sigmoid()\n",
    "\n",
    "    def forward(self, inputs):\n",
    "        embeddings = self.embedding(inputs)\n",
    "        # Call RNN layer\n",
    "        outputs, _ = self.rnn(embeddings)\n",
    "        # Use the output of each time step\n",
    "        # Send it all together to the linear layer\n",
    "        outs = self.linear(outputs.reshape(outputs.shape[0], -1))\n",
    "        return self.act(outs)\n",
    "    \n",
    "model = Net(vocab_size, embed_size, hidden_size, num_layers=2)\n",
    "\n",
    "# Initialize the weights\n",
    "def init_weights(m):\n",
    "    if type(m) == nn.Linear:\n",
    "        nn.init.xavier_uniform_(m.weight)\n",
    "    if type(m) == nn.RNN:\n",
    "        for param in m._flat_weights_names:\n",
    "            if \"weight\" in param:\n",
    "                nn.init.xavier_uniform_(m._parameters[param])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's initialize this network. Then, we will need to make the embedding layer use our GloVe word vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:04:29.902048Z",
     "start_time": "2021-01-09T05:04:29.899284Z"
    }
   },
   "outputs": [],
   "source": [
    "# We set the embedding layer's parameters from GloVe\n",
    "model.embedding.weight.data.copy_(embedding_matrix)\n",
    "# We won't change/train the embedding layer\n",
    "model.embedding.weight.requires_grad = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. <a name=\"8\">Training the Network</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Now, it is time to start our training. We define the loss function and training algorithm first. Then, training starts!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:04:29.906415Z",
     "start_time": "2021-01-09T05:04:29.903716Z"
    }
   },
   "source": [
    "We will define the trainer and loss function below. \n",
    "\n",
    "__Binary cross-entropy loss__ is used as this is a binary classification problem.\n",
    "\n",
    "$$\n",
    "\\mathrm{BinaryCrossEntropyLoss} = -\\sum_{examples}{(y\\log(p) + (1 - y)\\log(1 - p))}\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting our trainer\n",
    "trainer = torch.optim.SGD(model.parameters(), lr=learning_rate)\n",
    "\n",
    "# We will use Binary Cross-entropy loss\n",
    "# reduction=\"sum\" sums the losses for given output and target\n",
    "cross_ent_loss = nn.BCELoss(reduction=\"sum\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:06:35.434926Z",
     "start_time": "2021-01-09T05:04:29.908071Z"
    },
    "scrolled": true
   },
   "source": [
    "Now, it is time to start the training process. We will print the Binary cross-entropy loss loss after each epoch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see some validation results below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-01-09T05:06:36.046946Z",
     "start_time": "2021-01-09T05:06:35.436633Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0. Train_loss 0.565565640170492. Val_loss 0.5057228901229904. Seconds 14.46343994140625\n",
      "Epoch 1. Train_loss 0.48193354848491365. Val_loss 0.4710737303131426. Seconds 14.414626359939575\n",
      "Epoch 2. Train_loss 0.45539847189026195. Val_loss 0.4506763486389365. Seconds 14.43692421913147\n",
      "Epoch 3. Train_loss 0.4381956004399991. Val_loss 0.43892509767985. Seconds 14.420343399047852\n",
      "Epoch 4. Train_loss 0.42654436279580904. Val_loss 0.4318110432211272. Seconds 14.416456937789917\n",
      "Epoch 5. Train_loss 0.4179300016421215. Val_loss 0.42749577591224713. Seconds 14.436838865280151\n",
      "Epoch 6. Train_loss 0.41113823203363764. Val_loss 0.4250155760400856. Seconds 14.415632724761963\n",
      "Epoch 7. Train_loss 0.40561350161243115. Val_loss 0.4235686122187377. Seconds 14.442252397537231\n",
      "Epoch 8. Train_loss 0.400997344562602. Val_loss 0.4223748551281644. Seconds 14.412562131881714\n",
      "Epoch 9. Train_loss 0.397065539376322. Val_loss 0.4210442323857741. Seconds 14.469841718673706\n",
      "Epoch 10. Train_loss 0.39365788666023716. Val_loss 0.41945044949048244. Seconds 15.151393413543701\n",
      "Epoch 11. Train_loss 0.39065901268889247. Val_loss 0.4176639952579896. Seconds 14.733606338500977\n",
      "Epoch 12. Train_loss 0.3879863079870896. Val_loss 0.41584413094867345. Seconds 14.412443161010742\n",
      "Epoch 13. Train_loss 0.38557515267739884. Val_loss 0.4141347120959242. Seconds 14.503468751907349\n",
      "Epoch 14. Train_loss 0.38337428946024954. Val_loss 0.4125501166617705. Seconds 14.388250350952148\n",
      "Epoch 15. Train_loss 0.38135641634057377. Val_loss 0.4111177539600613. Seconds 14.407227277755737\n",
      "Epoch 16. Train_loss 0.37949003540612264. Val_loss 0.40982920644214826. Seconds 14.403850317001343\n",
      "Epoch 17. Train_loss 0.3777463913020643. Val_loss 0.40862205335456553. Seconds 14.436709642410278\n",
      "Epoch 18. Train_loss 0.376106850414697. Val_loss 0.4074430435040045. Seconds 14.461316108703613\n",
      "Epoch 19. Train_loss 0.3745616044973945. Val_loss 0.4062969534205341. Seconds 14.42183542251587\n",
      "Epoch 20. Train_loss 0.3731047643160211. Val_loss 0.4052028246207959. Seconds 14.499522686004639\n",
      "Epoch 21. Train_loss 0.3717317574552968. Val_loss 0.4041522572527342. Seconds 14.428503513336182\n",
      "Epoch 22. Train_loss 0.3704353946450211. Val_loss 0.4031779228372052. Seconds 14.424946308135986\n",
      "Epoch 23. Train_loss 0.3692140730189263. Val_loss 0.40227297883933744. Seconds 15.147498846054077\n",
      "Epoch 24. Train_loss 0.368064061021555. Val_loss 0.40140937377527997. Seconds 14.439029932022095\n"
     ]
    }
   ],
   "source": [
    "# Get the compute device\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "\n",
    "model.apply(init_weights)\n",
    "model.to(device)\n",
    "\n",
    "for epoch in range(epochs):\n",
    "    start = time.time()\n",
    "    training_loss = 0\n",
    "    val_loss = 0\n",
    "    # Training loop, train the network\n",
    "    for data, target in train_loader:\n",
    "        trainer.zero_grad()\n",
    "        data = data.to(device)\n",
    "        target = target.to(device)\n",
    "        output = model(data)\n",
    "        L = cross_ent_loss(output.squeeze(1), target)\n",
    "        training_loss += L.item()\n",
    "        L.backward()\n",
    "        trainer.step()\n",
    "\n",
    "    # Validate the network, no training (no weight update)\n",
    "    for data, target in val_loader:\n",
    "        val_predictions = model(data.to(device))\n",
    "        L = cross_ent_loss(val_predictions.squeeze(1), target.to(device))\n",
    "        val_loss += L.item()\n",
    "\n",
    "    # Let's take the average losses\n",
    "    training_loss = training_loss / len(train_label)\n",
    "    val_loss = val_loss / len(val_label)\n",
    "\n",
    "    end = time.time()\n",
    "    print(\n",
    "        f\"Epoch {epoch}. Train_loss {training_loss}. Val_loss {val_loss}. Seconds {end-start}\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. <a name=\"9\">Test the classifier on the validation data</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's get the validation predictions. Earlier we made predictions on the validation set with this line: ```model(data.to(device))```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]\n"
     ]
    }
   ],
   "source": [
    "val_predictions = []\n",
    "for data, target in val_loader:\n",
    "    val_preds = model(data.to(device))\n",
    "    val_predictions.extend(\n",
    "        [np.rint(val_pred)[0] for val_pred in val_preds.detach().cpu().numpy()]\n",
    "    )\n",
    "print(val_predictions[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Confusion matrix, classification report and accuracy score are printed below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[2122  508]\n",
      " [ 706 3663]]\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.75      0.81      0.78      2630\n",
      "         1.0       0.88      0.84      0.86      4369\n",
      "\n",
      "    accuracy                           0.83      6999\n",
      "   macro avg       0.81      0.82      0.82      6999\n",
      "weighted avg       0.83      0.83      0.83      6999\n",
      "\n",
      "Accuracy (validation): 0.8265466495213601\n"
     ]
    }
   ],
   "source": [
    "# Use the fitted pipeline to make predictions on the validation dataset\n",
    "print(confusion_matrix(val_label, val_predictions))\n",
    "print(classification_report(val_label, val_predictions))\n",
    "print(\"Accuracy (validation):\", accuracy_score(val_label, val_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. <a name=\"10\">Improvement ideas</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We can improve our model by\n",
    "* Changing hyper-parameters: Learning rate, batch size and hidden size\n",
    "* Increasing the number of layers: num_layers\n",
    "* Using more advanced architetures such as Gated Recurrent Units (GRU) and Long Short-term Memory networks (LSTM)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p39",
   "language": "python",
   "name": "conda_pytorch_p39"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}