{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Machine Learning Accelerator - Natural Language Processing - Lecture 1</a>\n",
    "\n",
    "## K Nearest Neighbors Model for a Classification Problem: Classify Product Reviews as Positive or Negative\n",
    "\n",
    "In this notebook, we use the K Nearest Neighbors method to build a classifier to predict the __isPositive__ field of our review dataset (that is very similar to the final project dataset).\n",
    "\n",
    "\n",
    "1. <a href=\"#1\">Reading the dataset</a>\n",
    "2. <a href=\"#2\">Exploratory data analysis</a>\n",
    "3. <a href=\"#3\">Text Processing: Stop words removal and stemming</a>\n",
    "4. <a href=\"#4\">Train - Validation Split</a>\n",
    "5. <a href=\"#5\">Data processing with Pipeline</a>\n",
    "6. <a href=\"#6\">Train the classifier</a>\n",
    "7. <a href=\"#7\">Test the classifier</a> Find more details on the KNN Classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html\n",
    "8. <a href=\"#8\">Ideas for improvement</a>\n",
    "\n",
    "Overall dataset schema:\n",
    "* __reviewText:__ Text of the review\n",
    "* __summary:__ Summary of the review\n",
    "* __verified:__ Whether the purchase was verified (True or False)\n",
    "* __time:__ UNIX timestamp for the review\n",
    "* __log_votes:__ Logarithm-adjusted votes log(1+votes). *This field is a processed version of the votes field. People can click on the \"helpful\" button when they find a customer review helpful. This increases the vote by 1. __log_votes__ is calculated like this log(1+votes). This formulation helps us get a smaller range for votes.*\n",
    "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q -r ../requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Reading the dataset</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We will use the __pandas__ library to read our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The shape of the dataset is: (70000, 6)\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')\n",
    "\n",
    "print('The shape of the dataset is:', df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first 10 rows of the dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewText</th>\n",
       "      <th>summary</th>\n",
       "      <th>verified</th>\n",
       "      <th>time</th>\n",
       "      <th>log_votes</th>\n",
       "      <th>isPositive</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...</td>\n",
       "      <td>IDEAL FOR BEGINNER!</td>\n",
       "      <td>True</td>\n",
       "      <td>1361836800</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>unable to open or use</td>\n",
       "      <td>Two Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1452643200</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Waste of money!!! It wouldn't load to my system.</td>\n",
       "      <td>Dont buy it!</td>\n",
       "      <td>True</td>\n",
       "      <td>1433289600</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>I attempted to install this OS on two differen...</td>\n",
       "      <td>I attempted to install this OS on two differen...</td>\n",
       "      <td>True</td>\n",
       "      <td>1518912000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>I've spent 14 fruitless hours over the past tw...</td>\n",
       "      <td>Do NOT Download.</td>\n",
       "      <td>True</td>\n",
       "      <td>1441929600</td>\n",
       "      <td>1.098612</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>I purchased the home and business because I wa...</td>\n",
       "      <td>Quicken home and business  not for amatures</td>\n",
       "      <td>True</td>\n",
       "      <td>1335312000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>The download doesn't take long at all. And it'...</td>\n",
       "      <td>Great!</td>\n",
       "      <td>True</td>\n",
       "      <td>1377993600</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>This program is positively wonderful for word ...</td>\n",
       "      <td>Terrific for practice.</td>\n",
       "      <td>False</td>\n",
       "      <td>1158364800</td>\n",
       "      <td>2.397895</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Fantastic protection!!  Great customer support!!</td>\n",
       "      <td>Five Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1478476800</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Obviously Win 7 now the last great operating s...</td>\n",
       "      <td>Five Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1471478400</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          reviewText  \\\n",
       "0  PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...   \n",
       "1                              unable to open or use   \n",
       "2   Waste of money!!! It wouldn't load to my system.   \n",
       "3  I attempted to install this OS on two differen...   \n",
       "4  I've spent 14 fruitless hours over the past tw...   \n",
       "5  I purchased the home and business because I wa...   \n",
       "6  The download doesn't take long at all. And it'...   \n",
       "7  This program is positively wonderful for word ...   \n",
       "8   Fantastic protection!!  Great customer support!!   \n",
       "9  Obviously Win 7 now the last great operating s...   \n",
       "\n",
       "                                             summary  verified        time  \\\n",
       "0                                IDEAL FOR BEGINNER!      True  1361836800   \n",
       "1                                          Two Stars      True  1452643200   \n",
       "2                                       Dont buy it!      True  1433289600   \n",
       "3  I attempted to install this OS on two differen...      True  1518912000   \n",
       "4                                   Do NOT Download.      True  1441929600   \n",
       "5        Quicken home and business  not for amatures      True  1335312000   \n",
       "6                                             Great!      True  1377993600   \n",
       "7                             Terrific for practice.     False  1158364800   \n",
       "8                                         Five Stars      True  1478476800   \n",
       "9                                         Five Stars      True  1471478400   \n",
       "\n",
       "   log_votes  isPositive  \n",
       "0   0.000000         1.0  \n",
       "1   0.000000         0.0  \n",
       "2   0.000000         0.0  \n",
       "3   0.000000         0.0  \n",
       "4   1.098612         0.0  \n",
       "5   0.000000         0.0  \n",
       "6   0.000000         1.0  \n",
       "7   2.397895         1.0  \n",
       "8   0.000000         1.0  \n",
       "9   0.000000         1.0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Exploratory data analysis</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the distribution of __isPositive__ field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0    43692\n",
       "0.0    26308\n",
       "Name: isPositive, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"isPositive\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check the number of missing values for each columm below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reviewText    11\n",
      "summary       14\n",
      "verified       0\n",
      "time           0\n",
      "log_votes      0\n",
      "isPositive     0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.isna().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have missing values in our text fields."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. <a name=\"3\">Text Processing: Stop words removal and stemming</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Install the library and functions\n",
    "import nltk\n",
    "\n",
    "nltk.download('punkt')\n",
    "nltk.download('stopwords')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk, re\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import SnowballStemmer\n",
    "from nltk.tokenize import word_tokenize\n",
    "\n",
    "# Let's get a list of stop words from the NLTK library\n",
    "stop = stopwords.words('english')\n",
    "\n",
    "# These words are important for our problem. We don't want to remove them.\n",
    "excluding = ['against', 'not', 'don', \"don't\",'ain', 'aren', \"aren't\", 'couldn', \"couldn't\",\n",
    "             'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", \n",
    "             'haven', \"haven't\", 'isn', \"isn't\", 'mightn', \"mightn't\", 'mustn', \"mustn't\",\n",
    "             'needn', \"needn't\",'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \n",
    "             \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n",
    "\n",
    "# New stop word list\n",
    "stop_words = [word for word in stop if word not in excluding]\n",
    "\n",
    "snow = SnowballStemmer('english')\n",
    "\n",
    "def process_text(texts): \n",
    "    final_text_list=[]\n",
    "    for sent in texts:\n",
    "        \n",
    "        # Check if the sentence is a missing value\n",
    "        if isinstance(sent, str) == False:\n",
    "            sent = \"\"\n",
    "            \n",
    "        filtered_sentence=[]\n",
    "        \n",
    "        sent = sent.lower() # Lowercase \n",
    "        sent = sent.strip() # Remove leading/trailing whitespace\n",
    "        sent = re.sub('\\s+', ' ', sent) # Remove extra space and tabs\n",
    "        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:\n",
    "\n",
    "        for w in word_tokenize(sent):\n",
    "            # We are applying some custom filtering here, feel free to try different things\n",
    "            # Check if it is not numeric and its length>2 and not in stop words\n",
    "            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  \n",
    "                # Stem and add to filtered list\n",
    "                filtered_sentence.append(snow.stem(w))\n",
    "        final_string = \" \".join(filtered_sentence) #final string of cleaned words\n",
    " \n",
    "        final_text_list.append(final_string)\n",
    "        \n",
    "    return final_text_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. <a name=\"4\">Train - Validation Split</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's split our dataset into training (90%) and validation (10%). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "\n",
    "X_train, X_val, y_train, y_val = train_test_split(df[[\"reviewText\"]],\n",
    "                                                  df[\"isPositive\"],\n",
    "                                                  test_size=0.10,\n",
    "                                                  shuffle=True,\n",
    "                                                  random_state=324\n",
    "                                                 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing the reviewText fields\n"
     ]
    }
   ],
   "source": [
    "print(\"Processing the reviewText fields\")\n",
    "train_text_list = process_text(X_train[\"reviewText\"].tolist())\n",
    "val_text_list = process_text(X_val[\"reviewText\"].tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our __process_text()__ method in section 3 uses empty string for missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. <a name=\"5\">Data processing with Pipeline</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Today we will use a simple pipeline to use our text field and fit a simple K Nearest Neighbors classifier. This example only uses a single field (reviewText). In the next lecture, we will see how to combine multiple fields. \n",
    "\n",
    "Our CountVectorizer() will return binary values and use 15 vocabulary words. Feel free to experiment with different numbers here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 {color: black;background-color: white;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 pre{padding: 0;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-toggleable {background-color: white;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-estimator:hover {background-color: #d4ebff;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 2em;bottom: 0;left: 50%;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-item {z-index: 1;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-parallel::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 2em;bottom: 0;left: 50%;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-parallel-item {display: flex;flex-direction: column;position: relative;background-color: white;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-parallel-item:only-child::after {width: 0;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;position: relative;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-label label {font-family: monospace;font-weight: bold;background-color: white;display: inline-block;line-height: 1.2em;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-label-container {position: relative;z-index: 2;text-align: center;}#sk-ad892783-6fcc-494d-a56c-bf30825a28f9 div.sk-container {display: inline-block;position: relative;}</style><div id=\"sk-ad892783-6fcc-494d-a56c-bf30825a28f9\" class\"sk-top-container\"><div class=\"sk-container\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"2aa02fd0-12e0-421a-9e79-09c2ca15446e\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"2aa02fd0-12e0-421a-9e79-09c2ca15446e\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=15)),\n",
       "                ('knn', KNeighborsClassifier())])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"bd33cad5-cad4-4010-9490-3c45feebec4f\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"bd33cad5-cad4-4010-9490-3c45feebec4f\">CountVectorizer</label><div class=\"sk-toggleable__content\"><pre>CountVectorizer(binary=True, max_features=15)</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"fa456bc5-0e2d-4038-bf3c-b35283abe28b\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"fa456bc5-0e2d-4038-bf3c-b35283abe28b\">KNeighborsClassifier</label><div class=\"sk-toggleable__content\"><pre>KNeighborsClassifier()</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=15)),\n",
       "                ('knn', KNeighborsClassifier())])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "### PIPELINE ###\n",
    "##########################\n",
    "\n",
    "pipeline = Pipeline([\n",
    "    ('text_vect', CountVectorizer(binary=True,\n",
    "                                  max_features=15)),\n",
    "    ('knn', KNeighborsClassifier())  \n",
    "                                ])\n",
    "\n",
    "# Visualize the pipeline\n",
    "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n",
    "from sklearn import set_config\n",
    "set_config(display='diagram')\n",
    "pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. <a name=\"6\">Train the classifier</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We train our classifier with __.fit()__ on our training dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c {color: black;background-color: white;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c pre{padding: 0;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-toggleable {background-color: white;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-estimator:hover {background-color: #d4ebff;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 2em;bottom: 0;left: 50%;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-item {z-index: 1;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-parallel::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 2em;bottom: 0;left: 50%;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-parallel-item {display: flex;flex-direction: column;position: relative;background-color: white;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-parallel-item:only-child::after {width: 0;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;position: relative;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-label label {font-family: monospace;font-weight: bold;background-color: white;display: inline-block;line-height: 1.2em;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-label-container {position: relative;z-index: 2;text-align: center;}#sk-bac9d195-0379-4124-922a-eff2f0cc3f3c div.sk-container {display: inline-block;position: relative;}</style><div id=\"sk-bac9d195-0379-4124-922a-eff2f0cc3f3c\" class\"sk-top-container\"><div class=\"sk-container\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"4d399c2e-9356-4093-b312-2f1461423a19\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"4d399c2e-9356-4093-b312-2f1461423a19\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=15)),\n",
       "                ('knn', KNeighborsClassifier())])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"e458a0ba-146b-4335-a542-8bc6495ec093\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"e458a0ba-146b-4335-a542-8bc6495ec093\">CountVectorizer</label><div class=\"sk-toggleable__content\"><pre>CountVectorizer(binary=True, max_features=15)</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"9e12dab4-f46e-4b98-9d2a-5c5432a4782f\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"9e12dab4-f46e-4b98-9d2a-5c5432a4782f\">KNeighborsClassifier</label><div class=\"sk-toggleable__content\"><pre>KNeighborsClassifier()</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=15)),\n",
       "                ('knn', KNeighborsClassifier())])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We using lists of processed text fields \n",
    "X_train = train_text_list\n",
    "X_val = val_text_list\n",
    "\n",
    "# Fit the Pipeline to training data\n",
    "pipeline.fit(X_train, y_train.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. <a name=\"7\">Test the classifier</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's evaluate the performance of the trained classifier. We use __.predict()__ this time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1325 1280]\n",
      " [ 902 3493]]\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "         0.0       0.59      0.51      0.55      2605\n",
      "         1.0       0.73      0.79      0.76      4395\n",
      "\n",
      "    accuracy                           0.69      7000\n",
      "   macro avg       0.66      0.65      0.66      7000\n",
      "weighted avg       0.68      0.69      0.68      7000\n",
      "\n",
      "Accuracy (validation): 0.6882857142857143\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report, accuracy_score\n",
    "\n",
    "# Use the fitted pipeline to make predictions on the validation dataset\n",
    "val_predictions = pipeline.predict(X_val)\n",
    "print(confusion_matrix(y_val.values, val_predictions))\n",
    "print(classification_report(y_val.values, val_predictions))\n",
    "print(\"Accuracy (validation):\", accuracy_score(y_val.values, val_predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. <a name=\"8\">Ideas for improvement</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We can usually improve performance with some additional work. You can try the following:\n",
    "* Using your validation data, try different K values and look at accuracy for them.\n",
    "* Change the feature extractor to TF, TF-IDF. Also experiment with different vocabulary sizes.\n",
    "* Come up with some other features such as having certain punctuations, all-capitalized words or some words that might be useful in this problem."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p39",
   "language": "python",
   "name": "conda_pytorch_p39"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}