{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Machine Learning Accelerator - Natural Language Processing - Lecture 1</a>\n",
    "\n",
    "## Text Processing\n",
    "\n",
    "In this notebook, we go over some simple techniques to clean and prepare text data for modeling with machine learning.\n",
    "\n",
    "1. <a href=\"#1\">Simple text cleaning processes</a>\n",
    "2. <a href=\"#2\">Lexicon-based text processing</a>\n",
    "    * Stop words removal \n",
    "    * Stemming   \n",
    "    * Lemmatization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q -r ../requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Simple text cleaning processes</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "In this section, we will do some general purpose text cleaning. The following methods for cleaning can be extended depending on the application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  \n"
     ]
    }
   ],
   "source": [
    "text = \"   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  \"\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first lowercase our text. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  \n"
     ]
    }
   ],
   "source": [
    "text = text.lower()\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get rid of leading/trailing whitespace with the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .\n"
     ]
    }
   ],
   "source": [
    "text = text.strip()\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove HTML tags/markups:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is a message to be cleaned. it may involve some things like: , ?, :, ''  adjacent spaces and tabs     .\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = re.compile('<.*?>').sub('', text)\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace punctuation with space. Be careful with this one, depending on the application, punctuations can actually be useful. For example positive vs negative meanining of a sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is a message to be cleaned  it may involve some things like              adjacent spaces and tabs      \n"
     ]
    }
   ],
   "source": [
    "import re, string\n",
    "\n",
    "text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove extra space and tabs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is a message to be cleaned it may involve some things like adjacent spaces and tabs \n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = re.sub('\\s+', ' ', text)\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Lexicon-based text processing</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We saw some general purpose text pre-processing methods in the previous section. Lexicon based methods are usually applied after the common text processing methods. They are used to normalize sentences in our dataset. By normalization, here, we mean putting words into a similar format that will also enhace similarities (if any) between sentences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to install some packages for this example. Run the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.\n",
      "[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import nltk\n",
    "\n",
    "nltk.download('punkt')\n",
    "nltk.download('averaged_perceptron_tagger')\n",
    "nltk.download('wordnet')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Stop word removal\n",
    "There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: \"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\" in this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We will use a tokenizer from the NLTK library\n",
    "import nltk\n",
    "from nltk.tokenize import word_tokenize\n",
    "\n",
    "filtered_sentence = []\n",
    "\n",
    "# Stop word lists can be adjusted for your problem\n",
    "stop_words = [\"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\"]\n",
    "\n",
    "# Tokenize the sentence\n",
    "words = word_tokenize(text)\n",
    "for w in words:\n",
    "    if w not in stop_words:\n",
    "        filtered_sentence.append(w)\n",
    "text = \" \".join(filtered_sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "message be cleaned may involve some things like adjacent spaces tabs\n"
     ]
    }
   ],
   "source": [
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Stemming\n",
    "Stemming is a rule-based system to convert words into their root form. It removes suffixes from words. This helps us enhace similarities (if any) between sentences. \n",
    "\n",
    "Example:\n",
    "\n",
    "\"jumping\", \"jumped\" -> \"jump\"\n",
    "\n",
    "\"cars\" -> \"car\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We will use a tokenizer and stemmer from the NLTK library\n",
    "import nltk\n",
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.stem import SnowballStemmer\n",
    "\n",
    "# Initialize the stemmer\n",
    "snow = SnowballStemmer('english')\n",
    "\n",
    "stemmed_sentence = []\n",
    "# Tokenize the sentence\n",
    "words = word_tokenize(text)\n",
    "for w in words:\n",
    "    # Stem the word/token\n",
    "    stemmed_sentence.append(snow.stem(w))\n",
    "stemmed_text = \" \".join(stemmed_sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "messag be clean may involv some thing like adjac space tab\n"
     ]
    }
   ],
   "source": [
    "print(stemmed_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see above that stemming operation is NOT perfect. We have mistakes such as \"messag\", \"involv\", \"adjac\". It is a rule based method that sometimes mistakely remove suffixes from words. Nevertheless, it runs fast."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Lemmatization\n",
    "If we are not satisfied with the result of stemming, we can use the Lemmatization instead. It usually requires more work, but gives better results. As mentioned in the class, lemmatization needs to know the correct word position tags such as \"noun\", \"verb\", \"adjective\", etc. and we will use another NLTK function to feed this information to the lemmatizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing the necessary functions\n",
    "import nltk\n",
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.corpus import wordnet\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "\n",
    "# Initialize the lemmatizer\n",
    "wl = WordNetLemmatizer()\n",
    "\n",
    "# This is a helper function to map NTLK position tags\n",
    "# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html\n",
    "def get_wordnet_pos(tag):\n",
    "    if tag.startswith('J'):\n",
    "        return wordnet.ADJ\n",
    "    elif tag.startswith('V'):\n",
    "        return wordnet.VERB\n",
    "    elif tag.startswith('N'):\n",
    "        return wordnet.NOUN\n",
    "    elif tag.startswith('R'):\n",
    "        return wordnet.ADV\n",
    "    else:\n",
    "        return wordnet.NOUN\n",
    "\n",
    "lemmatized_sentence = []\n",
    "# Tokenize the sentence\n",
    "words = word_tokenize(text)\n",
    "# Get position tags\n",
    "word_pos_tags = nltk.pos_tag(words)\n",
    "# Map the position tag and lemmatize the word/token\n",
    "for idx, tag in enumerate(word_pos_tags):\n",
    "    lemmatized_sentence.append(wl.lemmatize(tag[0], get_wordnet_pos(tag[1])))\n",
    "\n",
    "lemmatized_text = \" \".join(lemmatized_sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "message be clean may involve some thing like adjacent space tabs\n"
     ]
    }
   ],
   "source": [
    "print(lemmatized_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks better than the stemming result."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p39",
   "language": "python",
   "name": "conda_pytorch_p39"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}