{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![MLU Logo](../data/MLU_Logo.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Tabular Data - Lecture 2\n", "\n", "\n", "## Text Preprocessing\n", "\n", "In this notebok we explore techniques to clean and convert text features into numerical features that machine learning algoritms can work with. \n", "\n", "1. Common text pre-processing\n", "2. Lexicon-based text processing\n", "3. Feature Extraction - Bag of Words\n", "4. Putting it all together\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.3.1; however, version 22.3.1 is available.\n", "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python -m pip install --upgrade pip' command.\u001b[0m\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Common text pre-processing\n", "(Go to top)\n", "\n", "In this section, we will do some general purpose text cleaning." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "text = \" This is a message to be cleaned. It may involve some things like:
, ?, :, '' adjacent spaces and tabs . \"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first lowercase our text. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " this is a message to be cleaned. it may involve some things like:
, ?, :, '' adjacent spaces and tabs . \n" ] } ], "source": [ "text = text.lower()\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get rid of leading/trailing whitespace with the following:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "this is a message to be cleaned. it may involve some things like:
, ?, :, '' adjacent spaces and tabs .\n" ] } ], "source": [ "text = text.strip()\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove HTML tags/markups:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "this is a message to be cleaned. it may involve some things like: , ?, :, '' adjacent spaces and tabs .\n" ] } ], "source": [ "import re\n", "\n", "text = re.compile('<.*?>').sub('', text)\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace punctuation with space" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "this is a message to be cleaned it may involve some things like adjacent spaces and tabs \n" ] } ], "source": [ "import re, string\n", "\n", "text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove extra space and tabs" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "this is a message to be cleaned it may involve some things like adjacent spaces and tabs \n" ] } ], "source": [ "import re\n", "\n", "text = re.sub('\\s+', ' ', text)\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Lexicon-based text processing\n", "(Go to top)\n", "\n", "In section 1, we saw some general purpose text pre-processing methods. Lexicon based methods are usually used __to normalize sentences in our dataset__ and later in section 3, we will use these normalized sentences for feature extraction.
\n", "By normalization, here, __we mean putting words in the sentences into a similar format that will enhance similarities (if any) between sentences__. \n", "\n", "__Stop word removal:__ There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: \"a\", \"an\", \"the\", \"this\", \"that\", \"is\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "stop_words = [\"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\"]\n", "\n", "filtered_sentence = []\n", "words = text.split(\" \")\n", "for w in words:\n", " if w not in stop_words:\n", " filtered_sentence.append(w)\n", "text = \" \".join(filtered_sentence)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "message be cleaned may involve some things like adjacent spaces tabs \n" ] } ], "source": [ "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Stemming:__ Stemming is a rule-based system to __convert words into their root form__.
\n", "It removes suffixes from words. This helps us enhace similarities (if any) between sentences. \n", "\n", "Example:\n", "\n", "\"jumping\", \"jumped\" -> \"jump\"\n", "\n", "\"cars\" -> \"car\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# We use the NLTK library\n", "import nltk\n", "from nltk.stem import SnowballStemmer\n", "\n", "# Initialize the stemmer\n", "snow = SnowballStemmer('english')\n", "\n", "stemmed_sentence = []\n", "words = text.split(\" \")\n", "for w in words:\n", " stemmed_sentence.append(snow.stem(w))\n", "text = \" \".join(stemmed_sentence)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "messag be clean may involv some thing like adjac space tab \n" ] } ], "source": [ "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Feature Extraction - Bag of Words\n", "(Go to top)\n", "\n", "In this section, we assume we will first apply the common and lexicon based pre-processing to our text. After those, we will convert our text data into numerical data with the __Bag of Words (BoW)__ representation. \n", "\n", "__Bag of Words (BoW)__: A modeling technique to convert text information into numerical representation.
\n", "__Machine learning models expect numerical or categorical values as input and won't work with raw text data__. \n", "\n", "Steps:\n", "1. Create vocabulary of known words\n", "2. Measure presence of the known words in sentences\n", "\n", "Let's seen an interactive example for ourselves:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mluvisuals import *\n", "\n", "BagOfWords()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the sklearn library's Bag of Words implementation:\n", "\n", "`from sklearn.feature_extraction.text import CountVectorizer`\n", "\n", "`countVectorizer = CountVectorizer(binary=True)`" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "countVectorizer = CountVectorizer(binary=True)\n", "\n", "sentences = [\n", " 'This is the first document.',\n", " 'This is the second second document.',\n", " 'And the third one.',\n", " 'Is this the first document?'\n", "]\n", "X = countVectorizer.fit_transform(sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's print the vocabulary below.
\n", "Each number next to a word shows the index of it in the vocabulary (From 0 to 8 here).
\n", "They are alphabetically ordered-> and:0, document:1, first:2, ..." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}\n" ] } ], "source": [ "print(countVectorizer.vocabulary_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Note:__ Sklearn automatically removes punctuation, but doesn't do the other extra pre-processing methods we discussed here.
\n", "Lexicon-based methods are also not automaticaly applied, we need to call those methods before feature extraction." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0 1 1 1 0 0 1 0 1]\n", " [0 1 0 1 0 1 1 0 1]\n", " [1 0 0 0 1 0 1 1 0]\n", " [0 1 1 1 0 0 1 0 1]]\n" ] } ], "source": [ "print(X.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__What happens when we encounter a new word during prediction?__ \n", "\n", "__New words will be skipped__.
\n", "This usually happens when we are making predictions. For our test and validation data/text, we need to use the __.transform()__ function this time.
\n", "This simulates a real-time prediction case where we cannot re-train the model quickly whenever we receive new words." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0 1 0 0 0 0 0 0 1]\n", " [0 0 0 1 1 0 0 0 1]]\n" ] } ], "source": [ "test_sentences = [\"this document has some new words\",\n", " \"this one is new too\"]\n", "\n", "count_vectors = countVectorizer.transform(test_sentences)\n", "print(count_vectors.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See that these last two vectors have the same lenght 9 (same vocabulary) like the ones before." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Putting it all together\n", "(Go to top)\n", "\n", "Let's have a full example here. We will apply everything discussed in this notebook." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Prepare cleaning functions\n", "import re, string\n", "import nltk\n", "from nltk.stem import SnowballStemmer\n", "\n", "stop_words = [\"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\"]\n", "\n", "stemmer = SnowballStemmer('english')\n", "\n", "def preProcessText(text):\n", " # lowercase and strip leading/trailing white space\n", " text = text.lower().strip()\n", " \n", " # remove HTML tags\n", " text = re.compile('<.*?>').sub('', text)\n", " \n", " # remove punctuation\n", " text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)\n", " \n", " # remove extra white space\n", " text = re.sub('\\s+', ' ', text)\n", " \n", " return text\n", "\n", "def lexiconProcess(text, stop_words, stemmer):\n", " filtered_sentence = []\n", " words = text.split(\" \")\n", " for w in words:\n", " if w not in stop_words:\n", " filtered_sentence.append(stemmer.stem(w))\n", " text = \" \".join(filtered_sentence)\n", " \n", " return text\n", "\n", "def cleanSentence(text, stop_words, stemmer):\n", " return lexiconProcess(preProcessText(text), stop_words, stemmer)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Prepare vectorizer \n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "textvectorizer = CountVectorizer(binary=True)# can also limit vocabulary size here, with say, max_features=50" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n", "Vocabulary: \n", " {'like': 11, 'materi': 13, 'color': 4, 'overal': 19, 'how': 10, 'look': 12, 'work': 29, 'okay': 18, 'first': 7, 'two': 27, 'time': 26, 'use': 28, 'but': 3, 'third': 24, 'burn': 2, 'my': 15, 'face': 6, 'am': 1, 'not': 17, 'sure': 23, 'about': 0, 'product': 21, 'never': 16, 'thought': 25, 'would': 30, 'pay': 20, 'so': 22, 'much': 14, 'for': 8, 'hair': 9, 'dryer': 5}\n", "Bag of Words Binary Features: \n", " [[0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]\n", " [0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0]\n", " [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0]\n", " [0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1]]\n", "(4, 31)\n" ] } ], "source": [ "# Clean and vectorize a text feature with four samples\n", "text_feature = [\"I liked the material, color and overall how it looks.

\",\n", " \"Worked okay first two times I used it, but third time burned my face.\",\n", " \"I am not sure about this product.\",\n", " \"I never thought I would pay so much for a hair dryer.\",\n", " ]\n", "\n", "print(len(text_feature))\n", "\n", "# Clean up the text\n", "text_feature_cleaned = [cleanSentence(item, stop_words, stemmer) for item in text_feature]\n", "\n", "# Vectorize the cleaned text\n", "text_feature_vectorized = textvectorizer.fit_transform(text_feature_cleaned)\n", "print('Vocabulary: \\n', textvectorizer.vocabulary_)\n", "print('Bag of Words Binary Features: \\n', text_feature_vectorized.toarray())\n", "\n", "print(text_feature_vectorized.shape)" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:conda_pytorch_p39] *", "language": "python", "name": "conda-env-conda_pytorch_p39-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 4 }