![MLU Logo](../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 2</a>


## Text Preprocessing

In this notebok we explore techniques to clean and convert text features into numerical features that machine learning algoritms can work with. 

1. <a href="#1">Common text pre-processing</a>
2. <a href="#2">Lexicon-based text processing</a>
3. <a href="#3">Feature Extraction - Bag of Words</a>
4. <a href="#4">Putting it all together</a>



In [1]:
%pip install -q -r ../requirements.txt

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


## 1. <a name="1">Common text pre-processing</a>
(<a href="#0">Go to top</a>)

In this section, we will do some general purpose text cleaning.

In [2]:
text = "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "

Let's first lowercase our text. 

In [3]:
text = text.lower()
print(text)

   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


We can get rid of leading/trailing whitespace with the following:

In [4]:
text = text.strip()
print(text)

this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .


Remove HTML tags/markups:

In [5]:
import re

text = re.compile('<.*?>').sub('', text)
print(text)

this is a message to be cleaned. it may involve some things like: , ?, :, ''  adjacent spaces and tabs     .


Replace punctuation with space

In [6]:
import re, string

text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
print(text)

this is a message to be cleaned  it may involve some things like              adjacent spaces and tabs      


Remove extra space and tabs

In [7]:
import re

text = re.sub('\s+', ' ', text)
print(text)

this is a message to be cleaned it may involve some things like adjacent spaces and tabs 


## 2. <a name="2">Lexicon-based text processing</a>
(<a href="#0">Go to top</a>)

In section 1, we saw some general purpose text pre-processing methods. Lexicon based methods are usually used __to normalize sentences in our dataset__ and later in section 3, we will use these normalized sentences for feature extraction. <br/>
By normalization, here, __we mean putting words in the sentences into a similar format that will enhance similarities (if any) between sentences__. 

__Stop word removal:__ There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: "a", "an", "the", "this", "that", "is"

In [8]:
stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

filtered_sentence = []
words = text.split(" ")
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
text = " ".join(filtered_sentence)

In [9]:
print(text)

message be cleaned may involve some things like adjacent spaces tabs 


__Stemming:__ Stemming is a rule-based system to __convert words into their root form__. <br/>
It removes suffixes from words. This helps us enhace similarities (if any) between sentences. 

Example:

"jumping", "jumped" -> "jump"

"cars" -> "car"

In [10]:
# We use the NLTK library
import nltk
from nltk.stem import SnowballStemmer

# Initialize the stemmer
snow = SnowballStemmer('english')

stemmed_sentence = []
words = text.split(" ")
for w in words:
    stemmed_sentence.append(snow.stem(w))
text = " ".join(stemmed_sentence)

In [11]:
print(text)

messag be clean may involv some thing like adjac space tab 


## 3. <a name="3">Feature Extraction - Bag of Words</a>
(<a href="#0">Go to top</a>)

In this section, we assume we will first apply the common and lexicon based pre-processing to our text. After those, we will convert our text data into numerical data with the __Bag of Words (BoW)__ representation. 

__Bag of Words (BoW)__: A modeling technique to convert text information into numerical representation. <br/>
__Machine learning models expect numerical or categorical values as input and won't work with raw text data__. 

Steps:
1. Create vocabulary of known words
2. Measure presence of the known words in sentences

Let's seen an interactive example for ourselves:

In [1]:
from mluvisuals import *

BagOfWords()

We will use the sklearn library's Bag of Words implementation:

`from sklearn.feature_extraction.text import CountVectorizer`

`countVectorizer = CountVectorizer(binary=True)`

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
countVectorizer = CountVectorizer(binary=True)

sentences = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?'
]
X = countVectorizer.fit_transform(sentences)

Let's print the vocabulary below. <br/>
Each number next to a word shows the index of it in the vocabulary (From 0 to 8 here).<br/>
They are alphabetically ordered-> and:0, document:1, first:2, ...

In [13]:
print(countVectorizer.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


__Note:__ Sklearn automatically removes punctuation, but doesn't do the other extra pre-processing methods we discussed here. <br/>
Lexicon-based methods are also not automaticaly applied, we need to call those methods before feature extraction.

In [14]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 1 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


__What happens when we encounter a new word during prediction?__ 

__New words will be skipped__. <br/>
This usually happens when we are making predictions. For our test and validation data/text, we need to use the __.transform()__ function this time. <br/>
This simulates a real-time prediction case where we cannot re-train the model quickly whenever we receive new words.

In [15]:
test_sentences = ["this document has some new words",
                 "this one is new too"]

count_vectors = countVectorizer.transform(test_sentences)
print(count_vectors.toarray())

[[0 1 0 0 0 0 0 0 1]
 [0 0 0 1 1 0 0 0 1]]


See that these last two vectors have the same lenght 9 (same vocabulary) like the ones before.

## 4. <a name="4">Putting it all together</a>
(<a href="#0">Go to top</a>)

Let's have a full example here. We will apply everything discussed in this notebook.

In [16]:
# Prepare cleaning functions
import re, string
import nltk
from nltk.stem import SnowballStemmer

stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

stemmer = SnowballStemmer('english')

def preProcessText(text):
    # lowercase and strip leading/trailing white space
    text = text.lower().strip()
    
    # remove HTML tags
    text = re.compile('<.*?>').sub('', text)
    
    # remove punctuation
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
    
    # remove extra white space
    text = re.sub('\s+', ' ', text)
    
    return text

def lexiconProcess(text, stop_words, stemmer):
    filtered_sentence = []
    words = text.split(" ")
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(stemmer.stem(w))
    text = " ".join(filtered_sentence)
    
    return text

def cleanSentence(text, stop_words, stemmer):
    return lexiconProcess(preProcessText(text), stop_words, stemmer)

In [17]:
# Prepare vectorizer 
from sklearn.feature_extraction.text import CountVectorizer

textvectorizer = CountVectorizer(binary=True)# can also limit vocabulary size here, with say, max_features=50

In [18]:
# Clean and vectorize a text feature with four samples
text_feature = ["I liked the material, color and overall how it looks.<br /><br />",
             "Worked okay first two times I used it, but third time burned my face.",
             "I am not sure about this product.",
             "I never thought I would pay so much for a hair dryer.",
            ]

print(len(text_feature))

# Clean up the text
text_feature_cleaned = [cleanSentence(item, stop_words, stemmer) for item in text_feature]

# Vectorize the cleaned text
text_feature_vectorized = textvectorizer.fit_transform(text_feature_cleaned)
print('Vocabulary: \n', textvectorizer.vocabulary_)
print('Bag of Words Binary Features: \n', text_feature_vectorized.toarray())

print(text_feature_vectorized.shape)

4
Vocabulary: 
 {'like': 11, 'materi': 13, 'color': 4, 'overal': 19, 'how': 10, 'look': 12, 'work': 29, 'okay': 18, 'first': 7, 'two': 27, 'time': 26, 'use': 28, 'but': 3, 'third': 24, 'burn': 2, 'my': 15, 'face': 6, 'am': 1, 'not': 17, 'sure': 23, 'about': 0, 'product': 21, 'never': 16, 'thought': 25, 'would': 30, 'pay': 20, 'so': 22, 'much': 14, 'for': 8, 'hair': 9, 'dryer': 5}
Bag of Words Binary Features: 
 [[0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0]
 [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1]]
(4, 31)
