{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine Learning Accelerator - Tabular Data - Lecture 2\n",
"\n",
"\n",
"## Text Preprocessing\n",
"\n",
"In this notebok we explore techniques to clean and convert text features into numerical features that machine learning algoritms can work with. \n",
"\n",
"1. Common text pre-processing\n",
"2. Lexicon-based text processing\n",
"3. Feature Extraction - Bag of Words\n",
"4. Putting it all together\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mWARNING: You are using pip version 21.3.1; however, version 22.3.1 is available.\n",
"You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python -m pip install --upgrade pip' command.\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -q -r ../requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Common text pre-processing\n",
"(Go to top)\n",
"\n",
"In this section, we will do some general purpose text cleaning."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"text = \" This is a message to be cleaned. It may involve some things like:
, ?, :, '' adjacent spaces and tabs . \""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first lowercase our text. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" this is a message to be cleaned. it may involve some things like:
, ?, :, '' adjacent spaces and tabs . \n"
]
}
],
"source": [
"text = text.lower()\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get rid of leading/trailing whitespace with the following:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"this is a message to be cleaned. it may involve some things like:
, ?, :, '' adjacent spaces and tabs .\n"
]
}
],
"source": [
"text = text.strip()\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove HTML tags/markups:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"this is a message to be cleaned. it may involve some things like: , ?, :, '' adjacent spaces and tabs .\n"
]
}
],
"source": [
"import re\n",
"\n",
"text = re.compile('<.*?>').sub('', text)\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Replace punctuation with space"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"this is a message to be cleaned it may involve some things like adjacent spaces and tabs \n"
]
}
],
"source": [
"import re, string\n",
"\n",
"text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove extra space and tabs"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"this is a message to be cleaned it may involve some things like adjacent spaces and tabs \n"
]
}
],
"source": [
"import re\n",
"\n",
"text = re.sub('\\s+', ' ', text)\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Lexicon-based text processing\n",
"(Go to top)\n",
"\n",
"In section 1, we saw some general purpose text pre-processing methods. Lexicon based methods are usually used __to normalize sentences in our dataset__ and later in section 3, we will use these normalized sentences for feature extraction.
\n",
"By normalization, here, __we mean putting words in the sentences into a similar format that will enhance similarities (if any) between sentences__. \n",
"\n",
"__Stop word removal:__ There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: \"a\", \"an\", \"the\", \"this\", \"that\", \"is\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"stop_words = [\"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\"]\n",
"\n",
"filtered_sentence = []\n",
"words = text.split(\" \")\n",
"for w in words:\n",
" if w not in stop_words:\n",
" filtered_sentence.append(w)\n",
"text = \" \".join(filtered_sentence)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"message be cleaned may involve some things like adjacent spaces tabs \n"
]
}
],
"source": [
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Stemming:__ Stemming is a rule-based system to __convert words into their root form__.
\n",
"It removes suffixes from words. This helps us enhace similarities (if any) between sentences. \n",
"\n",
"Example:\n",
"\n",
"\"jumping\", \"jumped\" -> \"jump\"\n",
"\n",
"\"cars\" -> \"car\""
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# We use the NLTK library\n",
"import nltk\n",
"from nltk.stem import SnowballStemmer\n",
"\n",
"# Initialize the stemmer\n",
"snow = SnowballStemmer('english')\n",
"\n",
"stemmed_sentence = []\n",
"words = text.split(\" \")\n",
"for w in words:\n",
" stemmed_sentence.append(snow.stem(w))\n",
"text = \" \".join(stemmed_sentence)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"messag be clean may involv some thing like adjac space tab \n"
]
}
],
"source": [
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Feature Extraction - Bag of Words\n",
"(Go to top)\n",
"\n",
"In this section, we assume we will first apply the common and lexicon based pre-processing to our text. After those, we will convert our text data into numerical data with the __Bag of Words (BoW)__ representation. \n",
"\n",
"__Bag of Words (BoW)__: A modeling technique to convert text information into numerical representation.
\n",
"__Machine learning models expect numerical or categorical values as input and won't work with raw text data__. \n",
"\n",
"Steps:\n",
"1. Create vocabulary of known words\n",
"2. Measure presence of the known words in sentences\n",
"\n",
"Let's seen an interactive example for ourselves:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
"