![MLU Logo](../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 1</a>

## Bag of Words Method

In this notebook, we go over the Bag of Words (BoW) method to convert text data into numerical values, that will be later used for predictions with machine learning algorithms.

To convert text data to vectors of numbers, a vocabulary of known words (tokens) is extracted from the text, the occurence of words is scored, and the resulting numerical values are saved in vocabulary-long vectors.



In [1]:
from mluvisuals import *

BagOfWords()

There are a few versions of BoW, corresponding to different words scoring methods. We use the Sklearn library to calculate the BoW numerical values using:

1. <a href="#1">Binary</a>
2. <a href="#2">Word Counts</a>
3. <a href="#3">Term Frequencies</a>
4. <a href="#4">Term Frequency-Inverse Document Frequencies</a>


In [1]:
%pip install -q -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.


## 1. <a name="1">Binary</a>
(<a href="#0">Go to top</a>)

Let's calculate the first type of BoW, recording whether the word is in the sentence or not. We will also go over some useful features of Sklearn's vectorizers here.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["This document is the first document",
             "This document is the second document",
             "and this is the third one"]

# Initialize the count vectorizer with the parameter: binary=True
binary_vectorizer = CountVectorizer(binary=True)

# fit_transform() function fits the text data and gets the binary BoW vectors
x = binary_vectorizer.fit_transform(sentences)

As the vocabulary size grows, the BoW vectors also get very large in size. They are usually made of many zeros and very few non-zero values. Sklearn stores these vectors in a compressed form. If we want to use them as Numpy arrays, we call the __toarray()__ function. Here are our binary BoW features. Each row corresponds to a single document.

In [3]:
x.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])

Let's check out our vocabulary. We can use the __vocabulary___ attribute. This returns a dictionary with each word as key and index as value. Notice that they are alphabetically ordered.

In [4]:
binary_vectorizer.vocabulary_

{'this': 8,
 'document': 1,
 'is': 3,
 'the': 6,
 'first': 2,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

Similar information can be reached with the __get_feature_names()__ function. The position of the terms in the .get_feature_names() correspond to the column position of the elements in the BoW matrix.

In [5]:
print(binary_vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']




How can we calculate BoW for a new text? We will use the __transform()__ function this time. You can see below this doesn't change the vocabulary. New words are simply skipped in this case.

In [6]:
new_sentence = ["This is the new sentence"]

new_vectors = binary_vectorizer.transform(new_sentence)

In [7]:
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

## 2. <a name="2">Word Counts</a>
(<a href="#0">Go to top</a>)

Word counts can be simply calculated using the same __CountVectorizer()__ function __without__ the __binary__ parameter.



In [8]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["This document is the first document", "This document is the second document", "and this is the third one"]

# Initialize the count vectorizer
count_vectorizer = CountVectorizer()

xc = count_vectorizer.fit_transform(sentences)

xc.toarray()

array([[0, 2, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])

In [9]:
new_sentence = ["This is the new sentence"]
new_vectors = count_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

## 3. <a name="3">Term Frequency (TF)</a>
(<a href="#0">Go to top</a>)

Term Frequency (TF) vectors that show how important words are to the documents, are computed using

$$tf(term, doc) = \frac{number\, of\, times\, the\, term\, occurs\, in\, the\, doc}{total\, number\, of\, terms\, in\, the\, doc}$$

From sklearn we use the __TfidfVectorizer()__ function with the parameter __use_idf=False__, which additionally *automatically normalizes the term frequencies vectors by their Euclidean ($l2$) norm*. 


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = TfidfVectorizer(use_idf=False)

x = tf_vectorizer.fit_transform(sentences)

x.toarray()

array([[0.        , 0.70710678, 0.35355339, 0.35355339, 0.        ,
        0.        , 0.35355339, 0.        , 0.35355339],
       [0.        , 0.70710678, 0.        , 0.35355339, 0.        ,
        0.35355339, 0.35355339, 0.        , 0.35355339],
       [0.40824829, 0.        , 0.        , 0.40824829, 0.40824829,
        0.        , 0.40824829, 0.40824829, 0.40824829]])

In [11]:
new_sentence = ["This is the new sentence"]
new_vectors = tf_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0.        , 0.        , 0.        , 0.57735027, 0.        ,
        0.        , 0.57735027, 0.        , 0.57735027]])

## 4. <a name="4">Term Frequency Inverse Document Frequency (TF-IDF)</a>
(<a href="#0">Go to top</a>)

Term Frequency Inverse Document Frequency (TF-IDF) vectors are computed using the __TfidfVectorizer()__ function with the parameter __use_idf=True__. We can also skip this parameter as it is already __True__ by default.


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(use_idf=True)

sentences = ["This document is the first document",
             "This document is the second document",
             "and this is the third one"]

xf = tfidf_vectorizer.fit_transform(sentences)

xf.toarray()

array([[0.        , 0.7284449 , 0.47890875, 0.28285122, 0.        ,
        0.        , 0.28285122, 0.        , 0.28285122],
       [0.        , 0.7284449 , 0.        , 0.28285122, 0.        ,
        0.47890875, 0.28285122, 0.        , 0.28285122],
       [0.49711994, 0.        , 0.        , 0.29360705, 0.49711994,
        0.        , 0.29360705, 0.49711994, 0.29360705]])

In [13]:
new_sentence = ["This is the new sentence"]
new_vectors = tfidf_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0.        , 0.        , 0.        , 0.57735027, 0.        ,
        0.        , 0.57735027, 0.        , 0.57735027]])

__Note 1__: In addition to *automatically normalizing the term frequencies vectors by their Euclidean ($l2$) norm*, sklearn also uses a *smoothed version of idf*, computing 

$$idf(term) = \ln \Big( \frac{n_{documents} +1}{n_{documents\,containing\,the\,term}+1}\Big) + 1$$

In [14]:
tfidf_vectorizer.idf_

array([1.69314718, 1.28768207, 1.69314718, 1.        , 1.69314718,
       1.69314718, 1.        , 1.69314718, 1.        ])

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = ["This document is the first document",
             "This document is the second document",
             "and this is the third one"]

tfidf_vectorizer = TfidfVectorizer()
xf = tfidf_vectorizer.fit_transform(sentences)
xf.toarray()

array([[0.        , 0.7284449 , 0.47890875, 0.28285122, 0.        ,
        0.        , 0.28285122, 0.        , 0.28285122],
       [0.        , 0.7284449 , 0.        , 0.28285122, 0.        ,
        0.47890875, 0.28285122, 0.        , 0.28285122],
       [0.49711994, 0.        , 0.        , 0.29360705, 0.49711994,
        0.        , 0.29360705, 0.49711994, 0.29360705]])