![MLU Logo](../../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 3</a>

## Recurrent Neural Networks (RNNs) for the Product Review Problem - Classify Product Reviews as Positive or Not

In this exercise, we will learn how to use Recurrent Neural Networks. 

We will follow these steps:
1. <a href="#1">Reading the dataset</a>
2. <a href="#2">Exploratory data analysis</a>
3. <a href="#3">Train-validation dataset split</a>
4. <a href="#4">Text processing and Transformation</a>
5. <a href="#5">Generating data batch and iterator</a>
6. <a href="#6">Using pre-trained GloVe Word Embeddings</a>
7. <a href="#7">Setting Hyperparameters and Bulding the Network</a>
8. <a href="#8">Training the Network</a>
9. <a href="#9">Test the classifier on the validation data</a>
10. <a href="#10">Improvement ideas</a>

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Whether the review is positive or negative (1 or 0)

__Important note:__ One big distinction betweeen the regular neural networks and RNNs is that RNNs work with sequential data. In our case, RNNs will help us with the text field. If we also want to consider other fields such as time, log_votes, verified, etc. , we need to use the regular neural networks with the RNN network.

In [1]:
%pip install -q -r ../../requirements.txt

In [2]:
import time
import numpy as np
import torch, torchtext
import pandas as pd
from collections import Counter
from torch import nn, optim
from torch.nn import BCEWithLogitsLoss
from torch.utils.data import TensorDataset, DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import GloVe
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

## 1. <a name="1">Reading the dataset</a>
(<a href="#0">Go to top</a>)

Let's read the dataset below and fill-in the reviewText field. We will use this field as input to our ML model.

In [3]:
import pandas as pd

df = pd.read_csv('../../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

Let's look at the first five rows in the dataset. As you can see the __log_votes__ field is numeric. That's why we will build a regression model.

In [4]:
df.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


## 2. <a name="2">Exploratory Data Analysis</a>
(<a href="#0">Go to top</a>)

Let's look at the range and distribution of log_votes

In [5]:
df["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

We can check the number of missing values for each columm below.

In [6]:
print(df.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


We have missing values in our text fields. We can the rows with __reviewText__ field missing as we will use that field with the model aftwerwards.

In [7]:
df = df.dropna(subset=['reviewText'])

Let's check again.

In [8]:
print(df.isna().sum())

reviewText     0
summary       12
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


## 3. <a name="3">Train-validation split</a>
(<a href="#0">Go to top</a>)

Let's split the dataset into training and validation

In [9]:
# This separates 10% of the entire dataset into validation dataset.
train_text, val_text, train_label, val_label = \
    train_test_split(df["reviewText"].tolist(),
                     df["isPositive"].tolist(),
                     test_size=0.10,
                     shuffle=True,
                     random_state=324)

## 4. <a name="4">Text processing and Transformation</a>
(<a href="#0">Go to top</a>)

We will apply the following processes here:
1. Creating a vocabulary
2. Text transformation

__1. Creating a vocabulary:__ 

We will create a vocabulary with the tokens from the text data. We use a simple english tokenizer and use these tokens to create our vocabulary. In this vocabulary, tokens will map to unique ids, such as "car"->32, "house"->651, etc. 

In [10]:
tokenizer = get_tokenizer("basic_english")
counter = Counter()
for line in train_text:
    counter.update(tokenizer(line))
    
# Create a vocabulary with words seen at least 5 (min_freq) times
vocab = torchtext.vocab.vocab(counter, min_freq=5)

In [11]:
# Add the unknown token
# and use it by default for unknown words
unk_token = '<unk>'
vocab.insert_token(unk_token, 0)
vocab.set_default_index(0)

# Add the pad token
pad_token = '<pad>'
vocab.insert_token(pad_token, 1)

Here are some examples.

In [12]:
print(f"'home' -> {vocab['home']}")
print(f"'wash' -> {vocab['wash']}")
# unknown word (assume from test set)
print(f"'fhshbasdhb' -> {vocab['fhshbasdhb']}")

'home' -> 524
'wash' -> 13931
'fhshbasdhb' -> 0


__2. Text transformation:__ 

We will use the vocabulary and map tokens in the text to unique ids of the tokens. For example: `["this", "is", "a", "sentence"] -> [14, 12, 9, 2066]`.

In [13]:
# Let's create a mapper to transform our text data
text_transform_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]

Let's see some text before and after transformation.

In [14]:
print(f"Before transform:\t{train_text[37]}")
print(f"After transform:\t{text_transform_pipeline(train_text[37])}")

Before transform:	very easy to install. purchased for the XP factor and it works great. haven't tried all upgrades yet but plan to.
After transform:	[62, 382, 17, 388, 12, 178, 39, 13, 797, 863, 9, 45, 490, 2, 12, 311, 34, 174, 303, 412, 864, 399, 21, 391, 17, 12]


Let's create a function for this. In this function, we transform and pad (if necessary) our text data. We cut the series of words at the point where it reaches a certain lenght (we used `max_len=50` here). If the text is shorter than max_len, we `pad 1s` to the end (corresponding to the pad token).

In [15]:
def transformText(text_list, max_len):
    # Transform the text
    transformed_data = [text_transform_pipeline(text)[:max_len] for text in text_list]

    # Pad zeros if the text is shoter than max_len
    for data in transformed_data:
        data[len(data) : max_len] = np.ones(max_len - len(data))

    return torch.tensor(transformed_data, dtype=torch.int64)

In [16]:
train_text[129]

'I have been using Webroot for several years now. I like how easy it is to update my subscription, conduct my own scans if needed, and set up regular scans. There is a fantastic feature that checks websites while I am browsing the web using search engines. Webroot will color coat the security "safeness" for the website and place the appropriate color bubble with a check-mark or \'x\' to the left of the name. Webroot provides a detail explanation on why a particular website is deemed safe or unsafe and some of those settings can be changed to fit the user\'s preferences.\n\nThis particular purchase: "Webroot SecureAnywhere Internet Security Plus 3 Device Download" provides the download code once purchase has been completed. It\'s great because, like I said I\'ve been using Webroot for several years and already have the software on my computer. I didn\'t need to buy and wait for a disc to ship to me. This download enabled me to install if I needed too (which I didn\'t need) or I could pu

In [17]:
text = train_text[5:7]
print(f"Text: {text}\n")
print(f"Num sentences: {len(text)}\n")
tt = transformText(text, max_len=50)
print(f"Transformed text: \n{tt}\n")
print(f"Shape of transformed text: {tt.shape}")

Text: ["We have recently installed this software at work. Its terrible. It overreacts to everything. It has even blocked us out of Yahoo sports, claiming that it was a malicious website. We have also been locked out of news sites, health sites and other benign websites. My guess is that it is detecting website advertisements as viruses. I 'm going to advocate that my office remove this frustrating garbage. Even my supervisor is getting frustrated with this software..", "Don't have it"]

Num sentences: 2

Transformed text: 
tensor([[137,  56, 138, 139,  37,  64,  53, 140,  12, 141, 142,  12,  45,   0,
          17,  52,  12,  45, 143,  59, 144, 145, 108,   7, 146, 147,   4, 148,
          82,  45, 149,   5, 150, 151,  12, 137,  56,  94,  57, 152, 108,   7,
         153, 154,   4, 155, 154,   9, 156, 157],
        [173,  34, 174,  56,  45,   1,   1,   1,   1,   1,   1,   1,   1,   1,
           1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
           1,   1,   1,   

## 5. <a name="5">Generating data batch and iterator</a>
(<a href="#0">Go to top</a>)

Let's use the transformText() function and create the data loaders. Here, we use __max_len=100__ to consider the first 100 words in the text.

In [18]:
max_len = 100
batch_size = 16

# Pass transformed and padded data to dataset
# Create data loaders
train_dataset = TensorDataset(
    transformText(train_text, max_len), torch.tensor(train_label)
)
train_loader = DataLoader(train_dataset, batch_size=batch_size)

val_dataset = TensorDataset(transformText(val_text, max_len), torch.tensor(val_label))
val_loader = DataLoader(val_dataset, batch_size=batch_size)

## 6. <a name="6">Using pre-trained GloVe Word Embeddings</a>
(<a href="#0">Go to top</a>)

In this example, we will use GloVe word vectors. `name='6B'` `dim=300` gives us 6 billion words/phrases vectors. Each word vector has 300 numbers in it. The following code shows how to get the word vectors and create an embedding matrix from them. We will connect our vocabulary indexes to the GloVe embedding with the `get_vecs_by_tokens()` function.

In [19]:
glove = GloVe(name="6B", dim=300)
embedding_matrix = glove.get_vecs_by_tokens(vocab.get_itos())

## 7. <a name="7">Setting Hyperparameters and Bulding the Network</a>
(<a href="#0">Go to top</a>)

We will set our parameters like below.

In [20]:
# Size of the state vectors
hidden_size = 8

# General NN training parameters
learning_rate = 0.001
epochs = 25

# Embedding vector and vocabulary sizes
embed_size = 300  # glove.6B.300d.txt
vocab_size = len(vocab.get_itos())

We need to put our data into correct format before the process.
Our model is made of these layers:
* Embedding layer: This is where our words/tokens are mapped to word vectors.
* RNN layer: We are using a simple RNN model. We stack 2 RNN layers in this example. More details about the RNN are available [here](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).
* Linear layer: A linear layer with a single neuron is used to output the `isPositive` prediction.

In [21]:
class Net(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(
            embed_size, hidden_size, num_layers=num_layers
        )

        self.linear = nn.Linear(hidden_size*max_len, 1)
        self.act = nn.Sigmoid()

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        # Call RNN layer
        outputs, _ = self.rnn(embeddings)
        # Use the output of each time step
        # Send it all together to the linear layer
        outs = self.linear(outputs.reshape(outputs.shape[0], -1))
        return self.act(outs)
    
model = Net(vocab_size, embed_size, hidden_size, num_layers=2)

# Initialize the weights
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)
    if type(m) == nn.RNN:
        for param in m._flat_weights_names:
            if "weight" in param:
                nn.init.xavier_uniform_(m._parameters[param])

Let's initialize this network. Then, we will need to make the embedding layer use our GloVe word vectors.

In [22]:
# We set the embedding layer's parameters from GloVe
model.embedding.weight.data.copy_(embedding_matrix)
# We won't change/train the embedding layer
model.embedding.weight.requires_grad = False

## 8. <a name="8">Training the Network</a>
(<a href="#0">Go to top</a>)

Now, it is time to start our training. We define the loss function and training algorithm first. Then, training starts!

We will define the trainer and loss function below. 

__Binary cross-entropy loss__ is used as this is a binary classification problem.

$$
\mathrm{BinaryCrossEntropyLoss} = -\sum_{examples}{(y\log(p) + (1 - y)\log(1 - p))}
$$

In [23]:
# Setting our trainer
trainer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# We will use Binary Cross-entropy loss
# reduction="sum" sums the losses for given output and target
cross_ent_loss = nn.BCELoss(reduction="sum")

Now, it is time to start the training process. We will print the Binary cross-entropy loss loss after each epoch.

Let's see some validation results below

In [24]:
# Get the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.apply(init_weights)
model.to(device)

for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    val_loss = 0
    # Training loop, train the network
    for data, target in train_loader:
        trainer.zero_grad()
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        L = cross_ent_loss(output.squeeze(1), target)
        training_loss += L.item()
        L.backward()
        trainer.step()

    # Validate the network, no training (no weight update)
    for data, target in val_loader:
        val_predictions = model(data.to(device))
        L = cross_ent_loss(val_predictions.squeeze(1), target.to(device))
        val_loss += L.item()

    # Let's take the average losses
    training_loss = training_loss / len(train_label)
    val_loss = val_loss / len(val_label)

    end = time.time()
    print(
        f"Epoch {epoch}. Train_loss {training_loss}. Val_loss {val_loss}. Seconds {end-start}"
    )

Epoch 0. Train_loss 0.565565640170492. Val_loss 0.5057228901229904. Seconds 14.46343994140625
Epoch 1. Train_loss 0.48193354848491365. Val_loss 0.4710737303131426. Seconds 14.414626359939575
Epoch 2. Train_loss 0.45539847189026195. Val_loss 0.4506763486389365. Seconds 14.43692421913147
Epoch 3. Train_loss 0.4381956004399991. Val_loss 0.43892509767985. Seconds 14.420343399047852
Epoch 4. Train_loss 0.42654436279580904. Val_loss 0.4318110432211272. Seconds 14.416456937789917
Epoch 5. Train_loss 0.4179300016421215. Val_loss 0.42749577591224713. Seconds 14.436838865280151
Epoch 6. Train_loss 0.41113823203363764. Val_loss 0.4250155760400856. Seconds 14.415632724761963
Epoch 7. Train_loss 0.40561350161243115. Val_loss 0.4235686122187377. Seconds 14.442252397537231
Epoch 8. Train_loss 0.400997344562602. Val_loss 0.4223748551281644. Seconds 14.412562131881714
Epoch 9. Train_loss 0.397065539376322. Val_loss 0.4210442323857741. Seconds 14.469841718673706
Epoch 10. Train_loss 0.39365788666023716.

## 9. <a name="9">Test the classifier on the validation data</a>
(<a href="#0">Go to top</a>)

Let's get the validation predictions. Earlier we made predictions on the validation set with this line: ```model(data.to(device))```.

In [25]:
val_predictions = []
for data, target in val_loader:
    val_preds = model(data.to(device))
    val_predictions.extend(
        [np.rint(val_pred)[0] for val_pred in val_preds.detach().cpu().numpy()]
    )
print(val_predictions[:10])

[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]


Confusion matrix, classification report and accuracy score are printed below.

In [26]:
# Use the fitted pipeline to make predictions on the validation dataset
print(confusion_matrix(val_label, val_predictions))
print(classification_report(val_label, val_predictions))
print("Accuracy (validation):", accuracy_score(val_label, val_predictions))

[[2122  508]
 [ 706 3663]]
              precision    recall  f1-score   support

         0.0       0.75      0.81      0.78      2630
         1.0       0.88      0.84      0.86      4369

    accuracy                           0.83      6999
   macro avg       0.81      0.82      0.82      6999
weighted avg       0.83      0.83      0.83      6999

Accuracy (validation): 0.8265466495213601


## 10. <a name="10">Improvement ideas</a>
(<a href="#0">Go to top</a>)

We can improve our model by
* Changing hyper-parameters: Learning rate, batch size and hidden size
* Increasing the number of layers: num_layers
* Using more advanced architetures such as Gated Recurrent Units (GRU) and Long Short-term Memory networks (LSTM).