{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![MLU Logo](../data/MLU_Logo.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 3\n", "\n", "## Recurrent Neural Networks (RNNs) for the Product Review Problem - Classify Product Reviews as Positive or Not\n", "\n", "In this exercise, we will learn how to use Recurrent Neural Networks. \n", "\n", "We will follow these steps:\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Train-validation dataset split\n", "4. Text processing and transformation\n", "5. Using GloVe Word Embeddings\n", "6. Training and validating model\n", "7. Improvement ideas\n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n", "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n", "\n", "__Important note:__ One big distinction betweeen the regular neural networks and RNNs is that RNNs work with sequential data. In our case, RNNs will help us with the text field. If we also want to consider other fields such as time, log_votes, verified, etc. , we need to use the regular neural networks with the RNN network." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import re\n", "import numpy as np\n", "import mxnet as mx\n", "from mxnet import gluon, nd, autograd\n", "from mxnet.gluon import nn, rnn, Trainer\n", "from mxnet.gluon.loss import SigmoidBinaryCrossEntropyLoss \n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read the dataset below and fill-in the reviewText field. We will use this field as input to our ML model." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first five rows in the dataset. As you can see the __log_votes__ field is numeric. That's why we will build a regression model." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewTextsummaryverifiedtimelog_votesisPositive
0PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...IDEAL FOR BEGINNER!True13618368000.0000001.0
1unable to open or useTwo StarsTrue14526432000.0000000.0
2Waste of money!!! It wouldn't load to my system.Dont buy it!True14332896000.0000000.0
3I attempted to install this OS on two differen...I attempted to install this OS on two differen...True15189120000.0000000.0
4I've spent 14 fruitless hours over the past tw...Do NOT Download.True14419296001.0986120.0
\n", "
" ], "text/plain": [ " reviewText \\\n", "0 PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO... \n", "1 unable to open or use \n", "2 Waste of money!!! It wouldn't load to my system. \n", "3 I attempted to install this OS on two differen... \n", "4 I've spent 14 fruitless hours over the past tw... \n", "\n", " summary verified time \\\n", "0 IDEAL FOR BEGINNER! True 1361836800 \n", "1 Two Stars True 1452643200 \n", "2 Dont buy it! True 1433289600 \n", "3 I attempted to install this OS on two differen... True 1518912000 \n", "4 Do NOT Download. True 1441929600 \n", "\n", " log_votes isPositive \n", "0 0.000000 1.0 \n", "1 0.000000 0.0 \n", "2 0.000000 0.0 \n", "3 0.000000 0.0 \n", "4 1.098612 0.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Exploratory Data Analysis\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the range and distribution of log_votes" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0 43692\n", "0.0 26308\n", "Name: isPositive, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"isPositive\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the number of missing values for each columm below." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "reviewText 11\n", "summary 14\n", "verified 0\n", "time 0\n", "log_votes 0\n", "isPositive 0\n", "dtype: int64\n" ] } ], "source": [ "print(df.isna().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have missing values in our text fields." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Train-validation split\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's split the dataset into training and validation" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# This separates 15% of the entire dataset into validation dataset.\n", "train_text, val_text, train_label, val_label = \\\n", " train_test_split(df[\"reviewText\"].tolist(),\n", " df[\"isPositive\"].tolist(),\n", " test_size=0.10,\n", " shuffle=True,\n", " random_state=324)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Text processing and Transformation\n", "(Go to top)\n", "\n", "We will apply the following processes here:\n", "* __Text cleaning:__ Simple text cleaning operations. We won't do stemming or lemmatization as our word vectors already cover different forms of words. We are using GloVe word embeddings for 6 billion words, phrases or punctuations in this example.\n", "* __Tokenization:__ Tokenizing all sentences\n", "* __Creating vocabulary:__ We will create a vocabulary of the tokens. In this vocabulary, tokens will map to unique ids, such as \"car\"->32, \"house\"->651, etc.\n", "* __Transforming text:__ Tokenized sentences will be mapped to unique ids. For example: [\"this\", \"is\", \"sentence\"] -> [13, 54, 412]." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n" ] } ], "source": [ "import nltk, gluonnlp\n", "from nltk.tokenize import word_tokenize\n", "\n", "nltk.download('punkt')\n", "\n", "def cleanStr(text):\n", " \n", " # Check if the sentence is a missing value\n", " if isinstance(text, str) == False:\n", " text = \"\"\n", " \n", " # Remove leading/trailing whitespace\n", " text = text.lower().strip()\n", " # Remove extra space and tabs\n", " text = re.sub('\\s+', ' ', text)\n", " # Remove HTML tags/markups\n", " text = re.compile('<.*?>').sub('', text)\n", " return text\n", "\n", "def tokenize(text):\n", " tokens = []\n", " text = cleanStr(text)\n", " words = word_tokenize(text)\n", " for word in words:\n", " tokens.append(word)\n", " return tokens\n", "\n", "def createVocabulary(text_list, min_freq):\n", " all_tokens = []\n", " for sentence in text_list:\n", " all_tokens += tokenize(sentence)\n", " # Calculate token frequencies\n", " counter = gluonnlp.data.count_tokens(all_tokens)\n", " # Create the vocabulary\n", " vocab = gluonnlp.Vocab(counter,\n", " min_freq = min_freq,\n", " unknown_token = '',\n", " padding_token = None,\n", " bos_token = None,\n", " eos_token = None)\n", " \n", " return vocab\n", "\n", "def transformText(text, vocab, max_length):\n", " token_arr = np.zeros((max_length,))\n", " tokens = tokenize(text)[0:max_length]\n", " for idx, token in enumerate(tokens):\n", " try:\n", " # Use the vocabulary index of the token\n", " token_arr[idx] = vocab.token_to_idx[token]\n", " except:\n", " token_arr[idx] = 0 # Unknown word\n", " return token_arr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to keep the training time low, we only consider the first 250 words (max_length) in sentences. We also only use words that occur more than 5 times in the all sentences (min_freq)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating the vocabulary\n", "Transforming training texts\n", "Transforming validation texts\n" ] } ], "source": [ "min_freq = 5\n", "max_length = 250\n", "\n", "print(\"Creating the vocabulary\")\n", "vocab = createVocabulary(train_text, min_freq)\n", "print(\"Transforming training texts\")\n", "train_text_transformed = nd.array([transformText(text, vocab, max_length) for text in train_text])\n", "print(\"Transforming validation texts\")\n", "val_text_transformed = nd.array([transformText(text, vocab, max_length) for text in val_text])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see some unique ids for some words." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vocabulary index for computer: 67\n", "Vocabulary index for beautiful: 1931\n", "Vocabulary index for code: 403\n" ] } ], "source": [ "print(\"Vocabulary index for computer:\", vocab['computer'])\n", "print(\"Vocabulary index for beautiful:\", vocab['beautiful'])\n", "print(\"Vocabulary index for code:\", vocab['code'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Using pre-trained GloVe Word Embeddings\n", "(Go to top)\n", "\n", "In this example, we will use GloVe word vectors. `'glove.6B.50d.txt'` file gives us 6 billion words/phrases vectors. Each word vector has 50 numbers in it. The following code shows how to get the word vectors and create an embedding matrix from them. We will connect our vocabulary indexes to the GloVe embedding with the `get_vecs_by_tokens()` function." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading /home/ec2-user/.mxnet/embeddings/glove/glove.6B.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/embeddings/glove/glove.6B.zip...\n" ] } ], "source": [ "from mxnet.contrib import text\n", "glove = text.embedding.create('glove',\n", " pretrained_file_name = 'glove.6B.50d.txt')\n", "embedding_matrix = glove.get_vecs_by_tokens(vocab.idx_to_token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Training and validation\n", "(Go to top)\n", "\n", "We have processed our text data and also created our embedding matrixes from GloVe. Now, it is time to start the training process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will set our parameters below" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Size of the state vectors\n", "hidden_size = 12\n", "\n", "# General NN training parameters\n", "learning_rate = 0.01\n", "epochs = 15\n", "batch_size = 32\n", "\n", "# Embedding vector and vocabulary sizes\n", "num_embed = 50 # glove.6B.50d.txt\n", "vocab_size = len(vocab.token_to_idx.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to put our data into correct format before the process." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from mxnet.gluon.data import ArrayDataset, DataLoader\n", "\n", "train_label = nd.array(train_label)\n", "val_label = nd.array(val_label)\n", "\n", "train_dataset = ArrayDataset(train_text_transformed, train_label)\n", "train_loader = DataLoader(train_dataset, batch_size=batch_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our sequential model is made of these layers:\n", "* Embedding layer: This is where our words/tokens are mapped to word vectors.\n", "* RNN layer: We will be using a simple RNN model. We won't stack RNN units in this example. It uses a sinle RNN unit with its hidden state size of 12. More details about the RNN is available [here](https://mxnet.incubator.apache.org/api/python/docs/api/gluon/rnn/index.html#mxnet.gluon.rnn.RNN).\n", "* Dense layer: A dense layer with a single neuron is used to output our log_votes prediction." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "context = mx.cpu() # use mx.gpu() if you are using GPU\n", "\n", "model = nn.Sequential()\n", "model.add(nn.Embedding(vocab_size, num_embed), # Embedding layer\n", " rnn.RNN(hidden_size, num_layers=1), # Recurrent layer\n", " nn.Dense(1, activation='sigmoid')) # Output layer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's initialize this network. Then, we will need to make the embedding layer use our GloVe word vectors." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Initialize networks parameters\n", "model.collect_params().initialize(mx.init.Xavier(), ctx=context)\n", "\n", "# We set the embedding layer's parameters from GloVe\n", "model[0].weight.set_data(embedding_matrix.as_in_context(context))\n", "# We won't change/train the embedding layer\n", "model[0].collect_params().setattr('grad_req', 'null')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will define the trainer and loss function below. __Binary cross-entropy loss__ is used as this is a binary classification problem.\n", "$$\n", "\\mathrm{BinaryCrossEntropyLoss} = -\\sum_{examples}{(y\\log(p) + (1 - y)\\log(1 - p))}\n", "$$" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Setting our trainer\n", "trainer = Trainer(model.collect_params(),\n", " 'sgd',\n", " {'learning_rate': learning_rate})\n", "\n", "# We will use Binary Cross-entropy loss\n", "cross_ent_loss = SigmoidBinaryCrossEntropyLoss(from_sigmoid=True) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, it is time to start the training process. We will print the Binary cross-entropy loss loss after each epoch." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 0. Train_loss 0.603717 Validation_loss 0.553842 Seconds 11.135917\n", "Epoch 1. Train_loss 0.534499 Validation_loss 0.509175 Seconds 11.703702\n", "Epoch 2. Train_loss 0.504744 Validation_loss 0.489113 Seconds 11.794788\n", "Epoch 3. Train_loss 0.486071 Validation_loss 0.474568 Seconds 11.669017\n", "Epoch 4. Train_loss 0.471923 Validation_loss 0.463993 Seconds 12.016294\n", "Epoch 5. Train_loss 0.460395 Validation_loss 0.455817 Seconds 11.684173\n", "Epoch 6. Train_loss 0.451343 Validation_loss 0.449823 Seconds 11.397681\n", "Epoch 7. Train_loss 0.444055 Validation_loss 0.445230 Seconds 11.690629\n", "Epoch 8. Train_loss 0.438143 Validation_loss 0.442019 Seconds 12.062137\n", "Epoch 9. Train_loss 0.433309 Validation_loss 0.439299 Seconds 12.089358\n", "Epoch 10. Train_loss 0.428902 Validation_loss 0.437187 Seconds 11.639935\n", "Epoch 11. Train_loss 0.425322 Validation_loss 0.435542 Seconds 11.373106\n", "Epoch 12. Train_loss 0.422239 Validation_loss 0.434210 Seconds 11.662943\n", "Epoch 13. Train_loss 0.419500 Validation_loss 0.432850 Seconds 11.755149\n", "Epoch 14. Train_loss 0.417022 Validation_loss 0.431435 Seconds 11.742159\n" ] } ], "source": [ "import time\n", "for epoch in range(epochs):\n", " start = time.time()\n", " training_loss = 0\n", " # Training loop, train the network\n", " for idx, (data, target) in enumerate(train_loader):\n", "\n", " data = data.as_in_context(context)\n", " target = target.as_in_context(context)\n", " \n", " with autograd.record():\n", " output = model(data)\n", " L = cross_ent_loss(output, target)\n", " training_loss += nd.sum(L).asscalar()\n", " L.backward()\n", " trainer.step(data.shape[0])\n", " \n", " # Calculate validation loss\n", " val_predictions = model(val_text_transformed.as_in_context(context))\n", " val_loss = nd.sum(cross_ent_loss(val_predictions, val_label)).asscalar()\n", " \n", " # Let's take the average losses\n", " training_loss = training_loss / len(train_label)\n", " val_loss = val_loss / len(val_label)\n", " \n", " end = time.time()\n", " print(\"Epoch %s. Train_loss %f Validation_loss %f Seconds %f\" % \\\n", " (epoch, training_loss, val_loss, end-start))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see some validation results below" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Classification Report\n", " precision recall f1-score support\n", "\n", " 0.0 0.75 0.70 0.73 2605\n", " 1.0 0.83 0.86 0.84 4395\n", "\n", " accuracy 0.80 7000\n", " macro avg 0.79 0.78 0.79 7000\n", "weighted avg 0.80 0.80 0.80 7000\n", "\n", "Accuracy\n", "0.8018571428571428\n" ] } ], "source": [ "from sklearn.metrics import classification_report, accuracy_score\n", "\n", "# Get validation predictions\n", "val_predictions = model(val_text_transformed.as_in_context(context))\n", "\n", "val_label = nd.array(val_label)\n", "\n", "# Round predictions: 1 if pred>0.5, 0 otherwise\n", "val_predictions = np.round(val_predictions.asnumpy())\n", "\n", "print(\"Classification Report\")\n", "print(classification_report(val_label.asnumpy(), val_predictions))\n", "print(\"Accuracy\")\n", "print(accuracy_score(val_label.asnumpy(), val_predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Improvement ideas\n", "(Go to top)\n", "\n", "We can improve our model by\n", "* Changing hyper-parameters\n", "* Using more advanced architetures such as Gated Recurrent Units (GRU) and Long Short-term Memory networks (LSTM)." ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }