{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 3\n", "\n", "## Recurrent Neural Networks (RNNs) for the Product Review Problem - Classify Product Reviews as Positive or Not\n", "\n", "In this exercise, we will learn how to use Recurrent Neural Networks. \n", "\n", "We will follow these steps:\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Train-validation dataset split\n", "4. Text processing and Transformation\n", "5. Generating data batch and iterator\n", "6. Using pre-trained GloVe Word Embeddings\n", "7. Setting Hyperparameters and Bulding the Network\n", "8. Training the Network\n", "9. Test the classifier on the validation data\n", "10. Improvement ideas\n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n", "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n", "\n", "__Important note:__ One big distinction betweeen the regular neural networks and RNNs is that RNNs work with sequential data. In our case, RNNs will help us with the text field. If we also want to consider other fields such as time, log_votes, verified, etc. , we need to use the regular neural networks with the RNN network." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-01-09T05:02:47.162268Z", "start_time": "2021-01-09T05:02:47.160085Z" } }, "outputs": [], "source": [ "%pip install -q -r ../../requirements.txt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-01-09T05:02:48.342987Z", "start_time": "2021-01-09T05:02:47.164823Z" } }, "outputs": [], "source": [ "import time\n", "import numpy as np\n", "import torch, torchtext\n", "import pandas as pd\n", "from collections import Counter\n", "from torch import nn, optim\n", "from torch.nn import BCEWithLogitsLoss\n", "from torch.utils.data import TensorDataset, DataLoader\n", "from torchtext.data.utils import get_tokenizer\n", "from torchtext.vocab import GloVe\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, classification_report, accuracy_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read the dataset below and fill-in the reviewText field. We will use this field as input to our ML model." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-01-09T05:02:48.995226Z", "start_time": "2021-01-09T05:02:48.344888Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first five rows in the dataset. As you can see the __log_votes__ field is numeric. That's why we will build a regression model." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-01-09T05:02:49.015545Z", "start_time": "2021-01-09T05:02:48.997444Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | reviewText | \n", "summary | \n", "verified | \n", "time | \n", "log_votes | \n", "isPositive | \n", "
---|---|---|---|---|---|---|
0 | \n", "PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO... | \n", "IDEAL FOR BEGINNER! | \n", "True | \n", "1361836800 | \n", "0.000000 | \n", "1.0 | \n", "
1 | \n", "unable to open or use | \n", "Two Stars | \n", "True | \n", "1452643200 | \n", "0.000000 | \n", "0.0 | \n", "
2 | \n", "Waste of money!!! It wouldn't load to my system. | \n", "Dont buy it! | \n", "True | \n", "1433289600 | \n", "0.000000 | \n", "0.0 | \n", "
3 | \n", "I attempted to install this OS on two differen... | \n", "I attempted to install this OS on two differen... | \n", "True | \n", "1518912000 | \n", "0.000000 | \n", "0.0 | \n", "
4 | \n", "I've spent 14 fruitless hours over the past tw... | \n", "Do NOT Download. | \n", "True | \n", "1441929600 | \n", "1.098612 | \n", "0.0 | \n", "