{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 3\n", "\n", "## Recurrent Neural Networks (RNNs) for the Product Review Problem - Classify Product Reviews as Positive or Not\n", "\n", "In this exercise, we will learn how to use Recurrent Neural Networks. \n", "\n", "We will follow these steps:\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Train-validation dataset split\n", "4. Text processing and transformation\n", "5. Using GloVe Word Embeddings\n", "6. Training and validating model\n", "7. Improvement ideas\n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n", "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n", "\n", "__Important note:__ One big distinction betweeen the regular neural networks and RNNs is that RNNs work with sequential data. In our case, RNNs will help us with the text field. If we also want to consider other fields such as time, log_votes, verified, etc. , we need to use the regular neural networks with the RNN network." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import re\n", "import numpy as np\n", "import mxnet as mx\n", "from mxnet import gluon, nd, autograd\n", "from mxnet.gluon import nn, rnn, Trainer\n", "from mxnet.gluon.loss import SigmoidBinaryCrossEntropyLoss \n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read the dataset below and fill-in the reviewText field. We will use this field as input to our ML model." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first five rows in the dataset. As you can see the __log_votes__ field is numeric. That's why we will build a regression model." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | reviewText | \n", "summary | \n", "verified | \n", "time | \n", "log_votes | \n", "isPositive | \n", "
---|---|---|---|---|---|---|
0 | \n", "PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO... | \n", "IDEAL FOR BEGINNER! | \n", "True | \n", "1361836800 | \n", "0.000000 | \n", "1.0 | \n", "
1 | \n", "unable to open or use | \n", "Two Stars | \n", "True | \n", "1452643200 | \n", "0.000000 | \n", "0.0 | \n", "
2 | \n", "Waste of money!!! It wouldn't load to my system. | \n", "Dont buy it! | \n", "True | \n", "1433289600 | \n", "0.000000 | \n", "0.0 | \n", "
3 | \n", "I attempted to install this OS on two differen... | \n", "I attempted to install this OS on two differen... | \n", "True | \n", "1518912000 | \n", "0.000000 | \n", "0.0 | \n", "
4 | \n", "I've spent 14 fruitless hours over the past tw... | \n", "Do NOT Download. | \n", "True | \n", "1441929600 | \n", "1.098612 | \n", "0.0 | \n", "