{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 1\n", "\n", "## K Nearest Neighbors Model for a Classification Problem: Classify Product Reviews as Positive or Negative\n", "\n", "In this notebook, we use the K Nearest Neighbors method to build a classifier to predict the __isPositive__ field of our review dataset (that is very similar to the final project dataset).\n", "\n", "\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Text Processing: Stop words removal and stemming\n", "4. Train - Validation Split\n", "5. Data processing with Pipeline\n", "6. Train the classifier\n", "7. Test the classifier Find more details on the KNN Classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html\n", "8. Ideas for improvement\n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes). *This field is a processed version of the votes field. People can click on the \"helpful\" button when they find a customer review helpful. This increases the vote by 1. __log_votes__ is calculated like this log(1+votes). This formulation helps us get a smaller range for votes.*\n", "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)\n", "\n", "We will use the __pandas__ library to read our dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shape of the dataset is: (70000, 6)\n" ] } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')\n", "\n", "print('The shape of the dataset is:', df.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first 10 rows of the dataset. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | reviewText | \n", "summary | \n", "verified | \n", "time | \n", "log_votes | \n", "isPositive | \n", "
---|---|---|---|---|---|---|
0 | \n", "PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO... | \n", "IDEAL FOR BEGINNER! | \n", "True | \n", "1361836800 | \n", "0.000000 | \n", "1.0 | \n", "
1 | \n", "unable to open or use | \n", "Two Stars | \n", "True | \n", "1452643200 | \n", "0.000000 | \n", "0.0 | \n", "
2 | \n", "Waste of money!!! It wouldn't load to my system. | \n", "Dont buy it! | \n", "True | \n", "1433289600 | \n", "0.000000 | \n", "0.0 | \n", "
3 | \n", "I attempted to install this OS on two differen... | \n", "I attempted to install this OS on two differen... | \n", "True | \n", "1518912000 | \n", "0.000000 | \n", "0.0 | \n", "
4 | \n", "I've spent 14 fruitless hours over the past tw... | \n", "Do NOT Download. | \n", "True | \n", "1441929600 | \n", "1.098612 | \n", "0.0 | \n", "
5 | \n", "I purchased the home and business because I wa... | \n", "Quicken home and business not for amatures | \n", "True | \n", "1335312000 | \n", "0.000000 | \n", "0.0 | \n", "
6 | \n", "The download doesn't take long at all. And it'... | \n", "Great! | \n", "True | \n", "1377993600 | \n", "0.000000 | \n", "1.0 | \n", "
7 | \n", "This program is positively wonderful for word ... | \n", "Terrific for practice. | \n", "False | \n", "1158364800 | \n", "2.397895 | \n", "1.0 | \n", "
8 | \n", "Fantastic protection!! Great customer support!! | \n", "Five Stars | \n", "True | \n", "1478476800 | \n", "0.000000 | \n", "1.0 | \n", "
9 | \n", "Obviously Win 7 now the last great operating s... | \n", "Five Stars | \n", "True | \n", "1471478400 | \n", "0.000000 | \n", "1.0 | \n", "
Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=15)),\n", " ('knn', KNeighborsClassifier())])
CountVectorizer(binary=True, max_features=15)
KNeighborsClassifier()
Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=15)),\n", " ('knn', KNeighborsClassifier())])
CountVectorizer(binary=True, max_features=15)
KNeighborsClassifier()