{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 2\n", "\n", "## Tree-Based Models for a Classification Problem, and Hyperparameter Tuning\n", "\n", "We continue to work with our review dataset to see how Tree-based classifiers (Decision Tree, Random Forest), along with efficient optimization techniques (GridSearch, RandomizedSearch), perform to predict the __isPositive__ field of our review dataset (that is very similar to the final project dataset).\n", "\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Stop word removal and stemming\n", "4. Train - Validation Split\n", "5. Data processing with Pipeline and ColumnTransform\n", "6. Fit the Decision Tree classifier Find more details on the __DecisionTreeClassifier__ here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html \n", "7. Test the classifier\n", "8. Fit and test the Random Forest classifier Find more details on the __RandomForestClassifier__ here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n", "9. Hyperparameter Tuning\n", " * Find more details on the __GridSearchCV__ here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\n", " * Find more details on the __RandomizedSearchCV__ here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html\n", "10. Ideas for improvement\n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n", "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)\n", "\n", "We will use the __pandas__ library to read our dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shape of the dataset is: (70000, 6)\n" ] } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')\n", "\n", "print('The shape of the dataset is:', df.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first 10 rows of the dataset. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | reviewText | \n", "summary | \n", "verified | \n", "time | \n", "log_votes | \n", "isPositive | \n", "
---|---|---|---|---|---|---|
0 | \n", "PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO... | \n", "IDEAL FOR BEGINNER! | \n", "True | \n", "1361836800 | \n", "0.000000 | \n", "1.0 | \n", "
1 | \n", "unable to open or use | \n", "Two Stars | \n", "True | \n", "1452643200 | \n", "0.000000 | \n", "0.0 | \n", "
2 | \n", "Waste of money!!! It wouldn't load to my system. | \n", "Dont buy it! | \n", "True | \n", "1433289600 | \n", "0.000000 | \n", "0.0 | \n", "
3 | \n", "I attempted to install this OS on two differen... | \n", "I attempted to install this OS on two differen... | \n", "True | \n", "1518912000 | \n", "0.000000 | \n", "0.0 | \n", "
4 | \n", "I've spent 14 fruitless hours over the past tw... | \n", "Do NOT Download. | \n", "True | \n", "1441929600 | \n", "1.098612 | \n", "0.0 | \n", "
5 | \n", "I purchased the home and business because I wa... | \n", "Quicken home and business not for amatures | \n", "True | \n", "1335312000 | \n", "0.000000 | \n", "0.0 | \n", "
6 | \n", "The download doesn't take long at all. And it'... | \n", "Great! | \n", "True | \n", "1377993600 | \n", "0.000000 | \n", "1.0 | \n", "
7 | \n", "This program is positively wonderful for word ... | \n", "Terrific for practice. | \n", "False | \n", "1158364800 | \n", "2.397895 | \n", "1.0 | \n", "
8 | \n", "Fantastic protection!! Great customer support!! | \n", "Five Stars | \n", "True | \n", "1478476800 | \n", "0.000000 | \n", "1.0 | \n", "
9 | \n", "Obviously Win 7 now the last great operating s... | \n", "Five Stars | \n", "True | \n", "1471478400 | \n", "0.000000 | \n", "1.0 | \n", "
Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('decision_tree',\n", " DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))])
ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])
['time', 'log_votes']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
DecisionTreeClassifier(max_depth=10, min_samples_leaf=15)
Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('decision_tree',\n", " DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))])
ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])
['time', 'log_votes']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
DecisionTreeClassifier(max_depth=10, min_samples_leaf=15)
Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('decision_tree',\n", " RandomForestClassifier(max_depth=10, min_samples_leaf=15,\n", " n_estimators=150))])
ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])
['time', 'log_votes']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
RandomForestClassifier(max_depth=10, min_samples_leaf=15, n_estimators=150)