{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 2\n", "\n", "## Logistic Regression Model and Threshold Calibration\n", "\n", "In this notebook, we go over the Logistic Regression method to predict the __isPositive__ field of our final dataset, while also having a look at how probability threshold calibration can help improve classifier's performance.\n", "\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Stop word removal and stemming\n", "4. Train - Validation Split\n", "5. Data processing with Pipeline and ColumnTransform\n", "6. Fit the classifier\n", "Find more details on the __LogisticRegression__ here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n", "7. Test the classifier\n", "8. Ideas for improvement: Probability threshold calibration (optional) \n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n", "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)\n", "\n", "We will use the __pandas__ library to read our datasets." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first five rows in the datasets." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | reviewText | \n", "summary | \n", "verified | \n", "time | \n", "log_votes | \n", "isPositive | \n", "
---|---|---|---|---|---|---|
0 | \n", "PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO... | \n", "IDEAL FOR BEGINNER! | \n", "True | \n", "1361836800 | \n", "0.000000 | \n", "1.0 | \n", "
1 | \n", "unable to open or use | \n", "Two Stars | \n", "True | \n", "1452643200 | \n", "0.000000 | \n", "0.0 | \n", "
2 | \n", "Waste of money!!! It wouldn't load to my system. | \n", "Dont buy it! | \n", "True | \n", "1433289600 | \n", "0.000000 | \n", "0.0 | \n", "
3 | \n", "I attempted to install this OS on two differen... | \n", "I attempted to install this OS on two differen... | \n", "True | \n", "1518912000 | \n", "0.000000 | \n", "0.0 | \n", "
4 | \n", "I've spent 14 fruitless hours over the past tw... | \n", "Do NOT Download. | \n", "True | \n", "1441929600 | \n", "1.098612 | \n", "0.0 | \n", "
Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('logistic_regression', LogisticRegression(C=0.1))])
ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])
['time', 'log_votes']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
LogisticRegression(C=0.1)
Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('logistic_regression', LogisticRegression(C=0.1))])
ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])
['time', 'log_votes']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
LogisticRegression(C=0.1)