{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 2\n", "\n", "## Linear Regression Models and Regularization\n", "\n", "In this notebook, we go over Linear Regression methods (with and without regularization: LinearRegression, Ridge, Lasso, ElasticNet) to predict the __log_votes__ field of our review dataset. \n", "\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Stop word removal and stemming\n", "4. Train - Validation Split\n", "5. Data processing with Pipeline and ColumnTransform\n", "6. Train the regressor\n", "7. Fitting Linear Regression models and checking the validation performance Find more details on the classical Linear Regression models with and without regularization here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model\n", "8. Ideas for improvement\n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __rating:__ Rating of the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)\n", "\n", "We will use the __pandas__ library to read our dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | reviewText | \n", "summary | \n", "verified | \n", "time | \n", "rating | \n", "log_votes | \n", "
---|---|---|---|---|---|---|
0 | \n", "Stuck with this at work, slow and we still got... | \n", "Use SEP or Mcafee | \n", "False | \n", "1464739200 | \n", "1.0 | \n", "0.0 | \n", "
1 | \n", "I use parallels every day with both my persona... | \n", "Use it daily | \n", "False | \n", "1332892800 | \n", "5.0 | \n", "0.0 | \n", "
2 | \n", "Barbara Robbins\\n\\nI've used TurboTax to do ou... | \n", "Helpful Product | \n", "True | \n", "1398816000 | \n", "4.0 | \n", "0.0 | \n", "
3 | \n", "I have been using this software security for y... | \n", "Five Stars | \n", "True | \n", "1430784000 | \n", "5.0 | \n", "0.0 | \n", "
4 | \n", "If you want your computer hijacked and slowed ... | \n", "... hijacked and slowed to a crawl Windows 10 ... | \n", "False | \n", "1508025600 | \n", "1.0 | \n", "0.0 | \n", "
Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'rating']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('lr', LinearRegression())])
ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'rating']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])
['time', 'rating']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
LinearRegression()
Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'rating']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('lr', LinearRegression())])
ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'rating']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])
['time', 'rating']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
LinearRegression()