{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![MLU Logo](../data/MLU_Logo.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 2\n", "\n", "## Linear Regression Models and Regularization\n", "\n", "In this notebook, we go over Linear Regression methods (with and without regularization: LinearRegression, Ridge, Lasso, ElasticNet) to predict the __log_votes__ field of our review dataset. \n", "\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Stop word removal and stemming\n", "4. Train - Validation Split\n", "5. Data processing with Pipeline and ColumnTransform\n", "6. Train the regressor\n", "7. Fitting Linear Regression models and checking the validation performance Find more details on the classical Linear Regression models with and without regularization here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model\n", "8. Ideas for improvement\n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __rating:__ Rating of the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)\n", "\n", "We will use the __pandas__ library to read our dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewTextsummaryverifiedtimeratinglog_votes
0Stuck with this at work, slow and we still got...Use SEP or McafeeFalse14647392001.00.0
1I use parallels every day with both my persona...Use it dailyFalse13328928005.00.0
2Barbara Robbins\\n\\nI've used TurboTax to do ou...Helpful ProductTrue13988160004.00.0
3I have been using this software security for y...Five StarsTrue14307840005.00.0
4If you want your computer hijacked and slowed ...... hijacked and slowed to a crawl Windows 10 ...False15080256001.00.0
\n", "
" ], "text/plain": [ " reviewText \\\n", "0 Stuck with this at work, slow and we still got... \n", "1 I use parallels every day with both my persona... \n", "2 Barbara Robbins\\n\\nI've used TurboTax to do ou... \n", "3 I have been using this software security for y... \n", "4 If you want your computer hijacked and slowed ... \n", "\n", " summary verified time \\\n", "0 Use SEP or Mcafee False 1464739200 \n", "1 Use it daily False 1332892800 \n", "2 Helpful Product True 1398816000 \n", "3 Five Stars True 1430784000 \n", "4 ... hijacked and slowed to a crawl Windows 10 ... False 1508025600 \n", "\n", " rating log_votes \n", "0 1.0 0.0 \n", "1 5.0 0.0 \n", "2 4.0 0.0 \n", "3 5.0 0.0 \n", "4 1.0 0.0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-REGRESSION.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first five rows in the dataset. As you can see the __log_votes__ field is numeric. That's why we will build a regression model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Exploratory data analysis\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the range and distribution of log_votes" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"log_votes\"].min()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.799753318287247" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"log_votes\"].max()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAk0AAAGdCAYAAAAPLEfqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA58klEQVR4nO3df1hUdd7/8RegA/4aTA2QFX+kppG/EhOnzM1kHZO6M929tSxJqS690VUof7D5Rat7w+zWdFfTWluxa3VTd9MtSYgwdUtMxcgfG1SmoSsDbiqjpKDMfP/o5txOuHakoRnx+biuc+U55z2feX/GuubVmc+cCXC73W4BAADgigJ93QAAAMC1gNAEAABgAqEJAADABEITAACACYQmAAAAEwhNAAAAJhCaAAAATCA0AQAAmNDI1w00FC6XS8ePH1eLFi0UEBDg63YAAIAJbrdbZ86cUWRkpAIDr3wtidDkJcePH1dUVJSv2wAAAHVw9OhRtWvX7oo1hCYvadGihaTvXnSr1erjbgAAgBlOp1NRUVHG+/iVEJq8pOYjOavVSmgCAOAaY2ZpDQvBAQAATCA0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABMITQAAACb4TWiaN2+eAgICNG3aNOPY+fPnlZSUpNatW6t58+YaNWqUSktLPR5XXFys+Ph4NW3aVGFhYZo+fbouXrzoUbN161b17dtXwcHB6tKlizIyMmo9/9KlS9WxY0eFhIQoNjZWu3btqo9pAgCAa5RfhKbdu3fr1VdfVa9evTyOJycn65133tH69eu1bds2HT9+XCNHjjTOV1dXKz4+XlVVVdqxY4dWrVqljIwMpaWlGTWHDx9WfHy8Bg8erIKCAk2bNk2PP/64srOzjZq1a9cqJSVFc+bM0d69e9W7d2/Z7XaVlZXV/+QBAMC1we1jZ86ccXft2tWdk5Pj/vnPf+6eOnWq2+12u0+fPu1u3Lixe/369UbtZ5995pbkzsvLc7vdbve7777rDgwMdDscDqNm2bJlbqvV6q6srHS73W73jBkz3LfeeqvHc44ePdptt9uN/f79+7uTkpKM/erqandkZKQ7PT3d9DzKy8vdktzl5eXmJw8AAHzqat6/fX6lKSkpSfHx8YqLi/M4np+frwsXLngc7969u9q3b6+8vDxJUl5ennr27Knw8HCjxm63y+l06uDBg0bN98e22+3GGFVVVcrPz/eoCQwMVFxcnFFzOZWVlXI6nR4bAABouHz6g71vvvmm9u7dq927d9c653A4ZLFY1LJlS4/j4eHhcjgcRs2lganmfM25K9U4nU6dO3dOp06dUnV19WVrCgsL/23v6enpevbZZ81NFAAAXPN8dqXp6NGjmjp1qlavXq2QkBBftVFnqampKi8vN7ajR4/6uiUAAFCPfHalKT8/X2VlZerbt69xrLq6Wtu3b9eSJUuUnZ2tqqoqnT592uNqU2lpqSIiIiRJERERtb7lVvPtuktrvv+Nu9LSUlmtVjVp0kRBQUEKCgq6bE3NGJcTHBys4ODgq594HXWclfmTPZe3HJkX7+sWAADwGp9daRoyZIj279+vgoICY+vXr5/Gjh1r/Llx48bKzc01HlNUVKTi4mLZbDZJks1m0/79+z2+5ZaTkyOr1aro6Gij5tIxampqxrBYLIqJifGocblcys3NNWoAAAB8dqWpRYsW6tGjh8exZs2aqXXr1sbxxMREpaSkqFWrVrJarZoyZYpsNpsGDBggSRo6dKiio6P16KOPav78+XI4HJo9e7aSkpKMq0ATJ07UkiVLNGPGDE2YMEFbtmzRunXrlJn5f1duUlJSlJCQoH79+ql///5atGiRKioqNH78+J/o1QAAAP7OpwvBf8jLL7+swMBAjRo1SpWVlbLb7XrllVeM80FBQdq0aZMmTZokm82mZs2aKSEhQc8995xR06lTJ2VmZio5OVmLFy9Wu3bttGLFCtntdqNm9OjROnHihNLS0uRwONSnTx9lZWXVWhwOAACuXwFut9vt6yYaAqfTqdDQUJWXl8tqtXp9fNY0AQDgfVfz/u3z+zQBAABcCwhNAAAAJhCaAAAATCA0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABMITQAAACYQmgAAAEwgNAEAAJhAaAIAADCB0AQAAGACoQkAAMAEQhMAAIAJhCYAAAATCE0AAAAmEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAEwhNAAAAJhCaAAAATCA0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABN8GpqWLVumXr16yWq1ymq1ymazafPmzcb5u+++WwEBAR7bxIkTPcYoLi5WfHy8mjZtqrCwME2fPl0XL170qNm6dav69u2r4OBgdenSRRkZGbV6Wbp0qTp27KiQkBDFxsZq165d9TJnAABwbfJpaGrXrp3mzZun/Px87dmzR/fcc48eeOABHTx40Kh54oknVFJSYmzz5883zlVXVys+Pl5VVVXasWOHVq1apYyMDKWlpRk1hw8fVnx8vAYPHqyCggJNmzZNjz/+uLKzs42atWvXKiUlRXPmzNHevXvVu3dv2e12lZWV/TQvBAAA8HsBbrfb7esmLtWqVSu99NJLSkxM1N13360+ffpo0aJFl63dvHmz7rvvPh0/flzh4eGSpOXLl2vmzJk6ceKELBaLZs6cqczMTB04cMB43JgxY3T69GllZWVJkmJjY3X77bdryZIlkiSXy6WoqChNmTJFs2bNMtW30+lUaGioysvLZbVaf8QrcHkdZ2V6fcz6dmRevK9bAADgiq7m/dtv1jRVV1frzTffVEVFhWw2m3F89erVatOmjXr06KHU1FR9++23xrm8vDz17NnTCEySZLfb5XQ6jatVeXl5iouL83guu92uvLw8SVJVVZXy8/M9agIDAxUXF2fUXE5lZaWcTqfHBgAAGq5Gvm5g//79stlsOn/+vJo3b64NGzYoOjpakvTwww+rQ4cOioyM1L59+zRz5kwVFRXprbfekiQ5HA6PwCTJ2Hc4HFescTqdOnfunE6dOqXq6urL1hQWFv7bvtPT0/Xss8/+uMkDAIBrhs9DU7du3VRQUKDy8nL95S9/UUJCgrZt26bo6Gg9+eSTRl3Pnj3Vtm1bDRkyRIcOHVLnzp192LWUmpqqlJQUY9/pdCoqKsqHHQEAgPrk89BksVjUpUsXSVJMTIx2796txYsX69VXX61VGxsbK0n68ssv1blzZ0VERNT6lltpaakkKSIiwvhnzbFLa6xWq5o0aaKgoCAFBQVdtqZmjMsJDg5WcHDwVc4WAABcq/xmTVMNl8ulysrKy54rKCiQJLVt21aSZLPZtH//fo9vueXk5MhqtRof8dlsNuXm5nqMk5OTY6ybslgsiomJ8ahxuVzKzc31WFsFAACubz690pSamqp7771X7du315kzZ7RmzRpt3bpV2dnZOnTokNasWaPhw4erdevW2rdvn5KTkzVo0CD16tVLkjR06FBFR0fr0Ucf1fz58+VwODR79mwlJSUZV4EmTpyoJUuWaMaMGZowYYK2bNmidevWKTPz/76NlpKSooSEBPXr10/9+/fXokWLVFFRofHjx/vkdQEAAP7Hp6GprKxM48aNU0lJiUJDQ9WrVy9lZ2frF7/4hY4ePar333/fCDBRUVEaNWqUZs+ebTw+KChImzZt0qRJk2Sz2dSsWTMlJCToueeeM2o6deqkzMxMJScna/HixWrXrp1WrFghu91u1IwePVonTpxQWlqaHA6H+vTpo6ysrFqLwwEAwPXL7+7TdK3iPk21cZ8mAIC/uybv0wQAAODPCE0AAAAmEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAEwhNAAAAJhCaAAAATCA0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABMITQAAACYQmgAAAEwgNAEAAJhAaAIAADCB0AQAAGACoQkAAMAEQhMAAIAJhCYAAAATCE0AAAAmEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAE3wampYtW6ZevXrJarXKarXKZrNp8+bNxvnz588rKSlJrVu3VvPmzTVq1CiVlpZ6jFFcXKz4+Hg1bdpUYWFhmj59ui5evOhRs3XrVvXt21fBwcHq0qWLMjIyavWydOlSdezYUSEhIYqNjdWuXbvqZc4AAODa5NPQ1K5dO82bN0/5+fnas2eP7rnnHj3wwAM6ePCgJCk5OVnvvPOO1q9fr23btun48eMaOXKk8fjq6mrFx8erqqpKO3bs0KpVq5SRkaG0tDSj5vDhw4qPj9fgwYNVUFCgadOm6fHHH1d2drZRs3btWqWkpGjOnDnau3evevfuLbvdrrKysp/uxQAAAH4twO12u33dxKVatWqll156Sb/85S914403as2aNfrlL38pSSosLNQtt9yivLw8DRgwQJs3b9Z9992n48ePKzw8XJK0fPlyzZw5UydOnJDFYtHMmTOVmZmpAwcOGM8xZswYnT59WllZWZKk2NhY3X777VqyZIkkyeVyKSoqSlOmTNGsWbNM9e10OhUaGqry8nJZrVZvviSSpI6zMr0+Zn07Mi/e1y0AAHBFV/P+7Tdrmqqrq/Xmm2+qoqJCNptN+fn5unDhguLi4oya7t27q3379srLy5Mk5eXlqWfPnkZgkiS73S6n02lcrcrLy/MYo6amZoyqqirl5+d71AQGBiouLs6oAQAAaOTrBvbv3y+bzabz58+refPm2rBhg6Kjo1VQUCCLxaKWLVt61IeHh8vhcEiSHA6HR2CqOV9z7ko1TqdT586d06lTp1RdXX3ZmsLCwn/bd2VlpSorK419p9N5dRMHAADXFJ9faerWrZsKCgr08ccfa9KkSUpISNA//vEPX7f1g9LT0xUaGmpsUVFRvm4JAADUI5+HJovFoi5duigmJkbp6enq3bu3Fi9erIiICFVVVen06dMe9aWlpYqIiJAkRURE1Po2Xc3+D9VYrVY1adJEbdq0UVBQ0GVrasa4nNTUVJWXlxvb0aNH6zR/AABwbfB5aPo+l8ulyspKxcTEqHHjxsrNzTXOFRUVqbi4WDabTZJks9m0f/9+j2+55eTkyGq1Kjo62qi5dIyampoxLBaLYmJiPGpcLpdyc3ONmssJDg42bpVQswEAgIbLp2uaUlNTde+996p9+/Y6c+aM1qxZo61btyo7O1uhoaFKTExUSkqKWrVqJavVqilTpshms2nAgAGSpKFDhyo6OlqPPvqo5s+fL4fDodmzZyspKUnBwcGSpIkTJ2rJkiWaMWOGJkyYoC1btmjdunXKzPy/b6OlpKQoISFB/fr1U//+/bVo0SJVVFRo/PjxPnldAACA//FpaCorK9O4ceNUUlKi0NBQ9erVS9nZ2frFL34hSXr55ZcVGBioUaNGqbKyUna7Xa+88orx+KCgIG3atEmTJk2SzWZTs2bNlJCQoOeee86o6dSpkzIzM5WcnKzFixerXbt2WrFihex2u1EzevRonThxQmlpaXI4HOrTp4+ysrJqLQ4HAADXL7+7T9O1ivs01cZ9mgAA/u6avE8TAACAPyM0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABMITQAAACYQmgAAAEwgNAEAAJhAaAIAADCB0AQAAGACoQkAAMAEQhMAAIAJhCYAAAATCE0AAAAmEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAEwhNAAAAJhCaAAAATCA0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABMITQAAACYQmgAAAEwgNAEAAJjg09CUnp6u22+/XS1atFBYWJhGjBihoqIij5q7775bAQEBHtvEiRM9aoqLixUfH6+mTZsqLCxM06dP18WLFz1qtm7dqr59+yo4OFhdunRRRkZGrX6WLl2qjh07KiQkRLGxsdq1a5fX5wwAAK5NPg1N27ZtU1JSknbu3KmcnBxduHBBQ4cOVUVFhUfdE088oZKSEmObP3++ca66ulrx8fGqqqrSjh07tGrVKmVkZCgtLc2oOXz4sOLj4zV48GAVFBRo2rRpevzxx5WdnW3UrF27VikpKZozZ4727t2r3r17y263q6ysrP5fCAAA4PcC3G6329dN1Dhx4oTCwsK0bds2DRo0SNJ3V5r69OmjRYsWXfYxmzdv1n333afjx48rPDxckrR8+XLNnDlTJ06ckMVi0cyZM5WZmakDBw4YjxszZoxOnz6trKwsSVJsbKxuv/12LVmyRJLkcrkUFRWlKVOmaNasWT/Yu9PpVGhoqMrLy2W1Wn/My3BZHWdlen3M+nZkXryvWwAA4Iqu5v3br9Y0lZeXS5JatWrlcXz16tVq06aNevToodTUVH377bfGuby8PPXs2dMITJJkt9vldDp18OBBoyYuLs5jTLvdrry8PElSVVWV8vPzPWoCAwMVFxdn1HxfZWWlnE6nxwYAABquRr5uoIbL5dK0adN05513qkePHsbxhx9+WB06dFBkZKT27dunmTNnqqioSG+99ZYkyeFweAQmSca+w+G4Yo3T6dS5c+d06tQpVVdXX7amsLDwsv2mp6fr2Wef/XGTBgAA1wy/CU1JSUk6cOCAPvzwQ4/jTz75pPHnnj17qm3bthoyZIgOHTqkzp07/9RtGlJTU5WSkmLsO51ORUVF+awfAABQv/wiNE2ePFmbNm3S9u3b1a5duyvWxsbGSpK+/PJLde7cWREREbW+5VZaWipJioiIMP5Zc+zSGqvVqiZNmigoKEhBQUGXrakZ4/uCg4MVHBxsfpIAAOCa5tM1TW63W5MnT9aGDRu0ZcsWderU6QcfU1BQIElq27atJMlms2n//v0e33LLycmR1WpVdHS0UZObm+sxTk5Ojmw2myTJYrEoJibGo8blcik3N9eoAQAA17c6haavvvrKK0+elJSkP/3pT1qzZo1atGghh8Mhh8Ohc+fOSZIOHTqk559/Xvn5+Tpy5IjefvttjRs3ToMGDVKvXr0kSUOHDlV0dLQeffRRffrpp8rOztbs2bOVlJRkXAmaOHGivvrqK82YMUOFhYV65ZVXtG7dOiUnJxu9pKSk6A9/+INWrVqlzz77TJMmTVJFRYXGjx/vlbkCAIBrW51CU5cuXTR48GD96U9/0vnz5+v85MuWLVN5ebnuvvtutW3b1tjWrl0r6bsrQO+//76GDh2q7t2766mnntKoUaP0zjvvGGMEBQVp06ZNCgoKks1m0yOPPKJx48bpueeeM2o6deqkzMxM5eTkqHfv3lqwYIFWrFghu91u1IwePVr/8z//o7S0NPXp00cFBQXKysqqtTgcAABcn+p0n6aCggKtXLlSf/7zn1VVVaXRo0crMTFR/fv3r48erwncp6k27tMEAPB39X6fpj59+mjx4sU6fvy4/vjHP6qkpEQDBw5Ujx49tHDhQp04caJOjQMAAPirH7UQvFGjRho5cqTWr1+vF198UV9++aWefvppRUVFady4cSopKfFWnwAAAD71o0LTnj179F//9V9q27atFi5cqKefflqHDh1STk6Ojh8/rgceeMBbfQIAAPhUne7TtHDhQq1cuVJFRUUaPny43njjDQ0fPlyBgd9lsE6dOikjI0MdO3b0Zq8AAAA+U6fQtGzZMk2YMEGPPfaYcb+k7wsLC9Prr7/+o5oDAADwF3UKTV988cUP1lgsFiUkJNRleAAAAL9TpzVNK1eu1Pr162sdX79+vVatWvWjmwIAAPA3dQpN6enpatOmTa3jYWFheuGFF350UwAAAP6mTqGpuLj4sr8T16FDBxUXF//opgAAAPxNnUJTWFiY9u3bV+v4p59+qtatW//opgAAAPxNnULTQw89pF//+tf64IMPVF1drerqam3ZskVTp07VmDFjvN0jAACAz9Xp23PPP/+8jhw5oiFDhqhRo++GcLlcGjduHGuaAABAg1Sn0GSxWLR27Vo9//zz+vTTT9WkSRP17NlTHTp08HZ/AAAAfqFOoanGzTffrJtvvtlbvQAAAPitOoWm6upqZWRkKDc3V2VlZXK5XB7nt2zZ4pXmAAAA/EWdQtPUqVOVkZGh+Ph49ejRQwEBAd7uCwAAwK/UKTS9+eabWrdunYYPH+7tfgAAAPxSnW45YLFY1KVLF2/3AgAA4LfqFJqeeuopLV68WG6329v9AAAA+KU6fTz34Ycf6oMPPtDmzZt16623qnHjxh7n33rrLa80BwAA4C/qFJpatmypBx980Nu9AAAA+K06haaVK1d6uw8AAAC/Vqc1TZJ08eJFvf/++3r11Vd15swZSdLx48d19uxZrzUHAADgL+p0penrr7/WsGHDVFxcrMrKSv3iF79QixYt9OKLL6qyslLLly/3dp8AAAA+VacrTVOnTlW/fv106tQpNWnSxDj+4IMPKjc312vNAQAA+Is6XWn6+9//rh07dshisXgc79ixo/75z396pTEAAAB/UqcrTS6XS9XV1bWOHzt2TC1atPjRTQEAAPibOoWmoUOHatGiRcZ+QECAzp49qzlz5vDTKgAAoEGq08dzCxYskN1uV3R0tM6fP6+HH35YX3zxhdq0aaM///nP3u4RAADA5+oUmtq1a6dPP/1Ub775pvbt26ezZ88qMTFRY8eO9VgYDgAA0FDUKTRJUqNGjfTII494sxcAAAC/VafQ9MYbb1zx/Lhx4+rUDAAAgL+qU2iaOnWqx/6FCxf07bffymKxqGnTpoQmAADQ4NTp23OnTp3y2M6ePauioiINHDjwqhaCp6en6/bbb1eLFi0UFhamESNGqKioyKPm/PnzSkpKUuvWrdW8eXONGjVKpaWlHjXFxcWKj49X06ZNFRYWpunTp+vixYseNVu3blXfvn0VHBysLl26KCMjo1Y/S5cuVceOHRUSEqLY2Fjt2rXL/IsCAAAatDr/9tz3de3aVfPmzat1FepKtm3bpqSkJO3cuVM5OTm6cOGChg4dqoqKCqMmOTlZ77zzjtavX69t27bp+PHjGjlypHG+urpa8fHxqqqq0o4dO7Rq1SplZGQoLS3NqDl8+LDi4+M1ePBgFRQUaNq0aXr88ceVnZ1t1Kxdu1YpKSmaM2eO9u7dq969e8tut6usrOxHvjIAAKAhCHC73W5vDVZQUKBBgwbJ6XTW6fEnTpxQWFiYtm3bpkGDBqm8vFw33nij1qxZo1/+8peSpMLCQt1yyy3Ky8vTgAEDtHnzZt133306fvy4wsPDJUnLly/XzJkzdeLECVksFs2cOVOZmZk6cOCA8VxjxozR6dOnlZWVJUmKjY3V7bffriVLlkj67gaeUVFRmjJlimbNmvWDvTudToWGhqq8vFxWq7VO87+SjrMyvT5mfTsyL97XLQAAcEVX8/5dpzVNb7/9tse+2+1WSUmJlixZojvvvLMuQ0qSysvLJUmtWrWSJOXn5+vChQuKi4szarp376727dsboSkvL089e/Y0ApMk2e12TZo0SQcPHtRtt92mvLw8jzFqaqZNmyZJqqqqUn5+vlJTU43zgYGBiouLU15e3mV7raysVGVlpbFf16AIAACuDXUKTSNGjPDYDwgI0I033qh77rlHCxYsqFMjLpdL06ZN05133qkePXpIkhwOhywWi1q2bOlRGx4eLofDYdRcGphqztecu1KN0+nUuXPndOrUKVVXV1+2prCw8LL9pqen69lnn63TXAEAwLWnTqHJ5XJ5uw8lJSXpwIED+vDDD70+dn1ITU1VSkqKse90OhUVFeXDjgAAQH2q880tvWny5MnatGmTtm/frnbt2hnHIyIiVFVVpdOnT3tcbSotLVVERIRR8/1vudV8u+7Smu9/4660tFRWq1VNmjRRUFCQgoKCLltTM8b3BQcHKzg4uG4TBgAA15w6haZLr7D8kIULF/7bc263W1OmTNGGDRu0detWderUyeN8TEyMGjdurNzcXI0aNUqSVFRUpOLiYtlsNkmSzWbTb3/7W5WVlSksLEySlJOTI6vVqujoaKPm3Xff9Rg7JyfHGMNisSgmJka5ubnGR48ul0u5ubmaPHmy6bkCAICGq06h6ZNPPtEnn3yiCxcuqFu3bpKkzz//XEFBQerbt69RFxAQcMVxkpKStGbNGv3tb39TixYtjDVIoaGhatKkiUJDQ5WYmKiUlBS1atVKVqtVU6ZMkc1m04ABAyRJQ4cOVXR0tB599FHNnz9fDodDs2fPVlJSknElaOLEiVqyZIlmzJihCRMmaMuWLVq3bp0yM//vG2kpKSlKSEhQv3791L9/fy1atEgVFRUaP358XV4iAADQwNQpNN1///1q0aKFVq1apRtuuEHSdze8HD9+vO666y499dRTpsZZtmyZJOnuu+/2OL5y5Uo99thjkqSXX35ZgYGBGjVqlCorK2W32/XKK68YtUFBQdq0aZMmTZokm82mZs2aKSEhQc8995xR06lTJ2VmZio5OVmLFy9Wu3bttGLFCtntdqNm9OjROnHihNLS0uRwONSnTx9lZWXVWhwOAACuT3W6T9PPfvYzvffee7r11ls9jh84cEBDhw7V8ePHvdbgtYL7NNXGfZoAAP7uat6/63RHcKfTqRMnTtQ6fuLECZ05c6YuQwIAAPi1OoWmBx98UOPHj9dbb72lY8eO6dixY/rrX/+qxMREj584AQAAaCjqtKZp+fLlevrpp/Xwww/rwoUL3w3UqJESExP10ksvebVBAAAAf1Cn0NS0aVO98soreumll3To0CFJUufOndWsWTOvNgcAAOAv6vTxXI2SkhKVlJSoa9euatasmbz4278AAAB+pU6h6ZtvvtGQIUN08803a/jw4SopKZEkJSYmmr7dAAAAwLWkTqEpOTlZjRs3VnFxsZo2bWocHz16tLKysrzWHAAAgL+o05qm9957T9nZ2R6/EydJXbt21ddff+2VxgAAAPxJna40VVRUeFxhqnHy5El+xBYAADRIdQpNd911l9544w1jPyAgQC6XS/Pnz9fgwYO91hwAAIC/qNPHc/Pnz9eQIUO0Z88eVVVVacaMGTp48KBOnjypjz76yNs9AgAA+FydrjT16NFDn3/+uQYOHKgHHnhAFRUVGjlypD755BN17tzZ2z0CAAD43FVfabpw4YKGDRum5cuX65lnnqmPngAAAPzOVV9paty4sfbt21cfvQAAAPitOn0898gjj+j111/3di8AAAB+q04LwS9evKg//vGPev/99xUTE1PrN+cWLlzoleYAAAD8xVWFpq+++kodO3bUgQMH1LdvX0nS559/7lETEBDgve4AAAD8xFWFpq5du6qkpEQffPCBpO9+NuV3v/udwsPD66U5AAAAf3FVa5rcbrfH/ubNm1VRUeHVhgAAAPxRnRaC1/h+iAIAAGiorio0BQQE1FqzxBomAABwPbiqNU1ut1uPPfaY8aO858+f18SJE2t9e+6tt97yXocAAAB+4KpCU0JCgsf+I4884tVmAAAA/NVVhaaVK1fWVx8AAAB+7UctBAcAALheEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAEwhNAAAAJvg0NG3fvl3333+/IiMjFRAQoI0bN3qcf+yxx4wfCa7Zhg0b5lFz8uRJjR07VlarVS1btlRiYqLOnj3rUbNv3z7dddddCgkJUVRUlObPn1+rl/Xr16t79+4KCQlRz5499e6773p9vgAA4Nrl09BUUVGh3r17a+nSpf+2ZtiwYSopKTG2P//5zx7nx44dq4MHDyonJ0ebNm3S9u3b9eSTTxrnnU6nhg4dqg4dOig/P18vvfSS5s6dq9dee82o2bFjhx566CElJibqk08+0YgRIzRixAgdOHDA+5MGAADXpAC32+32dROSFBAQoA0bNmjEiBHGsccee0ynT5+udQWqxmeffabo6Gjt3r1b/fr1kyRlZWVp+PDhOnbsmCIjI7Vs2TI988wzcjgcslgskqRZs2Zp48aNKiwslCSNHj1aFRUV2rRpkzH2gAED1KdPHy1fvtxU/06nU6GhoSovL5fVaq3DK3BlHWdlen3M+nZkXryvWwAA4Iqu5v3b79c0bd26VWFhYerWrZsmTZqkb775xjiXl5enli1bGoFJkuLi4hQYGKiPP/7YqBk0aJARmCTJbrerqKhIp06dMmri4uI8ntdutysvL+/f9lVZWSmn0+mxAQCAhsuvQ9OwYcP0xhtvKDc3Vy+++KK2bdume++9V9XV1ZIkh8OhsLAwj8c0atRIrVq1ksPhMGrCw8M9amr2f6im5vzlpKenKzQ01NiioqJ+3GQBAIBfa+TrBq5kzJgxxp979uypXr16qXPnztq6dauGDBniw86k1NRUpaSkGPtOp5PgBABAA+bXV5q+76abblKbNm305ZdfSpIiIiJUVlbmUXPx4kWdPHlSERERRk1paalHTc3+D9XUnL+c4OBgWa1Wjw0AADRc11RoOnbsmL755hu1bdtWkmSz2XT69Gnl5+cbNVu2bJHL5VJsbKxRs337dl24cMGoycnJUbdu3XTDDTcYNbm5uR7PlZOTI5vNVt9TAgAA1wifhqazZ8+qoKBABQUFkqTDhw+roKBAxcXFOnv2rKZPn66dO3fqyJEjys3N1QMPPKAuXbrIbrdLkm655RYNGzZMTzzxhHbt2qWPPvpIkydP1pgxYxQZGSlJevjhh2WxWJSYmKiDBw9q7dq1Wrx4scdHa1OnTlVWVpYWLFigwsJCzZ07V3v27NHkyZN/8tcEAAD4J5+Gpj179ui2227TbbfdJklKSUnRbbfdprS0NAUFBWnfvn36j//4D918881KTExUTEyM/v73vys4ONgYY/Xq1erevbuGDBmi4cOHa+DAgR73YAoNDdV7772nw4cPKyYmRk899ZTS0tI87uV0xx13aM2aNXrttdfUu3dv/eUvf9HGjRvVo0ePn+7FAAAAfs1v7tN0reM+TbVxnyYAgL9rUPdpAgAA8AeEJgAAABMITQAAACYQmgAAAEwgNAEAAJhAaAIAADCB0AQAAGACoQkAAMAEQhMAAIAJhCYAAAATCE0AAAAmEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAEwhNAAAAJhCaAAAATCA0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABMITQAAACYQmgAAAEwgNAEAAJhAaAIAADCB0AQAAGACoQkAAMAEQhMAAIAJhCYAAAATfBqatm/frvvvv1+RkZEKCAjQxo0bPc673W6lpaWpbdu2atKkieLi4vTFF1941Jw8eVJjx46V1WpVy5YtlZiYqLNnz3rU7Nu3T3fddZdCQkIUFRWl+fPn1+pl/fr16t69u0JCQtSzZ0+9++67Xp8vAAC4dvk0NFVUVKh3795aunTpZc/Pnz9fv/vd77R8+XJ9/PHHatasmex2u86fP2/UjB07VgcPHlROTo42bdqk7du368knnzTOO51ODR06VB06dFB+fr5eeuklzZ07V6+99ppRs2PHDj300ENKTEzUJ598ohEjRmjEiBE6cOBA/U0eAABcUwLcbrfb101IUkBAgDZs2KARI0ZI+u4qU2RkpJ566ik9/fTTkqTy8nKFh4crIyNDY8aM0Weffabo6Gjt3r1b/fr1kyRlZWVp+PDhOnbsmCIjI7Vs2TI988wzcjgcslgskqRZs2Zp48aNKiwslCSNHj1aFRUV2rRpk9HPgAED1KdPHy1fvtxU/06nU6GhoSovL5fVavXWy2LoOCvT62PWtyPz4n3dAgAAV3Q1799+u6bp8OHDcjgciouLM46FhoYqNjZWeXl5kqS8vDy1bNnSCEySFBcXp8DAQH388cdGzaBBg4zAJEl2u11FRUU6deqUUXPp89TU1DzP5VRWVsrpdHpsAACg4fLb0ORwOCRJ4eHhHsfDw8ONcw6HQ2FhYR7nGzVqpFatWnnUXG6MS5/j39XUnL+c9PR0hYaGGltUVNTVThEAAFxD/DY0+bvU1FSVl5cb29GjR33dEgAAqEd+G5oiIiIkSaWlpR7HS0tLjXMREREqKyvzOH/x4kWdPHnSo+ZyY1z6HP+upub85QQHB8tqtXpsAACg4fLb0NSpUydFREQoNzfXOOZ0OvXxxx/LZrNJkmw2m06fPq38/HyjZsuWLXK5XIqNjTVqtm/frgsXLhg1OTk56tatm2644Qaj5tLnqampeR4AAACfhqazZ8+qoKBABQUFkr5b/F1QUKDi4mIFBARo2rRp+u///m+9/fbb2r9/v8aNG6fIyEjjG3a33HKLhg0bpieeeEK7du3SRx99pMmTJ2vMmDGKjIyUJD388MOyWCxKTEzUwYMHtXbtWi1evFgpKSlGH1OnTlVWVpYWLFigwsJCzZ07V3v27NHkyZN/6pcEAAD4qUa+fPI9e/Zo8ODBxn5NkElISFBGRoZmzJihiooKPfnkkzp9+rQGDhyorKwshYSEGI9ZvXq1Jk+erCFDhigwMFCjRo3S7373O+N8aGio3nvvPSUlJSkmJkZt2rRRWlqax72c7rjjDq1Zs0azZ8/Wb37zG3Xt2lUbN25Ujx49foJXAQAAXAv85j5N1zru01Qb92kCAPi7BnGfJgAAAH9CaAIAADCB0AQAAGACoQkAAMAEQhMAAIAJhCYAAAATCE0AAAAmEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAEwhNAAAAJhCaAAAATCA0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABMITQAAACYQmgAAAEwgNAEAAJhAaAIAADChka8bQMPVcVamr1u4akfmxfu6BQCAn+JKEwAAgAmEJgAAABMITQAAACYQmgAAAEwgNAEAAJhAaAIAADDBr0PT3LlzFRAQ4LF1797dOH/+/HklJSWpdevWat68uUaNGqXS0lKPMYqLixUfH6+mTZsqLCxM06dP18WLFz1qtm7dqr59+yo4OFhdunRRRkbGTzE9AABwDfH7+zTdeuutev/99439Ro3+r+Xk5GRlZmZq/fr1Cg0N1eTJkzVy5Eh99NFHkqTq6mrFx8crIiJCO3bsUElJicaNG6fGjRvrhRdekCQdPnxY8fHxmjhxolavXq3c3Fw9/vjjatu2rex2+087Wfgc95YCAPw7fh+aGjVqpIiIiFrHy8vL9frrr2vNmjW65557JEkrV67ULbfcop07d2rAgAF677339I9//EPvv/++wsPD1adPHz3//POaOXOm5s6dK4vFouXLl6tTp05asGCBJOmWW27Rhx9+qJdffpnQBAAADH798ZwkffHFF4qMjNRNN92ksWPHqri4WJKUn5+vCxcuKC4uzqjt3r272rdvr7y8PElSXl6eevbsqfDwcKPGbrfL6XTq4MGDRs2lY9TU1Izx71RWVsrpdHpsAACg4fLr0BQbG6uMjAxlZWVp2bJlOnz4sO666y6dOXNGDodDFotFLVu29HhMeHi4HA6HJMnhcHgEpprzNeeuVON0OnXu3Ll/21t6erpCQ0ONLSoq6sdOFwAA+DG//nju3nvvNf7cq1cvxcbGqkOHDlq3bp2aNGniw86k1NRUpaSkGPtOp5PgBABAA+bXV5q+r2XLlrr55pv15ZdfKiIiQlVVVTp9+rRHTWlpqbEGKiIiota36Wr2f6jGarVeMZgFBwfLarV6bAAAoOG6pkLT2bNndejQIbVt21YxMTFq3LixcnNzjfNFRUUqLi6WzWaTJNlsNu3fv19lZWVGTU5OjqxWq6Kjo42aS8eoqakZAwAAQPLz0PT0009r27ZtOnLkiHbs2KEHH3xQQUFBeuihhxQaGqrExESlpKTogw8+UH5+vsaPHy+bzaYBAwZIkoYOHaro6Gg9+uij+vTTT5Wdna3Zs2crKSlJwcHBkqSJEyfqq6++0owZM1RYWKhXXnlF69atU3Jysi+nDgAA/Ixfr2k6duyYHnroIX3zzTe68cYbNXDgQO3cuVM33nijJOnll19WYGCgRo0apcrKStntdr3yyivG44OCgrRp0yZNmjRJNptNzZo1U0JCgp577jmjplOnTsrMzFRycrIWL16sdu3aacWKFdxuAAAAeAhwu91uXzfREDidToWGhqq8vLxe1jddizddxE+Dm1sCQN1dzfu3X388BwAA4C8ITQAAACYQmgAAAEwgNAEAAJhAaAIAADCB0AQAAGACoQkAAMAEQhMAAIAJhCYAAAATCE0AAAAmEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAExr5ugEAP07HWZm+buGqHZkX7+sWAOCqcaUJAADABEITAACACYQmAAAAEwhNAAAAJhCaAAAATCA0AQAAmEBoAgAAMIHQBAAAYAKhCQAAwARCEwAAgAmEJgAAABP47TkAPzl+Lw/AtYgrTQAAACYQmgAAAEwgNAEAAJhAaPqepUuXqmPHjgoJCVFsbKx27drl65YAAIAfYCH4JdauXauUlBQtX75csbGxWrRokex2u4qKihQWFubr9gD4EIvXAXCl6RILFy7UE088ofHjxys6OlrLly9X06ZN9cc//tHXrQEAAB/jStP/qqqqUn5+vlJTU41jgYGBiouLU15eXq36yspKVVZWGvvl5eWSJKfTWS/9uSq/rZdxATRc7ZPX+7qFq3bgWbuvW8B1puZ92+12/2Atoel//etf/1J1dbXCw8M9joeHh6uwsLBWfXp6up599tlax6OiouqtRwBo6EIX+boDXK/OnDmj0NDQK9YQmuooNTVVKSkpxr7L5dLJkyfVunVrBQQEePW5nE6noqKidPToUVmtVq+O7U+uh3leD3OUmGdDwzwbjuthjtLVzdPtduvMmTOKjIz8wXEJTf+rTZs2CgoKUmlpqcfx0tJSRURE1KoPDg5WcHCwx7GWLVvWZ4uyWq0N+l/yGtfDPK+HOUrMs6Fhng3H9TBHyfw8f+gKUw0Wgv8vi8WimJgY5ebmGsdcLpdyc3Nls9l82BkAAPAHXGm6REpKihISEtSvXz/1799fixYtUkVFhcaPH+/r1gAAgI8Rmi4xevRonThxQmlpaXI4HOrTp4+ysrJqLQ7/qQUHB2vOnDm1Pg5saK6HeV4Pc5SYZ0PDPBuO62GOUv3NM8Bt5jt2AAAA1znWNAEAAJhAaAIAADCB0AQAAGACoQkAAMAEQpOfW7p0qTp27KiQkBDFxsZq165dvm7J67Zv3677779fkZGRCggI0MaNG33dktelp6fr9ttvV4sWLRQWFqYRI0aoqKjI12153bJly9SrVy/jhnI2m02bN2/2dVv1at68eQoICNC0adN83YpXzZ07VwEBAR5b9+7dfd1WvfjnP/+pRx55RK1bt1aTJk3Us2dP7dmzx9dteVXHjh1r/X0GBAQoKSnJ1615VXV1tf7f//t/6tSpk5o0aaLOnTvr+eefN/W7cmYQmvzY2rVrlZKSojlz5mjv3r3q3bu37Ha7ysrKfN2aV1VUVKh3795aunSpr1upN9u2bVNSUpJ27typnJwcXbhwQUOHDlVFRYWvW/Oqdu3aad68ecrPz9eePXt0zz336IEHHtDBgwd93Vq92L17t1599VX16tXL163Ui1tvvVUlJSXG9uGHH/q6Ja87deqU7rzzTjVu3FibN2/WP/7xDy1YsEA33HCDr1vzqt27d3v8Xebk5EiSfvWrX/m4M+968cUXtWzZMi1ZskSfffaZXnzxRc2fP1+///3vvfMEbvit/v37u5OSkoz96upqd2RkpDs9Pd2HXdUvSe4NGzb4uo16V1ZW5pbk3rZtm69bqXc33HCDe8WKFb5uw+vOnDnj7tq1qzsnJ8f985//3D116lRft+RVc+bMcffu3dvXbdS7mTNnugcOHOjrNn5yU6dOdXfu3Nntcrl83YpXxcfHuydMmOBxbOTIke6xY8d6ZXyuNPmpqqoq5efnKy4uzjgWGBiouLg45eXl+bAzeEN5ebkkqVWrVj7upP5UV1frzTffVEVFRYP8KaKkpCTFx8d7/Dfa0HzxxReKjIzUTTfdpLFjx6q4uNjXLXnd22+/rX79+ulXv/qVwsLCdNttt+kPf/iDr9uqV1VVVfrTn/6kCRMmeP0H5n3tjjvuUG5urj7//HNJ0qeffqoPP/xQ9957r1fG547gfupf//qXqqura92NPDw8XIWFhT7qCt7gcrk0bdo03XnnnerRo4ev2/G6/fv3y2az6fz582revLk2bNig6OhoX7flVW+++ab27t2r3bt3+7qVehMbG6uMjAx169ZNJSUlevbZZ3XXXXfpwIEDatGiha/b85qvvvpKy5YtU0pKin7zm99o9+7d+vWvfy2LxaKEhARft1cvNm7cqNOnT+uxxx7zdSteN2vWLDmdTnXv3l1BQUGqrq7Wb3/7W40dO9Yr4xOagJ9YUlKSDhw40CDXh0hSt27dVFBQoPLycv3lL39RQkKCtm3b1mCC09GjRzV16lTl5OQoJCTE1+3Um0v/z7xXr16KjY1Vhw4dtG7dOiUmJvqwM+9yuVzq16+fXnjhBUnSbbfdpgMHDmj58uUNNjS9/vrruvfeexUZGenrVrxu3bp1Wr16tdasWaNbb71VBQUFmjZtmiIjI73y90lo8lNt2rRRUFCQSktLPY6XlpYqIiLCR13hx5o8ebI2bdqk7du3q127dr5up15YLBZ16dJFkhQTE6Pdu3dr8eLFevXVV33cmXfk5+errKxMffv2NY5VV1dr+/btWrJkiSorKxUUFOTDDutHy5YtdfPNN+vLL7/0dSte1bZt21qB/pZbbtFf//pXH3VUv77++mu9//77euutt3zdSr2YPn26Zs2apTFjxkiSevbsqa+//lrp6eleCU2safJTFotFMTExys3NNY65XC7l5uY2yPUhDZ3b7dbkyZO1YcMGbdmyRZ06dfJ1Sz8Zl8ulyspKX7fhNUOGDNH+/ftVUFBgbP369dPYsWNVUFDQIAOTJJ09e1aHDh1S27Ztfd2KV9155521bv/x+eefq0OHDj7qqH6tXLlSYWFhio+P93Ur9eLbb79VYKBntAkKCpLL5fLK+Fxp8mMpKSlKSEhQv3791L9/fy1atEgVFRUaP368r1vzqrNnz3r83+vhw4dVUFCgVq1aqX379j7szHuSkpK0Zs0a/e1vf1OLFi3kcDgkSaGhoWrSpImPu/Oe1NRU3XvvvWrfvr3OnDmjNWvWaOvWrcrOzvZ1a17TokWLWmvRmjVrptatWzeoNWpPP/207r//fnXo0EHHjx/XnDlzFBQUpIceesjXrXlVcnKy7rjjDr3wwgv6z//8T+3atUuvvfaaXnvtNV+35nUul0srV65UQkKCGjVqmG//999/v37729+qffv2uvXWW/XJJ59o4cKFmjBhgneewCvfwUO9+f3vf+9u376922KxuPv37+/euXOnr1vyug8++MAtqdaWkJDg69a85nLzk+ReuXKlr1vzqgkTJrg7dOjgtlgs7htvvNE9ZMgQ93vvvefrtupdQ7zlwOjRo91t27Z1WywW989+9jP36NGj3V9++aWv26oX77zzjrtHjx7u4OBgd/fu3d2vvfaar1uqF9nZ2W5J7qKiIl+3Um+cTqd76tSp7vbt27tDQkLcN910k/uZZ55xV1ZWemX8ALfbS7fJBAAAaMBY0wQAAGACoQkAAMAEQhMAAIAJhCYAAAATCE0AAAAmEJoAAABMIDQBAACYQGgCAAAwgdAEAABgAqEJAADABEITAACACYQmAAAAE/4/4ebMwtFvOG0AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "df[\"log_votes\"].plot.hist()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the number of missing values for each columm below." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "reviewText 6\n", "summary 7\n", "verified 0\n", "time 0\n", "rating 0\n", "log_votes 0\n", "dtype: int64\n" ] } ], "source": [ "print(df.isna().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Text Processing: Stop words removal and stemming\n", "(Go to top)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n", "[nltk_data] Downloading package stopwords to\n", "[nltk_data] /home/ec2-user/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Install the library and functions\n", "import nltk\n", "\n", "nltk.download('punkt')\n", "nltk.download('stopwords')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import nltk, re\n", "from nltk.corpus import stopwords\n", "from nltk.stem import SnowballStemmer\n", "from nltk.tokenize import word_tokenize\n", "\n", "# Let's get a list of stop words from the NLTK library\n", "stop = stopwords.words('english')\n", "\n", "# These words are important for our problem. We don't want to remove them.\n", "excluding = ['against', 'not', 'don', \"don't\",'ain', 'aren', \"aren't\", 'couldn', \"couldn't\",\n", " 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", \n", " 'haven', \"haven't\", 'isn', \"isn't\", 'mightn', \"mightn't\", 'mustn', \"mustn't\",\n", " 'needn', \"needn't\",'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \n", " \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n", "\n", "# New stop word list\n", "stop_words = [word for word in stop if word not in excluding]\n", "\n", "snow = SnowballStemmer('english')\n", "\n", "def process_text(texts): \n", " final_text_list=[]\n", " for sent in texts:\n", " \n", " # Check if the sentence is a missing value\n", " if isinstance(sent, str) == False:\n", " sent = \"\"\n", " \n", " filtered_sentence=[]\n", " \n", " sent = sent.lower() # Lowercase \n", " sent = sent.strip() # Remove leading/trailing whitespace\n", " sent = re.sub('\\s+', ' ', sent) # Remove extra space and tabs\n", " sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:\n", " \n", " for w in word_tokenize(sent):\n", " # We are applying some custom filtering here, feel free to try different things\n", " # Check if it is not numeric and its length>2 and not in stop words\n", " if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words): \n", " # Stem and add to filtered list\n", " filtered_sentence.append(snow.stem(w))\n", " final_string = \" \".join(filtered_sentence) #final string of cleaned words\n", " \n", " final_text_list.append(final_string)\n", " \n", " return final_text_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Train - Validation Split\n", "(Go to top)\n", "\n", "Let's split our dataset into training (90%) and validation (10%). We will use \"reviewText\", \"summary\", \"time\", \"rating\" fields and predict the \"log_votes\" field." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "\n", "X_train, X_val, y_train, y_val = train_test_split(df[[\"reviewText\", \"summary\", \"time\", \"rating\"]],\n", " df[\"log_votes\"],\n", " test_size=0.10,\n", " shuffle=True,\n", " random_state=324\n", " )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing the reviewText fields\n", "Processing the summary fields\n" ] } ], "source": [ "print(\"Processing the reviewText fields\")\n", "X_train[\"reviewText\"] = process_text(X_train[\"reviewText\"].tolist())\n", "X_val[\"reviewText\"] = process_text(X_val[\"reviewText\"].tolist())\n", "\n", "print(\"Processing the summary fields\")\n", "X_train[\"summary\"] = process_text(X_train[\"summary\"].tolist())\n", "X_val[\"summary\"] = process_text(X_val[\"summary\"].tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our __process_text()__ method in section 3 uses empty string for missing values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Data processing with Pipeline and ColumnTransform\n", "(Go to top)\n", "\n", "In the previous examples, we have seen how to use pipeline to prepare a data field for our machine learning model. This time, we will focus on multiple fields: numeric and text fields. We are using linear regression model from Sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model. \n", "\n", " * For the numerical features pipeline, the __numerical_processor__ below, we use a MinMaxScaler (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.\n", " * For the numerical features pipeline, the __text_processor__ below, we use CountVectorizer() for the text fields.\n", " \n", "The selective preparations of the dataset features are then put together into a collective ColumnTransformer, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Grab model features/inputs and target/output\n", "numerical_features = ['time',\n", " 'rating']\n", "\n", "text_features = ['summary',\n", " 'reviewText']\n", "\n", "model_features = numerical_features + text_features\n", "model_target = 'log_votes'" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'rating']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('lr', LinearRegression())])
ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                 Pipeline(steps=[('num_scaler',\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 ['time', 'rating']),\n",
       "                                ('text_pre_0',\n",
       "                                 Pipeline(steps=[('text_vect_0',\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 'summary'),\n",
       "                                ('text_pre_1',\n",
       "                                 Pipeline(steps=[('text_vect_1',\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 'reviewText')])
['time', 'rating']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
LinearRegression()
" ], "text/plain": [ "Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'rating']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('lr', LinearRegression())])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.linear_model import LinearRegression\n", "\n", "### COLUMN_TRANSFORMER ###\n", "##########################\n", "\n", "# Preprocess the numerical features\n", "numerical_processor = Pipeline([\n", " ('num_scaler', MinMaxScaler())\n", "])\n", "# Preprocess 1st text feature\n", "text_processor_0 = Pipeline([\n", " ('text_vect_0', CountVectorizer(binary=True, max_features=50))\n", "])\n", "\n", "# Preprocess 2nd text feature (larger vocabulary)\n", "text_precessor_1 = Pipeline([\n", " ('text_vect_1', CountVectorizer(binary=True, max_features=150))\n", "])\n", "\n", "# Combine all data preprocessors from above (add more, if you choose to define more!)\n", "# For each processor/step specify: a name, the actual process, and finally the features to be processed\n", "data_preprocessor = ColumnTransformer([\n", " ('numerical_pre', numerical_processor, numerical_features),\n", " ('text_pre_0', text_processor_0, text_features[0]),\n", " ('text_pre_1', text_precessor_1, text_features[1])\n", "]) \n", "\n", "### PIPELINE ###\n", "################\n", "\n", "# Pipeline desired all data transformers, along with an estimator at the end\n", "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n", "pipeline = Pipeline([\n", " ('data_preprocessing', data_preprocessor),\n", " ('lr', LinearRegression())\n", "])\n", "\n", "# Visualize the pipeline\n", "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n", "from sklearn import set_config\n", "set_config(display='diagram')\n", "pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Train the regressor\n", "(Go to top)\n", "\n", "We train our model by using __.fit()__ on our training dataset. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'rating']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('lr', LinearRegression())])
ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                 Pipeline(steps=[('num_scaler',\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 ['time', 'rating']),\n",
       "                                ('text_pre_0',\n",
       "                                 Pipeline(steps=[('text_vect_0',\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 'summary'),\n",
       "                                ('text_pre_1',\n",
       "                                 Pipeline(steps=[('text_vect_1',\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 'reviewText')])
['time', 'rating']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
LinearRegression()
" ], "text/plain": [ "Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'rating']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('lr', LinearRegression())])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit the Pipeline to training data\n", "pipeline.fit(X_train[model_features], y_train.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Fitting Linear Regression models and checking the validation performance\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7.1 LinearRegression\n", "Let's first fit __LinearRegression__ from Sklearn library, and check the performance on the validation dataset. Using the __coef___ atribute, we can also print the learned weights of the model.\n", "\n", "Find more details on __LinearRegression__ here:\n", "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LinearRegression on Validation: Mean_squared_error: 0.591002, R_square_score: 0.356090\n", "LinearRegression model weights: \n", " [-1.74577314e+00 -4.16699072e-01 6.55550275e-02 -2.35263228e-02\n", " 8.61220740e-02 -1.01349476e-02 7.93537101e-02 1.00580696e-01\n", " 5.29633789e-02 2.05181032e-02 5.73207961e-02 2.22064160e-01\n", " 9.93746605e-02 -1.37704787e-02 -4.06298595e-02 -1.02553754e-04\n", " 4.40499946e-02 9.05559004e-02 2.17966121e-02 1.26253656e-02\n", " -2.25983349e-02 8.51472731e-03 2.46086741e-02 -2.33539637e-02\n", " -3.30884571e-02 -9.93520115e-03 -1.51318120e-01 -3.76403719e-02\n", " 6.49345534e-02 -1.17077198e-02 1.08909214e-02 1.58695722e-02\n", " -4.96625564e-02 -5.60339836e-02 5.79985155e-02 -8.74198875e-02\n", " 1.68564961e-02 -4.05450902e-02 2.67895702e-02 4.66004523e-02\n", " -9.30063028e-03 1.16291360e-01 2.62058316e-02 -2.14182250e-02\n", " -1.19422704e-02 -4.02101945e-02 -5.45593128e-02 -1.20283610e-01\n", " 1.28812660e-02 5.21457390e-02 -1.36325031e-02 8.86600753e-02\n", " 5.42079431e-03 -4.70404179e-02 9.00120138e-02 6.50392182e-02\n", " 6.40209412e-02 -4.80337724e-02 5.21237257e-02 2.77769541e-02\n", " -1.33472735e-02 2.49326537e-02 1.52713526e-02 4.96132410e-03\n", " 2.69374361e-02 4.02838081e-02 3.99136935e-02 1.09036222e-01\n", " 6.67872768e-02 8.41276207e-02 8.75869336e-02 -1.99736843e-02\n", " -4.10805883e-02 1.20796345e-01 1.03699122e-01 2.50625700e-02\n", " 5.49379324e-02 1.33983956e-02 9.03568378e-04 8.61970536e-03\n", " 1.03039754e-02 1.76767186e-02 1.58121513e-02 3.10935235e-02\n", " 8.68812511e-02 2.62193135e-02 6.90466196e-02 9.47787414e-03\n", " -5.32796735e-03 -2.52701243e-02 7.86172025e-02 4.59835043e-02\n", " 2.21098468e-02 1.36482974e-02 -1.70524769e-02 1.81249397e-02\n", " 2.28481058e-02 3.49090942e-02 1.92472778e-03 -7.10653542e-03\n", " 5.47236007e-03 -1.32636957e-02 1.70598083e-02 6.10152748e-02\n", " -5.83130587e-03 6.08769223e-02 4.66497862e-03 1.25991440e-01\n", " 2.07886715e-02 3.18103212e-02 1.37392505e-01 3.61222035e-04\n", " 2.82119724e-03 -1.89558194e-02 2.59798664e-02 1.17412865e-01\n", " 1.52362853e-02 -3.92003538e-02 1.76706026e-02 3.49413161e-02\n", " 1.10495609e-01 -2.62584609e-02 1.74357004e-02 -2.61435462e-02\n", " 2.30332076e-02 5.66528790e-02 5.47466918e-02 2.12465772e-02\n", " 7.42130409e-02 5.54875149e-02 2.96635678e-02 2.42994427e-02\n", " 1.24573503e-02 -3.66267577e-02 5.10829404e-02 -9.19411300e-02\n", " 2.57375037e-02 -1.59823270e-02 4.48793990e-02 -4.43223613e-02\n", " -8.83611002e-04 1.04401352e-02 4.80105825e-02 6.02075576e-02\n", " 3.17552705e-02 9.05583387e-03 -3.71826118e-02 6.11118328e-03\n", " 5.73208843e-02 6.00443297e-02 3.66195424e-02 1.28904881e-02\n", " -8.39218163e-02 9.64028148e-02 6.08002240e-02 1.50603031e-02\n", " 2.91222041e-02 1.49478153e-02 1.21161957e-01 3.57855324e-02\n", " 1.68262163e-02 -3.11501113e-02 1.75311652e-02 4.75839757e-02\n", " -3.34246260e-02 -5.20256808e-02 1.90839522e-02 -2.86798642e-02\n", " 9.26662160e-03 4.65491098e-02 3.57859859e-02 -3.84354749e-02\n", " 1.98149122e-02 -7.98392531e-02 3.79819451e-05 4.44709459e-02\n", " -1.08130571e-02 -7.87849503e-03 -1.15668044e-02 -1.11208300e-01\n", " 1.54510731e-02 8.73681550e-03 2.00028343e-02 -2.14026797e-02\n", " 4.16388636e-03 3.86776539e-02 1.37940318e-02 6.06018219e-02\n", " 7.20383295e-03 3.61070560e-02 6.47849492e-02 5.97982740e-02\n", " 5.21145861e-02 4.88811939e-02 1.04418306e-02 2.96479498e-02\n", " 5.59811016e-02 1.21381519e-01 1.10283847e-03 4.90773178e-03\n", " -2.28286540e-02 1.79477653e-02]\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import r2_score, mean_squared_error\n", "\n", "lrRegressor_val_predictions = pipeline.predict(X_val[model_features])\n", "print(\"LinearRegression on Validation: Mean_squared_error: %f, R_square_score: %f\" % \\\n", " (mean_squared_error(y_val, lrRegressor_val_predictions),r2_score(y_val, lrRegressor_val_predictions)))\n", "print(\"LinearRegression model weights: \\n\", pipeline.named_steps['lr'].coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7.2 Ridge (Linear Regression with L2 regularization)\n", "Let's now fit __Ridge__ from Sklearn library, and check the performance on the validation dataset.\n", "\n", "Find more details on __Ridge__ here:\n", "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html\n", "\n", "To improve the performance of a LinearRegression model, __Ridge__ is tuning model complexity by adding a $L_2$ penalty score for complexity to the model cost function:\n", "\n", "$$\\text{C}_{\\text{regularized}}(\\textbf{w}) = \\text{C}(\\textbf{w}) + {alpha}∗||\\textbf{w}||_2^2$$\n", "\n", "where $\\textbf{w}$ is the model weights vector, and $||\\textbf{w}||_2^2 = \\sum \\textbf{w}_i^2$.\n", "\n", "The strength of the regularization is controlled by the regularizer parameter, alpha: smaller value of $alpha$, weaker regularization; larger value of $alpha$, stronger regularization. \n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ridge on Validation: Mean_squared_error: 0.589688, R_square_score: 0.357521\n", "Ridge model weights: \n", " [-1.62654931e+00 -4.03980500e-01 5.81220902e-02 -2.25711041e-02\n", " 7.88328713e-02 -1.08712774e-02 5.87454888e-02 8.20379113e-02\n", " 4.34748520e-02 1.30718660e-02 4.92064953e-02 1.56653709e-01\n", " 3.40521265e-02 -1.37047185e-02 -4.37902511e-02 -4.00150710e-03\n", " 3.66506119e-02 7.16482168e-02 1.45858409e-02 4.34449427e-03\n", " -2.20149581e-02 -2.25600565e-03 8.19135963e-03 -2.43087375e-02\n", " -3.32077788e-02 -9.60478980e-03 -1.21408190e-01 -7.16105716e-02\n", " 5.40908434e-02 -8.04966729e-03 8.44027269e-03 1.24341154e-02\n", " -4.83056785e-02 -5.35369894e-02 5.08590586e-02 -4.52176597e-02\n", " 1.90388401e-02 -4.23436565e-02 -2.94076371e-02 3.36209818e-02\n", " -1.18991231e-02 9.43684398e-02 -2.23346459e-02 -1.74423405e-02\n", " -1.46565163e-02 -3.34405210e-02 -5.02863381e-02 -8.35639539e-02\n", " 6.56486970e-03 4.45007519e-02 -1.77680247e-02 7.81482628e-02\n", " 4.97927904e-03 -4.80963778e-02 8.47090401e-02 6.29460655e-02\n", " 6.53205098e-02 -4.79727793e-02 4.66023553e-02 2.79484450e-02\n", " -1.17575875e-02 2.64193698e-02 1.68832290e-02 8.01533362e-03\n", " 2.67844455e-02 4.32633929e-02 3.79657411e-02 1.03874963e-01\n", " 6.48396191e-02 8.21807405e-02 8.55962826e-02 -2.02461793e-02\n", " -3.94784910e-02 1.17263960e-01 1.00835548e-01 2.55930293e-02\n", " 5.28142874e-02 9.45265464e-03 2.21566733e-03 1.13596966e-02\n", " 1.36937682e-02 1.70790310e-02 1.53268325e-02 3.29887677e-02\n", " 8.30255263e-02 2.52772590e-02 6.93033099e-02 9.36214427e-03\n", " -3.58456009e-03 -2.37789709e-02 7.88116054e-02 4.21964660e-02\n", " 2.37008621e-02 1.14793753e-02 -3.40475479e-03 1.97052950e-02\n", " 2.37479985e-02 3.48860874e-02 2.84385769e-03 -6.23881008e-03\n", " 6.39140349e-03 -1.03076142e-02 1.89189355e-02 5.96529854e-02\n", " -2.96285413e-03 5.96278846e-02 6.51257550e-03 1.21468297e-01\n", " 2.29779822e-02 3.35291301e-02 1.31797282e-01 -3.69072675e-03\n", " 2.38394158e-04 -1.68113652e-02 2.52795364e-02 1.15885409e-01\n", " 1.61784887e-02 -3.64252648e-02 1.72712231e-02 3.14786817e-02\n", " 1.09530994e-01 -2.44276040e-02 1.56694940e-02 -2.33400821e-02\n", " 2.31731767e-02 5.77542149e-02 5.61254467e-02 2.07419400e-02\n", " 7.36677216e-02 5.79194588e-02 2.92110614e-02 2.45904683e-02\n", " 1.17346722e-02 -3.41900531e-02 5.20640373e-02 -8.62899691e-02\n", " 2.72612015e-02 -1.44112357e-02 4.69216838e-02 -4.23409001e-02\n", " -2.56763095e-03 8.26979815e-03 4.55174780e-02 5.83923851e-02\n", " 3.45502411e-02 9.16790259e-03 -3.29566214e-02 1.14049197e-02\n", " 6.09025473e-02 6.05713340e-02 3.62507711e-02 1.21651047e-02\n", " -7.88952321e-02 9.35845486e-02 5.98854008e-02 1.67738024e-02\n", " 2.97146399e-02 1.61466753e-02 1.19440166e-01 3.64894193e-02\n", " 1.82890723e-02 -2.86903851e-02 1.81400260e-02 4.66683649e-02\n", " -3.11908826e-02 -4.59074292e-02 1.90952004e-02 -2.52876919e-02\n", " 1.05388344e-02 4.50892554e-02 4.04079976e-02 -3.68933489e-02\n", " 2.00643735e-02 -7.38852595e-02 1.06757103e-03 4.80133393e-02\n", " -1.19690383e-02 -2.12773490e-03 -9.23910621e-03 -1.07697471e-01\n", " 1.39679499e-02 1.08400317e-02 2.24968845e-02 -1.94375049e-02\n", " 5.19196268e-03 3.92066545e-02 1.28766699e-02 6.09577345e-02\n", " 7.95772110e-03 3.74589691e-02 6.43038952e-02 6.08809993e-02\n", " 5.21411158e-02 4.94554112e-02 1.15281299e-02 2.69802561e-02\n", " 5.46104077e-02 1.16341171e-01 6.07930760e-04 7.25457407e-03\n", " -2.07763124e-02 1.56772576e-02]\n" ] } ], "source": [ "from sklearn.linear_model import Ridge\n", "from sklearn.metrics import r2_score, mean_squared_error\n", "\n", "# Let's update the pipeline with Ridge regression model\n", "ridge_pipeline = Pipeline([\n", " ('data_preprocessing', data_preprocessor),\n", " ('ridge', Ridge(alpha = 100))\n", "])\n", "\n", "ridge_pipeline.fit(X_train[model_features], y_train.values)\n", "ridgeRegressor_val_predictions = ridge_pipeline.predict(X_val[model_features])\n", "\n", "print(\"Ridge on Validation: Mean_squared_error: %f, R_square_score: %f\" % \\\n", " (mean_squared_error(y_val, ridgeRegressor_val_predictions),r2_score(y_val, ridgeRegressor_val_predictions)))\n", "\n", "print(\"Ridge model weights: \\n\", ridge_pipeline.named_steps['ridge'].coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7.3 LASSO (Linear Regression with L1 regularization)\n", "Let's also fit __Lasso__ from Sklearn library, and check the performance on the validation dataset.\n", "\n", "Find more details on __Lasso__ here:\n", "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html\n", "\n", "__Lasso__ is tuning model complexity by adding a $L_1$ penalty score for complexity to the model cost function:\n", "\n", "$$\\text{C}_{\\text{regularized}}(\\textbf{w}) = \\text{C}(\\textbf{w}) + alpha∗||\\textbf{w}||_1$$\n", "\n", "where $\\textbf{w}$ is the model weights vector, and $||\\textbf{w}||_1 = \\sum |\\textbf{w}_i|$. \n", "\n", "Again, the strength of the regularization is controlled by the regularizer parameter, $alpha$. Due to the geometry of $L_1$ norm, with __Lasso__, some of the weights will shrink all the way to 0, leading to sparsity - some of the features are not contributing to the model afterall!" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lasso on Validation: Mean_squared_error: 0.589867, R_square_score: 0.357327\n", "Lasso model weights: \n", " [-1.72010524e+00 -3.89067686e-01 2.58213525e-02 -0.00000000e+00\n", " 3.82842552e-02 -0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00 1.06602376e-02 1.26472385e-01\n", " 0.00000000e+00 -0.00000000e+00 -2.50955076e-02 0.00000000e+00\n", " 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " -0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00\n", " -0.00000000e+00 -0.00000000e+00 -4.63960564e-02 -5.02315491e-02\n", " 0.00000000e+00 -0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " -0.00000000e+00 -0.00000000e+00 3.47943828e-02 -1.61006740e-02\n", " 0.00000000e+00 -7.44345517e-03 -0.00000000e+00 0.00000000e+00\n", " -0.00000000e+00 1.95457533e-02 -0.00000000e+00 -0.00000000e+00\n", " 0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00\n", " -0.00000000e+00 0.00000000e+00 -5.88222939e-05 2.42462224e-02\n", " 0.00000000e+00 -1.67083722e-02 7.65288954e-02 5.11191702e-02\n", " 6.52495708e-02 -1.95841027e-02 4.24640009e-02 1.94354445e-02\n", " -0.00000000e+00 2.31689366e-02 6.60345613e-03 0.00000000e+00\n", " 7.77241315e-03 4.24802208e-02 2.46611963e-02 9.84988824e-02\n", " 5.66360446e-02 7.04583867e-02 7.39390006e-02 -7.30202686e-03\n", " -2.01568144e-02 1.09070226e-01 9.85161084e-02 1.89173069e-02\n", " 4.21733593e-02 8.92224900e-03 0.00000000e+00 1.71690041e-03\n", " 1.12690153e-02 8.37400874e-03 0.00000000e+00 2.08424501e-02\n", " 6.58613271e-02 1.11684319e-02 6.50926378e-02 0.00000000e+00\n", " 0.00000000e+00 -0.00000000e+00 7.75635511e-02 1.81639732e-02\n", " 1.34094779e-02 3.47083919e-03 0.00000000e+00 1.94450618e-02\n", " 1.64302217e-02 2.08545783e-02 0.00000000e+00 -0.00000000e+00\n", " 4.76631063e-04 -0.00000000e+00 1.11247950e-02 4.88585581e-02\n", " 0.00000000e+00 4.50304949e-02 0.00000000e+00 1.18703397e-01\n", " 0.00000000e+00 3.08937134e-02 1.23558636e-01 0.00000000e+00\n", " 0.00000000e+00 -0.00000000e+00 1.16569677e-02 1.21335603e-01\n", " 1.45310173e-02 -5.73789112e-03 0.00000000e+00 1.61575418e-02\n", " 1.09974422e-01 -0.00000000e+00 5.62051077e-03 -4.01496300e-03\n", " 7.70560592e-03 5.84355530e-02 5.64992818e-02 4.38075213e-03\n", " 6.32725981e-02 4.04038476e-02 1.22287629e-02 2.22709234e-02\n", " 1.04258533e-02 -1.31483697e-02 4.90841487e-02 -7.81193766e-02\n", " 2.65567658e-02 -0.00000000e+00 4.76621958e-02 -7.68826444e-03\n", " 0.00000000e+00 0.00000000e+00 2.78708595e-02 4.61145064e-02\n", " 4.89654813e-03 7.44021251e-03 -2.00508423e-02 9.89654553e-03\n", " 6.13162205e-02 5.69293636e-02 1.80026670e-02 0.00000000e+00\n", " -6.68203206e-02 8.53132296e-02 5.07280820e-02 1.31730549e-02\n", " 2.11212109e-02 0.00000000e+00 1.17546023e-01 2.52114748e-02\n", " 7.19393639e-03 -4.51436030e-05 1.42525486e-02 4.03626273e-02\n", " -8.38373569e-03 -6.87175976e-03 4.08641734e-03 -0.00000000e+00\n", " 0.00000000e+00 3.92851230e-02 4.45392213e-02 -5.89714208e-05\n", " 1.65714427e-02 -3.69897761e-02 0.00000000e+00 5.00988888e-02\n", " -0.00000000e+00 0.00000000e+00 0.00000000e+00 -9.13432006e-02\n", " 0.00000000e+00 4.31028874e-03 1.26817405e-02 -2.24943717e-03\n", " 4.04191830e-03 3.34928099e-02 2.29938781e-03 5.26626213e-02\n", " 4.47070643e-03 3.24783959e-02 6.23245733e-02 5.60383137e-02\n", " 5.44973737e-02 4.95496624e-02 7.21334713e-03 2.96171512e-02\n", " 4.54341406e-02 1.06924613e-01 -0.00000000e+00 0.00000000e+00\n", " -5.10833926e-03 1.40971226e-02]\n" ] } ], "source": [ "from sklearn.linear_model import Lasso\n", "from sklearn.metrics import r2_score, mean_squared_error\n", "\n", "# Let's update the pipeline with Lasso regression model\n", "lasso_pipeline = Pipeline([\n", " ('data_preprocessing', data_preprocessor),\n", " ('lasso', Lasso(alpha = 0.001))\n", "])\n", "\n", "lasso_pipeline.fit(X_train[model_features], y_train.values)\n", "lassoRegressor_val_predictions = lasso_pipeline.predict(X_val[model_features])\n", "\n", "print(\"Lasso on Validation: Mean_squared_error: %f, R_square_score: %f\" % \\\n", " (mean_squared_error(y_val, lassoRegressor_val_predictions),r2_score(y_val, lassoRegressor_val_predictions)))\n", "\n", "print(\"Lasso model weights: \\n\", lasso_pipeline.named_steps['lasso'].coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7.4 ElasticNet (Linear Regression with L2 and L1 regularization)\n", "Let's finally try __ElasticNet__ from Sklearn library, and check the performance on the validation dataset.\n", "\n", "Find more details on __ElasticNet__ here:\n", "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html\n", "\n", "__ElasticNet__ is tuning model complexity by adding both $L_2$ and $L_1$ penalty scores for complexity to the model's cost function:\n", "\n", "$$\\text{C}_{\\text{regularized}}(\\textbf{w}) = \\text{C}(\\textbf{w}) + 0.5*alpha∗(1-\\textit{l1}_{ratio})||\\textbf{w}||_2^2 + alpha∗\\textit{l1}_{ratio}∗||\\textbf{w}||_1$$\n", "\n", "and using two parameters, $alpha$ and $\\textit{l1}_{ratio}$, to control the strength of the regularization." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ElasticNet on Validation: Mean_squared_error: 0.589963, R_square_score: 0.357222\n", "ElasticNet model weights: \n", " [-1.68706874e+00 -4.07625447e-01 5.80533276e-02 -1.74342339e-02\n", " 7.71012102e-02 -2.79938007e-03 6.08704201e-02 8.27085819e-02\n", " 4.01079987e-02 1.22164446e-02 4.90111408e-02 1.81185231e-01\n", " 5.72731714e-02 -7.48911322e-03 -4.03361893e-02 -4.30131224e-04\n", " 3.32676913e-02 7.24489929e-02 1.04312914e-02 1.44020260e-03\n", " -1.70751670e-02 0.00000000e+00 5.69117077e-03 -1.75438933e-02\n", " -2.71578678e-02 -7.12566653e-03 -1.26970793e-01 -5.45150508e-02\n", " 5.15288749e-02 -8.76079762e-04 7.65163810e-03 9.88670210e-03\n", " -4.33266719e-02 -4.74613861e-02 5.27279477e-02 -6.00927840e-02\n", " 1.05449510e-02 -4.03282619e-02 -0.00000000e+00 3.02110416e-02\n", " -2.58696608e-03 9.72705366e-02 -0.00000000e+00 -1.08439478e-02\n", " -8.05392460e-03 -2.82054308e-02 -4.79513248e-02 -8.53745513e-02\n", " 0.00000000e+00 4.13017601e-02 -1.32683702e-02 7.47598512e-02\n", " 4.29318642e-03 -4.45466901e-02 8.62029311e-02 6.26287563e-02\n", " 6.45280064e-02 -4.50540665e-02 4.86399069e-02 2.68667074e-02\n", " -1.01189203e-02 2.55678660e-02 1.50411873e-02 5.94743167e-03\n", " 2.45497805e-02 4.18197492e-02 3.75863429e-02 1.05350785e-01\n", " 6.45827819e-02 8.16920678e-02 8.50396766e-02 -1.92542247e-02\n", " -3.84084998e-02 1.18160062e-01 1.01825789e-01 2.45545932e-02\n", " 5.25886285e-02 1.15784649e-02 1.22055403e-03 9.17429902e-03\n", " 1.19036314e-02 1.64724437e-02 1.35701623e-02 3.11118328e-02\n", " 8.30514230e-02 2.43230082e-02 6.87413774e-02 8.40801281e-03\n", " -1.89335613e-03 -2.18277179e-02 7.88095533e-02 4.11053130e-02\n", " 2.18727044e-02 1.17358428e-02 -5.38738308e-03 1.87305169e-02\n", " 2.24699570e-02 3.31072597e-02 1.69661089e-03 -4.62130160e-03\n", " 5.04998495e-03 -8.68897055e-03 1.68907133e-02 5.91045780e-02\n", " -7.87592543e-04 5.86171776e-02 4.09090917e-03 1.22948260e-01\n", " 1.93750677e-02 3.23713804e-02 1.33235522e-01 -0.00000000e+00\n", " 1.04567821e-04 -1.52113462e-02 2.39688365e-02 1.17479237e-01\n", " 1.60226689e-02 -3.45103379e-02 1.46060847e-02 3.13436287e-02\n", " 1.10006644e-01 -2.29315833e-02 1.60635679e-02 -2.33524240e-02\n", " 2.13945444e-02 5.73501143e-02 5.54829406e-02 1.90147288e-02\n", " 7.28260875e-02 5.60818827e-02 2.76391306e-02 2.42511949e-02\n", " 1.15196184e-02 -3.36649539e-02 5.10161063e-02 -8.89813950e-02\n", " 2.64829788e-02 -1.25957025e-02 4.62400475e-02 -3.97579749e-02\n", " -0.00000000e+00 7.90135047e-03 4.52666170e-02 5.78268289e-02\n", " 2.84180954e-02 8.85493041e-03 -3.37988175e-02 8.80560276e-03\n", " 5.99676159e-02 5.96939271e-02 3.44039817e-02 1.12371157e-02\n", " -8.01106811e-02 9.35775293e-02 5.93942119e-02 1.54218302e-02\n", " 2.84324048e-02 1.35352644e-02 1.19857160e-01 3.49082498e-02\n", " 1.61413198e-02 -2.68365149e-02 1.74494864e-02 4.60901761e-02\n", " -3.02599073e-02 -4.46047674e-02 1.73595678e-02 -2.36046871e-02\n", " 7.97216178e-03 4.48052587e-02 3.89272497e-02 -3.40282621e-02\n", " 1.97509694e-02 -7.30542845e-02 0.00000000e+00 4.74082231e-02\n", " -7.51778123e-03 -2.05099807e-03 -7.61479018e-03 -1.07341257e-01\n", " 1.23865483e-02 9.08048512e-03 2.03503967e-02 -1.88313074e-02\n", " 4.77047920e-03 3.84286106e-02 1.19681889e-02 5.90752754e-02\n", " 7.18757654e-03 3.60911593e-02 6.44898529e-02 5.95977797e-02\n", " 5.22282033e-02 4.88821127e-02 1.09551351e-02 2.82955014e-02\n", " 5.41002158e-02 1.17644956e-01 0.00000000e+00 4.00313384e-03\n", " -2.03658410e-02 1.64129324e-02]\n" ] } ], "source": [ "from sklearn.linear_model import ElasticNet\n", "from sklearn.metrics import r2_score, mean_squared_error\n", "\n", "# Let's update the pipeline with ElasticNet regression model\n", "elastic_net_pipeline = Pipeline([\n", " ('data_preprocessing', data_preprocessor),\n", " ('elastic_net', ElasticNet(alpha = 0.001, l1_ratio = 0.1))\n", "])\n", "\n", "elastic_net_pipeline.fit(X_train[model_features], y_train.values)\n", "enRegressor_val_predictions = elastic_net_pipeline.predict(X_val[model_features])\n", "\n", "print(\"ElasticNet on Validation: Mean_squared_error: %f, R_square_score: %f\" % \\\n", " (mean_squared_error(y_val, enRegressor_val_predictions),r2_score(y_val, enRegressor_val_predictions)))\n", "\n", "print(\"ElasticNet model weights: \\n\", elastic_net_pipeline.named_steps['elastic_net'].coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7.5 Weights shrinkage and sparsity\n", "\n", "Let's compare weights ranges for all these regression models:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LinearRegression weights range: \n", " 3.798194508242648e-05 1.745773143590561\n", "Ridge weights range: \n", " 0.00023839415841929562 1.626549306436705\n", "Lasso weights range: \n", " 0.0 1.7201052442656493\n", "ElasticNet weights range: \n", " 0.0 1.6870687401639743\n" ] } ], "source": [ "import numpy as np\n", "\n", "lin_regression_coeffs = pipeline.named_steps['lr'].coef_\n", "ridge_regression_coeffs = ridge_pipeline.named_steps['ridge'].coef_\n", "lasso_regression_coeffs = lasso_pipeline.named_steps['lasso'].coef_\n", "enet_regression_coeffs = elastic_net_pipeline.named_steps['elastic_net'].coef_\n", "\n", "print('LinearRegression weights range: \\n', np.abs(lin_regression_coeffs).min(), np.abs(lin_regression_coeffs).max())\n", "print('Ridge weights range: \\n', np.abs(ridge_regression_coeffs).min(), np.abs(ridge_regression_coeffs).max())\n", "print('Lasso weights range: \\n', np.abs(lasso_regression_coeffs).min(), np.abs(lasso_regression_coeffs).max())\n", "print('ElasticNet weights range: \\n', np.abs(enet_regression_coeffs).min(), np.abs(enet_regression_coeffs).max())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The weights of all regularized models are lowered compared to __LinearRegression__, with some of the weights of __Lasso__ and __ElasticNet__ shrinked all the way to 0. Using sparsity, the __Lasso__ regularization reduces the number of features, performing feature selection." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Ideas for improvement\n", "(Go to top)\n", "\n", "One way to improve the performance of a linear regression model is to try different strenghts of regularization, here controlled by the parameters $alpha$ and $\\textit{l1}_{ratio}$." ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }