{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![MLU Logo](../data/MLU_Logo.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Accelerator - Natural Language Processing - Lecture 2\n", "\n", "## Logistic Regression Model and Threshold Calibration\n", "\n", "In this notebook, we go over the Logistic Regression method to predict the __isPositive__ field of our final dataset, while also having a look at how probability threshold calibration can help improve classifier's performance.\n", "\n", "1. Reading the dataset\n", "2. Exploratory data analysis\n", "3. Stop word removal and stemming\n", "4. Train - Validation Split\n", "5. Data processing with Pipeline and ColumnTransform\n", "6. Fit the classifier\n", "Find more details on the __LogisticRegression__ here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n", "7. Test the classifier\n", "8. Ideas for improvement: Probability threshold calibration (optional) \n", "\n", "Overall dataset schema:\n", "* __reviewText:__ Text of the review\n", "* __summary:__ Summary of the review\n", "* __verified:__ Whether the purchase was verified (True or False)\n", "* __time:__ UNIX timestamp for the review\n", "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n", "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -q -r ../requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Reading the dataset\n", "(Go to top)\n", "\n", "We will use the __pandas__ library to read our datasets." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first five rows in the datasets." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewTextsummaryverifiedtimelog_votesisPositive
0PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...IDEAL FOR BEGINNER!True13618368000.0000001.0
1unable to open or useTwo StarsTrue14526432000.0000000.0
2Waste of money!!! It wouldn't load to my system.Dont buy it!True14332896000.0000000.0
3I attempted to install this OS on two differen...I attempted to install this OS on two differen...True15189120000.0000000.0
4I've spent 14 fruitless hours over the past tw...Do NOT Download.True14419296001.0986120.0
\n", "
" ], "text/plain": [ " reviewText \\\n", "0 PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO... \n", "1 unable to open or use \n", "2 Waste of money!!! It wouldn't load to my system. \n", "3 I attempted to install this OS on two differen... \n", "4 I've spent 14 fruitless hours over the past tw... \n", "\n", " summary verified time \\\n", "0 IDEAL FOR BEGINNER! True 1361836800 \n", "1 Two Stars True 1452643200 \n", "2 Dont buy it! True 1433289600 \n", "3 I attempted to install this OS on two differen... True 1518912000 \n", "4 Do NOT Download. True 1441929600 \n", "\n", " log_votes isPositive \n", "0 0.000000 1.0 \n", "1 0.000000 0.0 \n", "2 0.000000 0.0 \n", "3 0.000000 0.0 \n", "4 1.098612 0.0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Exploratory data analysis\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the target distribution for our datasets." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0 43692\n", "0.0 26308\n", "Name: isPositive, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"isPositive\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking the number of missing values: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "reviewText 11\n", "summary 14\n", "verified 0\n", "time 0\n", "log_votes 0\n", "isPositive 0\n", "dtype: int64\n" ] } ], "source": [ "print(df.isna().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have missing values in our text fields." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Text Processing: Stop words removal and stemming\n", "(Go to top)\n", "\n", "We will apply the text processing methods discussed in the class. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n", "[nltk_data] Downloading package stopwords to\n", "[nltk_data] /home/ec2-user/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Install the library and functions\n", "import nltk\n", "\n", "nltk.download('punkt')\n", "nltk.download('stopwords')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import nltk, re\n", "from nltk.corpus import stopwords\n", "from nltk.stem import SnowballStemmer\n", "from nltk.tokenize import word_tokenize\n", "\n", "# Let's get a list of stop words from the NLTK library\n", "stop = stopwords.words('english')\n", "\n", "# These words are important for our problem. We don't want to remove them.\n", "excluding = ['against', 'not', 'don', \"don't\",'ain', 'aren', \"aren't\", 'couldn', \"couldn't\",\n", " 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", \n", " 'haven', \"haven't\", 'isn', \"isn't\", 'mightn', \"mightn't\", 'mustn', \"mustn't\",\n", " 'needn', \"needn't\",'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \n", " \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n", "\n", "# New stop word list\n", "stop_words = [word for word in stop if word not in excluding]\n", "\n", "snow = SnowballStemmer('english')\n", "\n", "def process_text(texts): \n", " final_text_list=[]\n", " for sent in texts:\n", " \n", " # Check if the sentence is a missing value\n", " if isinstance(sent, str) == False:\n", " sent = \"\"\n", " \n", " filtered_sentence=[]\n", " \n", " sent = sent.lower() # Lowercase \n", " sent = sent.strip() # Remove leading/trailing whitespace\n", " sent = re.sub('\\s+', ' ', sent) # Remove extra space and tabs\n", " sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:\n", " \n", " for w in word_tokenize(sent):\n", " # We are applying some custom filtering here, feel free to try different things\n", " # Check if it is not numeric and its length>2 and not in stop words\n", " if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words): \n", " # Stem and add to filtered list\n", " filtered_sentence.append(snow.stem(w))\n", " final_string = \" \".join(filtered_sentence) #final string of cleaned words\n", " \n", " final_text_list.append(final_string)\n", " \n", " return final_text_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Train - Validation Split\n", "(Go to top)\n", "\n", "Let's split our dataset into training (90%) and validation (10%). " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_val, y_train, y_val = train_test_split(df[[\"reviewText\", \"summary\", \"time\", \"log_votes\"]],\n", " df[\"isPositive\"],\n", " test_size=0.10,\n", " shuffle=True,\n", " random_state=324\n", " )" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processing the reviewText fields\n", "Processing the summary fields\n" ] } ], "source": [ "print(\"Processing the reviewText fields\")\n", "X_train[\"reviewText\"] = process_text(X_train[\"reviewText\"].tolist())\n", "X_val[\"reviewText\"] = process_text(X_val[\"reviewText\"].tolist())\n", "\n", "print(\"Processing the summary fields\")\n", "X_train[\"summary\"] = process_text(X_train[\"summary\"].tolist())\n", "X_val[\"summary\"] = process_text(X_val[\"summary\"].tolist())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our process_text() method in section 3 uses empty string for missing values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Data processing with Pipeline and ColumnTransform\n", "(Go to top)\n", "\n", "In the previous examples, we have seen how to use pipeline to prepare a data field for our machine learning model. This time, we will focus on multiple fields: numeric and text fields. Find more details on __LogisticRegression__ here:\n", "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n", "\n", " * For the numerical features pipeline, the __numerical_processor__ below, we use a MinMaxScaler (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.\n", " * For the text features pipeline, the __text_processor__ below, we use CountVectorizer() for the text fields.\n", " \n", "The selective preparations of the dataset features are then put together into a collective ColumnTransformer, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Grab model features/inputs and target/output\n", "numerical_features = ['time',\n", " 'log_votes']\n", "\n", "text_features = ['summary',\n", " 'reviewText']\n", "\n", "model_features = numerical_features + text_features\n", "model_target = 'isPositive'" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'log_votes']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('logistic_regression', LogisticRegression(C=0.1))])
ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                 Pipeline(steps=[('num_scaler',\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 ['time', 'log_votes']),\n",
       "                                ('text_pre_0',\n",
       "                                 Pipeline(steps=[('text_vect_0',\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 'summary'),\n",
       "                                ('text_pre_1',\n",
       "                                 Pipeline(steps=[('text_vect_1',\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 'reviewText')])
['time', 'log_votes']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
LogisticRegression(C=0.1)
" ], "text/plain": [ "Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('logistic_regression', LogisticRegression(C=0.1))])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "### COLUMN_TRANSFORMER ###\n", "##########################\n", "\n", "# Preprocess the numerical features\n", "numerical_processor = Pipeline([\n", " ('num_scaler', MinMaxScaler())\n", "])\n", "\n", "# Preprocess 1st text feature\n", "text_processor_0 = Pipeline([\n", " ('text_vect_0', CountVectorizer(binary=True, max_features=50))\n", "])\n", "\n", "# Preprocess 2nd text feature (larger vocabulary)\n", "text_precessor_1 = Pipeline([\n", " ('text_vect_1', CountVectorizer(binary=True, max_features=150))\n", "])\n", "\n", "# Combine all data preprocessors from above (add more, if you choose to define more!)\n", "# For each processor/step specify: a name, the actual process, and finally the features to be processed\n", "data_preprocessor = ColumnTransformer([\n", " ('numerical_pre', numerical_processor, numerical_features),\n", " ('text_pre_0', text_processor_0, text_features[0]),\n", " ('text_pre_1', text_precessor_1, text_features[1])\n", "]) \n", "\n", "### PIPELINE ###\n", "################\n", "\n", "# Pipeline desired all data transformers, along with an estimator at the end\n", "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n", "pipeline = Pipeline([\n", " ('data_preprocessing', data_preprocessor),\n", " ('logistic_regression', LogisticRegression(penalty = 'l2',\n", " C = 0.1))\n", " ])\n", "\n", "# Visualize the pipeline\n", "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n", "from sklearn import set_config\n", "set_config(display='diagram')\n", "pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Fit the classifier\n", "(Go to top)\n", "\n", "We train our model by using __.fit()__ on our training dataset. \n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'log_votes']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('logistic_regression', LogisticRegression(C=0.1))])
ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                 Pipeline(steps=[('num_scaler',\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 ['time', 'log_votes']),\n",
       "                                ('text_pre_0',\n",
       "                                 Pipeline(steps=[('text_vect_0',\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 'summary'),\n",
       "                                ('text_pre_1',\n",
       "                                 Pipeline(steps=[('text_vect_1',\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 'reviewText')])
['time', 'log_votes']
MinMaxScaler()
summary
CountVectorizer(binary=True, max_features=50)
reviewText
CountVectorizer(binary=True, max_features=150)
LogisticRegression(C=0.1)
" ], "text/plain": [ "Pipeline(steps=[('data_preprocessing',\n", " ColumnTransformer(transformers=[('numerical_pre',\n", " Pipeline(steps=[('num_scaler',\n", " MinMaxScaler())]),\n", " ['time', 'log_votes']),\n", " ('text_pre_0',\n", " Pipeline(steps=[('text_vect_0',\n", " CountVectorizer(binary=True,\n", " max_features=50))]),\n", " 'summary'),\n", " ('text_pre_1',\n", " Pipeline(steps=[('text_vect_1',\n", " CountVectorizer(binary=True,\n", " max_features=150))]),\n", " 'reviewText')])),\n", " ('logistic_regression', LogisticRegression(C=0.1))])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit the Pipeline to training data\n", "pipeline.fit(X_train, y_train.values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Test the classifier\n", "(Go to top)\n", "\n", "Let's evaluate the performance of the trained classifier. We use __.predict()__ this time. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1995 610]\n", " [ 450 3945]]\n", " precision recall f1-score support\n", "\n", " 0.0 0.82 0.77 0.79 2605\n", " 1.0 0.87 0.90 0.88 4395\n", "\n", " accuracy 0.85 7000\n", " macro avg 0.84 0.83 0.84 7000\n", "weighted avg 0.85 0.85 0.85 7000\n", "\n", "Accuracy (validation): 0.8485714285714285\n" ] } ], "source": [ "from sklearn.metrics import confusion_matrix, classification_report, accuracy_score\n", "\n", "# Use the fitted pipeline to make predictions on the validation dataset\n", "val_predictions = pipeline.predict(X_val)\n", "print(confusion_matrix(y_val.values, val_predictions))\n", "print(classification_report(y_val.values, val_predictions))\n", "print(\"Accuracy (validation):\", accuracy_score(y_val.values, val_predictions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Ideas for improvement: Probability threshold calibration (optional)\n", "(Go to top)\n", "\n", "Besides tuning __LogisticRegression__ hyperparameter values, one other path to improve a classifier's performance is to dig deeper into how the classifier actually assigns class membership.\n", "\n", "**Binary predictions versus probability predictions.** We often use __classifier.predict()__ to examine classifier binary predictions, while in fact the outputs of most classifiers are real-valued, not binary. For most classifiers in sklearn, the method __classifier.predict_proba()__ returns class probabilities as a two-dimensional numpy array of shape (n_samples, n_classes) where the classes are lexicographically ordered. \n", "\n", "For our example, let's look at the first 5 predictions we made, in binary format and in real-valued probability format:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1., 1., 1., 1., 0.])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.predict(X_val)[0:5]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.03677896, 0.96322104],\n", " [0.03782395, 0.96217605],\n", " [0.09222425, 0.90777575],\n", " [0.01059353, 0.98940647],\n", " [0.84317225, 0.15682775]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.predict_proba(X_val)[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How are the predicted probabilities used to decide class membership?** On each row of predict_proba output, the probabilities values sum to 1. There are two columns, one for each response class: column 0 - predicted probability that each observation is a member of class 0; column 1 - predicted probability that each observation is a member of class 1. From the predicted probabilities, choose the class with the highest probability.\n", "\n", "The key here is that a **threshold of 0.5** is used by default (for binary problems) to convert predicted probabilities into class predictions: class 0, if predicted probability is less than 0.5; class 1, if predicted probability is greater than 0.5.\n", "\n", "**Can we improve classifier performance by changing the classification threshold?** Let's **adjust** the classification threshold to influence the performance of the classifier. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.1 Threshold calibration to improve model accuracy\n", "\n", "We calculate the accuracy using different values for the classification threshold, and pick the threshold that resulted in the highest accuracy." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Highest Accuracy on Validation: 0.8491428571428571 , Threshold for the highest Accuracy: 0.51\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline \n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Calculate the accuracy using different values for the classification threshold, \n", "# and pick the threshold that resulted in the highest accuracy.\n", "highest_accuracy = 0\n", "threshold_highest_accuracy = 0\n", "\n", "thresholds = np.arange(0,1,0.01)\n", "scores = []\n", "for t in thresholds:\n", " # set threshold to 't' instead of 0.5\n", " y_val_other = (pipeline.predict_proba(X_val)[:,1] >= t).astype(float)\n", " score = accuracy_score(y_val, y_val_other)\n", " scores.append(score)\n", " if(score > highest_accuracy):\n", " highest_accuracy = score\n", " threshold_highest_accuracy = t\n", "print(\"Highest Accuracy on Validation:\", highest_accuracy, \\\n", " \", Threshold for the highest Accuracy:\", threshold_highest_accuracy) \n", "\n", "# Let's plot the accuracy versus different choices of thresholds\n", "plt.plot([0.5, 0.5], [np.min(scores), np.max(scores)], linestyle='--')\n", "plt.plot(thresholds, scores, marker='.')\n", "plt.title('Accuracy versus different choices of thresholds')\n", "plt.xlabel('Threshold')\n", "plt.ylabel('Accuracy')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.2 Threshold calibration to improve model F1 score\n", "\n", "Similarly, various choices of classification thresholds would affect the Precision and Recall metrics. Precision and Recall are usually trade offs of each other, so when you can improve both at the same time, your model's overall performance is undeniably improved. To choose a threshold that balances Precision and Recall, we can plot the Precision-Recall curve and pick the point with the highest F1 score. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline \n", "import matplotlib.pyplot as plt\n", "from sklearn.metrics import precision_recall_curve\n", "\n", "# Calculate the precision and recall using different values for the classification threshold\n", "val_predictions_probs = pipeline.predict_proba(X_val)\n", "precisions, recalls, thresholds = precision_recall_curve(y_val, val_predictions_probs[:, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the Precision and Recall values from the curve above, we calculate the F1 scores using:\n", "\n", "$$\\text{F1_score} = \\frac{2*(\\text{Precision} * \\text{Recall})}{(\\text{Precision} + \\text{Recall})}$$\n", "\n", "and pick the threshold that gives the highest F1 score." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Highest F1 score on Validation: 0.8826827690317015 , Threshold for the highest F1 score: 0.4090877401871203\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline \n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Calculate the F1 score using different values for the classification threshold, \n", "# and pick the threshold that resulted in the highest F1 score.\n", "highest_f1 = 0\n", "threshold_highest_f1 = 0\n", "\n", "f1_scores = []\n", "for id, threhold in enumerate(thresholds):\n", " f1_score = 2*precisions[id]*recalls[id]/(precisions[id]+recalls[id])\n", " f1_scores.append(f1_score)\n", " if(f1_score > highest_f1):\n", " highest_f1 = f1_score\n", " threshold_highest_f1 = threhold\n", "print(\"Highest F1 score on Validation:\", highest_f1, \\\n", " \", Threshold for the highest F1 score:\", threshold_highest_f1)\n", "\n", "# Let's plot the F1 score versus different choices of thresholds\n", "plt.plot([0.5, 0.5], [np.min(f1_scores), np.max(f1_scores)], linestyle='--')\n", "plt.plot(thresholds, f1_scores, marker='.')\n", "plt.title('F1 Score versus different choices of thresholds')\n", "plt.xlabel('Threshold')\n", "plt.ylabel('F1 Score')\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }