{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine Learning Accelerator - Tabular Data - Lecture 1\n",
"\n",
"\n",
"## ML Model Development\n",
"\n",
"In this notebook we start from a simple machine learning problem, John's food delivery problem, and shape a machine learning solution with __sklearn__ library. Given John's dataset of food delivery recordings, the task is to predict whether an order will be on time or delayed. \n",
"\n",
"1. The dataset\n",
"2. Select features to build the model\n",
"3. Train a classifier ([__K Nearest Neighbors Classifier__](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html))\n",
"4. Use the trained classifier to make predictions\n",
"5. Model evaluation\n",
"6. Training and test datasets\n",
"7. Overfitting\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mWARNING: You are using pip version 21.3.1; however, version 22.3.1 is available.\n",
"You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python -m pip install --upgrade pip' command.\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -q -r ../requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Jupiter notebooks environment__: \n",
"* Jupiter notebooks allow creating and sharing documents that contain both code and rich text element, such as equations. If you are not familiar with Jupiter notebooks, read more [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). \n",
"* This is a quick-start demo to bring you up to speed on coding and experimenting with machine learning. Move through the notebook from top to bottom. Run each code cell to see its output. To run a cell, click within the cell and press __Shift__+__Enter__, or click __Run__ at the top of the page. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. The dataset\n",
"Go to top\n",
"\n",
"Let's enter John's food deliveries dataset here, using the __numpy__ high-performance array-processing package. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0. 1. 5. 1. 0. ]\n",
" [1. 0. 7. 0. 1. ]\n",
" [0. 1. 2. 1. 0. ]\n",
" [1. 1. 4.2 1. 0. ]\n",
" [0. 0. 7.8 0. 1. ]\n",
" [1. 0. 3.9 1. 0. ]\n",
" [0. 1. 4. 1. 0. ]\n",
" [1. 1. 2. 0. 0. ]\n",
" [0. 0. 3.5 0. 1. ]\n",
" [1. 0. 2.6 1. 0. ]\n",
" [0. 0. 4.1 0. 1. ]\n",
" [0. 1. 1.5 0. 1. ]\n",
" [1. 1. 1.75 1. 0. ]\n",
" [1. 0. 1.3 0. 0. ]\n",
" [1. 1. 2.1 0. 0. ]\n",
" [1. 1. 0.2 1. 0. ]\n",
" [1. 1. 5.2 0. 1. ]\n",
" [0. 1. 2. 1. 0. ]\n",
" [1. 0. 5.5 0. 1. ]\n",
" [0. 0. 2. 1. 0. ]\n",
" [1. 1. 1.7 0. 0. ]\n",
" [0. 1. 3. 1. 1. ]\n",
" [1. 1. 1.9 1. 0. ]\n",
" [0. 1. 3.1 0. 1. ]\n",
" [0. 1. 2.3 0. 0. ]\n",
" [0. 0. 1.1 1. 0. ]\n",
" [1. 1. 2.5 1. 1. ]\n",
" [1. 1. 5. 0. 1. ]\n",
" [1. 0. 7.5 1. 1. ]\n",
" [0. 0. 0.5 1. 0. ]\n",
" [0. 0. 1.5 1. 0. ]\n",
" [1. 0. 3.2 1. 0. ]\n",
" [0. 0. 2.15 1. 0. ]\n",
" [1. 1. 4.2 0. 1. ]\n",
" [1. 0. 6.5 0. 1. ]\n",
" [1. 0. 0.5 0. 0. ]\n",
" [0. 0. 3.5 0. 1. ]\n",
" [0. 0. 1.75 0. 0. ]\n",
" [1. 1. 5. 0. 1. ]\n",
" [0. 0. 2. 1. 0. ]\n",
" [0. 1. 1.3 1. 1. ]\n",
" [0. 1. 0.2 0. 0. ]\n",
" [1. 1. 2.2 0. 0. ]\n",
" [0. 1. 1.2 1. 0. ]\n",
" [1. 1. 4.2 0. 1. ]]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"# This is John's data\n",
"data = np.array([[0, 1, 5, 1, 0], # record of John's 1st food delivery\n",
" [1, 0, 7, 0, 1], # record of John's 2nd food delivery\n",
" [0, 1, 2, 1, 0], # record of John's 3rd food delivery\n",
" [1, 1, 4.2, 1, 0], # record of John's 4th food delivery\n",
" [0, 0, 7.8, 0, 1], # ...\n",
" [1, 0, 3.9, 1, 0],\n",
" [0, 1, 4, 1, 0],\n",
" [1, 1, 2, 0, 0],\n",
" [0, 0, 3.5, 0, 1],\n",
" [1, 0, 2.6, 1, 0],\n",
" [0, 0, 4.1, 0, 1],\n",
" [0, 1, 1.5, 0, 1],\n",
" [1, 1, 1.75, 1, 0],\n",
" [1, 0, 1.3, 0, 0],\n",
" [1, 1, 2.1, 0, 0],\n",
" [1, 1, 0.2, 1, 0],\n",
" [1, 1, 5.2, 0, 1],\n",
" [0, 1, 2, 1, 0],\n",
" [1, 0, 5.5, 0, 1],\n",
" [0, 0, 2, 1, 0],\n",
" [1, 1, 1.7, 0, 0],\n",
" [0, 1, 3, 1, 1],\n",
" [1, 1, 1.9, 1, 0],\n",
" [0, 1, 3.1, 0, 1],\n",
" [0, 1, 2.3, 0, 0],\n",
" [0, 0, 1.1, 1, 0],\n",
" [1, 1, 2.5, 1, 1],\n",
" [1, 1, 5, 0, 1],\n",
" [1, 0, 7.5, 1, 1],\n",
" [0, 0, 0.5, 1, 0],\n",
" [0, 0, 1.5, 1, 0],\n",
" [1, 0, 3.2, 1, 0],\n",
" [0, 0, 2.15, 1, 0],\n",
" [1, 1, 4.2, 0, 1],\n",
" [1, 0, 6.5, 0, 1],\n",
" [1, 0, 0.5, 0, 0],\n",
" [0, 0, 3.5, 0, 1],\n",
" [0, 0, 1.75, 0, 0],\n",
" [1, 1, 5, 0, 1],\n",
" [0, 0, 2, 1, 0],\n",
" [0, 1, 1.3, 1, 1],\n",
" [0, 1, 0.2, 0, 0],\n",
" [1, 1, 2.2, 0, 0],\n",
" [0, 1, 1.2, 1, 0],\n",
" [1, 1, 4.2, 0, 1]])\n",
"\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now write our toy dataset into a __pandas__ dataframe, labeling the columns for easier access. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Create the dataframe with this data, labeling the columns\n",
"delivery_data = pd.DataFrame(data, columns=[\"bad_weather\", \"is_rush_hour\", \"mile_distance\", \"urban_address\", \"late\"])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at our dataset as a dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" bad_weather | \n",
" is_rush_hour | \n",
" mile_distance | \n",
" urban_address | \n",
" late | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 5.00 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 7.00 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.00 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4.20 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 7.80 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 3.90 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 4.00 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 2.00 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 8 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.50 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 2.60 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 10 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 4.10 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 11 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 1.50 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 12 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 1.75 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 13 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.30 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 14 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 2.10 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bad_weather is_rush_hour mile_distance urban_address late\n",
"0 0.0 1.0 5.00 1.0 0.0\n",
"1 1.0 0.0 7.00 0.0 1.0\n",
"2 0.0 1.0 2.00 1.0 0.0\n",
"3 1.0 1.0 4.20 1.0 0.0\n",
"4 0.0 0.0 7.80 0.0 1.0\n",
"5 1.0 0.0 3.90 1.0 0.0\n",
"6 0.0 1.0 4.00 1.0 0.0\n",
"7 1.0 1.0 2.00 0.0 0.0\n",
"8 0.0 0.0 3.50 0.0 1.0\n",
"9 1.0 0.0 2.60 1.0 0.0\n",
"10 0.0 0.0 4.10 0.0 1.0\n",
"11 0.0 1.0 1.50 0.0 1.0\n",
"12 1.0 1.0 1.75 1.0 0.0\n",
"13 1.0 0.0 1.30 0.0 0.0\n",
"14 1.0 1.0 2.10 0.0 0.0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Print the first 5 rows\n",
"delivery_data.head(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dataframes are not just more meaningful to look at, are also powerful, expressive and flexible data structures that make data manipulation and analysis much easier. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Select features to build the model\n",
"Go to top\n",
"\n",
"Let's start using the dataframe, by first grabing the input and output of our machine learning problem."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"input_data = delivery_data[[\"bad_weather\", \"is_rush_hour\", \"mile_distance\", \"urban_address\"]]\n",
"target = delivery_data[\"late\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On this dataset containing samples of each of the two possible classes, we fit an estimator from the __sklearn__ library to best capture the relationship between the input and the output, and further explore that learned relationship to predict the classes to which unseen samples belong.\n",
"\n",
"In __sklearn__, an estimator is a Python object that implements the methods __.fit()__ and __.predict()__. The estimator’s constructor takes as arguments the model’s parameters.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Train a classifier\n",
"Go to top\n",
"\n",
"\n",
"Let's fit a K Nearest Neighbors (KNN) model to our data. We use the sklearn library's [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) here."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"KNeighborsClassifier(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"KNeighborsClassifier(n_neighbors=1)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"# Use n_neighbors = 1\n",
"# This means the KNN will consider the \"closest\" record to make a decision.\n",
"classifier = KNeighborsClassifier(n_neighbors = 1)\n",
"\n",
"# Fit the model to our data\n",
"classifier.fit(input_data, target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Use the trained classifier to make predictions\n",
"Go to top\n",
"\n",
"Let's make some prediction with our fitted model. Assume we have the following data:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names\n",
" warnings.warn(\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"some_data = np.array([[0, 0, 2.1, 1]]) # bad_weather->0, is_rush_hour->0, mile_distance->2.1 and urban_address->1\n",
"\n",
"# Use the fitted model to make predictions on new data\n",
"print(classifier.predict(some_data))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We predicted this delivery to be on time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also predict multiple records, as shown below."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0. 0. 1.]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names\n",
" warnings.warn(\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"some_data = np.array([[0, 0, 2.1, 1], # bad_weather->0, is_rush_hour->0, mile_distance->2.1 and urban_address->1\n",
" [0, 1, 5, 0], # bad_weather->0, is_rush_hour->1, mile_distance->5.0 and urban_address->0\n",
" [1, 1, 3.1, 1] # bad_weather->1, is_rush_hour->1, mile_distance->3.1 and urban_address->1\n",
" ])\n",
"\n",
"# Use the fitted model to make predictions on more new data\n",
"print(classifier.predict(some_data))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last delivery is predicted to be late. The first two will be on time (Hopefully!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Model evaluation\n",
"Go to top\n",
"\n",
"__How do we know whether our predictions were good or bad predictions?__
\n",
"If we don't have the correct label for this input, we won't know. Similarly, we won't have any idea about how good this model is. \n",
"\n",
"One thing we can do is to test the model with the data we used to train it, and use sklearn's metrics functions to examine the performance of the classifier."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Use the fitted model to make predictions on our training dataset\n",
"predictions = classifier.predict(input_data)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__Confusion matrix__: The diagonals show us correct classifications. Each row and column belongs to a class (late and on time). The first column and row correspond to \"on time\" case, the second column-rows are \"late\" cases.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[27 0]\n",
" [ 0 18]]\n"
]
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"\n",
"print(confusion_matrix(target, predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we look at the confusion matrix, we can quickly see that all predictions were correct, so our classifier should have a high score.\n",
"\n",
"__Classification metrics__: We use here the __accuracy__ metric, that measures how correctly the trained model predicts the late or not late outcomes. Let's look at the classification report and the accuracy score below."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0.0 1.00 1.00 1.00 27\n",
" 1.0 1.00 1.00 1.00 18\n",
"\n",
" accuracy 1.00 45\n",
" macro avg 1.00 1.00 1.00 45\n",
"weighted avg 1.00 1.00 1.00 45\n",
"\n",
"Accuracy: 1.0\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"print(classification_report(target, predictions))\n",
"\n",
"print(\"Accuracy:\", accuracy_score(target, predictions))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indeed, we predicted all outcomes with 100% accuracy!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Training and test datasets\n",
"Go to top\n",
"\n",
"John's model worked with 100% accuracy on the whole dataset. This might seem promising, but this doesn't tell us anything about performance on future orders. One way to test whether this model works on new \"unseen\" orders, is to reserve some data from out original dataset for test purposes. \n",
"\n",
"__Let's split our data into two sets: Training (85%) and Test (15%)__. This will give us 38 training records and 7 test records (of the total 45 records)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"(45, 5)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"delivery_data.shape"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" bad_weather | \n",
" is_rush_hour | \n",
" mile_distance | \n",
" urban_address | \n",
" late | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 7.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 4.2 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 7.8 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bad_weather is_rush_hour mile_distance urban_address late\n",
"0 0.0 1.0 5.0 1.0 0.0\n",
"1 1.0 0.0 7.0 0.0 1.0\n",
"2 0.0 1.0 2.0 1.0 0.0\n",
"3 1.0 1.0 4.2 1.0 0.0\n",
"4 0.0 0.0 7.8 0.0 1.0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's split our data into two sets: Training (85%) and Test (15%)\n",
"# This gives us 38 training records and 7 test records (total 45 records)\n",
"\n",
"training_data = delivery_data.iloc[:38, :] # First 38\n",
"test_data = delivery_data.iloc[38:, :] # Remaining\n",
"\n",
"# Print the first 5 rows\n",
"training_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fitting the KNN on training dataset this time."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"KNeighborsClassifier(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. "
],
"text/plain": [
"KNeighborsClassifier(n_neighbors=1)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"X_train = training_data[[\"bad_weather\", \"is_rush_hour\", \"mile_distance\", \"urban_address\"]].values\n",
"y_train = training_data[\"late\"].tolist()\n",
"\n",
"# Use n_neighbors = 1\n",
"# This means the KNN will consider two other \"closest\" records to make a decision.\n",
"classifier = KNeighborsClassifier(n_neighbors = 1)\n",
"\n",
"# Fit the model to our training data\n",
"classifier.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check the accuracy on the training data."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model evaluation on the training set: \n",
"\n",
"[[23 0]\n",
" [ 0 15]]\n",
" precision recall f1-score support\n",
"\n",
" 0.0 1.00 1.00 1.00 23\n",
" 1.0 1.00 1.00 1.00 15\n",
"\n",
" accuracy 1.00 38\n",
" macro avg 1.00 1.00 1.00 38\n",
"weighted avg 1.00 1.00 1.00 38\n",
"\n",
"Training accuracy: 1.0\n"
]
}
],
"source": [
"from sklearn.metrics import confusion_matrix, classification_report, accuracy_score\n",
"\n",
"# Use the fitted model to make predictions on the same dataset we trained the model on\n",
"train_predictions = classifier.predict(X_train)\n",
"\n",
"print('Model evaluation on the training set: \\n')\n",
"print(confusion_matrix(y_train, train_predictions))\n",
"print(classification_report(y_train, train_predictions))\n",
"print(\"Training accuracy:\", accuracy_score(y_train, train_predictions))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now, let's check the accuracy on the test data."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model evaluation on the training set: \n",
"\n",
"[[3 1]\n",
" [1 2]]\n",
" precision recall f1-score support\n",
"\n",
" 0.0 0.75 0.75 0.75 4\n",
" 1.0 0.67 0.67 0.67 3\n",
"\n",
" accuracy 0.71 7\n",
" macro avg 0.71 0.71 0.71 7\n",
"weighted avg 0.71 0.71 0.71 7\n",
"\n",
"Training accuracy: 0.7142857142857143\n"
]
}
],
"source": [
"from sklearn.metrics import confusion_matrix, classification_report, accuracy_score\n",
"\n",
"X_test = test_data[[\"bad_weather\", \"is_rush_hour\", \"mile_distance\", \"urban_address\"]].values\n",
"y_test = test_data[\"late\"].tolist()\n",
"\n",
"# Use the fitted model to make predictions on the test dataset\n",
"test_predictions = classifier.predict(X_test)\n",
"\n",
"print('Model evaluation on the training set: \\n')\n",
"print(confusion_matrix(y_test, test_predictions))\n",
"print(classification_report(y_test, test_predictions))\n",
"print(\"Training accuracy:\", accuracy_score(y_test, test_predictions))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Overfitting\n",
"Go to top\n",
"\n",
"### This doesn't look good! \n",
"We only achieved 71% accuracy on data that the model hasn't seen before.
\n",
"Can we trust this model? Probably not.\n",
"\n",
"### Let's explain what happened here. \n",
"We experienced a common problem called __\"Overfitting\"__. This means our model \"over-learned\" or memorized our training data, and failed on the new data it hasn't seen before.\n",
"\n",
"Experienced people would have spotted the problem even before fitting the classifier, the K parameter we chose as 1 here looks at the closest one record and assign the class of that record. This doesn't generalize well to our overall dataset and \"overfits\" the dataset.\n",
"\n",
"### Where is the validation subset?\n",
"\n",
"If we want to optimize the performance of our algorithm, and therefore reduce the so-called *generalization gap*, we need to look for the best performing K value, using a validation set. We pick the K value that results in the best validation performance metric of our choice, and then we finally check model performance on the test dataset."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# Let's further split our training data into two sets: Training (80%) and Validation (20%)\n",
"# This gives us 30 training records and 8 test records \n",
"\n",
"train_data = training_data.iloc[:30, :] # First 30\n",
"val_data = training_data.iloc[30:, :] # Remaining\n",
"\n",
"X_train = train_data[[\"bad_weather\", \"is_rush_hour\", \"mile_distance\", \"urban_address\"]].values\n",
"y_train = train_data[\"late\"].tolist()\n",
"\n",
"X_val = val_data[[\"bad_weather\", \"is_rush_hour\", \"mile_distance\", \"urban_address\"]].values\n",
"y_val =val_data[\"late\"].tolist()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Trying different K values.\n",
"\n",
"Let's try different K values and see how the model performs with each one, on the validation set.\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"K=1, Validation accuracy score: 0.875000\n",
"K=2, Validation accuracy score: 0.875000\n",
"K=3, Validation accuracy score: 1.000000\n",
"K=4, Validation accuracy score: 1.000000\n",
"K=5, Validation accuracy score: 1.000000\n",
"K=6, Validation accuracy score: 0.875000\n"
]
}
],
"source": [
"K_values = [1, 2, 3, 4, 5, 6]\n",
"\n",
"for K in K_values:\n",
" classifier = KNeighborsClassifier(n_neighbors = K)\n",
" classifier.fit(X_train, y_train)\n",
" val_predictions = classifier.predict(X_val)\n",
" print(\"K=%d, Validation accuracy score: %f\" % (K, accuracy_score(y_val, val_predictions)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like K = 3 or K = 4 or K = 5 are optimal choices for K. Let's choose K = 4 to build the classifier, train on the train set, and finally test on the test set."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test accuracy score: 0.857143\n"
]
}
],
"source": [
"classifier = KNeighborsClassifier(n_neighbors = 4)\n",
"classifier.fit(X_train, y_train)\n",
"test_predictions = classifier.predict(X_test)\n",
"print(\"Test accuracy score: %f\" % (accuracy_score(y_test, test_predictions)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Indeed, accuracy on the test set improved from 71% to 86%, reducing the generalization gap."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_pytorch_p39",
"language": "python",
"name": "conda_pytorch_p39"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}