{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine Learning Accelerator - Tabular Data - Lecture 2\n",
"\n",
"\n",
"## SageMaker build-in LinearLearner\n",
"\n",
"In this notebook, we use Sagemaker's built-in machine learning model __LinearLearner__ to predict the __Outcome Type__ field of our review dataset.\n",
"\n",
"__Notes on AWS SageMaker__\n",
"\n",
"* Fully managed machine learning service, to quickly and easily get you started on building and training machine learning models - we have seen that already! Integrated Jupyter notebook instances, with easy access to data sources for exploration and analysis, abstract away many of the messy infrastructural details needed for hands-on ML - you don't have to manage servers, install libraries/dependencies, etc.!\n",
"\n",
"\n",
"* Apart from easily building end-to-end machine learning models in SageMaker notebooks, like we did so far, SageMaker also provides a few __build-in common machine learning algorithms__ (check \"SageMaker Examples\" from your SageMaker instance top menu for a complete updated list) that are optimized to run efficiently against extremely large data in a distributed environment. __LinearLearner__ build-in algorithm in SageMaker is extremely fast at inference and can be trained at scale, in mini-batch fashion over GPU(s). The trained model can then be directly deployed into a production-ready hosted environment for easy access at inference. \n",
"\n",
"\n",
"1. Read the dataset\n",
"2. Exploratory Data Analysis\n",
"3. Select features to build the model\n",
"4. Training and test datasets\n",
"5. Data processing with Pipeline and ColumnTransformer\n",
"6. Train a classifier with SageMaker build-in algorithm\n",
"7. Model evaluation\n",
"8. Deploy the model to an endpoint\n",
"9. Test the enpoint\n",
"10. Clean up model artifacts\n",
"\n",
"__Austin Animal Center Dataset__:\n",
"\n",
"In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). \n",
"\n",
"In order to work with a single table, we joined the intake and outcome tables using the \"Animal ID\" column and created a single __review.csv__ file. We also didn't consider animals with multiple entries to the facility to keep our dataset simple. If you want to see the original datasets and the merged data with multiple entries, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv and Austin_Animal_Center_Intakes_Outcomes.csv.\n",
"\n",
"__Dataset schema:__ \n",
"- __Pet ID__ - Unique ID of pet\n",
"- __Outcome Type__ - State of pet at the time of recording the outcome (0 = not placed, 1 = placed). This is the field to predict.\n",
"- __Sex upon Outcome__ - Sex of pet at outcome\n",
"- __Name__ - Name of pet \n",
"- __Found Location__ - Found location of pet before entered the center\n",
"- __Intake Type__ - Circumstances bringing the pet to the center\n",
"- __Intake Condition__ - Health condition of pet when entered the center\n",
"- __Pet Type__ - Type of pet\n",
"- __Sex upon Intake__ - Sex of pet when entered the center\n",
"- __Breed__ - Breed of pet \n",
"- __Color__ - Color of pet \n",
"- __Age upon Intake Days__ - Age of pet when entered the center (days)\n",
"- __Age upon Outcome Days__ - Age of pet at outcome (days)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mWARNING: You are using pip version 21.3.1; however, version 22.3.1 is available.\n",
"You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python -m pip install --upgrade pip' command.\u001b[0m\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -q -r ../requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Read the dataset\n",
"(Go to top)\n",
"\n",
"Let's read the dataset into a dataframe, using Pandas."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The shape of the dataset is: (95485, 13)\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
" \n",
"df = pd.read_csv('../data/review/review_dataset.csv')\n",
"\n",
"print('The shape of the dataset is:', df.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Exploratory Data Analysis\n",
"(Go to top)\n",
"\n",
"We will look at number of rows, columns and some simple statistics of the dataset."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Pet ID
\n",
"
Outcome Type
\n",
"
Sex upon Outcome
\n",
"
Name
\n",
"
Found Location
\n",
"
Intake Type
\n",
"
Intake Condition
\n",
"
Pet Type
\n",
"
Sex upon Intake
\n",
"
Breed
\n",
"
Color
\n",
"
Age upon Intake Days
\n",
"
Age upon Outcome Days
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
A794011
\n",
"
1.0
\n",
"
Neutered Male
\n",
"
Chunk
\n",
"
Austin (TX)
\n",
"
Owner Surrender
\n",
"
Normal
\n",
"
Cat
\n",
"
Neutered Male
\n",
"
Domestic Shorthair Mix
\n",
"
Brown Tabby/White
\n",
"
730
\n",
"
730
\n",
"
\n",
"
\n",
"
1
\n",
"
A776359
\n",
"
1.0
\n",
"
Neutered Male
\n",
"
Gizmo
\n",
"
7201 Levander Loop in Austin (TX)
\n",
"
Stray
\n",
"
Normal
\n",
"
Dog
\n",
"
Intact Male
\n",
"
Chihuahua Shorthair Mix
\n",
"
White/Brown
\n",
"
365
\n",
"
365
\n",
"
\n",
"
\n",
"
2
\n",
"
A674754
\n",
"
0.0
\n",
"
Intact Male
\n",
"
NaN
\n",
"
12034 Research in Austin (TX)
\n",
"
Stray
\n",
"
Nursing
\n",
"
Cat
\n",
"
Intact Male
\n",
"
Domestic Shorthair Mix
\n",
"
Orange Tabby
\n",
"
6
\n",
"
6
\n",
"
\n",
"
\n",
"
3
\n",
"
A689724
\n",
"
1.0
\n",
"
Neutered Male
\n",
"
*Donatello
\n",
"
2300 Waterway Bnd in Austin (TX)
\n",
"
Stray
\n",
"
Normal
\n",
"
Cat
\n",
"
Intact Male
\n",
"
Domestic Shorthair Mix
\n",
"
Black
\n",
"
60
\n",
"
60
\n",
"
\n",
"
\n",
"
4
\n",
"
A680969
\n",
"
1.0
\n",
"
Neutered Male
\n",
"
*Zeus
\n",
"
4701 Staggerbrush Rd in Austin (TX)
\n",
"
Stray
\n",
"
Nursing
\n",
"
Cat
\n",
"
Intact Male
\n",
"
Domestic Shorthair Mix
\n",
"
White/Orange Tabby
\n",
"
7
\n",
"
60
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Pet ID Outcome Type Sex upon Outcome Name \\\n",
"0 A794011 1.0 Neutered Male Chunk \n",
"1 A776359 1.0 Neutered Male Gizmo \n",
"2 A674754 0.0 Intact Male NaN \n",
"3 A689724 1.0 Neutered Male *Donatello \n",
"4 A680969 1.0 Neutered Male *Zeus \n",
"\n",
" Found Location Intake Type Intake Condition \\\n",
"0 Austin (TX) Owner Surrender Normal \n",
"1 7201 Levander Loop in Austin (TX) Stray Normal \n",
"2 12034 Research in Austin (TX) Stray Nursing \n",
"3 2300 Waterway Bnd in Austin (TX) Stray Normal \n",
"4 4701 Staggerbrush Rd in Austin (TX) Stray Nursing \n",
"\n",
" Pet Type Sex upon Intake Breed Color \\\n",
"0 Cat Neutered Male Domestic Shorthair Mix Brown Tabby/White \n",
"1 Dog Intact Male Chihuahua Shorthair Mix White/Brown \n",
"2 Cat Intact Male Domestic Shorthair Mix Orange Tabby \n",
"3 Cat Intact Male Domestic Shorthair Mix Black \n",
"4 Cat Intact Male Domestic Shorthair Mix White/Orange Tabby \n",
"\n",
" Age upon Intake Days Age upon Outcome Days \n",
"0 730 730 \n",
"1 365 365 \n",
"2 6 6 \n",
"3 60 60 \n",
"4 7 60 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Print the first five rows\n",
"# NaN means missing data\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 95485 entries, 0 to 95484\n",
"Data columns (total 13 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Pet ID 95485 non-null object \n",
" 1 Outcome Type 95485 non-null float64\n",
" 2 Sex upon Outcome 95484 non-null object \n",
" 3 Name 59138 non-null object \n",
" 4 Found Location 95485 non-null object \n",
" 5 Intake Type 95485 non-null object \n",
" 6 Intake Condition 95485 non-null object \n",
" 7 Pet Type 95485 non-null object \n",
" 8 Sex upon Intake 95484 non-null object \n",
" 9 Breed 95485 non-null object \n",
" 10 Color 95485 non-null object \n",
" 11 Age upon Intake Days 95485 non-null int64 \n",
" 12 Age upon Outcome Days 95485 non-null int64 \n",
"dtypes: float64(1), int64(2), object(10)\n",
"memory usage: 9.5+ MB\n"
]
}
],
"source": [
"# Let's see the data types and non-null values for each column\n",
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Outcome Type
\n",
"
Age upon Intake Days
\n",
"
Age upon Outcome Days
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
95485.000000
\n",
"
95485.000000
\n",
"
95485.000000
\n",
"
\n",
"
\n",
"
mean
\n",
"
0.564005
\n",
"
703.436959
\n",
"
717.757313
\n",
"
\n",
"
\n",
"
std
\n",
"
0.495889
\n",
"
1052.252197
\n",
"
1055.023160
\n",
"
\n",
"
\n",
"
min
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
\n",
"
\n",
"
25%
\n",
"
0.000000
\n",
"
30.000000
\n",
"
60.000000
\n",
"
\n",
"
\n",
"
50%
\n",
"
1.000000
\n",
"
365.000000
\n",
"
365.000000
\n",
"
\n",
"
\n",
"
75%
\n",
"
1.000000
\n",
"
730.000000
\n",
"
730.000000
\n",
"
\n",
"
\n",
"
max
\n",
"
1.000000
\n",
"
9125.000000
\n",
"
9125.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Outcome Type Age upon Intake Days Age upon Outcome Days\n",
"count 95485.000000 95485.000000 95485.000000\n",
"mean 0.564005 703.436959 717.757313\n",
"std 0.495889 1052.252197 1055.023160\n",
"min 0.000000 0.000000 0.000000\n",
"25% 0.000000 30.000000 60.000000\n",
"50% 1.000000 365.000000 365.000000\n",
"75% 1.000000 730.000000 730.000000\n",
"max 1.000000 9125.000000 9125.000000"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This prints basic statistics for numerical columns\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's separate model features and model target. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Pet ID', 'Outcome Type', 'Sex upon Outcome', 'Name', 'Found Location',\n",
" 'Intake Type', 'Intake Condition', 'Pet Type', 'Sex upon Intake',\n",
" 'Breed', 'Color', 'Age upon Intake Days', 'Age upon Outcome Days'],\n",
" dtype='object')\n"
]
}
],
"source": [
"print(df.columns)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model features: Index(['Pet ID', 'Sex upon Outcome', 'Name', 'Found Location', 'Intake Type',\n",
" 'Intake Condition', 'Pet Type', 'Sex upon Intake', 'Breed', 'Color',\n",
" 'Age upon Intake Days', 'Age upon Outcome Days'],\n",
" dtype='object')\n",
"Model target: Outcome Type\n"
]
}
],
"source": [
"model_features = df.columns.drop('Outcome Type')\n",
"model_target = 'Outcome Type'\n",
"\n",
"print('Model features: ', model_features)\n",
"print('Model target: ', model_target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can explore the features set further, figuring out first what features are numerical or categorical. Beware that some integer-valued features could actually be categorical features, and some categorical features could be text features. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Numerical columns: Index(['Age upon Intake Days', 'Age upon Outcome Days'], dtype='object')\n",
"\n",
"Categorical columns: Index(['Pet ID', 'Sex upon Outcome', 'Name', 'Found Location', 'Intake Type',\n",
" 'Intake Condition', 'Pet Type', 'Sex upon Intake', 'Breed', 'Color'],\n",
" dtype='object')\n"
]
}
],
"source": [
"import numpy as np\n",
"numerical_features_all = df[model_features].select_dtypes(include=np.number).columns\n",
"print('Numerical columns:',numerical_features_all)\n",
"\n",
"print('')\n",
"\n",
"categorical_features_all = df[model_features].select_dtypes(include='object').columns\n",
"print('Categorical columns:',categorical_features_all)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Target distribution\n",
"\n",
"Let's check our target distribution."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYMAAAD+CAYAAADYr2m5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAPLUlEQVR4nO3df6zddX3H8efLVpTMISB3Hestu2TcxFQTEZvSxf2xSVZaNCt/KIEsa0MauwRINFky6/5pUEngn7GRoFszOluzWYkbo8OyrimaZVkKvSgDC2O9QwltkF5pgRkjDnzvj/spHK7n9p5C7zmXnucjOTnf7/vz+X7P+yQ3fZ3vj3OaqkKSNNzeMegGJEmDZxhIkgwDSZJhIEnCMJAkAYsH3cCbdcEFF9TY2Nig25Ckt42HH374x1U10m3sbRsGY2NjTExMDLoNSXrbSPL0bGOeJpIkGQaSJMNAkoRhIEnCMJAkYRhIkjAMJEkYBpIkDANJEm/jbyC/HYxt/tagWzij/PDWjw+6BemM5ZGBJMkwkCQZBpIkDANJEoaBJAnDQJKEYSBJwjCQJGEYSJIwDCRJ9BgGSX6Y5LEkjySZaLXzk+xNcqg9n9fqSXJHkskkjya5rGM/G9r8Q0k2dNQ/0vY/2bbN6X6jkqTZncqRwe9V1aVVtaKtbwb2VdU4sK+tA6wFxttjE/AVmA4PYAtwObAS2HIiQNqcT3dst+ZNvyNJ0il7K6eJ1gHb2/J24OqO+o6ath84N8mFwJXA3qo6VlXHgb3AmjZ2TlXtr6oCdnTsS5LUB72GQQH/muThJJtabUlVPduWfwQsactLgWc6tj3caierH+5SlyT1Sa8/Yf07VXUkya8Be5P8V+dgVVWSOv3tvVELok0AF1100Xy/nCQNjZ6ODKrqSHs+CtzD9Dn/59opHtrz0Tb9CLCsY/PRVjtZfbRLvVsfW6tqRVWtGBkZ6aV1SVIP5gyDJL+S5FdPLAOrge8Du4ATdwRtAO5ty7uA9e2uolXAi+100h5gdZLz2oXj1cCeNvZSklXtLqL1HfuSJPVBL6eJlgD3tLs9FwN/X1X/kuQAcHeSjcDTwDVt/m7gKmAS+ClwPUBVHUvyReBAm/eFqjrWlm8AvgqcDdzfHpKkPpkzDKrqKeBDXerPA1d0qRdw4yz72gZs61KfAD7YQ7+SpHngN5AlSYaBJMkwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAkYRhIkjAMJEkYBpIkDANJEoaBJAnDQJKEYSBJwjCQJGEYSJIwDCRJGAaSJAwDSRKGgSQJw0CShGEgScIwkCQBiwfdgKTBGNv8rUG3cEb54a0fH3QLb4lHBpIkw0CSdAphkGRRku8lua+tX5zkwSSTSb6R5KxWf1dbn2zjYx37+HyrP5nkyo76mlabTLL5NL4/SVIPTuXI4DPAEx3rtwG3V9UlwHFgY6tvBI63+u1tHkmWA9cCHwDWAF9uAbMIuBNYCywHrmtzJUl90lMYJBkFPg78TVsP8DHgm23KduDqtryurdPGr2jz1wE7q+rlqvoBMAmsbI/Jqnqqqn4O7GxzJUl90uuRwV8Afwr8oq2/D3ihql5p64eBpW15KfAMQBt/sc1/rT5jm9nqvyTJpiQTSSampqZ6bF2SNJc5wyDJJ4CjVfVwH/o5qaraWlUrqmrFyMjIoNuRpDNGL98z+CjwB0muAt4NnAP8JXBuksXt0/8ocKTNPwIsAw4nWQy8F3i+o35C5zaz1SVJfTDnkUFVfb6qRqtqjOkLwA9U1R8C3wY+2aZtAO5ty7vaOm38gaqqVr+23W10MTAOPAQcAMbb3UlntdfYdVrenSSpJ2/lG8ifA3Ym+RLwPeCuVr8L+FqSSeAY0/+4U1UHk9wNPA68AtxYVa8CJLkJ2AMsArZV1cG30Jck6RSdUhhU1XeA77Tlp5i+E2jmnJ8Bn5pl+1uAW7rUdwO7T6UXSdLp4zeQJUmGgSTJMJAkYRhIkjAMJEkYBpIkDANJEoaBJAnDQJKEYSBJwjCQJGEYSJIwDCRJGAaSJAwDSRKGgSQJw0CShGEgScIwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAkYRhIkjAMJEn0EAZJ3p3koST/meRgkptb/eIkDyaZTPKNJGe1+rva+mQbH+vY1+db/ckkV3bU17TaZJLN8/A+JUkn0cuRwcvAx6rqQ8ClwJokq4DbgNur6hLgOLCxzd8IHG/129s8kiwHrgU+AKwBvpxkUZJFwJ3AWmA5cF2bK0nqkznDoKb9pK2+sz0K+BjwzVbfDlzdlte1ddr4FUnS6jur6uWq+gEwCaxsj8mqeqqqfg7sbHMlSX3S0zWD9gn+EeAosBf4H+CFqnqlTTkMLG3LS4FnANr4i8D7OusztpmtLknqk57CoKperapLgVGmP8m/fz6bmk2STUkmkkxMTU0NogVJOiOd0t1EVfUC8G3gt4FzkyxuQ6PAkbZ8BFgG0MbfCzzfWZ+xzWz1bq+/tapWVNWKkZGRU2ldknQSvdxNNJLk3LZ8NvD7wBNMh8In27QNwL1teVdbp40/UFXV6te2u40uBsaBh4ADwHi7O+kspi8y7zoN702S1KPFc0/hQmB7u+vnHcDdVXVfkseBnUm+BHwPuKvNvwv4WpJJ4BjT/7hTVQeT3A08DrwC3FhVrwIkuQnYAywCtlXVwdP2DiVJc5ozDKrqUeDDXepPMX39YGb9Z8CnZtnXLcAtXeq7gd099CtJmgd+A1mSZBhIkgwDSRKGgSQJw0CShGEgScIwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAkYRhIkjAMJEkYBpIkDANJEoaBJAnDQJKEYSBJwjCQJGEYSJIwDCRJGAaSJAwDSRKGgSQJw0CShGEgSaKHMEiyLMm3kzye5GCSz7T6+Un2JjnUns9r9SS5I8lkkkeTXNaxrw1t/qEkGzrqH0nyWNvmjiSZjzcrSequlyODV4A/qarlwCrgxiTLgc3AvqoaB/a1dYC1wHh7bAK+AtPhAWwBLgdWAltOBEib8+mO7da89bcmSerVnGFQVc9W1Xfb8v8CTwBLgXXA9jZtO3B1W14H7Khp+4Fzk1wIXAnsrapjVXUc2AusaWPnVNX+qipgR8e+JEl9cErXDJKMAR8GHgSWVNWzbehHwJK2vBR4pmOzw612svrhLvVur78pyUSSiampqVNpXZJ0Ej2HQZL3AP8AfLaqXuoca5/o6zT39kuqamtVraiqFSMjI/P9cpI0NHoKgyTvZDoI/q6q/rGVn2uneGjPR1v9CLCsY/PRVjtZfbRLXZLUJ73cTRTgLuCJqvrzjqFdwIk7gjYA93bU17e7ilYBL7bTSXuA1UnOaxeOVwN72thLSVa111rfsS9JUh8s7mHOR4E/Ah5L8kir/RlwK3B3ko3A08A1bWw3cBUwCfwUuB6gqo4l+SJwoM37QlUda8s3AF8Fzgbubw9JUp/MGQZV9e/AbPf9X9FlfgE3zrKvbcC2LvUJ4INz9SJJmh9+A1mSZBhIkgwDSRKGgSQJw0CShGEgScIwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAkYRhIkjAMJEkYBpIkDANJEoaBJAnDQJKEYSBJwjCQJGEYSJIwDCRJGAaSJAwDSRKGgSQJw0CSRA9hkGRbkqNJvt9ROz/J3iSH2vN5rZ4kdySZTPJokss6ttnQ5h9KsqGj/pEkj7Vt7kiS0/0mJUkn18uRwVeBNTNqm4F9VTUO7GvrAGuB8fbYBHwFpsMD2AJcDqwEtpwIkDbn0x3bzXwtSdI8mzMMqurfgGMzyuuA7W15O3B1R31HTdsPnJvkQuBKYG9VHauq48BeYE0bO6eq9ldVATs69iVJ6pM3e81gSVU925Z/BCxpy0uBZzrmHW61k9UPd6l3lWRTkokkE1NTU2+ydUnSTG/5AnL7RF+noZdeXmtrVa2oqhUjIyP9eElJGgpvNgyea6d4aM9HW/0IsKxj3mirnaw+2qUuSeqjNxsGu4ATdwRtAO7tqK9vdxWtAl5sp5P2AKuTnNcuHK8G9rSxl5KsancRre/YlySpTxbPNSHJ14HfBS5Icpjpu4JuBe5OshF4GrimTd8NXAVMAj8FrgeoqmNJvggcaPO+UFUnLkrfwPQdS2cD97eHJKmP5gyDqrpulqEruswt4MZZ9rMN2NalPgF8cK4+JEnzx28gS5IMA0mSYSBJwjCQJGEYSJIwDCRJGAaSJAwDSRKGgSQJw0CShGEgScIwkCRhGEiSMAwkSRgGkiQMA0kShoEkCcNAkoRhIEnCMJAkYRhIkjAMJEkYBpIkDANJEoaBJAnDQJKEYSBJwjCQJGEYSJIwDCRJLKAwSLImyZNJJpNsHnQ/kjRMFkQYJFkE3AmsBZYD1yVZPtiuJGl4LIgwAFYCk1X1VFX9HNgJrBtwT5I0NBYPuoFmKfBMx/ph4PKZk5JsAja11Z8kebIPvQ2DC4AfD7qJueS2QXegAfHv8/T5zdkGFkoY9KSqtgJbB93HmSbJRFWtGHQfUjf+ffbHQjlNdARY1rE+2mqSpD5YKGFwABhPcnGSs4BrgV0D7kmShsaCOE1UVa8kuQnYAywCtlXVwQG3NUw89aaFzL/PPkhVDboHSdKALZTTRJKkATIMJEmGgSTJMJC0ACU5P8n5g+5jmBgGkhaEJBcl2ZlkCngQeCjJ0VYbG3B7ZzzDYEglWZLksvZYMuh+JOAbwD3Ar1fVeFVdAlwI/BPTv1emeeStpUMmyaXAXwHv5fVveY8CLwA3VNV3B9OZhl2SQ1U1fqpjOj0MgyGT5BHgj6vqwRn1VcBfV9WHBtKYhl6SncAxYDuv/3DlMmADcEFVXTOo3oaBYTBk5vj0NdkOzaW+az9Fs5Hpn69f2sqHgX8G7qqqlwfV2zAwDIZMkjuA3wJ28MZPX+uBH1TVTYPqTdLgGAZDKMla3vjp6wiwq6p2D64raXZJPlFV9w26jzOZYSBpwUtyc1VtGXQfZzLDQK9Jsqn9B0LSQCR5P92PWp8YXFfDwe8ZqFMG3YCGV5LPMf19ggAPtUeAryfZPMjehoFHBnpNkuur6m8H3YeGU5L/Bj5QVf83o34WcNDvGcwvjwzU6eZBN6Ch9gvgN7rUL2xjmkcL4n86U/8keXS2IcCfpdAgfRbYl+QQr9/2fBFwCeAtz/PM00RDJslzwJXA8ZlDwH9UVbdPZlJfJHkHsJI3XkA+UFWvDq6r4eCRwfC5D3hPVT0ycyDJd/rejdShqn4B7B90H8PIIwNJkheQJUmGgSQJw0CShGEgSQL+HwXgTl2dbdU4AAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"df[model_target].value_counts().plot.bar()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the target plots we can identify whether or not we are dealing with imbalanced datasets - this means one result type is dominating the other one(s). \n",
"\n",
"Handling class imbalance is highly recommended, as the model performance can be greatly impacted. In particular the model may not work well for the infrequent classes, as there are not enough samples to learn patterns from, and so it would be hard for the classifier to identify and match those patterns. \n",
"\n",
"We might want to downsample the dominant class or upsample the rare the class, to help with learning its patterns. However, we should only fix the imbalance in training set, without changing the validation and test sets, as these should follow the original distribution. We will perform this task after train/test split. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Select features to build the model\n",
"(Go to top)\n",
"\n",
"This time we build a model using all features. That is, we build a classifier including __numerical, categorical__ and __text__ features. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Grab model features/inputs and target/output\n",
"\n",
"# can also grab less numerical features, as some numerical data might not be very useful\n",
"numerical_features = ['Age upon Intake Days', 'Age upon Outcome Days']\n",
"\n",
"# dropping the IDs features, RescuerID and PetID here \n",
"categorical_features = ['Sex upon Outcome', 'Intake Type',\n",
" 'Intake Condition', 'Pet Type', 'Sex upon Intake']\n",
"\n",
"# from EDA, select the text features\n",
"text_features = ['Name', 'Found Location', 'Breed', 'Color']\n",
" \n",
"model_features = numerical_features + categorical_features + text_features\n",
"model_target = 'Outcome Type'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Cleaning numerical features "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Age upon Intake Days\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"