{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"source": [
"# 0. Setup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# -*- coding: utf-8 -*-"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import numpy as np\n",
"import pandas as pd\n",
"import sklearn as sk\n",
"import pickle as pkl"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"---\n",
"\n",
"# 1. Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### About the dataset\n",
"\n",
"Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. \n",
"\n",
"+ Sex / nominal / -- / M, F, and I (infant) \n",
"+ Length / continuous / mm / Longest shell measurement \n",
"+ Diameter\t/ continuous / mm / perpendicular to length \n",
"+ Height / continuous / mm / with meat in shell \n",
"+ Whole weight / continuous / grams / whole abalone \n",
"+ Shucked weight / continuous\t/ grams / weight of meat \n",
"+ Viscera weight / continuous / grams / gut weight (after bleeding) \n",
"+ Shell weight / continuous / grams / after being dried \n",
"+ Rings / integer / -- / +1.5 gives the age in years "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# variable names\n",
"names = [\n",
" 'sex',\n",
" 'length',\n",
" 'diameter',\n",
" 'height',\n",
" 'whole_weight',\n",
" 'shucked_weight',\n",
" 'viscera_weight',\n",
" 'shell_weight',\n",
" 'rings'\n",
"]\n",
"\n",
"# reading dataset\n",
"df = pd.read_csv('data/abalone.data', header=None, names=names)\n",
"\n",
"# building prediction target\n",
"df['target'] = (df['rings'] >= 10).astype(int)\n",
"df = df.drop('rings', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sex | \n",
" length | \n",
" diameter | \n",
" height | \n",
" whole_weight | \n",
" shucked_weight | \n",
" viscera_weight | \n",
" shell_weight | \n",
" target | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" M | \n",
" 0.455 | \n",
" 0.365 | \n",
" 0.095 | \n",
" 0.5140 | \n",
" 0.2245 | \n",
" 0.1010 | \n",
" 0.150 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" M | \n",
" 0.350 | \n",
" 0.265 | \n",
" 0.090 | \n",
" 0.2255 | \n",
" 0.0995 | \n",
" 0.0485 | \n",
" 0.070 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" F | \n",
" 0.530 | \n",
" 0.420 | \n",
" 0.135 | \n",
" 0.6770 | \n",
" 0.2565 | \n",
" 0.1415 | \n",
" 0.210 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" M | \n",
" 0.440 | \n",
" 0.365 | \n",
" 0.125 | \n",
" 0.5160 | \n",
" 0.2155 | \n",
" 0.1140 | \n",
" 0.155 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" I | \n",
" 0.330 | \n",
" 0.255 | \n",
" 0.080 | \n",
" 0.2050 | \n",
" 0.0895 | \n",
" 0.0395 | \n",
" 0.055 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sex length diameter height whole_weight shucked_weight viscera_weight \\\n",
"0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 \n",
"1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 \n",
"2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 \n",
"3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 \n",
"4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 \n",
"\n",
" shell_weight target \n",
"0 0.150 1 \n",
"1 0.070 0 \n",
"2 0.210 0 \n",
"3 0.155 1 \n",
"4 0.055 0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"---\n",
"\n",
"# 2. Feature Preparation"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# seperating target from features\n",
"y = np.array(df['target'])\n",
"X = df.drop('target', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# shuffling and splitting data into training and test sets\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"SPLIT = 0.33\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=SPLIT, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Pre-processing Categorical Features\n",
"Example: one-hot encoding"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" length | \n",
" diameter | \n",
" height | \n",
" whole_weight | \n",
" shucked_weight | \n",
" viscera_weight | \n",
" shell_weight | \n",
" sex__M | \n",
" sex__F | \n",
" sex__I | \n",
"
\n",
" \n",
" \n",
" \n",
" 1593 | \n",
" 0.525 | \n",
" 0.380 | \n",
" 0.135 | \n",
" 0.6150 | \n",
" 0.2610 | \n",
" 0.1590 | \n",
" 0.1750 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 111 | \n",
" 0.465 | \n",
" 0.360 | \n",
" 0.105 | \n",
" 0.4310 | \n",
" 0.1720 | \n",
" 0.1070 | \n",
" 0.1750 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3271 | \n",
" 0.520 | \n",
" 0.425 | \n",
" 0.155 | \n",
" 0.7735 | \n",
" 0.2970 | \n",
" 0.1230 | \n",
" 0.2550 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1089 | \n",
" 0.450 | \n",
" 0.330 | \n",
" 0.105 | \n",
" 0.3715 | \n",
" 0.1865 | \n",
" 0.0785 | \n",
" 0.0975 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 2918 | \n",
" 0.600 | \n",
" 0.445 | \n",
" 0.135 | \n",
" 0.9205 | \n",
" 0.4450 | \n",
" 0.2035 | \n",
" 0.2530 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" length diameter height whole_weight shucked_weight viscera_weight \\\n",
"1593 0.525 0.380 0.135 0.6150 0.2610 0.1590 \n",
"111 0.465 0.360 0.105 0.4310 0.1720 0.1070 \n",
"3271 0.520 0.425 0.155 0.7735 0.2970 0.1230 \n",
"1089 0.450 0.330 0.105 0.3715 0.1865 0.0785 \n",
"2918 0.600 0.445 0.135 0.9205 0.4450 0.2035 \n",
"\n",
" shell_weight sex__M sex__F sex__I \n",
"1593 0.1750 0 0 1 \n",
"111 0.1750 1 0 0 \n",
"3271 0.2550 1 0 0 \n",
"1089 0.0975 0 0 1 \n",
"2918 0.2530 0 0 1 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# one-hot encoding categorical features\n",
"CATEGORICALS = ['sex']\n",
"DUMMIES = {\n",
" 'sex':['M','F','I']\n",
"}\n",
"\n",
"def dummy_encode(in_df, dummies):\n",
" out_df = in_df.copy()\n",
" \n",
" for feature, values in dummies.items():\n",
" for value in values:\n",
" dummy_name = '{}__{}'.format(feature, value)\n",
" out_df[dummy_name] = (out_df[feature] == value).astype(int)\n",
" \n",
" del out_df[feature]\n",
" # print('Dummy-encoded feature\\t\\t{}'.format(feature))\n",
" return out_df\n",
" \n",
"X_train = dummy_encode(in_df=X_train, dummies=DUMMIES)\n",
"\n",
"X_test = dummy_encode(in_df=X_test, dummies=DUMMIES)\n",
"\n",
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Pre-processing Numeric Features\n",
"Example: min-max scaling"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# rescaling numerical features\n",
"NUMERICS = ['length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight']\n",
"BOUNDARIES = {\n",
" 'length': (0.075000, 0.815000),\n",
" 'diameter': (0.055000, 0.650000),\n",
" 'height': (0.000000, 1.130000),\n",
" 'whole_weight': (0.002000, 2.825500),\n",
" 'shucked_weight': (0.001000, 1.488000),\n",
" 'viscera_weight': (0.000500, 0.760000),\n",
" 'shell_weight': (0.001500, 1.005000)\n",
"}\n",
"\n",
"def minmax_scale(in_df, boundaries):\n",
" out_df = in_df.copy()\n",
" \n",
" for feature, (min_val, max_val) in boundaries.items(): \n",
" col_name = '{}__norm'.format(feature)\n",
" \n",
" out_df[col_name] = round((out_df[feature] - min_val)/(max_val - min_val),3)\n",
" out_df.loc[out_df[col_name] < 0, col_name] = 0\n",
" out_df.loc[out_df[col_name] > 1, col_name] = 1\n",
" \n",
" del out_df[feature]\n",
" # print('MinMax Scaled feature\\t\\t{}'.format(feature))\n",
" return out_df\n",
" \n",
"X_train = minmax_scale(in_df=X_train, boundaries=BOUNDARIES)\n",
"\n",
"X_test = minmax_scale(in_df=X_test, boundaries=BOUNDARIES)\n",
"\n",
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Notes\n",
"+ *Scikit-Learn* already has implementations of the most common variable transformers, however they tend to break down or behave unusually when they encounter values different from those they were fitted on.\n",
"+ Your transformations need to be executed in the same order in production because *Scikit-Learn* (silently) assumes that your columns in prediction are in the same order as they were during training."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"---\n",
"\n",
"# 3. Training"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"clf = RandomForestClassifier(\n",
" n_estimators=100, # number of trees\n",
" n_jobs=-1, # parallelization\n",
" random_state=1337, # random seed\n",
" max_depth=10, # maximum tree depth\n",
" min_samples_leaf=10\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 428 ms, sys: 16 ms, total: 444 ms\n",
"Wall time: 306 ms\n"
]
}
],
"source": [
"%time model = clf.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"---\n",
"\n",
"# 4. Evaluation"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training ROC AUC:\t 0.851\n"
]
}
],
"source": [
"# computing ROC AUC over training set\n",
"train_auc = sk.metrics.roc_auc_score(y_train, model.predict(X_train))\n",
"print('Training ROC AUC:\\t', round(train_auc, 3))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test ROC AUC:\t\t 0.783\n"
]
}
],
"source": [
"# computing ROC AUC over test set\n",
"test_auc = sk.metrics.roc_auc_score(y_test, model.predict(X_test))\n",
"print('Test ROC AUC:\\t\\t', round(test_auc, 3))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"---\n",
"# 5. Storing Model"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"pkl.dump(model, open('pickles/model_v1.pkl','wb'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"---\n",
"# 6. Loading Model"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=10, max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_split=1e-07, min_samples_leaf=10,\n",
" min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
" n_estimators=100, n_jobs=-1, oob_score=False,\n",
" random_state=1337, verbose=0, warm_start=False)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = pkl.load(open('pickles/model_v1.pkl','rb'))\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"### All good. Let's build this bad boy into an API!"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}