# Train a binary classifier to classify income as below 50k or above 50k from census data

## Dataset Information:

The dataset is obtained from [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/Adult). It consists of many multivariate features including demographics and income information. The task is to predict, if the income of a person is above or below $50k. 

### Setup

To get started, make sure you have these prerequisites completed.

* Specify an AWS Region to host your model.
* An IAM role ARN exists that is used to give Amazon SageMaker access to your data in Amazon Simple Storage Service (Amazon S3). See the documentation for how to fine tune the permissions needed.
* Create an S3 bucket used to store the data used to train your model, any additional model data, and the data captured from model invocations. For demonstration purposes, you are using the same bucket for these. In reality, you might want to separate them with different security policies.


In [None]:
import datetime
import time
import tarfile
import os
import pickle

import boto3
import pandas as pd
import numpy as np
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput

import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer

from sklearn.base import TransformerMixin

sm_boto3 = boto3.client('sagemaker')

sess = sagemaker.Session()

region = sess.boto_session.region_name

bucket = sess.default_bucket()  # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)

### Download data from UCI repository

In [None]:
%%bash

wget -q https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
mv adult.data ./data/adult.csv

## Pre-process

Create a Scikit-learn pipeline to handle pre-processing. It consists of following steps:
* Create train-test split
* Use simple imputer to substitute most frequent for categorical and mean for numerical features.
* Use one-hot encoding for handling categorical features.

In [None]:
# Read csv with column names
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
df = pd.read_csv('data/adult.csv', names = column_names)

df.replace('?',np.NaN,inplace=True)

df_train_val, df_test, = train_test_split(df, test_size=0.1, random_state=42)
df_train_val_no_target = df_train_val.drop('income', axis=1)

df_test.to_csv('data/test.csv', index=False)

In [None]:
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numeric_ind = [i for i, x in enumerate(df_train_val_no_target.dtypes) if x != object]
cat_ind = [i for i, x in enumerate(df_train_val_no_target.dtypes) if x == object]

numeric_transformer = SimpleImputer(strategy='mean')
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_ind),
    ('cat', categorical_transformer, cat_ind)
])

X = preprocessor.fit_transform(df_train_val_no_target)

y = LabelEncoder().fit_transform(df_train_val.income)
X = np.insert(X, 0, y, axis=1)

# Save the ColumnTransformer to be used during inference
with open('script/preprocess.pkl', 'wb') as f:
    pickle.dump(preprocessor, f) 

### Train-val split
Split the training set again to create validation set and upload it to S3

In [None]:
X_train, X_val, = train_test_split(X, test_size=0.2, random_state=42)

np.savetxt("data/train.csv", X_train, delimiter=",", fmt='%f')
np.savetxt("data/val.csv", X_val, delimiter=",", fmt='%f')

prefix = 'sagemaker/blog'

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv'))\
.upload_file('data/train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv'))\
.upload_file('data/val.csv')

# Model training

We train the model using SageMaker built-in XGBoost algorithm.

In [None]:
container = sagemaker.image_uris.retrieve('xgboost', boto3.Session().region_name, '1.2-1')

hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":"50"}

xgb = sagemaker.estimator.Estimator(container,
                                    get_execution_role(), 
                                    hyperparameters=hyperparameters,                                    
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

s3_input_train = TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

In [None]:
# Save the model to be used during inference
!aws s3 cp {xgb.model_data} model/