# Module 3: Model Training
**This notebook uses the feature set extracted by `module-2` to create a XGBoost based machine learning model for binary classification**

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Load transformed feature set](#Load-transformed-feature-set)
1. [Split data](#Split-data)
1. [Train a model using SageMaker built-in XgBoost algorithm](#Train-a-model-using-SageMaker-built-in-XgBoost-algorithm)
1. [Real time inference using the deployed endpoint](#Real-time-inference-using-the-deployed-endpoint)


# Background

In this notebook, we demonstrate how to use the feature set derived in `Module-2` and create a machine learning model for predicting whether a customer will reorder a product or not based on historical records. Given the problem type is supervised binary classification, we will use a SageMaker built-in algorithm XGBoost to design this classifier. Once the model is trained, we will also deploy the trained model as a SageMaker endpoint for real-time inference. 

# Setup

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.inputs import TrainingInput
from sagemaker.predictor import Predictor
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import sagemaker
import logging
import boto3
import json
import os

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

#### Essentials

In [None]:
sagemaker_execution_role = get_execution_role()
logger.info(f'Role = {sagemaker_execution_role}')
session = boto3.Session()
sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-featurestore-workshop'
s3 = session.resource('s3')

# Load transformed feature set

In [None]:
df = pd.read_csv('.././data/train/transformed.csv')
df.head(5)

In [None]:
df.shape

Move column `is_redordered` to be the first column since our training algorithm `XGBoost` expects the target column to be the first column.

In [None]:
first_column = df.pop('is_reordered')
df.insert(0, 'is_reordered', first_column)
df.head()

# Split data

We will shuffle the whole dataset first (df.sample(frac=1, random_state=123)) and then split our data set into the following parts:

* 70% - train set,
* 20% - validation set,
* 10% - test set

**Note:** In the code below, the first element denotes size for train (0.7 = 70%), second element denotes size for test (1-0.9 = 0.1 = 10%) and difference between the two denotes size for validation(1 - [0.7+0.1] = 0.2 = 20%).

In [None]:
train_df, validation_df, test_df = np.split(df.sample(frac=1, random_state=123), [int(.7*len(df)), int(.9*len(df))])

In [None]:
train_df.shape

In [None]:
validation_df.shape

In [None]:
test_df.shape

Save split datasets to local

In [None]:
train_df.to_csv('../data/train/train.csv', index=False)
validation_df.to_csv('../data/validation/validation.csv', index=False)
test_df.to_csv('../data/test/test.csv', index=False)

Copy datasets to S3 from local

In [None]:
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('.././data/train/train.csv')
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('.././data/validation/validation.csv')
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('.././data/test/test.csv')

Create Pointers to the uploaded files

In [None]:
train_set_location = 's3://{}/{}/train/'.format(default_bucket, prefix)
validation_set_location = 's3://{}/{}/validation/'.format(default_bucket, prefix)
test_set_location = 's3://{}/{}/test/'.format(default_bucket, prefix)

In [None]:
train_set_pointer = TrainingInput(s3_data=train_set_location, content_type='csv')
validation_set_pointer = TrainingInput(s3_data=validation_set_location, content_type='csv')
test_set_pointer = TrainingInput(s3_data=test_set_location, content_type='csv')

In [None]:
print(json.dumps(train_set_pointer.__dict__, indent=2))

# Train a model using SageMaker built-in XgBoost algorithm

In [None]:
container_uri = sagemaker.image_uris.retrieve(region=session.region_name, 
 framework='xgboost', 
 version='1.0-1', 
 image_scope='training')

In [None]:
xgb = sagemaker.estimator.Estimator(image_uri=container_uri,
 role=sagemaker_execution_role, 
 instance_count=2, 
 instance_type='ml.m5.xlarge',
 output_path='s3://{}/{}/model-artifacts'.format(default_bucket, prefix),
 sagemaker_session=sagemaker_session,
 base_job_name='reorder-classifier')

xgb.set_hyperparameters(objective='binary:logistic',
 num_round=100)

In [None]:
xgb.fit({'train': train_set_pointer, 'validation': validation_set_pointer})

#### Saving Training Job Information

In [None]:
# Saving training job information to be used in the ML lineage module
training_job_info = xgb.latest_training_job.describe()
if training_job_info != None :
 training_jobName = training_job_info["TrainingJobName"]
 %store training_jobName

#### Host the trained XGBoost model as a SageMaker Endpoint

**Note:** The deployment usually takes ~10 mins - good time to take a coffee break :)

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=2,
 instance_type='ml.m5.xlarge')

# Real time inference using the deployed endpoint

In [None]:
csv_serializer = CSVSerializer()
endpoint_name = xgb_predictor.endpoint_name
%store endpoint_name
predictor = Predictor(endpoint_name=endpoint_name, 
 serializer=csv_serializer)

In [None]:
test_df = pd.read_csv('.././data/test/test.csv')
record = test_df.sample(1)
record

In [None]:
X = record.values[0]
payload = X[1:]
payload

In [None]:
%%time

predicted_class_prob = predictor.predict(payload).decode('utf-8')
if float(predicted_class_prob) < 0.5:
 logger.info('Prediction (y) = Will not reorder')
else:
 logger.info('Prediction (y) = Will reorder')