# Model Profiler with XGBoost, Linear-Learner and Job Scheduler
Now that we can easily schedule notebooks on the fly, let's see how to set up our own model profiler. In this case, we'll be using the direct marketing dataset used throughout the SageMaker examples to train multiple binary classifiers - XGBoost and Linear Learner are great places to start. Next, we'll spin up endpoints for these models to get some prediction results on our test data set. Finally, we'll consolidate all of the results below into a multi-model ROC curve. Notice that you can run this entire notebook on an ephemeral SageMaker Job, and still get the updated chart below! 

*Important* - To run this notebook, you'll need to download and prepare a train, validation, and test data set. We used the one available below. You're welcome to use a different set if you prefer!
- https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_direct_marketing/xgboost_direct_marketing_sagemaker.ipynb 

Notice that you can run this entire notebook step-by-step. You can use the notebook scheduler to run this on its own ephemeral instance. To try it out, make sure you have valid parameters specified below for your S3 paths for your train, test, and validation data sets.

Paste in the base ECR image, which you can find in the ECS/ECR console after you've fully installed the toolkit. Paste in your execution role ARN, and type in the instance you prefer. This was developed on an ml.m4.xlarge. Then click "Run Now," and its running on a dedicated host! You can also click 

In [None]:
# you won't need to install the sagemaker sdk on Studio, but since we're running this on a base image, we'll need to do that here
!pip install sagemaker
!pip install s3fs

First, let's get our imports and configs set up. 

In [None]:
import os
import time
import boto3
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
bucket=sess.default_bucket()

prefix = 'sagemaker/model-profiler' 

In [None]:
# remember, Papermill will add a new cell just below this one, because it is tagged "parameters". 
# typically you'll overwrite the default variable values here, using the ones you bring in the key-value pairs 
s3_prefix = 'sagemaker/model-profiler'
s3_absolute = 's3://{}/{}'.format(bucket, s3_prefix)
test_relative = 'test/test_data.csv'
train_relative = 'train/train.csv'
validation_relative = 'validation/validation.csv'

In [None]:
s3_test_data = '{}/{}'.format(s3_absolute, test_relative)
s3_train_data = '{}/{}'.format(s3_absolute, train_relative)
s3_validation_data = '{}/{}'.format(s3_absolute, validation_relative)

In [None]:
print ('pointing to data paths ')
print (s3_test_data)
print (s3_train_data)
print (s3_validation_data)

---

## Training job and model creation
Assuming you have a valid dataset stored in S3, let's continue. We're going to loop through classifiers, here both XGBoost and Linear Learner. We'll train them both simultanously, because each is running on its own instance.

In [None]:
from time import gmtime, strftime
from sagemaker.amazon.amazon_estimator import get_image_uri

s3_input_train = sagemaker.s3_input(s3_data=s3_train_data, content_type='text/csv')
s3_input_validation = sagemaker.s3_input(s3_data=s3_validation_data, content_type='text/csv')

output_location = 's3://{}/{}/output'.format(bucket, prefix)

data_channels = {'train': s3_input_train, 'validation': s3_input_validation}


Next, here's the fun part. We need a base estimator function that can take the name of a model as a parameter. Then, we're going to loop through our two classifiers and run a training job for both of them.

In [None]:
def get_base_estimator(model):
 '''
 This function takes the name of a model, grabs the image from ECR, and build a base SageMaker estimator
 '''
 image = get_image_uri(boto3.Session().region_name, model)
 
 est = sagemaker.estimator.Estimator(image,
 role,
 train_instance_count=1,
 train_instance_type='ml.m5.4xlarge',
 train_volume_size=50,
 input_mode='File',
 output_path=output_location,
 sagemaker_session=sess)
 return est

def run_training_job(clf, is_last):
 '''
 This function takes the name of the classifier you want to train, in addition to a binary indicator for it being the last model
 We need to halt the Jupyter execution on the last model so we wait for all the jobs to finish
 '''
 est = get_base_estimator(clf)
 
 if clf == 'xgboost':
 est.set_hyperparameters(objective="binary:logistic",
 max_depth=5,
 eta=0.2,
 gamma=4,
 min_child_weight=6,
 subsample=0.8,
 silent=0,
 num_round=100) 
 
 elif clf == 'linear-learner':
 est.set_hyperparameters(predictor_type='binary_classifier',
 num_classes=2,
 mini_batch_size = 50)
 
 # the wait parameter will hault the local Jupyter env if it's set to True
 # we want this to happen for the last job, but not the previous ones 
 est.fit(inputs=data_channels, logs=False, wait=is_last)

 return est

def get_estimators(models_to_run):
 '''
 Takes a list of classifiers you want to run
 Runs a training job and returns the estimator
 Waits until the last estimator has finished to give you back the Jupyter kernel
 '''
 rt = []
 
 for i, model in enumerate(models_to_run):
 est = run_training_job(model, i == len(models_to_run)-1)
 rt.append(est)
 
 return rt

models_to_run = ['xgboost', 'linear-learner']
estimators = get_estimators(models_to_run)

Pretty fun, right? After that, we'll spin up an endpoint for each model. We'll make sure to tear these down afterward, so the costs stay low. 

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer


def get_endpoints(estimators):
 '''
 Takes a list of estimators and turns each into an endpoint.
 Waits until the last one is complete.
 '''

 rt = []
 
 for i, est in enumerate(estimators):

 endpoint = est.deploy(1, 'ml.m4.xlarge', wait= (i == len(estimators)- 1))
 endpoint.content_type = 'text/csv'
 endpoint.serializer = csv_serializer
 endpoint.deserializer = json_deserializer

 rt.append(endpoint)
 
 return rt

endpoints = get_endpoints(estimators)

With the endpoints online, we'll take our test data and package it as a request up to each endpoint. Then, with the model responses, we can build our multi-model ROC curve.

In [None]:
import json 
import s3fs


def get_pred_array(array):
 '''
 Takes a list of floats, or features, and converts these into a single string separated by commas 
 '''
 pred_array = ''
 for i, x in enumerate(array):
 if i == len(array) - 1:
 pred_array += str(x)
 else:
 pred_array += str(x) + ', ' 
 return pred_array


def get_y_pred_matrix(endpoints, test_data):
 '''
 Takes a list of endpoints and an S3 path with the test data. 
 Loops through the test data, builds prediction arrays, 
 Hits both endpoints, and consolidates the results.
 Returns a pandas data frame. 
 '''
 
 rt = []
 
 for idx, row in test_data.iterrows():
 
 pred_array = get_pred_array(row[:-1].to_numpy())

 xgb_prediction = endpoints[0].predict(pred_array)
 
 ll_prediction = endpoints[1].predict(pred_array)['predictions'][0]['score']
 
 y_true = row['y_yes']
 
 new_row = [xgb_prediction, ll_prediction, y_true] 
 rt.append(new_row)
 
 return pd.DataFrame(rt, columns=['xgboost', 'linear-learner', 'y_true'])
 
test_data = pd.read_csv(s3_test_data)
td = test_data.drop(['y_no'], axis=1)

predictions_df = get_y_pred_matrix(endpoints, td)

Now, let's take that matrix of prediction responses and get our ROC curve.

In [None]:
from sklearn.metrics import roc_curve

fpr_rt_xgb, tpr_rt_xgb, _ = roc_curve(predictions_df['y_true'], predictions_df['xgboost'])
fpr_rt_ll, tpr_rt_ll, _ = roc_curve(predictions_df['y_true'], predictions_df['linear-learner'])

In [None]:
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_xgb, tpr_rt_xgb, label='XGBoost')
plt.plot(fpr_rt_ll, tpr_rt_ll, label='Linear Learner')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

In [None]:
# make sure to run this step and delete your endpoints!
for e in endpoints:
 e.delete_endpoint()