# Amazon SageMaker Workshop
## _**Modeling**_

---
In this part of the workshop we will use **train** multiple Gradient Boosted Trees model to Predict Mobile Customer Departure.

---

## Contents

1. [Background](#Background) - Getting the data prepared in the previous lab.
2. [Train](#Train) - Train with the Amazon SageMaker XGBoost algorithm
 * [Creating Amazon SageMaker Training Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html)
 * [Algorithm implementations](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html#algorithms-choose-implementation) - Built-in and Framework mode (pre-made Docker images)
 * [Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html)
 * Manage multiple Runs
 * Experiment with hyperparameters and charting
 * [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html)
 * Debug your model 
 
3. [Exercise](#a_Exercise) - You turn to create multiple training Jobs]
4. [Wrap-up - end of Modeling Lab](#Wrap-up)
 

---

## Background

In the previous [Data Preparation](../1-DataPrep/data_preparation.ipynb) lab we used SageMaker notebook (with Pandas library) and/or SageMaker Data Wrangler.

Let's import the libraries for this lab:

In [None]:
#Supress default INFO logging
import logging
logger = logging.getLogger()
logger.setLevel(logging.ERROR)

In [None]:
import pprint
from time import strftime, gmtime

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig

from sagemaker.experiments.run import Run, load_run

In [None]:
role = sagemaker.get_execution_role()
sm_sess = sagemaker.session.Session()

---
### - if you _**skipped**_ the lab `1-DataPrep` follow instructions:

 - **run this [notebook](./config/pre_setup.ipynb)**
---
Now, load variables for this lab:

In [None]:
%store -r bucket
%store -r prefix
%store -r docker_image_name
%store -r framework_version
%store -r s3uri_train
%store -r s3uri_validation

In [None]:
bucket, prefix, docker_image_name, framework_version, s3uri_train, s3uri_validation

---
## Train

We'll use the XGBoost library to train a class of models known as gradient boosted decision trees on the data that we just uploaded. 

Because we're using XGBoost, we first need to specify the locations of the XGBoost algorithm containers.

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [None]:
from sagemaker.inputs import TrainingInput

In [None]:
s3_input_train = TrainingInput(s3_data=s3uri_train, content_type='csv')
s3_input_validation = TrainingInput(s3_data=s3uri_validation, content_type='csv')

In [None]:
pprint.pprint(s3_input_train.config)

In [None]:
pprint.pprint(s3_input_validation.config)

### Amazon SageMaker Experiments

Amazon SageMaker Experiments allows us to keep track of model training; organize related models together and log model configuration, parameters, and metrics to reproduce and iterate on previous models and compare models. We'll create a single experiment (during training job) to keep track of the different approaches we'll try to train the model.

Each approach or block of training code that we run will be an experiment run. Later, we'll be able to compare different runs in Amazon SageMaker Studio.

Let's choose a name to the experiment.

In [None]:
# Helper to create timestamps
create_date = lambda: strftime("%Y-%m-%d-%H-%M-%S", gmtime())
experiment_name=f"customer-churn-prediction-xgboost-{create_date()}"

#### Hyperparameters
Now we can specify our XGBoost hyperparameters. Among them are these key hyperparameters:
- `max_depth` Controls how deep each tree within the algorithm can be built. Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting. Typically, you need to explore some trade-offs in model performance between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` Controls sampling of the training data. This hyperparameter can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` Controls the number of boosting rounds. This value specifies the models that are subsequently trained using the residuals of previous iterations. Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` Controls how aggressive each round of boosting is. Larger values lead to more conservative boosting.
- `gamma` Controls how aggressively trees are grown. Larger values lead to more conservative models.
- `min_child_weight` Also controls how aggresively trees are grown. Large values lead to a more conservative model.

For more information about these hyperparameters, see [XGBoost's hyperparameters GitHub page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst).

In [None]:
hyperparams = {"max_depth":5,
 "subsample":0.8,
 "num_round":600,
 "eta":0.2,
 "gamma":4,
 "min_child_weight":6,
 "objective":'binary:logistic',
 "verbosity": 0
 }

### Run 1 - XGBoost in algorithm mode

For our first run, we'll use the built-in XGBoost algorithm to train a model without supplying any additional code. This way, we can use XGBoost to train and deploy a model as we would with other Amazon SageMaker built-in algorithms.

We'll create a new `Run` object. To train the model, we'll create an estimator and specify a few parameters, like the type of training instances we'd like to use and how many, and where the artifacts of the trained model should be stored. 

We'll also associate the training job with the experiment run that we just created when we call the `fit` method of the `estimator`.

In [None]:
xgb = sagemaker.estimator.Estimator(image_uri=docker_image_name,
 role=role,
 hyperparameters=hyperparams,
 instance_count=1, 
 instance_type='ml.m4.xlarge',
 output_path=f's3://{bucket}/{prefix}/output',
 base_job_name='workshop-xgboost-customer-churn',
 sagemaker_session=sm_sess)

In [None]:
with Run(
 experiment_name=experiment_name,
 run_name=f"algorithm-mode-run-{create_date()}",
 sagemaker_session=sm_sess,
) as run:
 xgb.fit(inputs={
 'train': s3_input_train,
 'validation': s3_input_validation
 }
 )

### Review the results

After the training job completes successfully, you can view metrics, logs, and graphs related to the run on the **SageMaker resources** tab in Amazon SageMaker Studio. 

To see the experiments, choose the **Experiments** on left menu, and select experiment with the name you've created on previous step.

![experiments_button](media/10-experiments.png)

Too see detailed metrics about your training job, click on respective run name:
 
![Run_Detail](media/20-run.png)

#### Download the model

You can find and download the model in Amazon SageMaker Studio. To find the model, choose the **Output Artifacts** tab in the left, and check the S3 URI.

![Run_Detail](media/30-download.png)

#### Trying other hyperparameter values

To improve a model, you typically try other hyperparameter values to see if they affect the final validation error. Let's vary the `min_child_weight` parameter and start other training jobs with those different values to see how they affect the validation error. For each value, we'll create a separate run so that we can compare the results in Amazon SageMaker Studio later.



# 2a. Exercise

### >>> Your turn! 

Run 5 **- simultaneous -** training jobs with different values for `min_child_weight` in XGBoost.

If you'd like to try with another hyperparameters, choose other ones from here:

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html 

In [None]:
# YOUR SOLUTION HERE



Want a hint? [Look here](./solutions/a-hint.md)

### Visualize and compare multiple training jobs

#### Create charts

To create a chart, multi-select the components. Because this is a sample training run and the data is sparse, there's not much to chart in a time series. However, we can create a scatter plot for the parameter sweep. 

On your experiment, select runs that you would like to compare. The following image is an example.

![Run_Detail](media/40-compare.png)

On following screen, click on "Add Graph" button and then choose "Scatter plot" type:

![Run_Detail](media/50-add-graph.png)

On graph screen, choose:

* "Wall time" on dimension
* "validation:logloss" on both axis, and click on "Update" button:

![scatter plot example](media/60-graph-detail.png)

### Amazon SageMaker Debugger

Amazon SageMaker Debugger lets you debug a model during training. As you train, Debugger periodicially saves tensors, which fully specify the state of the model at that point in time. Debugger saves the tensors to an Amazon S3 bucket. You can then use Amazon SageMaker Studio for analysis and visualization to diagnose training issues.

#### Specify SageMaker Debugger Rules

To enable automated detection of common issues during training, Amazon SageMaker Debugger also allows you to attach a list of rules to evaluate the training job against.

Some rules that apply to XGBoost include `AllZero`, `ClassImbalance`, `Confusion`, `LossNotDecreasing`, `Overfit`, `Overtraining`, `SimilarAcrossRuns`, `TensorVariance`, `UnchangedTensor`, and `TreeDepth`. 

[More information here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#debugger-built-in-rules-Rule)

We'll use the `LossNotDecreasing` rule--which is triggered if the loss doesn't decrease monotonically at any point during training--the `Overtraining` rule, and the `Overfit` rule. Let's create the rules.

In [None]:
debug_rules = [Rule.sagemaker(rule_configs.loss_not_decreasing()),
 Rule.sagemaker(rule_configs.overtraining()),
 Rule.sagemaker(rule_configs.overfit())
 ]

### Run 2 - XGBoost in framework mode

For the next run, we'll train a similar model, but use XGBoost in framework mode. If you've worked with the open source XGBoost, using XGBoost this way will be familiar to you. 

Using XGBoost as a framework provides more flexibility than using it as a built-in algorithm because it enables more advanced scenarios that **allow incorporating custom pre-processing, training and post-processing scripts.**

In addition, we'll be able to specify a list of rules that we want Amazon SageMaker Debugger to evaluate our training against.

#### Fit estimator

To use XGBoost as a framework, you need to specify an entry-point script that can incorporate additional processing into your training jobs.

We've made a couple of simple changes to enable the Amazon SageMaker Debugger `smdebug` library. We created a `SessionHook`, which we pass as a callback function when creating a `Booster`. We passed a `SaveConfig` object that tells the hook to save the evaluation metrics, feature importances, and SHAP values at regular intervals. (Debugger is highly configurable. You can choose exactly what to save.) We describe the changes in more detail after we train this example. For even more detail, see the [Developer Guide for XGBoost](https://github.com/awslabs/sagemaker-debugger/tree/master/docs/xgboost).

In [None]:
%%writefile xgboost_customer_churn.py
import argparse
import json
import os
import pickle
import random
import tempfile
import urllib.request

import xgboost
from smdebug import SaveConfig
from smdebug.xgboost import Hook


def parse_args():

 parser = argparse.ArgumentParser()

 parser.add_argument("--max_depth", type=int, default=5)
 parser.add_argument("--eta", type=float, default=0.2)
 parser.add_argument("--gamma", type=int, default=4)
 parser.add_argument("--min_child_weight", type=int, default=6)
 parser.add_argument("--subsample", type=float, default=0.8)
 parser.add_argument("--verbosity", type=int, default=0)
 parser.add_argument("--objective", type=str, default="binary:logistic")
 parser.add_argument("--num_round", type=int, default=50)
 parser.add_argument("--smdebug_path", type=str, default=None)
 parser.add_argument("--smdebug_frequency", type=int, default=1)
 parser.add_argument("--smdebug_collections", type=str, default='metrics')
 parser.add_argument("--output_uri", type=str, default="/opt/ml/output/tensors",
 help="S3 URI of the bucket where tensor data will be stored.")

 parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
 parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
 parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
 
 args = parser.parse_args()

 return args


def create_smdebug_hook(out_dir, train_data=None, validation_data=None, frequency=1, collections=None,):

 save_config = SaveConfig(save_interval=frequency)
 hook = Hook(
 out_dir=out_dir,
 train_data=train_data,
 validation_data=validation_data,
 save_config=save_config,
 include_collections=collections,
 )

 return hook


def main():
 
 args = parse_args()

 train, validation = args.train, args.validation
 parse_csv = "?format=csv&label_column=0"
 dtrain = xgboost.DMatrix(train+parse_csv)
 dval = xgboost.DMatrix(validation+parse_csv)

 watchlist = [(dtrain, "train"), (dval, "validation")]

 params = {
 "max_depth": args.max_depth,
 "eta": args.eta,
 "gamma": args.gamma,
 "min_child_weight": args.min_child_weight,
 "subsample": args.subsample,
 "verbosity": args.verbosity,
 "objective": args.objective}

 # The output_uri is a the URI for the s3 bucket where the metrics will be
 # saved.
 output_uri = (
 args.smdebug_path
 if args.smdebug_path is not None
 else args.output_uri
 )

 collections = (
 args.smdebug_collections.split(',')
 if args.smdebug_collections is not None
 else None
 )

 hook = create_smdebug_hook(
 out_dir=output_uri,
 frequency=args.smdebug_frequency,
 collections=collections,
 train_data=dtrain,
 validation_data=dval,
 )

 bst = xgboost.train(
 params=params,
 dtrain=dtrain,
 evals=watchlist,
 num_boost_round=args.num_round,
 callbacks=[hook])
 
 if not os.path.exists(args.model_dir):
 os.makedirs(args.model_dir)

 model_location = os.path.join(args.model_dir, 'xgboost-model')
 pickle.dump(bst, open(model_location, 'wb'))


if __name__ == "__main__":

 main()


def model_fn(model_dir):
 """Load a model. For XGBoost Framework, a default function to load a model is not provided.
 Users should provide customized model_fn() in script.
 Args:
 model_dir: a directory where model is saved.
 Returns:
 A XGBoost model.
 XGBoost model format type.
 """
 model_files = (file for file in os.listdir(model_dir) if os.path.isfile(os.path.join(model_dir, file)))
 model_file = next(model_files)
 try:
 booster = pickle.load(open(os.path.join(model_dir, model_file), 'rb'))
 format = 'pkl_format'
 except Exception as exp_pkl:
 try:
 booster = xgboost.Booster()
 booster.load_model(os.path.join(model_dir, model_file))
 format = 'xgb_format'
 except Exception as exp_xgb:
 raise ModelLoadInferenceError("Unable to load model: {} {}".format(str(exp_pkl), str(exp_xgb)))
 booster.set_param('nthread', 1)
 return booster, format


def predict_fn(input_object, model):
 """
 Perform prediction on the deserialized object, with the loaded model.
 """
 X_test = xgboost.DMatrix(input_object.values)
 predictions_probs = model.predict(X_test)
 predictions = predictions_probs.round()
 return {"predictions": predictions}


def input_fn(request_body, content_type):
 """
 Perform preprocessing task on inference dataset.
 """
 if content_type == "text/csv":
 df = pd.read_csv(StringIO(request_body), header=None)
 return df
 else:
 raise ValueError("{} not supported by script!".format(content_type))

### Tips about predict_fn & input_fn on code
Although we are developing the training script, we can add functions for the inference logic. SageMaker provides a simple way for users to customize not only the training logic, but also the inference logic. All you have to do is create functions called model_fn, input_fn, predict_fn and output_fn [more details here](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html#serve-a-model).

Let's create our framwork estimator and call `fit` to start the training job. As before, we'll create a separate run for this run so that we can use Amazon SageMaker Studio to compare it with other runs later. Because we are running in framework mode, we also need to pass additional parameters, like the entry point script and the framework version, to the estimator. 

As training progresses, you'll be able to see Amazon SageMaker Debugger logs that evaluate the rule against the training job.

In [None]:
from sagemaker.xgboost.estimator import XGBoost

In [None]:
train_script_name = 'xgboost_customer_churn.py'

In [None]:
framework_xgb = XGBoost(image_uri=docker_image_name,
 entry_point=train_script_name,
 role=role,
 framework_version=framework_version,
 py_version="py3",
 hyperparameters=hyperparams,
 instance_count=1, 
 instance_type='ml.m4.xlarge',
 output_path=f's3://{bucket}/{prefix}/output',
 base_job_name='workshop-framework-xgboost-customer-churn',
 sagemaker_session=sm_sess,
 rules=debug_rules
 )

In [None]:
with Run(
 experiment_name=experiment_name,
 run_name=f"framework-mode-run-{create_date()}",
 sagemaker_session=sm_sess,
) as run:
 framework_xgb.fit(inputs={
 'train': s3_input_train,
 'validation': s3_input_validation
 }
 )

After the training has been running for a while you can view debug info in the Debugger panel. To get to this panel you must click through the experiment, respective run, and then click on left "debugs" option.

![debugger panel](media/70-debug.png)

It's also possible to open details about debugging, clicking on the name inside "Debug Insights" (on the top down of the run):

![view_debugger panel](media/71-debug-insights.png)

On this page, it's possible to see detailed metrics about Debug process:

![Debugger detail](media/72-debug-det.png)

#### Let's save `training_job_name` for later (Lab4-Deployment)

In [None]:
training_job_name = framework_xgb.latest_training_job.job_name
training_job_name

In [None]:
%store training_job_name

#### Let's also store the S3 URI and training script name where our training script was saved for later

(Observe that SageMaker in Framework mode saves the training script in S3 and puts the the S3 URI as a hyperparameter called `sagemaker_submit_directory` and also the `sagemaker_program` key point to the training script).

In [None]:
framework_xgb.hyperparameters()

In [None]:
s3_modeling_code_uri = eval(framework_xgb.hyperparameters()["sagemaker_submit_directory"])
s3_modeling_code_uri

In [None]:
%store s3_modeling_code_uri

In [None]:
%store train_script_name

# Wrap-up

Now that we finished the **modeling lab**, let's make everything here re-usable. It may come in handy later (spoiler alert - when creating Pipelines)...

**Obs.:** We will use the `sagemaker.estimator.Estimator` class, but we will also pass our custom training script as a hyperparameter.

(yes, it is possible to do that, like we saw when inspecting the hyperparameters and seeing the `sagemaker_program` and `sagemaker_submit_directory` keys!).

In [None]:
%%writefile ../6-Pipelines/my_labs_solutions/modeling_solution.py

import boto3
import sagemaker
from sagemaker.inputs import TrainingInput

from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig

def get_modeling_estimator(bucket,
 prefix,
 s3_modeling_code_uri,
 docker_image_name,
 entry_point_script = 'xgboost_customer_churn.py') -> sagemaker.estimator.Estimator:
 
 role = sagemaker.get_execution_role()
 sm_sess = sagemaker.session.Session()

 # Input configs
 hyperparams = {"sagemaker_program": entry_point_script,
 "sagemaker_submit_directory": s3_modeling_code_uri,
 "max_depth": 5,
 "subsample": 0.8,
 "num_round": 600,
 "eta": 0.2,
 "gamma": 4,
 "min_child_weight": 6,
 "objective": 'binary:logistic',
 "verbosity": 0
 }

 # Debugger configs
 debug_rules = [
 Rule.sagemaker(rule_configs.loss_not_decreasing()),
 Rule.sagemaker(rule_configs.overtraining()),
 Rule.sagemaker(rule_configs.overfit())
 ]

 # Estimator configs
 xgb = sagemaker.estimator.Estimator(image_uri=docker_image_name,
 role=role,
 hyperparameters=hyperparams,
 instance_count=1, 
 instance_type='ml.m4.xlarge',
 output_path=f's3://{bucket}/{prefix}/output',
 base_job_name='pipeline-xgboost-customer-churn',
 sagemaker_session=sm_sess,
 rules=debug_rules)
 
 return xgb


---
# [You can now go to the lab 3-Evaluation](../3-Evaluation/evaluation.ipynb)