## Amazon SageMaker with XGBoost and Hyperparameter Tuning for Taxi Trip Fare Prediction
#### Supervised Learning with Gradient Boosted Trees
This notebook works well with the **Python 3 (Data Science)** kernel on SageMaker Studio, or conda_python3 on classic SageMaker Notebook Instances

---

## Objective

This workshop aims to give you an **example of using and tuning a SageMaker built-in algorithm**: Focussing on the **data interfaces** and SageMaker's automatic **Hyperparameter Optimization** (HPO) capabilities.

Teaching in-depth data science approaches for tabular data is outside this scope, and we hope you can use this notebook as a starting point to modify for the needs of your future projects.

---

## Prepare our Environment

We'll need to:

- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services

While `boto3` is the general AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.

**Note that, you need to complete Lab 1 as a prerequisite before running this notebook as the training, validation and test datasets are prepared in Lab 1**.

In [None]:
import os
import boto3
import re
import copy
import time
from time import gmtime, strftime
import sagemaker
from sagemaker import get_execution_role
from sagemaker.debugger import Rule, rule_configs

role = get_execution_role()

region = boto3.Session().region_name

sagemaker_session = sagemaker.Session()

bucket=sagemaker.Session().default_bucket()
prefix = 'sagemaker/DEMO-xgboost-tripfare'

In [None]:
%store
%store -r

---

## Understand the Algorithm

We'll be using SageMaker's [built-in **XGBoost Algorithm**](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html): Benefiting from performance-optimized, pre-implemented functionality like multi-instance parallelization, and support for multiple input formats.

In general to use the pre-built algorithms, we'll need to:

- Refer to the [Common Parameters docs](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) to see the **high-level configuration** and what features each algorithm has
- Refer to the [algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to understand the **detail** of the **data formats** and **(hyper)-parameters** it supports

From these docs, we'll understand what data format we need to upload to S3 (next), and how to get the container image URI of the algorithm... which is listed on the Common Parameters page but can also be extracted through the SDK.

We know from [the algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) that SageMaker XGBoost expects data in the **libSVM** or **CSV** formats, with:

- The target variable in the first column, and
- No header row

The data used in training has already been prepared during the Lab 1.

In [None]:
# training step for generating model artifacts
training_instance_type = "ml.m5.xlarge"
model_output = f"s3://{bucket}/{prefix}/model"

# Define the XGBoost training report rules
# see: https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-training-xgboost-report.html
rules = [Rule.sagemaker(rule_configs.create_xgboost_report())]

image_uri = sagemaker.image_uris.retrieve(
 framework="xgboost",
 region=region,
 version="1.2-2",
 py_version="py3",
 instance_type=training_instance_type,
)

In [None]:
from sagemaker.inputs import TrainingInput
content_type = "csv"
train_input = TrainingInput(
 train_path, content_type=content_type, distribution='ShardedByS3Key'
)
validation_input = TrainingInput(
 validation_path, content_type=content_type, distribution='FullyReplicated'
)

---

## Train the Model

Training a model on SageMaker follows the usual steps with other ML libraries (e.g. SciKit-Learn):
1. Initiate a session (we did this up top).
2. Instantiate an estimator object for our algorithm (XGBoost).
3. Define its hyperparameters.
4. Start the training job.

#### A small competition!
SageMaker's XGBoost includes 38 parameters. You can find more information about them [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).
For simplicity, we choose to experiment only with a few of them.


In [None]:
xgb_train = sagemaker.estimator.Estimator(
 image_uri=image_uri,
 instance_type=training_instance_type,
 instance_count=2,
 output_path=model_output,
 base_job_name=f"{prefix.split('/')[-1]}-train",
 sagemaker_session=sagemaker_session,
 role=role,
 disable_profiler=False, # Profile processing job
 rules=rules, # Report processing job
)

In [None]:
# Set some hyper parameters
# https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
xgb_train.set_hyperparameters(
 objective="reg:squarederror",
 num_round=100,
 early_stopping_rounds=10,
 max_depth=9,
 eta=0.2,
 gamma=4,
 min_child_weight=300,
 subsample=0.8,
)

In [None]:
xgb_train.fit({'train': train_input, 'validation': validation_input})

Before you run the next cell, make sure you wait the training job finished. You should see the output summarizing the Training seconds and Billable seconds at then end of the training. 

In [None]:
training_job_name = xgb_train.latest_training_job.job_name
model_url = xgb_train.model_data

In [None]:
%store training_job_name
%store model_url

---

## Deploy and Evaluate the Model

### Deployment

Now that we've trained the xgboost algorithm on our data, deploying the model (hosting it behind a real-time endpoint) is just one function call!

This deployment might take **up to 10 minutes**, and by default the code will wait for the deployment to complete.

If you like, you can instead:

- add the `wait=False` parameter to the deploy function
- Use the [Endpoints page of the SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/endpoints) to check the status of the deployment
- Skip over the *Evaluation* section below (which won't run until the deployment is complete), and start the Hyperparameter Optimization job - which will take a while to run too, so can be started in parallel

In [None]:
xgboost_endpoint_name = "xgboost-endpoint-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
xgboost_predictor = xgb_train.deploy(
 initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=xgboost_endpoint_name
)

In [None]:
# read a few rows from the test dataset in s3
import awswrangler as wr
test_df = wr.s3.read_csv(
 path=test_path, dataset=True, nrows=5, header=None
 )

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

xgboost_predictor.serializer = CSVSerializer()
xgboost_predictor.deserializer = CSVDeserializer()
xgboost_predictor.predict(test_df.iloc[:,1:].values)

---

## Hyperparameter Optimization (HPO)
*Note, with the default settings below, the hyperparameter tuning job can take up to ~20 minutes to complete.*

We will use SageMaker HyperParameter Optimization (HPO) to automate the searching process effectively. Specifically, we **specify a range**, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.

SageMaker hyperparameter tuning will automatically launch **multiple training jobs** with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will specify the maximum number of HPO tries (`max_jobs`) and how many of these can happen in parallel (`max_parallel_jobs`).

Tip: `max_parallel_jobs` creates a **trade-off between performance and speed** (better hyperparameter values vs how long it takes to find these values). If `max_parallel_jobs` is large, then HPO is faster, but the discovered values may not be optimal. Smaller `max_parallel_jobs` will increase the chance of finding optimal values, but HPO will take more time to finish.

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: **validation:auc** and **train:auc**.

Area Under the ROC Curve (AUC) measures the ability of a binary ML model to predict a higher score for positive examples as compared to negative examples. [See Machine Learning Key Concepts](https://docs.aws.amazon.com/machine-learning/latest/dg/amazon-machine-learning-key-concepts.html)

We elected to monitor *validation:auc* as you can see below. In this case (because it's pre-built for us), we only need to specify the metric name.

For more information on the documentation of the Sagemaker HPO please refer [here](https://sagemaker.readthedocs.io/en/stable/tuner.html).

In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
 'min_child_weight': ContinuousParameter(1, 10),
 'alpha': ContinuousParameter(0, 2),
 'max_depth': IntegerParameter(1, 10)}

In [None]:
objective_metric_name = 'validation:rmse'
objective_type = 'Minimize'

In [None]:
tuner = HyperparameterTuner(xgb_train,
 objective_metric_name,
 hyperparameter_ranges,
 objective_type=objective_type,
 max_jobs=10,
 max_parallel_jobs=3)

### Launch HPO
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [None]:
tuner.fit({'train': train_input, 'validation': validation_input})

In [None]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

In [None]:
# return the best training job name
tuner.best_training_job()

In [None]:
# Deploy the best trained or user specified model to an Amazon SageMaker endpoint
tuner_predictor = tuner.deploy(initial_instance_count=1,
 instance_type='ml.m4.xlarge')

In [None]:
# Create a serializer
tuner_predictor.serializer = CSVSerializer()
tuner_predictor.deserializer = CSVDeserializer()

In [None]:
# Predict
tuner_predictor.predict(test_df.iloc[:,1:].values)[0]

### (Optional) Delete the Endpoint
If you're done with this exercise, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint and avoid any charges from a stray instance being left on.

In [None]:
xgboost_predictor.delete_endpoint(delete_endpoint_config=True)

In [None]:
tuner_predictor.delete_endpoint(delete_endpoint_config=True)

---

## End of Lab 2