# Music Recommender Lab 2 : Model Training using Amazon SageMaker XGBoost Script Mode and Hyperparameter Tuning

#### Supervised Learning with Gradient Boosted Trees
This notebook works well with the **Python 3 (Data Science)** kernel on SageMaker Studio, or conda_python3 on classic SageMaker Notebook Instances

---

## Objective

This workshop aims to give you an **example of training and tuning a machine learning model on SageMaker with [XGBoost as Framework](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#xgboost-modes)**: Focussing on the **data interfaces** and SageMaker's automatic **Hyperparameter Optimization** (HPO) capabilities.

This notebook will demonstrate how you can bring your own model by using custom training and inference scripts, similar to those you would use outside of SageMaker, with SageMaker's prebuilt containers for various frameworks like Scikit-learn, PyTorch, and XGBoost.

SageMaker Script Mode is flexible so you'll also be seeing examples of how to include your own dependencies, such as a custom Python library, in your training and inference.

The following diagram provides a solution overview:

<img title="SageMaker Script Mode" alt="Solution diagram" src="images/solution-diagram.jpg">

Teaching in-depth data science approaches for tabular data is outside this scope, and we hope you can use this notebook as a starting point to modify for the needs of your future projects.


---

## Prepare our Environment

We'll need to:

- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services

While `boto3` is the general AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.

**Note that, you need to complete Lab 1 as a prerequisite before running this notebook as the training dataset is prepared in Lab 1**.

In [None]:
!pip install awswrangler -q

In [None]:
import os
import boto3
import re
import copy
import datetime
import sagemaker
from sagemaker import get_execution_role
from sagemaker.debugger import Rule, rule_configs
from sagemaker.xgboost import XGBoost
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

import awswrangler as wr

role = get_execution_role()
sagemaker_session = sagemaker.Session()

region = sagemaker_session.boto_region_name
client = sagemaker_session.sagemaker_client

bucket= sagemaker_session.default_bucket()
prefix = 'music-recommendation-workshop'

In [None]:
%store
%store -r

In [None]:
if 'train_path' not in locals():
    response = client.list_processing_jobs(NameContains='music-processing',
                           SortBy='CreationTime',
                           SortOrder='Descending'
                          )
    train_path = [x for x in client.describe_processing_job(
            ProcessingJobName=response['ProcessingJobSummaries'][0]['ProcessingJobName']
        )['ProcessingOutputConfig']['Outputs'] if 'train' in x['OutputName']][0]['S3Output']['S3Uri']
    test_path = [x for x in client.describe_processing_job(
            ProcessingJobName=response['ProcessingJobSummaries'][0]['ProcessingJobName']
        )['ProcessingOutputConfig']['Outputs'] if 'test' in x['OutputName']][0]['S3Output']['S3Uri']

    %store train_path
    %store test_path
    
else:
    print(f'input source is available: {train_path} & {test_path}')


---

## Understand the Algorithm

### Use XGBoost as a built-in algorithm
You can use SageMaker's [**XGBoost Algorithm** as a built-in algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html): Benefiting from performance-optimized, pre-implemented functionality like multi-instance parallelization, and support for multiple input formats.

In general to use the pre-built algorithms, we'll need to:

- Refer to the [Common Parameters docs](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) to see the **high-level configuration** and what features each algorithm has
- Refer to the [algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to understand the **detail** of the **data formats** and **(hyper)-parameters** it supports

From these docs, we'll understand what data format we need to upload to S3 (next), and how to get the container image URI of the algorithm... which is listed on the Common Parameters page but can also be extracted through the SDK.

We know from [the algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) that SageMaker XGBoost expects data in the **libSVM** or **CSV** formats, with:

- The target variable in the first column, and
- No header row

### Use XGBoost as a framework

Use XGBoost as a framework to run your customized training scripts that can incorporate additional data processing into your training jobs, this is also known as **SageMaker Script Mode**.

- The first "level" of script mode is the ability to define your own training job, model, and inference process without any dependencies. This is done using a customized python script and pointing that script as the "entry point" when defining your SageMaker training estimator.
- The second "level" of script mode is the ability to modularize and logically organize your custom training jobs, models, and inference processes.
- The third "level" of script mode is the ability to bring your own libraries and dependencies to support custom functionality within your models, training jobs, and inference processes. This supercharges your customization options, and allows you to import libraries you have created yourself or Python packages hosted on PyPi.

For examples of using Script Mode for different frameworks that SageMaker supports, please check [this github repo](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-script-mode).


The data used in training has already been prepared during the Lab 1.



In this example, we implement k-fold cross validation for an XGBoost model using a custom built library called my_custom_library. While XGBoost is supported "out-of-the-box" on SageMaker, that version does not support k-fold cross validation for training. Thus we use script mode to leverage the supported XGBoost container and the concomitant flexibility to include our custom libraries and dependencies.

**Task 1**:
Complete the train_deploy_xgboost_with_dependencies.py script file in the code folder. 

**Task 2**:
Fill in the parameters of the estimator object below

In [None]:
hyperparameters = {"num_round": 6, "K": 5}

enable_local_mode_training = False
if enable_local_mode_training:
    train_instance_type = "local"
    inputs = {"train": f"file://{train_dir}"}
else:
    train_instance_type = "ml.c5.xlarge"
    inputs = {"train": train_path}

estimator_parameters = {
    "entry_point": # FILL IN,
    "source_dir": # FILL IN,
    "dependencies": #[<FILL IN>],
    "instance_type": train_instance_type,
    "instance_count": 1,
    "hyperparameters": hyperparameters,
    "role": role,
    "base_job_name": "xgboost-model",
    "framework_version": "1.0-1",
    "py_version": "py3",
}

estimator = XGBoost(**estimator_parameters)
estimator.fit(inputs)

In [None]:
training_job_name = estimator.latest_training_job.job_name
model_url = estimator.model_data

In [None]:
%store training_job_name
%store model_url

---

## Deploy and Evaluate the Model

### Deployment

Now that we've trained the xgboost algorithm on our data, deploying the model (hosting it behind a real-time endpoint) is just one function call!

This deployment might take **up to 10 minutes**, and by default the code will wait for the deployment to complete.

If you like, you can instead:

- add the `wait=False` parameter to the deploy function
- Use the [Endpoints page of the SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/endpoints) to check the status of the deployment
- Skip over the *Evaluation* section below (which won't run until the deployment is complete), and start the Hyperparameter Optimization job - which will take a while to run too, so can be started in parallel

In [None]:
time_now = f"{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"
xgboost_endpoint_name = "xgb-endpoint" + time_now
xgboost_predictor = estimator.deploy(
    initial_instance_count=1, instance_type="ml.m5.xlarge", endpoint_name=xgboost_endpoint_name
)

In [None]:
# read a few rows from the test dataset in s3

test_df = wr.s3.read_csv(
        path=test_path, dataset=True, nrows=5, header=None
    )

In [None]:
xgboost_predictor.serializer = CSVSerializer()
xgboost_predictor.deserializer = CSVDeserializer()
xgboost_predictor.predict(test_df.iloc[:,1:].values)

---

## (Optional)Hyperparameter Optimization (HPO)
*Note, with the default settings below, the hyperparameter tuning job can take up to ~20 minutes to complete.*

We will use SageMaker HyperParameter Optimization (HPO) to automate the searching process effectively. Specifically, we **specify a range**, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.

SageMaker hyperparameter tuning will automatically launch **multiple training jobs** with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will specify the maximum number of HPO tries (`max_jobs`) and how many of these can happen in parallel (`max_parallel_jobs`).

Tip: `max_parallel_jobs` creates a **trade-off between performance and speed** (better hyperparameter values vs how long it takes to find these values). If `max_parallel_jobs` is large, then HPO is faster, but the discovered values may not be optimal. Smaller `max_parallel_jobs` will increase the chance of finding optimal values, but HPO will take more time to finish.

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: **validation:auc** and **train:auc**.

Area Under the ROC Curve (AUC) measures the ability of a binary ML model to predict a higher score for positive examples as compared to negative examples. [See Machine Learning Key Concepts](https://docs.aws.amazon.com/machine-learning/latest/dg/amazon-machine-learning-key-concepts.html)

We elected to monitor *validation:auc* as you can see below. In this case (because it's pre-built for us), we only need to specify the metric name.

For more information on the documentation of the Sagemaker HPO please refer [here](https://sagemaker.readthedocs.io/en/stable/tuner.html).

In [None]:
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                            'min_child_weight': ContinuousParameter(1, 10),
                            'alpha': ContinuousParameter(0, 2),
                            'max_depth': IntegerParameter(1, 10)}

In [None]:
objective_metric_name = 'validation:rmse'
objective_type = 'Minimize'

In [None]:
tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type=objective_type,
                            max_jobs=10,
                            max_parallel_jobs=3)

### Launch HPO
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [None]:
tuner.fit({'train': train_path, 'validation': test_path})

Wait until the HPO job finishes then process to the following cells

In [None]:
client.describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

In [None]:
# return the best training job name
tuner.best_training_job()

In [None]:
#  Deploy the best trained or user specified model to an Amazon SageMaker endpoint
tuner_predictor = tuner.deploy(initial_instance_count=1,
                           instance_type='ml.m5.xlarge')

In [None]:
# Perform predict
tuner_predictor.serializer = CSVSerializer()
tuner_predictor.deserializer = CSVDeserializer()
tuner_predictor.predict(test_df.iloc[:,1:].values)

### (Optional) Delete the Endpoint
If you're done with this exercise, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint and avoid any charges from a stray instance being left on.

In [None]:
xgboost_predictor.delete_endpoint(delete_endpoint_config=True)

In [None]:
tuner_predictor.delete_endpoint(delete_endpoint_config=True)

---

## End of Lab 2