# MLOps: Initial - Enable experimentation at scale

<div class="alert alert-warning"> 
	⚠️ <strong> PRE-REQUISITE: </strong> Before proceeding with this notebook, please ensure that you have executed the <code>1-data-prep-feature-store.ipynb</code> Notebook</li>
</div>

## Contents

- [Introduction](#Introduction)
- [Recap](#Recap)
- [Experiment tracking](#Experiment-tracking)
- [SageMaker Training](#SageMaker-Training)
- [SageMaker Training with Automatic Model Tuning (HPO)](#SageMaker-Training-with-Automatic-Model-Tuning-(HPO))
- [Model Registry](#Model-Registry)

## Introduction

This is our second notebook which will explore the model training stage of the ML workflow.

Here, we will put on the hat of the `Data Scientist` and will perform the task of modeling which includes training a model, performing hyperparameter tuning, evaluating the model and registering high performing candidate models in a model registry. This task is highly iterative in nature and hence we also need to track our experimentation until we reach desired results.

Similar to the previous notebook on preprocessing datasets, a data scientist could perform the training task using the notebook's local compute & storage using the local data generated during the previous steps.  However, this is typically only useful for quick iteration on small datasets.   For the purposes of time, we do not cover that option within this notebook.  Instead, we'll jump into learning how to bring scale to model development tasks using managed SageMaker training and experiment tracking capabilities combined with curated feature data pulled from SageMaker Feature Store.  You'll also perform tuning at scale using SageMaker's automatic hyperparameter tuning capabilities. Then, finally register the best performing model in SageMaker Model Registry. 

![Notebook2](images/Notebook2.png)



Let's get started!

**Important:** for this example, we will use SageMaker's [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) as a built-in model. That means that you don't have to write your model code and SageMaker takes care of it. We will use CSV data as input. For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. Let's query our Feature Store Group to get the necessary data

**Imports**

Let's first install the latest version of the SageMaker SDK to ensure we have the current sagemaker-experiments library

In [None]:
%store -r

In [None]:
!pip install -U sagemaker

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime
import boto3
import sys
import sagemaker
import json
import os
import pandas as pd
from sagemaker.model_metrics import ModelMetrics, MetricsSource
from sagemaker.analytics import ExperimentAnalytics
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
# SageMaker Experiments
from sagemaker.experiments.run import Run
from sagemaker.utils import unique_name_from_base

from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput

from helper_library import *

**Session variables**

In [None]:
# Useful SageMaker variables
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role_arn= sagemaker.get_execution_role()
region = sagemaker_session.boto_region_name
s3_client = boto3.client('s3', region_name=region)
sagemaker_client = boto3.client('sagemaker')

enable_local_mode_training = False
model_package_group_name = 'synthetic-housing-models'
model_name = 'xgboost-model'


fs_dir = os.path.join(os.getcwd(), 'data/fs_data')
os.makedirs(fs_dir, exist_ok=True)

fs_train_dir = os.path.join(os.getcwd(), 'data/fs_data/train')
os.makedirs(fs_train_dir, exist_ok=True)

fs_validation_dir = os.path.join(os.getcwd(), 'data/fs_data/validation')
os.makedirs(fs_validation_dir, exist_ok=True)

experiment_name = unique_name_from_base('synthetic-housing-regression')

## Recap

So we've processed our data and now have training and validation sets available in Feature Store to be used for training. Since SageMaker training jobs expects the training data to be on s3, let's first add our feature store data to s3

In [None]:
def save_fs_data_to_s3(fg_name, features_to_select, sm_session, file_name, local_path, bucket, bucket_prefix):
    fs_group = FeatureGroup(name=fg_name, sagemaker_session=sm_session)  
    query = fs_group.athena_query()
    table = query.table_name
    query_string = f'SELECT {features_to_select} FROM "sagemaker_featurestore"."{table}"  ORDER BY record_id'
    query_results= 'sagemaker-featurestore'
    output_location = f's3://{bucket}/{query_results}/query_results/'
    query.run(query_string=query_string, output_location=output_location)
    query.wait()
    df = query.as_dataframe()
    df.to_csv(local_path+'/'+file_name, index=False, header=False)
    s3_client.upload_file(local_path+'/'+file_name, bucket, bucket_prefix+'/'+file_name)
    dataset_uri_prefix = "s3://" + bucket + "/" + bucket_prefix
    return dataset_uri_prefix

train_data = save_fs_data_to_s3(
    train_feature_group_name, 
    features_to_select, 
    sagemaker_session, 
    "train.csv", 
    fs_train_dir, 
    bucket, 
    s3_prefix+"/data/fs_data/train"
)
val_data = save_fs_data_to_s3(
    validation_feature_group_name, 
    features_to_select, 
    sagemaker_session, 
    "validation.csv", 
    fs_validation_dir, 
    bucket, 
    s3_prefix+"/data/fs_data/validation"
)
train_data, val_data

Let's compare the dataset distribution of our original dataset and the one read from Feature Store.

In [None]:
# read original training data
df_train_orig = pd.read_csv(sm_processed_train_dir+'/train.csv', header=None)
df_train_orig.describe()

In [None]:
# reading training data from Feature Store
df_train_fs = pd.read_csv(fs_train_dir+'/train.csv', header=None)
df_train_fs.describe()

Great! Our dataset distribution seems intact!

We are ready to train a SageMaker Scikit-Learn model with it!  At this point, you could choose to train locally if you're still experimenting with small datasets; however, we're going to move on to training at scale using transient compute environments using SageMaker Training Jobs.

## SageMaker Training

Now that we've prepared our training and test data, we can move on to use SageMaker's hosted training functionality - [SageMaker Training](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html). Hosted training is preferred for doing actual training, especially large-scale, distributed training. Unlike training a model on a local computer or server, SageMaker hosted training will spin up a separate cluster of machines managed by SageMaker to train your model. Before starting hosted training, the data must be in S3, or an EFS or FSx for Lustre file system. We uploaded to S3 in the previous notebook, so we're good to go here.

Let's go ahead and create a Random Forest Regressor model from the Scikit-Learn library.

In [None]:
from sagemaker.sklearn.estimator import SKLearn

hyperparameters = {'max_depth': 8, 'n_jobs': 4, 'n_estimators': 80}
train_instance_type = 'ml.c5.xlarge'
inputs = {'train': train_data, 'validation': val_data}
    
# Metrics to be captured from logs.
metric_definitions = [{'Name': 'r_squared',
                       'Regex': 'r-squared: ([0-9\\.]+)'},
                      {'Name': 'mse',
                       'Regex': 'MSE: ([0-9\\.]+)'}]

estimator_parameters = {
    'entry_point': './pipeline_scripts/train_deploy_scikitlearn_randomforestregressor.py',
    'framework_version': '0.23-1',
    'py_version': 'py3',
    'instance_type': train_instance_type,
    'instance_count': 1,
    'hyperparameters': hyperparameters,
    'role': role_arn,
    'metric_definitions': metric_definitions,
    'base_job_name': 'randomforestregressor-model',
    'output_path': f's3://{bucket}/{s3_prefix}/',
    'image_scope': 'training'
}

Before we actually train the Scikit-Learn model, we'll want to make sure we track experiments.

[SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) can track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for:

1. A business use case you are addressing (e.g. create experiment named "customer churn prediction"), or
2. A data science team that owns the experiment (e.g. create experiment named "marketing analytics experiment"), or
3. A specific data science and ML project. Think of it as a "folder" for organizing your "files".

To both train our model and track experiments, we'll just need a few lines of code below.

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review the <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(
            region, experiment_name
        )
    )
)

with Run(experiment_name=experiment_name, run_name='12345-test') as run:
    estimator = SKLearn(**estimator_parameters)
    estimator.fit(inputs)

 Now that training finished, we can navigate to the Experiments menu in SageMaker Studio and take a gander at our model's MSE. You can find **SageMaker Experiments** on the left menu under **Home**. 

If we don't like the MSE, we could improve it by adjusting model hyperparameters. But instead of guessing what hyperparameters we should have, we can let SageMaker search the hyperparameter space in an intelligent way on our behalf.

## SageMaker Training with Automatic Model Tuning (HPO)

[Amazon SageMaker Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), also known as hyperparameter tuning/optimization, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

You can use SageMaker automatic model tuning with built-in algorithms, custom algorithms, and SageMaker pre-built containers for machine learning frameworks.

We begin by specifying the hyperparameters we wish to tune, and the range of values over which to tune each one.  We also must specify an objective metric to be optimized:  in this use case, we'd like to minimize the validation loss.

In [None]:
hyperparameter_ranges = {
  'max_depth': IntegerParameter(6, 10),
  'n_jobs': IntegerParameter(4, 6),
  'n_estimators': IntegerParameter(80, 130),
}

objective_metric_name = 'mse'
objective_type = 'Minimize'

Next we specify a HyperparameterTuner object that takes the above definitions as parameters.  Each tuning job must be given a budget:  a maximum number of training jobs.  A tuning job will complete after that many training jobs have been executed.  

We also can specify how much parallelism to employ, in this case two jobs, meaning that the tuning job will complete after two series of two jobs in parallel have completed (so, a total of 4 jobs as set by `max_jobs`).  For the default Bayesian Optimization tuning strategy used here, the tuning search is informed by the results of previous groups of training jobs, so we don't run all of the jobs in parallel, but rather divide the jobs into groups of parallel jobs.  There is a trade-off: using more parallel jobs will finish tuning sooner, but likely will sacrifice tuning search accuracy. 

Now we can launch a hyperparameter tuning job by calling the `fit` method of the HyperparameterTuner object.  The tuning job may take some minutes to finish.  While you're waiting, the status of the tuning job, including metadata and results for invidual training jobs within the tuning job, can be checked in the SageMaker console in the **Hyperparameter tuning jobs** panel.  

In [None]:
 tuner_parameters = {'estimator': estimator,
                    'objective_metric_name': objective_metric_name,
                    'hyperparameter_ranges': hyperparameter_ranges,
                    'metric_definitions': metric_definitions,
                    'max_jobs': 4,
                    'max_parallel_jobs': 2,
                    'objective_type': objective_type}
    
tuner = HyperparameterTuner(**tuner_parameters)

tuning_job_name = f'rf-model-tuning-{strftime("%d-%H-%M-%S", gmtime())}'
display(
    HTML(
        '<b>Review the <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/hyper-tuning-jobs/{}">Tuning Job</a> After About 5 Minutes</b>'.format(
            region, tuning_job_name
        )
    )
)
tuner.fit(inputs, job_name=tuning_job_name)
tuner.wait()

After the tuning job is finished, we can use the `HyperparameterTuningJobAnalytics` object from the SageMaker Python SDK to list the top 5 tuning jobs with the best performance. Although the results vary from tuning job to tuning job, the best validation loss from the tuning job (under the FinalObjectiveValue column) likely will be substantially lower than the validation loss from the hosted training job above, where we did not perform any tuning other than manually increasing the number of epochs once.  

In [None]:
tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
tuner_metrics.dataframe().sort_values(['FinalObjectiveValue'], ascending=True).head(5)

## Model Registry

With the [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) you can do the following:

- Catalog models for production.
- Manage model versions.
- Associate metadata, such as training metrics, with a model.
- Manage the approval status of a model.
- Deploy models to production.
- Automate model deployment with CI/CD.

You can catalog models by creating model package groups that contain different versions of a model. You can create a model group that tracks all of the models that you train to solve a particular problem. You can then register each model you train and the model registry adds it to the model group as a new model version. A typical workflow might look like the following:

- Create a model group.
- Create an ML pipeline that trains a model.
- For each run of the ML pipeline, create a model version that you register in the model group you created in the first step.

So first we'll create a [Model Package Group](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-model-group.html) in which we can store/group all related models and their versions.

In [None]:
sagemaker_client.create_model_package_group(ModelPackageGroupName=model_package_group_name,
                                            ModelPackageGroupDescription='Models predicting synthetic housing prices')                                            

Next we'll register the model we just trained with SageMaker Training.

In [None]:
# Register model
best_estimator = tuner.best_estimator()
model_metrics = create_training_job_metrics(best_estimator, s3_prefix, region, bucket)

In [None]:
model_package = best_estimator.register(content_types=['text/csv'],
                                        response_types=['application/json'],
                                        inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
                                        transform_instances=['ml.m5.xlarge'],
                                        image_uri=best_estimator.image_uri,
                                        model_package_group_name=model_package_group_name,
                                        #model_metrics=model_metrics,
                                        approval_status='PendingManualApproval',
                                        description='XGBoost model to predict synthetic housing prices',
                                        model_name=model_name,
                                        name=model_name)
model_package_arn = model_package.model_package_arn

We'll store relevant variables to be used in the next notebooks.

In [None]:
%store model_package_arn
%store model_name
%store model_package_group_name
#%store model_metrics