# SageMaker Autopilot Model Training Notebook for Automated Machine Learning
This notebook uses SageMaker Autopilot on training dataset to perform model training. [Amazon SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/) service automates the machine learning lifecycle. ML Model develop is an iterative process with several tasks that data scientists go through to produce an effective model that can solve business problem. The process typically involves:
* Data exploration and analysis
* Feature engineering
* Model development
* Model training and tuning
* Model deployment
* Model monitoring and retraining 

Model development can be a time consuming process. To address challenges of model development, AWS introduced [Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html), an Automated Machine Learning or AutoML service at AWS re:Invent 2019. Amazon SageMaker Autopilot is a whitebox approach to AutoML, producing the Python Notebooks for data analysis, feature engineering and model training. These notebooks can be examined by data scientists, giving them full control and visbility into model development. The image below descibes how SageMaker Autopilit works.

![SageMaker Autopilot](../images/iso20022-prototype-AutoML.png)

Note that model monioring and retraining is need to detect data and model drift in real world usage, learning about model's effectiveness and retraining to correct model may be making in real-world usage. Amazon SageMaker [Model Monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) service can help with monitoring.

In this notebook we use Amazon SageMaker Autopilot to **train multiple models** and select the best perfoming model using the model evaluation metric `Accuracy`.

The problem is defined to be a `binary classification` problem, that of predicting if a pacs.008 XML message with be processed sucessfully or lead to exception process. The predicts `Success` i.e. 1 or `Failure` i.e. 0. 

## Environment Setup

In [None]:
import os
import boto3
import pandas as pd
import numpy as np
import sagemaker
from sagemaker import get_execution_role

sm = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
region = boto3.session.Session().region_name

role = get_execution_role()
print ("Notebook is running with assumed role {}".format (role))
print("Working with AWS services in the {} region".format(region))

### Provide S3 Bucket Name

In [None]:
# Working directory for the notebook
WORKDIR = os.getcwd()
BASENAME = os.path.dirname(WORKDIR)

# Store all prototype assets in this bucket
s3_bucket_name = 'iso20022-prototype-t3'
s3_bucket_uri = 's3://' + s3_bucket_name

# Prefix for all files in this prototype
prefix = 'iso20022'

pacs008_prefix = prefix + '/pacs008'
raw_data_prefix = pacs008_prefix + '/raw-data'
labeled_data_prefix = pacs008_prefix + '/labeled-data'
training_data_prefix = pacs008_prefix + '/automl/training-data'
training_headers_prefix = pacs008_prefix + '/automl/training-headers'
test_data_prefix = pacs008_prefix + '/automl/test-data'
training_job_output_prefix = pacs008_prefix + '/training-output'

print(f"Training data will be uploaded to {s3_bucket_uri + '/' + training_data_prefix}")
print(f"Test data will be uploaded to {s3_bucket_uri + '/' + test_data_prefix}")
print(f"Training job output will be stored in {s3_bucket_uri + '/' + training_job_output_prefix}")

In [None]:
labeled_data_location = s3_bucket_uri + '/' + labeled_data_prefix
training_data_location = s3_bucket_uri + '/' + training_data_prefix
test_data_location = s3_bucket_uri + '/' + test_data_prefix
print(f"Raw labeled data location = {labeled_data_location}")
print(f"Training data location = {training_data_location}")
print(f"Test data location = {test_data_location}")

# Split Labeled Dataset to Training and Test Datasets

In [None]:
# Download labeled raw dataset from S3
s3 = boto3.client('s3')
s3.download_file(s3_bucket_name, labeled_data_prefix + '/labeled_data.csv', 'labeled_data.csv')
df = pd.read_csv('labeled_data.csv')
df.shape

In [None]:
df.head()

### Select Features

This is not used for Autopilot as it performs feature engineering, this section is here only for experimentation to see if providing selected features improves the model. Uncomment the cell below to experiment.

In [None]:
# Training features
# fts=[
# 'y_target', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_DbtCdtRptgInd', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Ctry', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd',
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
# ]

# # New data frame with selected features
# selected_df = df[fts]
 
# print(f"selected_df shape: {selected_df.shape}") 
# selected_df.head()

### Split into Training and Test Datasets

In [None]:
from sklearn.model_selection import train_test_split
# Split raw labeled data to training and test datasets
print('Spliting processed dataset into training and test datasets...')

X_train, X_test, y_train, y_test = train_test_split(df, df['y_target'], test_size=0.2, random_state=20, shuffle=True)
# Uncomment if experimenting by selecting features for training.
#X_train, X_test, y_train, y_test = train_test_split(selected_df, selected_df['y_target'], test_size=0.2, random_state=20, shuffle=True)

print(f"X_train shape: {X_train.shape}") 
print(f"X_test shape: {X_test.shape}") 
print(f"y_train shape: {y_train.shape}") 
print(f"y_test shape: {y_test.shape}") 

train_data_output_path = WORKDIR + '/train_data.csv'

test_data_output_path = WORKDIR + '/test_data.csv'

print("Saving training data with headers to {}".format(train_data_output_path))
X_train.to_csv(train_data_output_path, index=False)

print('Saving test data with headers to {}'.format(test_data_output_path))
X_test.to_csv(test_data_output_path, index=False)

s3.upload_file(train_data_output_path, s3_bucket_name, training_data_prefix + '/train_data.csv')
s3.upload_file(test_data_output_path, s3_bucket_name, test_data_prefix + '/test_data.csv')

print(f'Uploaded train data with headers to {training_data_location}')
print(f'Uploaded test data with headers to {test_data_location}')

print("Pre-processing Complete.")

# Create a Model Using SageMaker Autopilot

## Setting the Sagemaker Autopilot Job

In [None]:
input_data_config = [{
 'DataSource': {
 'S3DataSource': {
 'S3DataType': 'S3Prefix',
 'S3Uri': training_data_location ## Where the training data is stored 
 }
 },
 'TargetAttributeName': 'y_target' ## Name of the target value 
 }
 ]

output_data_config = {
 'S3OutputPath': s3_bucket_uri + '/' + training_job_output_prefix ## Where to store the model performance
 }

autoMLJobConfig={
 'CompletionCriteria': {
 'MaxCandidates': 5 ## Number of models you want to try 
 }
}

autoMLJobObjective = {
 "MetricName": "Accuracy" ## Metric we want to use to evaluate the model
}

In [None]:
print(input_data_config)

## Start SageMaker Autopilot job

In [None]:
from time import gmtime, strftime, sleep
timestamp_suffix = strftime('%Y-%m-%d-%H-%M', gmtime())

auto_ml_job_name = 'pacs008-automl-' + timestamp_suffix
print('AutoMLJobName: ' + auto_ml_job_name)

#auto_ml_job_name = 'automl-iso20022-2021-11-25-16-13' # selected_df
#auto_ml_job_name = '' #full labeled df

sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
 InputDataConfig=input_data_config,
 OutputDataConfig=output_data_config,
 AutoMLJobConfig=autoMLJobConfig,
 AutoMLJobObjective=autoMLJobObjective,
 ProblemType="BinaryClassification", ## Here we specify what type of problem we have
 RoleArn=role)

### Monitor Training Job

In [None]:
from time import sleep
print ('JobStatus - Secondary Status')
print('------------------------------')

#auto_ml_job_name = 'automl-iso20022-2021-11-09-22-42'
print('AutoMLJobName: ' + auto_ml_job_name)
describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
job_run_status = describe_response['AutoMLJobStatus']
 
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
 describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
 job_run_status = describe_response['AutoMLJobStatus']
 
 print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
 sleep(30)

### Autopilot Training Results

In [None]:
import pprint

best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
pprint.pprint(best_candidate)
print('\n')
print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

In [None]:
best_candidate['InferenceContainers'][1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,probability'})
pprint.pprint(best_candidate['InferenceContainers'][1])

best_candidate['InferenceContainers'][2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT': 'predicted_label,probability'})
best_candidate['InferenceContainers'][2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,probability'})
pprint.pprint(best_candidate['InferenceContainers'][2])

pprint.pprint(best_candidate)

In [None]:
candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']
for index, candidate in enumerate(candidates):
 print(str(index) + " " + candidate['CandidateName'] + " " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))

### Autopilot Generated Candidate Model Training Notebook

In [None]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']

### Autopilot Generated Data Exploration Notebook

In [None]:
sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['DataExplorationNotebookLocation']

## Store Training Job Name for use during Model Deployment
This name is used by model deployment notebook that deploys a SageMaker Inference Endpoint i.e. uses SageMaker hosting services to deploy the model and expose an inference endpoint for users to use the model to make predictions.

In [None]:
training_job_name = candidates[0]['CandidateName']
%store training_job_name

In [None]:
%store

In [None]:
# timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
#model_name = 'pacs008-automl-' + timestamp_suffix

model_name = auto_ml_job_name
model = sm.create_model(Containers=best_candidate['InferenceContainers'],
 ModelName=model_name,
 ExecutionRoleArn=role)

print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))

In [None]:
%store model_name

In [None]:
%store