# Notebook 2 - Run Processing and Training using SageMaker Resources

---

## Contents

1. [Introduction](#Introduction)
1. [Source the libraries](#Source-the-libraries)
1. [Task 1: Set Up Global Variables](#Task-1:-Set-Up-Global-Variables)
1. [Task 2: Run the Data Processing Job](#Task-2:-Run-the-Data-Processing-Job)
1. [Task 3: Run the Model Training Job](#Task-3:-Run-the-Model-Training-Job)
1. [Task 4: Deploy the Endpoint](#Task-4:-Deploy-the-Endpoint)
---

## Introduction

This notebook demonstrates the use of Amazon SageMaker XGBoost to process data, train a model and host an endpoint using [SageMaker](https://docs.aws.amazon.com/sagemaker/index.html) Resources.

We use the [Predictive Maintenance Dataset](https://static.us-east-1.prod.workshops.aws/public/6f2f7cb1-bfda-4b34-ae39-928502784393/static/datasets/maintenance_dataset.csv), originally from the [UCI data repository](http://archive.ics.uci.edu/ml). More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset).

---
## Sourcing the libraries

In [None]:
# cell 1

import sagemaker
from sagemaker import get_execution_role
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor
import pandas as pd
import numpy as np
import urllib
import boto3
import json
import os
import time

---
## Task 1: Set Up Global Variables
**Steps**
* Put the name of your bucket and file name into S3_BUCKET and KEY respectively - as you did in the previous notebook `cell 2`
* Optional: Change the `PREFIX` name (not mandatory) `cell 2`

### Set S3 bucket and data prefix

In [None]:
# cell 2

# Provide information to where the training and validation data will be uploaded to 
S3_BUCKET = 'YOUR_S3_BUCKET' # YOUR_S3_BUCKET
KEY = "YOUR_OBJECT_ON_S3" # YOUR_OBJECT_ON_S3
PREFIX = 'pred-maintenance'

### Set role and global vars

We are going to set the global variables that are going to be used by the SageMaker Resources, `cell 3`:
* `sagemaker_session` is an object that manage interactions with the Amazon SageMaker APIs and any other AWS services needed.
* `region` is the AWS Region where you want to create your object 
* `role` is the IAM role that the instances will use when executing a job.

In this notebook, we will move data processing, model training and inference into [SageMaker Docker containers](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html). For each Sagemaker Job, a new instance is started to execute the code. The instance specifications are defined in `cell 3`.

To get familiar with the [SageMaker](https://docs.aws.amazon.com/sagemaker/index.html) resources, also look at:
* [SageMaker Python SDK Documentation](https://sagemaker.readthedocs.io/en/stable/#)
* [Sagemaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)


In [None]:
# cell 3 

# Get a SageMaker-compatible role used by this function and the session.
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()

# Set your instance count and type
instance_type = 'ml.m5.xlarge'
instance_count = 1

---
## Task 2: Run the Data Processing Job

In this task, we are going to run a processing script using [SageMaker Processing Job](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html). To do so, we:
1. Instantiate the Processor by specifying the Docker image location and the instance specifications
1. We start the Processing Job by specifying the script to run, the inputs and the outputs.

Before you can run this code, please update the file found under `src/preprocessor.py` by adding the required code snippets from your notebook `ImmersionDay_Notebook1_GettingStarted.ipynb`.

**Steps**

* Open the notebook `ImmersionDay_Notebook1_GettingStarted.ipynb`
* Open the file `src/preprocessor.py`
* Fill the `ACTION` parts of the `preprocess.py` file by copy & pasting your code from the notebook
    * Here we copy the `feature_column_names` and `label_column` as well as the `train_test_split`
    
### Retrieve the XGBoost image from SageMaker and instantiate the Processor
In `cell 4`, we are:
* Retreiving the location of the XGBoost Docker Image from [Amazon Elastic Container Registry (Amazon ECR)](https://aws.amazon.com/ecr/). 
* Definition the Processor by specifying the location of the docker image and the isntance specification.

In [None]:
# cell 4 

# retrieving the image location on Amazon ECR
framework_version = "1.3-1"
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version=framework_version,
    py_version="py3",
    instance_type=instance_type)

# Defining the Processor
script_processor = ScriptProcessor(
    image_uri=image_uri,
    instance_type=instance_type,
    instance_count=instance_count,
    base_job_name=PREFIX,
    command=["python3"],
    sagemaker_session=sagemaker_session,
    role=role,
    volume_size_in_gb=10,
)

### Running the Processing Job

Running `cell 5` will start an [SageMaker Processing Job](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.html). The processing job will take about 4-5 minutes to complete. 

To start the SageMaker Processing job, we specify:
* `code` - this can be an S3 URI or a local path to a file with the framework script to run.
* `arguments` - a list of string arguments to be passed to a processing job
* `outputs` - list of the outputs for the processing job. 
You can refer to the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html) to find out about the other possible arguments.

In [None]:
# cell 5

script_processor.run(
    code="src/preprocess.py", # point to your file
    arguments=["--input-data", f"s3://{S3_BUCKET}/{KEY}"], # tell the script where to find your input data
    outputs=[ # make sure that all files saved in the container are properly synced back to S3
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation")])

---
## Task 3: Run the Model Training Job

Before you can run this code, please update the file found under `src/train.py` by adding the required code snippets from your notebook `ImmersionDay_Notebook1_GettingStarted.ipynb`.

**Steps**

* Open the notebook `ImmersionDay_Notebook1_GettingStarted.ipynb`
* Open the file `src/train.py`
* Fill the `ACTION` parts of the `train.py` file by copy & pasting your code from the notebook
    * Here we copy the `numeric_features` and `categorical_features`
    * Create an `X_train`, `y_train`, `X_val` and `y_val` from your datasets
    * Add your model code

### Create your training container/estimator in SageMaker

In `cell 6`, we define an XGBoost Estimator by specifying:
* the location of the training script
* the hyper parameters
* the role and instance specification
* output path on S3

In [None]:
# cell 6

# Set your code folder
source_dir = 'src/'
entry_point = 'train.py'

# Set the hyperparamters for this estimator
hyperparameters = {
    'n_estimators': 100,
    'max_depth': 10,
    'random_state': 42
}

xgb_estimator = XGBoost(
    entry_point=entry_point,
    source_dir=source_dir,
    hyperparameters=hyperparameters,
    role=role,
    instance_count=instance_count,
    instance_type=instance_type,
    framework_version=framework_version,
    sagemaker_session=sagemaker_session,
    output_path=f's3://{S3_BUCKET}/{PREFIX}/model/',)

### Run the Training Job

Running `cell 7` will start an [SageMaker Training Job](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-mkt-algo-train.html). The training job will take about 4-5 minutes to complete. 

To start the SageMaker Processing job, we specify:
* `inputs` - inputs for the training job. Refer to the [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.fit) for additional information regarding the different supported formats.
You can refer to the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.fit) to find out about the other possible arguments.

In [None]:
# cell 7

# Train the model
xgb_estimator.fit({
    'train': script_processor.jobs[-1].outputs[0].destination,
    'validation': script_processor.jobs[-1].outputs[1].destination})

## Task 4: Deploy the Endpoint

In `cell 8`, we deploy a [SageMaker Endpoint](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.deploy)...  That will take another 4-5 minutes...

In [None]:
# cell 8

endpoint_name = "predictive-maintenance-endpoint"
xgb_estimator.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    endpoint_name=endpoint_name)

### Test the endpint

Let's test the endpoint. 

In `cell 9`, we read the data in. In `cell 10`, we make sure you drop the target column since the model expects a DataFrame with all columns except for the target.

In [None]:
# cell 9

print("Downloading data from bucket: %s, key: %s", S3_BUCKET, KEY)
fn = "maintenance_dataset.csv"
s3 = boto3.resource("s3")
s3.Bucket(S3_BUCKET).download_file(KEY, fn)

print("Reading downloaded data.")
df = pd.read_csv(fn)
os.unlink(fn)

In [None]:
# cell 10

df = df.drop("Failure Type", axis=1)

In `cell 11`, we test the endpoint using [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html). 
1. We create a [SageMaker Runtime Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html)
2. We use the client to invoke a an endpoint. Please refer to the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint), for addition information.

In [None]:
# cell 11

client = boto3.client("sagemaker-runtime")

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="text/csv",
    Body=df.iloc[:10, :].to_csv(index=False))

In [None]:
# cell 12

response["Body"].read()