# Build a SageMaker Pipeline

---

## Contents

1. [Introduction](#Introduction)
1. [Source the libraries](#Source-the-libraries)
1. [Task 1: Set Up Global Variables](#Task-1:-Set-Up-Global-Variables)
1. [Task 2: Define the Processing Step](#Task-2:-Define-the-Processing-Step)
1. [Task 3: Define the Training Step](#Task-3:-Define-the-Training-Step)
1. [Task 4: Register your model](#Task-4:-Register-your-model)
1. [Task 5: Define the SageMaker Pipeline](#Task-5:-Define-the-SageMaker-Pipeline)
1. [Task 6: Start and Monitor the SageMaker Pipeline Execution](#Task-6:-Start-and-Monitor-the-SageMakerPipeline-Execution)


---

## Introduction

This notebook demonstrates the use of Amazon SageMaker XGBoost to process data, train a model and register the model.  We will be orchestrating the steps using [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/). To get familiar with the SageMaker Pipeline, also look at:
* [Sagemaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)
* [Sagemaker Python SDK Documentation](https://sagemaker.readthedocs.io/en/stable/#) 

We use the [Predictive Maintenance Dataset](https://static.us-east-1.prod.workshops.aws/public/6f2f7cb1-bfda-4b34-ae39-928502784393/static/datasets/maintenance_dataset.csv), originally from the [UCI data repository](http://archive.ics.uci.edu/ml). More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset).

---
## Sourcing the libraries

In [None]:
# cell 1

import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor

from sagemaker.workflow.step_collections import RegisterModel
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.pipeline import PipelineModel

---
## Task 1: Set Up Global Variables

**Steps**
* Put the name of your bucket and filenmae into S3_BUCKET and KEY respectively - as you did in the previous notebook `cell 2`
* Optional: Change the `PREFIX` name (not mandatory) `cell 2`

### Set S3 bucket and data prefix

In [None]:
# cell 2

# Provide information to where the training and validation data will be uploaded to 
S3_BUCKET = 'YOUR_S3_BUCKET' # YOUR_S3_BUCKET
KEY = "YOUR_OBJECT_ON_S3" # YOUR_OBJECT_ON_S3
PREFIX = 'pred-maintenance'
input_data = f"s3://{S3_BUCKET}/{KEY}"

### Set role and global vars

We are going to set the global variables that are going to be used by the SageMaker Resources, `cell 3`:
* `sagemaker_session` is an object that manage interactions with the Amazon SageMaker APIs and any other AWS services needed.
* `region` is the AWS Region where you want to create your object 
* `role` is the IAM role that the instances will use when executing a job.

In the same way as in `ImmersionDay_Notebook2_SageMaker_Resources.ipynb`, in this notebook, the data processing, model training and inference step are runnig using [SageMaker Docker containers](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html). For each step, a new instance is started to execute the code. The instance specifications are defined in `cell 3`.

To get familiar with the [SageMaker](https://docs.aws.amazon.com/sagemaker/index.html) resources, also look at:
* [SageMaker Python SDK Documentation](https://sagemaker.readthedocs.io/en/stable/#)
* [Sagemaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)

In [None]:
# cell 3

# Get a SageMaker-compatible role used by this function and the session.
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()

# Set your instance count and type
framework_version = "1.3-1"
instance_type = 'ml.m5.xlarge'
model_package_group_name = 'pred-maintenance-model'
instance_count = 1

## Task 2: Define the Processing Step

In this task, we are going to run a processing script using [SageMaker Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing). To do so, we:
1. Instantiate the Processor by specifying the Docker image location and the instance specifications
1. We define the Processing Step by specifying the script to run, the inputs and the outputs.

Before you can run this code, please update the file found under `src/preprocessor.py` by adding the required code snippets from your notebook `ImmersionDay_Notebook1_GettingStarted.ipynb`.

**Steps**

* Open the notebook `ImmersionDay_Notebook1_GettingStarted.ipynb`
* Open the file `src/preprocessor.py`
* Fill the `ACTION` parts of the `preprocess.py` file by copy & pasting your code from the notebook
    * Here we copy the `feature_column_names` and `label_column` as well as the `train_test_split`

### Retrieve the XGBoost image from SageMaker and instantiate the Processor

In `cell 4`, we are:
* Retreiving the location of the XGBoost Docker Image from [Amazon Elastic Container Registry (Amazon ECR)](https://aws.amazon.com/ecr/). 
* Definition the Processor by specifying the location of the docker image and the instance specification.

In [None]:
# cell 4 

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version=framework_version,
    py_version="py3",
    instance_type=instance_type,)

script_processor = ScriptProcessor(
    image_uri=image_uri,
    instance_type=instance_type,
    instance_count=instance_count,
    base_job_name=PREFIX,
    command=["python3"],
    sagemaker_session=sagemaker_session,
    role=role,
    volume_size_in_gb=10,
)

### Creating a Processing Step

In [None]:
# cell 5

# Define the processing step
step_process = ProcessingStep(
    name="PreprocessData",
    processor=script_processor,
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
    ],
    code="src/preprocess.py",
    job_arguments=["--input-data", input_data],
)

## Task 3: Define the Training Step

Before you can run this code, please update the file found under `src/train.py` by adding the required code snippets from your notebook `ImmersionDay_Notebook1_GettingStarted.ipynb`.

**Steps**

* Open the notebook `ImmersionDay_Notebook1_GettingStarted.ipynb`
* Open the file `src/train.py`
* Fill the `ACTION` parts of the `train.py` file by copy & pasting your code from the notebook
    * Here we copy the `numeric_features` and `categorical_features`
    * Create an `X_train`, `y_train`, `X_val` and `y_val` from your datasets
    * Add your model code

### Create your training container/estimator in SageMaker

In `cell 6`, we define an XGBoost Estimator by specifying:
* the location of the training script
* the hyper parameters
* the role and instance specification
* output path on S3

In [None]:
# cell 6

# Set your code folder
source_dir = 'src/'
entry_point = 'train.py'

# Set the hyperparamters for this estimator
hyperparameters = {
    'n_estimators': 100,
    'max_depth': 10,
    'random_state': 42
}

xgb_estimator = XGBoost(
    entry_point=entry_point,
    source_dir=source_dir,
    hyperparameters=hyperparameters,
    role=role,
    instance_count=instance_count,
    instance_type=instance_type,
    framework_version=framework_version,
    sagemaker_session=sagemaker_session,
    output_path=f's3://{S3_BUCKET}/{PREFIX}/model/',)

### Creating a Training Step

Get familiar with the pipeline, also look at here https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html

In [None]:
#cell 7

# Define the training step
step_train = TrainingStep(
    name="TrainModel",
    estimator=xgb_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["validation"].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
)

## Task 4: Register your model

In this task, we register the estimator with [Sagemaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html). With the SageMaker Model Registry you can manager model versions and deploy models to productions as SageMaker Endpoints.

In [None]:
# cell 8 

# Register your model
step_register = RegisterModel(
    name="RegisterModel",
    estimator=xgb_estimator,
    image_uri=image_uri,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.t2.medium", "ml.m5.large"],
    transform_instances=["ml.m5.large"],
    model_package_group_name=model_package_group_name,
    approval_status="PendingManualApproval",
)


## Task 5: Define the SageMaker Pipeline

Run through the cells below to:
1. Define the SageMaker Pipeline `cell 9`
2. Submit the pipeline definition to the SageMaker Pipelines service to create a pipeline if it doesn't exist, or update the pipeline if it does `cell 10`.

In [None]:
# cell 9

# Pipeline instance
pipeline = Pipeline(
    name="PredMainPipeline",
    parameters=[
        instance_type,
        instance_count,
        input_data
    ],
    steps=[step_process, step_train, step_register],
    sagemaker_session=sagemaker_session,
)

In [None]:
# cell 10

# Create or upsert the pipeline definition
pipeline.upsert(role_arn=role)

## Task 6: Start and Monitor the SageMaker Pipeline Execution

In this task we are going to:
1. Start the pipeline `cell 11`
2. Examine and monitor a pipeline execution
    * Run `cell 12` to describe the pipeline execution status to ensure that it has been created and started successfully.
    * Run `cell 13` to list the execution steps and their status.
    * (Optional) For those who are using SageMaker Studio, visualise the pipeline exection as described in the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-studio-view-execution.html)

In [None]:
# cell 11

# Start the pipeline
execution = pipeline.start()

In [None]:
# cell 12

execution.describe()

In [None]:
# cell 13

execution.list_steps()