# Notebook 1 - Use SageMaker to develop a model

## Contents
1. [Introduction](#Introduction)
1. [Source libraries](#Source-libraries)
1. [Task 1: Upload the dataset to Amazon S3](#Task-1:-Upload-the-dataset-to-Amazon-S3)
1. [Task 2: Download the dataset to the notebook instance](#Task-2:-Download-the-dataset-to-the-notebook-instance)
1. [Optional Task: Run Exploratory Data Analysis (EDA) on your dataset](#Optional-Task:-Run-Exploratory-Data-Analysis-(EDA)-on-your-dataset)
1. [Task 3: Data Preprocessing](#Task-3:-Data-Preprocessing)
1. [Task 4: Model Training and Evaluation](#Task-4:-Model-Training-and-Evaluation)
1. [Task 5: Inference and Model Evaluation](#Task-5:-Inference-and-Model-Evaluation)
---


## Introduction

This notebook demonstrates the use of Amazon SageMaker XGBoost to process data, train a model using a Jupyter notebook. You can run this solution using an [Amazon SageMaker Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html) or using [Amazon Sagemaker Studio Notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html).

We use the [Predictive Maintenance Dataset](https://static.us-east-1.prod.workshops.aws/public/6f2f7cb1-bfda-4b34-ae39-928502784393/static/datasets/maintenance_dataset.csv), originally from the [UCI data repository](http://archive.ics.uci.edu/ml). More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset).

---

## Source libraries


In [None]:
# cell 1 

import pandas as pd
import numpy as np
import boto3
import os

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, recall_score, precision_score, f1_score

---

## Task 1: Upload the dataset to Amazon S3

Before running the cells below, 
1. Create an Amazon S3 bucket. Note that the bucket name must be globally unique.
1. Download the [Predictive Maintenance Dataset](https://static.us-east-1.prod.workshops.aws/public/6f2f7cb1-bfda-4b34-ae39-928502784393/static/datasets/maintenance_dataset.csv)
1. Upload the file to your created bucket
1. Put the name of your bucket and file name into S3_BUCKET and KEY respectively in `cell 2`

In [None]:
# cell 2

# Provide information where the training and validation data will be uploaded to 
S3_BUCKET = 'YOUR_S3_BUCKET' # YOUR_S3_BUCKET
KEY = "YOUR_OBJECT_ON_S3" # YOUR_OBJECT_ON_S3

---
## Task 2: Download the dataset to the notebook instance

Download and read the file from Amazon S3 and take a look at the DataFrame.

In [None]:
# cell 3 

print(f"Downloading data from bucket: {S3_BUCKET}, key: {KEY}")
fn = "maintenance_dataset.csv"
s3 = boto3.resource("s3")
s3.Bucket(S3_BUCKET).download_file(KEY, fn)

print("Reading downloaded data.")
df = pd.read_csv(fn)
os.unlink(fn)

In [None]:
# cell 4

df.shape # Example

In [None]:
# cell 5

df.head() # Example

In [None]:
# cell 6

df.describe() # Example

---
## Optional Task: Run Exploratory Data Analysis (EDA) on your dataset

Run Exploratory Data Analysis (EDA) on your dataset to dive deeper into the data... We will skip this for the Immersion Day and leave this up to participants as call to action.

In [None]:
# cell 7

df.groupby("Failure Type").size()

In [None]:
# cell 8

df.groupby("Type").size()

In [None]:
# cell 9

df.groupby("Product ID").size()

In [None]:
# cell 10

df.groupby("UDI").size()

---
## Task 3: Data Preprocessing 
We will split the data into train and test sets and continue preprocess using the `feature_columns_names` and `label_column` mentioned in `cell 11`. To preprocess the data
1. Create a `train_test_split` using sklearn `cell 12`. This code will be part of the `preprocessing`. This means this code will later be re-used in our processing container.
1. Run the `cell 16` to preprocessor that scales your numerical features and encodes your categorical features.
1. Run the `cell 17 encode your label

In [None]:
# cell 11

feature_columns_names = [
    'UDI',
    'Product ID',
    'Type',
    'Air temperature [K]',
    'Process temperature [K]',
    'Rotational speed [rpm]',
    'Torque [Nm]',
    'Tool wear [min]']
label_column = 'Failure Type'

### Train Test Split

In [None]:
# cell 12

# Your split here...


Possible solution:

<code>
    X_train, X_val, y_train, y_val = train_test_split(
        df[feature_columns_names],
        df[label_column],
        random_state=42,
        train_size=0.8,
        shuffle=True,
        stratify=df[label_column])
</code>

In [None]:
# cell 13 

# Putting the data together again so that data can be saved to S3 or any other place
train = pd.concat(objs=[y_train, X_train], axis=1)
validation = pd.concat(objs=[y_val, X_val], axis=1)

In [None]:
# cell 14

train.shape

In [None]:
# cell 15

validation.shape

### Scaling and Encoding the Features


In [None]:
# cell 16

# Scaling the numerical features
numeric_features = [
    'Air temperature [K]',
    'Process temperature [K]',
    'Rotational speed [rpm]',
    'Torque [Nm]',
    'Tool wear [min]']

# TODO get rid of Pipeline 
numeric_transformer = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
    ]
)

# Encoding the categorical 
categorical_features = ['Type']
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

# Combining both transformers
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

### Label Encoding

In [None]:
# cell 17

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_val = le.transform(y_val)

---
## Task 4: Model Training and Evaluation

1. Install XGBoost `cell 18`
1. Set the hyper parameters and define the estimator `cell 19`
1. Train the model `cell 20`

In [None]:
!pip install xgboost

In [None]:
# cell 18

import xgboost as xgb

### Model Definition and Training

In [None]:
# cell 19

clf = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    n_jobs=-1)

model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", clf)]
)

In [None]:
# cell 20

model.fit(X_train, y_train)

---
## Task 5: Inference and Model Evaluation
1. Generate the prediction and evaluate the model performance `cell 21`

In [None]:
# cell 21

y_pred = model.predict(X_val)
y_hat = model.predict(X_train)

print("In Sample")
print(classification_report(y_train, y_hat, zero_division=1))
print(confusion_matrix(y_train, y_hat))
print("Out of Sample")
print(classification_report(y_val, y_pred, zero_division=1))
print(confusion_matrix(y_val, y_pred), "\n")

print(f"train-recall:{recall_score(y_train, y_hat, average='macro', zero_division=True)};")
print(f"validation-recall:{recall_score(y_val, y_pred, average='macro', zero_division=True)};")
print(f"train-precision:{precision_score(y_train, y_hat, average='macro', zero_division=True)};")
print(f"validation-precision:{precision_score(y_val, y_pred, average='macro', zero_division=True)};")
print(f"train-f1:{f1_score(y_train, y_hat, average='macro', zero_division=True)};")
print(f"validation-f1:{f1_score(y_val, y_pred, average='macro', zero_division=True)};")

# Optional - retrain the model on all data
# Leave this section commented for now:

"""
X = pd.concat(objs=[X_train, X_val], axis=0)
y = pd.concat(objs=[pd.DataFrame(y_train), pd.DataFrame(y_val)], axis=0)

model = model.fit(X, y)
"""

---
## Congratulations! 
You have successfully trained your first model on an Amazon SageMaker Notebook instance!

(Optional) You can now, go over [Task 4](#Task-4:-Model-Training-and-Evaluation) and [Task 5](#Task-5:-Inference-and-Model-Evaluation)
to:
1. Play with the hyperparameters to see if you can find a (slightly) better model 
1. Take 15 minutes to play with the algorithm - or develop your own using other RandomForest