# Lab: SKLearn Migration Challenge

> *Run this notebook in the same environment you used for the 'Local Notebook': `Python 3 (Data Science 3.0)` on SageMaker Studio, or `conda_python3` on classic SageMaker Notebook Instances*

## Introduction
Your new colleague in the data science team (who isn't very familiar with SageMaker) has written a nice notebook to tackle a classification problem with scikitlearn: [Local Notebook.ipynb](Local%20Notebook.ipynb)

It works OK with the simple Iris data set they were working on before, but now they'd like to take advantage of some of the features of SageMaker to tackle bigger and harder challenges.

Can you help refactor the Local Notebook code, to show them how to use SageMaker effectively?

## Getting Started

First, check you can run the [Local Notebook.ipynb](Local%20Notebook.ipynb) notebook through - reviewing what steps it takes.

This notebook sets out a structure you can use to migrate code into, and lists out some of the changes you'll need to make at a high level. You can either work directly in here, or duplicate this notebook so you still have an unchanged copy of the original.

Try to work through the sections first with an MVP goal in mind (fitting the model to data in S3 via a SageMaker Training Job, and deploying/using the model through a SageMaker Endpoint). The goal is to understand the big picture on how you can bring your own code to SageMaker and scale your training and deploy. You can always build more advanced models or more complex training code.

## SKLearn "script mode" training and serving

SageMaker provides [pre-built container images](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-prebuilt.html) for a range of ML frameworks, including Scikit-Learn, which allow you to bring custom models without worrying about building and maintaining your own container images or serving stacks: You can even install extra libraries by [providing a requirements.txt file](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#using-third-party-libraries) if you want.

This pattern is sometimes called "framework mode" or ["script mode"](https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/) - separate from building fully-custom containers or using the pre-built algorithms.

The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native SKlearn support sets up training-related environment variables and executes your training script. Script mode supports training with a Python script, a Python module, or a shell script.

## Dependencies
Listing all our imports at the start helps to keep the requirements to run any script/file transparent up-front, and is specified by nearly every style guide including Python's official [PEP 8](https://peps.python.org/pep-0008/#imports)

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import os

# External Dependencies:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# TODO: What else will you need?
# Have a look at the documentation: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html
# to see which libraries need to be imported to use sagemaker and the Sklearn estimator estimator


## Prepare the Data

Initial data preparation will be similar to what we did in the [Local Notebook.ipynb](Local%20Notebook.ipynb).

In [None]:
# TODO: Fetch the sample dataset


In [None]:
# TODO: Read in the data file and set headers


In [None]:
# TODO: Check class distribution


If you want your model to be aware of class names, not just numeric IDs, you might want to represent these in a hyperparameter. Since hyperparameters [must be representable as strings](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-HyperParameters), you'd need to represent the array somehow. In this example we'll just use a comma-separated string:

In [None]:
class_names_str = ",".join(class_names)

In [None]:
# TODO: Split the data into train and test CSVs (with headers)


## Upload Data to Amazon S3

To train in a SageMaker training job, rather than locally, we'll need the train and test datasets to be staged somewhere the job can access: Usually in Amazon S3.

Modern versions of Pandas should support saving to S3 directly with `dataframe.to_csv("s3://{bucket_name}/{file_path}")`

> Alternatively, you can refer to the previous exercises for examples copying files between S3 and local storage using the aws s3 sync CLI command or using the boto3 SDK.
>
> The high-level [`aws s3 sync` command](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html) synchronizes the contents of a local folder to or from an S3 bucket/folder. You can use options like `--delete` to remove objects from the target that are not present in the source, and `--include` or `--exclude` to filter what files get copied.

But what should your `bucket_name` be? Use the **default SageMaker bucket** for your bucket name, as shown in previous labs.

In [None]:
# TODO: Look up the default SageMaker bucket for your bucket_name


In [None]:
# TODO: Upload your `test` and `train` CSV data splits to your SageMaker default S3 bucket
# You can use pandas to_csv("s3://...") directly, the '!aws s3' CLI, or boto3 S3 as you prefer


## Algorithm ("Estimator") Configuration and Run

Instead of loading and fitting this data here in the notebook, we'll be creating a SKLearn Estimator through the SageMaker SDK, to run the code on a separate container that can be scaled as required.

The ["Using SKlearn with the SageMaker Python SDK"](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#using-scikit-learn-with-the-sagemaker-python-sdk) docs give a good overview of this process. You should run your estimator in script mode (which is easier to follow than the old default legacy mode) and as Python 3.

One thing you'll need to set up your training job is an [execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) the job should run with - to give it access to your S3 data. Let's look that role up first:

In [None]:
# TODO: Look up the SageMaker Execution role to use for training, as in previous labs


Next, you're ready to:

▶️ Use the **[src/main.py file](src/main.py) already prepared for you** in your local directory as your entry point to port code into. This includes a basic template, but with more TODOs you'll need to fill in.

▶️ Define the 'estimator' here in the notebook, to configure how the training job should run your script.

Remember these two sides connect together: The script receives parameters, local input data folders, and the target model output folder as CLI parameters and environment variables from the estimator.

In [None]:
# TODO: Define your estimator using SKlearn framework


> ⚠️ **Before running the actual training job** on SageMaker, we suggest running your script locally using the example command below.
>
> This can help you find and fix errors faster, because you won't need to wait for the job to start up each time.

Do the number and names of the data **'channels'** in this command match your script?

In [None]:
os.makedirs("data/model", exist_ok=True)

!python3 src/main.py \
 --train ./data/train \
 --test ./data/test \
 --model_dir ./data/model \
 --class_names {class_names_str} \
 --n_estimators=100 \
 --min_samples_leaf=3

## Run the SageMaker Training Job

When you're ready to try your script in a SageMaker training job, you can call `estimator.fit(...)` as we did in previous exercises: Specifying your input data location(s).

Your job should have 2 input datasets: One for training, and one for test/validation. In SageMaker terminology, each input data set is a "channel" and we can name them however we like... Just make sure you're consistent about what you call each one!

When training is complete, the training job will automatically upload the saved model to Amazon S3 ready for deployment.

In [None]:
# TODO: Call the fit function, passing in the data you uploaded to S3 earlier


## Deploy and Use Your Model (Real-Time Inference)
We are now ready to deploy our model to Sagemaker hosting services and make [real time predictions](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html).

**Hint:** With the Scikit-Learn framework (where models may come in many different formats), you need to [define a `model_fn`](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#load-a-model) for loading your model at inference time. The script provided for training isn't automatically passed at inference time too, so the easiest way to link your script to the model endpoint is probably to create a `SKLearnModel` - rather than trying to `deploy()` your estimator directly.

In [None]:
# TODO: Create a SKLearnModel from your training job


In [None]:
# TODO: Deploy your trained model to a real time endpoint


Let's now send some data to our model to predict.

Note you'll need to send the correct input fields the model expects (X_test only, excluding label column), and will need to send it in a [format supported](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#get-predictions) by the deployed endpoint.

In [None]:
# TODO: Load some test data to test your model with, in the same format as it was trained on


In [None]:
# TODO: Invoke your endpoint and return the predictions


## (Optional) Extension exercises

By getting this far, hopefully you've been able to train your model on SageMaker, deploy it to an inference endpoint, and make some test predictions. Great going!

If you have some extra time, try exploring these extension exercises for an extra challenge:

- **Cut training costs easily with SageMaker Managed Spot Mode**: Spot Instances let you take advantage of unused capacity in the AWS cloud, at up to a 90% discount versus standard on-demand pricing! For small jobs like this, taking advantage of this discount is as easy as adding a couple of parameters to the [Estimator constructor](https://sagemaker.readthedocs.io/en/stable/estimators.html)

> **Note** that in general, spot capacity is offered at a discounted rate because it's interruptible based on instantaneous demand... Longer-running training jobs should implement checkpoint saving and loading, so that they can efficiently resume if interrupted part way through. More information can be found on the [Managed Spot Training in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) page of the [SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/).

- **Batch Inference**: Many tabular data use-cases make predictions on batches of data, rather than real-time requests. [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) helps you run batches of data through your model and automatically spin up + shut down infrastructure when needed: There's no need to deploy an endpoint and orchestrate data batches yourself! See if you can run a batch transform job using your previously trained model.

> **Hint:** There's a batch transform example in the [built-in XGBoost algorithm example notebook](../../builtin_algorithm_hpo_tabular/1%20Autopilot%20and%20XGBoost.ipynb) that might be useful to refer to, and check out the [SageMaker Python SDK user guide on Batch Transform](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-batch-transform) as well.

- **Automatic Model Tuning**: We already showed some model hyper-parameters in the local notebook. Can you connect these up as training job hyperparameters; set up metric scraping from your job logs; and use those hyperparameters + metrics to run a [SageMaker Automatic Model Tuning](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-automatic-model-tuning) run?

> **Hint:** There's an HPO example in the [built-in XGBoost algorithm example notebook](../../builtin_algorithm_hpo_tabular/1%20Autopilot%20and%20XGBoost.ipynb), but for a custom algorithm you'll also need to supply `metric_definitions` to tell SageMaker how to read accuracy metrics from your training job logs.

## Clean-Up

Remember to clean up any persistent resources that aren't needed anymore to save costs: The most significant of these are real-time prediction endpoints, and this SageMaker Notebook Instance.

The SageMaker SDK [Predictor](https://sagemaker.readthedocs.io/en/stable/predictors.html) class provides an interface to clean up real-time prediction endpoints; and SageMaker Notebook Instances can be stopped through the SageMaker Console when you're finished.

You might also like to clean up any S3 buckets / content we created, to prevent ongoing storage costs.


In [None]:
# TODO: Clean up any endpoints/etc to release resources
