## Develop, Train, Optimize and Deploy Scikit-Learn Random Forest

> *This notebook should work well with the `Python 3 (Data Science)` kernel in SageMaker Studio, or the `conda_python3` kernel in SageMaker Notebook Instances*

In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Random Forest model based using the popular ML framework [Scikit-Learn](https://scikit-learn.org/stable/index.html).

The example uses the *California Housing dataset* (provided by Scikit-Learn) - more details of which can be found [here](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html).

To understand the code, you might also find it useful to refer to:

* The guide on [Using Scikit-Learn with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_sklearn.html)
* The API doc for [Scikit-Learn classes in the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html)
* The [SageMaker reference for Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client) (The general AWS SDK for Python, including low-level bindings for SageMaker as well as many other AWS services)


## Setup libraries and environment


In [None]:
# Python Built-Ins:
import datetime
import tarfile

# External Dependencies:
import boto3
import numpy as np
import pandas as pd
from sagemaker import get_execution_role
import sagemaker
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

sm_boto3 = boto3.client('sagemaker')
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = sess.default_bucket() # this could also be a hard-coded bucket name

print('Using bucket ' + bucket)


## Prepare data
We load a dataset from sklearn, split it and send it to S3

In [None]:
# we use the California housing dataset 
data = fetch_california_housing()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
 data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [None]:
trainX.head()

In [None]:
# create directories
! mkdir -p data
! mkdir -p source
! mkdir -p model

# save data as csv
trainX.to_csv('data/california_housing_train.csv')
testX.to_csv('data/california_housing_test.csv')

## Create a training script

The SageMaker Scikit-Learn Framework Container provides the basic runtime, and we as users specify the actual training steps to run as a script file (or even a folder of several, perhaps including a *requirements.txt* file).

The below code initializes a `.py` file from here in the notebook.

The same script can be used at training time (run as a script) and inference time (imported as a module) - So below we:

- Define some specific functions to override default inference behavior (e.g. `model_fn()`), and
- Enclose the training entry point in an `if __name__ == '__main__'` *guard clause* so it only executes when the module is run as a script.

You can find detailed guidance in the documentation on [Preparing a Scikit-Learn training script](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#prepare-a-scikit-learn-training-script) (for training) and the [SageMaker Scikit-Learn model server](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#sagemaker-scikit-learn-model-server) (for inference).

In [None]:
%%writefile source/sklearn_training_script.py
# Python Built-Ins:
import argparse
import os

# External Dependencies:
#Joblib is a set of tools to provide lightweight pipelining in Python.
# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, 
# along with a large collection of high-level mathematical functions to operate on these arrays.

# pandas is a software library written for the Python programming language for data manipulation and analysis. 
#In particular, it offers data structures and operations for manipulating numerical tables and time series.

import joblib
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
#A random forest is a meta estimator that fits a number of classifying decision trees 
#on various sub-samples of the dataset and uses averaging to improve the predictive accuracy 
#and control over-fitting.

# inference functions ---------------
def model_fn(model_dir):
 clf = joblib.load(os.path.join(model_dir, "model.joblib"))
 return clf


if __name__ == '__main__':

 #------------------------------- parsing input parameters (from command line)
 print('extracting arguments')
 parser = argparse.ArgumentParser()

 # RandomForest hyperparameters
 parser.add_argument('--n_estimators', type=int, default=10)
 parser.add_argument('--min_samples_leaf', type=int, default=3)

 # Data, model, and output directories
 parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
 parser.add_argument('--train_dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
 parser.add_argument('--test_dir', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
 parser.add_argument('--train_file', type=str, default='california_housing_train.csv')
 parser.add_argument('--test_file', type=str, default='california_housing_test.csv')
 parser.add_argument('--features', type=str) # explicitly name which features to use
 parser.add_argument('--target_variable', type=str) # explicitly name the column to be used as target

 args, _ = parser.parse_known_args()

 #------------------------------- data preparation
 print('reading data')
 train_df = pd.read_csv(os.path.join(args.train_dir, args.train_file))
 test_df = pd.read_csv(os.path.join(args.test_dir, args.test_file))

 print('building training and testing datasets')
 X_train = train_df[args.features.split()]
 X_test = test_df[args.features.split()]
 y_train = train_df[args.target_variable]
 y_test = test_df[args.target_variable]

 #------------------------------- model training
 print('training model')
 model = RandomForestRegressor(
 n_estimators=args.n_estimators,
 min_samples_leaf=args.min_samples_leaf,
 n_jobs=-1)

 model.fit(X_train, y_train)

 #------------------------------- model testing
 print('testing model')
 abs_err = np.abs(model.predict(X_test) - y_test)

 # percentile absolute errors
 for q in [10, 50, 90]:
 print('AE-at-' + str(q) + 'th-percentile: '
 + str(np.percentile(a=abs_err, q=q)))
 
#Mean Absolute Error is a model evaluation metric used with regression models. 

#------------------------------- save model
 path = os.path.join(args.model_dir, "model.joblib")
 joblib.dump(model, path)
 print('model saved at ' + path)


## Local training
Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

In [None]:
! python source/sklearn_training_script.py \
 --n_estimators 100 \
 --min_samples_leaf 3 \
 --model_dir 'model/' \
 --train_dir 'data/' \
 --test_dir 'data/' \
 --train_file 'california_housing_train.csv' \
 --test_file 'california_housing_test.csv' \
 --features 'MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude' \
 --target_variable 'target'

## SageMaker Training

### Creating data input channels (copy to S3)

In [None]:
# send data to S3. SageMaker will take training data from s3
train_path_s3 = sess.upload_data(
 path='data/california_housing_train.csv', # source
 bucket=bucket,
 key_prefix='sm101/sklearn' # destination path in S3
)

test_path_s3 = sess.upload_data(
 path='data/california_housing_test.csv', # source
 bucket=bucket,
 key_prefix='sm101/sklearn' # destination path in S3
)

print('Train set URI:', train_path_s3)
print('Test set URI:', test_path_s3)

### Launching a training job with the Python SDK

In [None]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
 entry_point='source/sklearn_training_script.py',
 role=get_execution_role(),
 instance_count=1,
 instance_type='ml.m5.large',
 framework_version='0.23-1',
 base_job_name='rf-scikit',
 metric_definitions=[
 { 'Name': 'median-AE', 'Regex': 'AE-at-50th-percentile: ([0-9.]+).*$' },
 ],
 hyperparameters={
 'n_estimators': 100,
 'min_samples_leaf': 3,
 'features': 'MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude',
 'target_variable': 'target',
 },
 max_run=20*60, # Maximum allowed active runtime (in seconds)
 use_spot_instances=True, # Use spot instances to reduce cost
 max_wait=30*60, # Maximum clock time (including spot delays)
)

In [None]:
sklearn_estimator.fit({'train':train_path_s3, 'test': test_path_s3}, wait=True)


Remember that the training job that we ran is very "light", due to the very small dataset. As such, running locally on the notebook instance results in a faster execution time, compared to SageMaker. SageMaker takes longer time to run the job because it has to provision the training infrastructure. Since this example training job not very resource-intensive, the infrastructure provisioning process adds more overhead, compared to the training job itself. 

In a real situation, where datasets are large, running on SageMaker can considerably speed up the execution process - and help us optimize costs, by keeping this interactive notebook environment modest and spinning up more powerful training job resources on-demand.

Note that this training job *did not run here on the notebook itself*. You'll be able to see the history in the [AWS Console for SageMaker - Training Jobs tab](https://console.aws.amazon.com/sagemaker/home?#/jobs).

> ℹ️ **Tip:** There's **no need to re-run** a training job if your notebook kernel restarts or the estimator state is lost for some other reason... You can just *attach* to a previous training job by name - for example:
>
> ```python
> estimator = SKLearn.attach('rf-scikit-2025-01-01-00-00-00-000')
> ```

## Deploy to a real-time endpoint

### Deploy with Python SDK

It's possible to deploy a trained `Estimator` to a SageMaker endpoint for real-time inference in one line of code, with `Estimator.deploy(...)` - which implicitly creates a SageMaker [Model](https://console.aws.amazon.com/sagemaker/home?#/models), [Endpoint Configuration](https://console.aws.amazon.com/sagemaker/home?#/endpointConfig), and [Endpoint](https://console.aws.amazon.com/sagemaker/home?#/endpoints).

For more fine-grained control though, you can choose to create a `Model` object through the SageMaker Python SDK - referencing the `model.tar.gz` produced on Amazon S3 by the training job. This would allow us to, for example:

- Modify environment variables or the Python files used between training and inference
- Import a model trained outside SageMaker that's been packaged to a compatible `model.tar.gz` on Amazon S3

We'll demonstrate the longer route here:

In [None]:
sklearn_estimator.latest_training_job.wait(logs='None') # Check the job is finished

model_artifact = sm_boto3.describe_training_job(
 TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact saved at:', model_artifact)

In [None]:
from sagemaker.sklearn.model import SKLearnModel

model = SKLearnModel(
 model_data=model_artifact,
 framework_version='0.23-1',
 py_version='py3',
 role=get_execution_role(),
 entry_point='source/sklearn_training_script.py',
)

In [None]:
predictor = model.deploy(
 instance_type='ml.c5.large',
 initial_instance_count=1,
)

### Realtime inference

The [Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) class from the SageMaker Python SDK provides a Python wrapper around the endpoint which also handles (configurable) de/serialization of the request and response.

Alternatively for clients which cannot use the SageMaker Python SDK (for example non-Python clients, or Python environments where the PyPI [sagemaker](https://pypi.org/project/sagemaker/) package can't be installed for some reason): The general AWS SDKs can be used to call the lower-level [SageMaker InvokeEndpoint API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html).

In [None]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[data.feature_names]))

In [None]:
print(data.feature_names)
print('nl')
print(testX[data.feature_names])

### Delete endpoint

While training job infrastructure is started on-demand and terminated as soon as the job stops, endpoints are live until we turn them off. Delete unused endpoints to prevent ongoing costs:

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)
