# Develop, Train, Optimize and Deploy Scikit-Learn Random Forest

> *This notebook should work well with the `Python 3 (Data Science 2.0)` kernel on SageMaker Studio, or `conda_python3` on classic SageMaker Notebook Instances*

In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Random Forest model based using the popular ML framework [Scikit-Learn](https://scikit-learn.org/stable/index.html).

The example uses the *California Housing dataset* (provided by Scikit-Learn) - more details of which can be found [here](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html).

To understand the code, you might also find it useful to refer to:

* The guide on [Using Scikit-Learn with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_sklearn.html)
* The API doc for [Scikit-Learn classes in the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html)
* The [SageMaker reference for Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client) (The general AWS SDK for Python, including low-level bindings for SageMaker as well as many other AWS services)


We start by importing dependencies and seting up basic configurations like the current AWS Region and target S3 bucket:

In [None]:
# Python Built-Ins:
import os

# External Dependencies:
import boto3 # General-purpose AWS SDK for Python
import numpy as np # Tools for working with numeric arrays
import pandas as pd # Tools for warking with data tables (dataframes)
import sagemaker # High-level SDK for Amazon SageMaker in particular
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
region = sess.boto_session.region_name
bucket = sess.default_bucket() # this could also be a hard-coded bucket name

print(f"Using bucket {bucket}")

## Prepare data

Next, we'll load our raw example dataset from SKLearn and prepare it into the format the training job will use: A separate CSV for training and for validation/test.

In [None]:
data = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(
 data.data, data.target, test_size=0.25, random_state=42
)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX["target"] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX["target"] = y_test

trainX.head()

In [None]:
# create directories
os.makedirs("data", exist_ok=True)
os.makedirs("src", exist_ok=True)
os.makedirs("model", exist_ok=True)

# save data as csv
trainX.to_csv("data/california_train.csv")
testX.to_csv("data/california_test.csv")

## Create a training script

The SageMaker Scikit-Learn [Framework Container](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-docker-containers-scikit-learn-spark.html) provides the basic runtime, and we as users specify the actual training steps to run as a script file (or even a folder of several, perhaps including a `requirements.txt`).

The below code initializes a [`src/main.py`](src/main.py) file from here in the notebook. You can also create Python scripts and other files from the launcher or the File menu.

In this example, the same file will be used at training time (run as as script), and at inference time (imported as a [module](https://docs.python.org/3/tutorial/modules.html)) - So below we:

- Define some specific **inference functions** to override default behavior (e.g. `model_fn()`), and
- Enclose the **training entry point** in an `if __name__ == '__main__'` [guard clause](https://docs.python.org/3/library/__main__.html) so it only executes when the module is run as a script.

You can find detailed guidance in the documentation on [Preparing a Scikit-Learn training script](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#prepare-a-scikit-learn-training-script) (for training) and the [SageMaker Scikit-Learn model server](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#sagemaker-scikit-learn-model-server) (for inference).

In [None]:
%%writefile src/main.py
# Python Built-Ins:
import argparse
import os

# External Dependencies:
import joblib
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor


# ---- INFERENCE FUNCTIONS ----
def model_fn(model_dir):
 clf = joblib.load(os.path.join(model_dir, "model.joblib"))
 return clf


if __name__ == "__main__":
 # ---- TRAINING ENTRY POINT ----
 
 # Arguments like data location and hyper-parameters are passed from SageMaker to your script
 # via command line arguments and/or environment variables. You can use Python's built-in
 # argparse module to parse them:
 print("Parsing training arguments")
 parser = argparse.ArgumentParser()

 # RandomForest hyperparameters
 parser.add_argument("--n_estimators", type=int, default=10)
 parser.add_argument("--min_samples_leaf", type=int, default=3)

 # Data, model, and output directories
 parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 parser.add_argument("--train_dir", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 parser.add_argument("--test_dir", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 parser.add_argument("--train_file", type=str, default="train.csv")
 parser.add_argument("--test_file", type=str, default="test.csv")
 parser.add_argument("--features", type=str) # explicitly name which features to use
 parser.add_argument("--target_variable", type=str) # name the column to be used as target

 args, _ = parser.parse_known_args()

 # -- DATA PREPARATION --
 # Load the data from the local folder(s) SageMaker pointed us to:
 print("Reading data")
 train_df = pd.read_csv(os.path.join(args.train_dir, args.train_file))
 test_df = pd.read_csv(os.path.join(args.test_dir, args.test_file))

 print("building training and testing datasets")
 X_train = train_df[args.features.split()]
 X_test = test_df[args.features.split()]
 y_train = train_df[args.target_variable]
 y_test = test_df[args.target_variable]

 # -- MODEL TRAINING --
 print("Training model")
 model = RandomForestRegressor(
 n_estimators=args.n_estimators,
 min_samples_leaf=args.min_samples_leaf,
 n_jobs=-1)

 model.fit(X_train, y_train)

 # -- MODEL EVALUATION --
 print("Testing model")
 abs_err = np.abs(model.predict(X_test) - y_test)
 # Output metrics to the console (in this case, percentile absolute errors):
 for q in [10, 50, 90]:
 print(f"AE-at-{q}th-percentile: {np.percentile(a=abs_err, q=q)}")

 # -- SAVE THE MODEL --
 # ...To the specific folder SageMaker pointed us to:
 path = os.path.join(args.model_dir, "model.joblib")
 joblib.dump(model, path)
 print(f"model saved at {path}")


## Local training

Since configuration is by command line arguments, we can test our training script locally before uploading to a SageMaker training job.

> ⚠️ **Note:** This is good for quick, functional tests of your script against small sample datasets... But once you're confident your script *functionally* works, you probably want to move your experiments to reproduceable, trackable, SageMaker training jobs. Be aware that the libraries in your notebook kernel may not exactly match the container image you configure for the training job later.

In [None]:
!python src/main.py \
 --n_estimators 100 \
 --min_samples_leaf 3 \
 --model_dir model/ \
 --train_dir data/ \
 --test_dir data/ \
 --train_file california_train.csv \
 --test_file california_test.csv \
 --features 'MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude' \
 --target_variable target

## SageMaker Training

To run your script in a training job, first we need to upload the data somewhere SageMaker can access it: Typically this will be [Amazon S3](https://aws.amazon.com/s3/).

### Creating data input "channels" (copy to S3)

Note that the number and naming of multiple data "channels" for SageMaker is up to you: You don't need to have exactly 2, and they don't need to be called "train" and "test".

In [None]:
train_data_s3uri = sess.upload_data(
 path="data/california_train.csv", # Local source
 bucket=bucket,
 key_prefix="sm101/sklearn", # Destination path in S3 bucket
)

test_data_s3uri = sess.upload_data(
 path="data/california_test.csv", # Local source
 bucket=bucket,
 key_prefix="sm101/sklearn", # Destination path in S3 bucket
)

print("Train set URI:", train_data_s3uri)
print("Test set URI:", test_data_s3uri)

### Launching a training job with the Python SDK

With the data uploaded and script prepared, you're ready to configure your SageMaker training job:

In [None]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
 entry_point="main.py",
 source_dir="src", # To upload the whole folder - or instead set entry_point="src/main.py"
 role=sagemaker.get_execution_role(), # Use same IAM role as notebook is currently using
 instance_count=1,
 instance_type="ml.m5.large",
 framework_version="1.0-1",
 base_job_name="rf-scikit",
 metric_definitions=[
 # SageMaker can extract metrics from your console logs via Regular Expressions:
 {"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"},
 ],
 hyperparameters={
 "n_estimators": 100,
 "min_samples_leaf": 3,
 "features": "MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude",
 "target_variable": "target",
 # SageMaker data channels are always folders. Even if you point to a particular object
 # S3URI, you'll need to either: Properly support loading folder inputs in your script; or
 # use extra configuration parameters to identify specific filename(s):
 "train_file": "california_train.csv",
 "test_file": "california_test.csv",
 },
 # Optional settings to run with SageMaker Managed Spot:
 max_run=20*60, # Maximum allowed active runtime (in seconds)
 use_spot_instances=True, # Use spot instances to reduce cost
 max_wait=30*60, # Maximum clock time (including spot delays)
)

In [None]:
sklearn_estimator.fit({"train": train_data_s3uri, "test": test_data_s3uri}, wait=True)

Remember that the training job that we ran is very "light", due to the very small dataset. As such, running locally on the notebook instance results in a faster execution time, compared to SageMaker. SageMaker takes longer time to run the job because it has to provision the training infrastructure. Since this example training job not very resource-intensive, the infrastructure provisioning process adds more overhead, compared to the training job itself. 

In a real situation, where datasets are large, running on SageMaker can considerably speed up the execution process - and help us optimize costs, by keeping this interactive notebook environment modest and spinning up more powerful training job resources on-demand.

Note that this training job *did not run here on the notebook itself*. You'll be able to see the history in the [AWS Console for SageMaker - Training Jobs tab](https://console.aws.amazon.com/sagemaker/home?#/jobs) and also the [SageMaker Studio Experiments and Trials UI](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-view-compare.html).

> ℹ️ **Tip:** There's **no need to re-run** a training job if your notebook kernel restarts or the estimator state is lost for some other reason... You can just *attach* to a previous training job by name - for example:
>
> ```python
> estimator = SKLearn.attach("rf-scikit-2025-01-01-00-00-00-000")
> ```

## Deploy to a real-time endpoint

### Deploy with Python SDK

It's possible to deploy a trained `Estimator` to a SageMaker endpoint for real-time inference in one line of code, with `Estimator.deploy(...)` - which implicitly creates a SageMaker [Model](https://console.aws.amazon.com/sagemaker/home?#/models), [Endpoint Configuration](https://console.aws.amazon.com/sagemaker/home?#/endpointConfig), and [Endpoint](https://console.aws.amazon.com/sagemaker/home?#/endpoints).

For more fine-grained control though, you can choose to create a `Model` object through the SageMaker Python SDK - referencing the `model.tar.gz` produced on Amazon S3 by the training job. This would allow us to, for example:

- Modify environment variables or the Python files used between training and inference
- Import a model trained outside SageMaker that's been packaged to a compatible `model.tar.gz` on Amazon S3

We'll demonstrate the longer route here:

In [None]:
sklearn_estimator.latest_training_job.wait(logs="None") # Check the job is finished

# describe() here is equivalent to low-level boto3 SageMaker describe_training_job
job_desc = sklearn_estimator.latest_training_job.describe()
model_s3uri = job_desc["ModelArtifacts"]["S3ModelArtifacts"]

print("Model artifact saved at:", model_s3uri)

In [None]:
from sagemaker.sklearn.model import SKLearnModel

model = SKLearnModel(
 model_data=model_s3uri,
 framework_version="1.0-1",
 py_version="py3",
 role=sagemaker.get_execution_role(),
 entry_point="src/main.py",
)

In [None]:
predictor = model.deploy(
 instance_type="ml.c5.large",
 initial_instance_count=1,
)

### Realtime inference

The [Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) class from the SageMaker Python SDK provides a Python wrapper around the endpoint which also handles (configurable) de/serialization of the request and response.

Alternatively for clients which cannot use the SageMaker Python SDK (for example non-Python clients, or Python environments where the PyPI [sagemaker](https://pypi.org/project/sagemaker/) package can't be installed for some reason): The general AWS SDKs can be used to call the lower-level [SageMaker InvokeEndpoint API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html).

In [None]:
# the SKLearnPredictor does the serialization from pandas for us
print(predictor.predict(testX[data.feature_names]))

### Delete endpoint

While training job infrastructure is started on-demand and terminated as soon as the job stops, endpoints are live until we turn them off. Delete unused endpoints to prevent ongoing costs:

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)

## (Optional) Batch inference

Above we saw how you can deploy your trained model to a real-time API, But what if you want to process a whole batch of data at once? There's no need to manually orchestrate sending this data through an endpoint: You can use [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html).

Like with training, your input data for a batch transform job needs to be accessible to SageMaker (i.e. uploaded to Amazon S3) and the result will be stored to S3. The compute infrastructure spun up for the job will be released as soon as the data is processed.

Unlike with training, the input data in S3 needs to match the format your model expects for *inference*. This means we'll need to remove `target`, any unused features, and also column headers (although we could have instead overridden [input_fn](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#process-input) to make our model handle more input shapes).

In [None]:
testX[data.feature_names].to_csv("data/transform_input.csv", header=False, index=False)

transform_input_s3uri = sess.upload_data(
 path="data/transform_input.csv", # Local source
 bucket=bucket,
 key_prefix="sm101/sklearn", # Destination path in S3 bucket
)

With the input data uploaded, you're ready to run a transform job using the `model` from before:

In [None]:
transformer = model.transformer(
 instance_count=1,
 instance_type="ml.m5.xlarge",
 # Input Parameters:
 strategy="MultiRecord", # Batch multiple records per request to the endpoint
 max_payload=2, # Max 2MB payload per request
 max_concurrent_transforms=2, # 2 concurrent request threads per instance
 # Output Parameters:
 output_path=f"s3://{bucket}/sm101/sklearn-transforms",
 accept="text/csv", # Request CSV output format
 assemble_with="Line", # Records in CSV output are newline-separated
)

In [None]:
transformer.transform(
 transform_input_s3uri,
 content_type="text/csv", # Input files are CSV format
 split_type="Line", # Interpret each line of the CSV as a separate record
 join_source="Input", # Bring input features through to the output file
 wait=True, # Keep this notebook blocked until the job completes
 logs=True, # Stream logs to the notebook
)

For each input object in S3, Batch Transform will generate a similar object under the output folder with `.out` appended to the file name. In our simple example, there was just one input CSV so there will be one `csv.out` result file:

In [None]:
job_desc = sm_boto3.describe_transform_job(TransformJobName=transformer.latest_transform_job.name)
output_s3uri = job_desc["TransformOutput"]["S3OutputPath"]

# pd.read_csv() can take an "s3://.../.../" folder, but doesn't like that our Batch Transform
# results have .csv.out extension instead of .csv: So instead manually specify which file we want:
!echo "Output folder contents:" && aws s3 ls {output_s3uri}/

input_filename = transform_input_s3uri.rpartition("/")[2]
output_file_s3uri = f"{output_s3uri}/{input_filename}.out"

print(f"\nReading {output_file_s3uri} from S3")
pd.read_csv(output_file_s3uri, names=data.feature_names + ["prediction"])

## Conclusions

In this notebook, we saw an example of:

- Running your own Scikit-Learn-based model training script as a SageMaker training job, with configurable parameters and output accuracy metrics
- Deploying the trained model to a real-time inference API
- Using the model for batch inference

SageMaker took care of the model serving stack for us with no boilerplate code required: Just define [override functions](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#sagemaker-scikit-learn-model-server) if needed (like `input_fn` and `model_fn`) to customize the default behaviour. At training time, our script read parameters from the command line arguments and environment variables provided through SageMaker - and loaded data from local folder because download from S3 is taken care of by SageMaker too.

By using the SageMaker APIs (instead of just working locally in the notebook), we can improve the traceability and reproducibility of experiments; optimize our compute resource usage; and accelerate the path from trained model to production deployment.