{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Develop, Train, Optimize and Deploy Scikit-Learn Random Forest\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "* Doc https://sagemaker.readthedocs.io/en/stable/using_sklearn.html\n", "* SDK https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html\n", "* boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client\n", "\n", "In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML model (Random Forest). More info on Scikit-Learn can be found here https://scikit-learn.org/stable/index.html. We use the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html. The California Housing dataset was originally published in:\n", "\n", "> Pace, R. Kelley, and Ronald Barry. \"Sparse spatial autoregressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n", " \n", "**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "import time\n", "import tarfile\n", "\n", "import boto3\n", "import pandas as pd\n", "import numpy as np\n", "from sagemaker import get_execution_role\n", "import sagemaker\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import fetch_california_housing\n", "\n", "\n", "sm_boto3 = boto3.client(\"sagemaker\")\n", "\n", "sess = sagemaker.Session()\n", "\n", "region = sess.boto_session.region_name\n", "\n", "bucket = sess.default_bucket() # this could also be a hard-coded bucket name\n", "\n", "print(\"Using bucket \" + bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare data\n", "We load a dataset from sklearn, split it and send it to S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we use the California housing dataset\n", "data = fetch_california_housing()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(\n", " data.data, data.target, test_size=0.25, random_state=42\n", ")\n", "\n", "trainX = pd.DataFrame(X_train, columns=data.feature_names)\n", "trainX[\"target\"] = y_train\n", "\n", "testX = pd.DataFrame(X_test, columns=data.feature_names)\n", "testX[\"target\"] = y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainX.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainX.to_csv(\"california_housing_train.csv\")\n", "testX.to_csv(\"california_housing_test.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# send data to S3. SageMaker will take training data from s3\n", "trainpath = sess.upload_data(\n", " path=\"california_housing_train.csv\", bucket=bucket, key_prefix=\"sagemaker/sklearncontainer\"\n", ")\n", "\n", "testpath = sess.upload_data(\n", " path=\"california_housing_test.csv\", bucket=bucket, key_prefix=\"sagemaker/sklearncontainer\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing a *Script Mode* script\n", "The below script contains both training and inference functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile script.py\n", "\n", "import argparse\n", "import joblib\n", "import os\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.ensemble import RandomForestRegressor\n", "\n", "\n", "# inference functions ---------------\n", "def model_fn(model_dir):\n", " clf = joblib.load(os.path.join(model_dir, \"model.joblib\"))\n", " return clf\n", "\n", "\n", "if __name__ == \"__main__\":\n", " print(\"extracting arguments\")\n", " parser = argparse.ArgumentParser()\n", "\n", " # hyperparameters sent by the client are passed as command-line arguments to the script.\n", " # to simplify the demo we don't use all sklearn RandomForest hyperparameters\n", " parser.add_argument(\"--n-estimators\", type=int, default=10)\n", " parser.add_argument(\"--min-samples-leaf\", type=int, default=3)\n", "\n", " # Data, model, and output directories\n", " parser.add_argument(\"--model-dir\", type=str, default=os.environ.get(\"SM_MODEL_DIR\"))\n", " parser.add_argument(\"--train\", type=str, default=os.environ.get(\"SM_CHANNEL_TRAIN\"))\n", " parser.add_argument(\"--test\", type=str, default=os.environ.get(\"SM_CHANNEL_TEST\"))\n", " parser.add_argument(\"--train-file\", type=str, default=\"california_housing_train.csv\")\n", " parser.add_argument(\"--test-file\", type=str, default=\"california_housing_test.csv\")\n", " parser.add_argument(\n", " \"--features\", type=str\n", " ) # in this script we ask user to explicitly name features\n", " parser.add_argument(\n", " \"--target\", type=str\n", " ) # in this script we ask user to explicitly name the target\n", "\n", " args, _ = parser.parse_known_args()\n", "\n", " print(\"reading data\")\n", " train_df = pd.read_csv(os.path.join(args.train, args.train_file))\n", " test_df = pd.read_csv(os.path.join(args.test, args.test_file))\n", "\n", " print(\"building training and testing datasets\")\n", " X_train = train_df[args.features.split()]\n", " X_test = test_df[args.features.split()]\n", " y_train = train_df[args.target]\n", " y_test = test_df[args.target]\n", "\n", " # train\n", " print(\"training model\")\n", " model = RandomForestRegressor(\n", " n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1\n", " )\n", "\n", " model.fit(X_train, y_train)\n", "\n", " # print abs error\n", " print(\"validating model\")\n", " abs_err = np.abs(model.predict(X_test) - y_test)\n", "\n", " # print couple perf metrics\n", " for q in [10, 50, 90]:\n", " print(\"AE-at-\" + str(q) + \"th-percentile: \" + str(np.percentile(a=abs_err, q=q)))\n", "\n", " # persist model\n", " path = os.path.join(args.model_dir, \"model.joblib\")\n", " joblib.dump(model, path)\n", " print(\"model persisted at \" + path)\n", " print(args.min_samples_leaf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Local training\n", "Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! python script.py --n-estimators 100 \\\n", " --min-samples-leaf 2 \\\n", " --model-dir ./ \\\n", " --train ./ \\\n", " --test ./ \\\n", " --features 'MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude' \\\n", " --target target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Launching a training job with the Python SDK" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We use the Estimator from the SageMaker Python SDK\n", "from sagemaker.sklearn.estimator import SKLearn\n", "\n", "FRAMEWORK_VERSION = \"0.23-1\"\n", "\n", "sklearn_estimator = SKLearn(\n", " entry_point=\"script.py\",\n", " role=get_execution_role(),\n", " instance_count=1,\n", " instance_type=\"ml.c5.xlarge\",\n", " framework_version=FRAMEWORK_VERSION,\n", " base_job_name=\"rf-scikit\",\n", " metric_definitions=[{\"Name\": \"median-AE\", \"Regex\": \"AE-at-50th-percentile: ([0-9.]+).*$\"}],\n", " hyperparameters={\n", " \"n-estimators\": 100,\n", " \"min-samples-leaf\": 3,\n", " \"features\": \"MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude\",\n", " \"target\": \"target\",\n", " },\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# launch training job, with asynchronous call\n", "sklearn_estimator.fit({\"train\": trainpath, \"test\": testpath}, wait=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alternative: launching a training with `boto3`\n", "`boto3` is more verbose yet gives more visibility in the low-level details of Amazon SageMaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# first compress the code and send to S3\n", "\n", "source = \"source.tar.gz\"\n", "project = \"scikitlearn-train-from-boto3\"\n", "\n", "tar = tarfile.open(source, \"w:gz\")\n", "tar.add(\"script.py\")\n", "tar.close()\n", "\n", "s3 = boto3.client(\"s3\")\n", "s3.upload_file(source, bucket, project + \"/\" + source)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When using `boto3` to launch a training job we must explicitly point to a docker image." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker import image_uris\n", "\n", "\n", "training_image = image_uris.retrieve(\n", " framework=\"sklearn\",\n", " region=region,\n", " version=FRAMEWORK_VERSION,\n", " py_version=\"py3\",\n", " instance_type=\"ml.c5.xlarge\",\n", ")\n", "print(training_image)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# launch training job\n", "\n", "response = sm_boto3.create_training_job(\n", " TrainingJobName=\"sklearn-boto3-\" + datetime.datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\"),\n", " HyperParameters={\n", " \"n_estimators\": \"300\",\n", " \"min_samples_leaf\": \"3\",\n", " \"sagemaker_program\": \"script.py\",\n", " \"features\": \"MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude\",\n", " \"target\": \"target\",\n", " \"sagemaker_submit_directory\": \"s3://\" + bucket + \"/\" + project + \"/\" + source,\n", " },\n", " AlgorithmSpecification={\n", " \"TrainingImage\": training_image,\n", " \"TrainingInputMode\": \"File\",\n", " \"MetricDefinitions\": [\n", " {\"Name\": \"median-AE\", \"Regex\": \"AE-at-50th-percentile: ([0-9.]+).*$\"},\n", " ],\n", " },\n", " RoleArn=get_execution_role(),\n", " InputDataConfig=[\n", " {\n", " \"ChannelName\": \"train\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": trainpath,\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " }\n", " },\n", " },\n", " {\n", " \"ChannelName\": \"test\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": testpath,\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " }\n", " },\n", " },\n", " ],\n", " OutputDataConfig={\"S3OutputPath\": \"s3://\" + bucket + \"/sagemaker-sklearn-artifact/\"},\n", " ResourceConfig={\"InstanceType\": \"ml.c5.xlarge\", \"InstanceCount\": 1, \"VolumeSizeInGB\": 10},\n", " StoppingCondition={\"MaxRuntimeInSeconds\": 86400},\n", " EnableNetworkIsolation=False,\n", ")\n", "\n", "print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Launching a tuning job with the Python SDK" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# we use the Hyperparameter Tuner\n", "from sagemaker.tuner import IntegerParameter\n", "\n", "# Define exploration boundaries\n", "hyperparameter_ranges = {\n", " \"n-estimators\": IntegerParameter(20, 100),\n", " \"min-samples-leaf\": IntegerParameter(2, 6),\n", "}\n", "\n", "# create Optimizer\n", "Optimizer = sagemaker.tuner.HyperparameterTuner(\n", " estimator=sklearn_estimator,\n", " hyperparameter_ranges=hyperparameter_ranges,\n", " base_tuning_job_name=\"RF-tuner\",\n", " objective_type=\"Minimize\",\n", " objective_metric_name=\"median-AE\",\n", " metric_definitions=[\n", " {\"Name\": \"median-AE\", \"Regex\": \"AE-at-50th-percentile: ([0-9.]+).*$\"}\n", " ], # extract tracked metric from logs with regexp\n", " max_jobs=10,\n", " max_parallel_jobs=2,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Optimizer.fit({\"train\": trainpath, \"test\": testpath})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get tuner results in a df\n", "results = Optimizer.analytics().dataframe()\n", "while results.empty:\n", " time.sleep(1)\n", " results = Optimizer.analytics().dataframe()\n", "results.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploy to a real-time endpoint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deploy with Python SDK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An `Estimator` could be deployed directly after training, with an `Estimator.deploy()` but here we showcase the more extensive process of creating a model from s3 artifacts, that could be used to deploy a model that was trained in a different session or even out of SageMaker." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sklearn_estimator.latest_training_job.wait(logs=\"None\")\n", "artifact = sm_boto3.describe_training_job(\n", " TrainingJobName=sklearn_estimator.latest_training_job.name\n", ")[\"ModelArtifacts\"][\"S3ModelArtifacts\"]\n", "\n", "print(\"Model artifact persisted at \" + artifact)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.sklearn.model import SKLearnModel\n", "\n", "model = SKLearnModel(\n", " model_data=artifact,\n", " role=get_execution_role(),\n", " entry_point=\"script.py\",\n", " framework_version=FRAMEWORK_VERSION,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor = model.deploy(instance_type=\"ml.c5.large\", initial_instance_count=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Invoke with the Python SDK" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the SKLearnPredictor does the serialization from pandas for us\n", "print(predictor.predict(testX[data.feature_names]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alternative: invoke with `boto3`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "runtime = boto3.client(\"sagemaker-runtime\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Option 1: `csv` serialization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# csv serialization\n", "response = runtime.invoke_endpoint(\n", " EndpointName=predictor.endpoint,\n", " Body=testX[data.feature_names].to_csv(header=False, index=False).encode(\"utf-8\"),\n", " ContentType=\"text/csv\",\n", ")\n", "\n", "print(response[\"Body\"].read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Option 2: `npy` serialization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# npy serialization\n", "from io import BytesIO\n", "\n", "\n", "# Serialise numpy ndarray as bytes\n", "buffer = BytesIO()\n", "# Assuming testX is a data frame\n", "np.save(buffer, testX[data.feature_names].values)\n", "\n", "response = runtime.invoke_endpoint(\n", " EndpointName=predictor.endpoint, Body=buffer.getvalue(), ContentType=\"application/x-npy\"\n", ")\n", "\n", "print(response[\"Body\"].read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Don't forget to delete the endpoint !" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm_boto3.delete_endpoint(EndpointName=predictor.endpoint)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-python-sdk|scikit_learn_randomforest|Sklearn_on_SageMaker_end2end.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 2 }