{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Develop, Train, Optimize and Deploy Scikit-Learn Random Forest\n",
    "\n",
    "> *This notebook should work well with the `Python 3 (Data Science)` kernel in SageMaker Studio, or the `conda_python3` kernel in SageMaker Notebook Instances*\n",
    "\n",
    "In this notebook we show how to use Amazon SageMaker to develop, train, tune and deploy a Random Forest model based using the popular ML framework [Scikit-Learn](https://scikit-learn.org/stable/index.html).\n",
    "\n",
    "The example uses the *California Housing dataset* (provided by Scikit-Learn) - more details of which can be found [here](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html).\n",
    "\n",
    "To understand the code, you might also find it useful to refer to:\n",
    "\n",
    "* The guide on [Using Scikit-Learn with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_sklearn.html)\n",
    "* The API doc for [Scikit-Learn classes in the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html)\n",
    "* The [SageMaker reference for Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#client) (The general AWS SDK for Python, including low-level bindings for SageMaker as well as many other AWS services)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup libraries and environment\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Python Built-Ins:\n",
    "import datetime\n",
    "import tarfile\n",
    "\n",
    "# External Dependencies:\n",
    "import boto3\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sagemaker import get_execution_role\n",
    "import sagemaker\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "\n",
    "sm_boto3 = boto3.client('sagemaker')\n",
    "sess = sagemaker.Session()\n",
    "region = sess.boto_session.region_name\n",
    "bucket = sess.default_bucket()  # this could also be a hard-coded bucket name\n",
    "\n",
    "print('Using bucket ' + bucket)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare data\n",
    "We load a dataset from sklearn, split it and send it to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we use the California housing dataset \n",
    "data = fetch_california_housing()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    data.data, data.target, test_size=0.25, random_state=42)\n",
    "\n",
    "trainX = pd.DataFrame(X_train, columns=data.feature_names)\n",
    "trainX['target'] = y_train\n",
    "\n",
    "testX = pd.DataFrame(X_test, columns=data.feature_names)\n",
    "testX['target'] = y_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainX.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create directories\n",
    "! mkdir -p data\n",
    "! mkdir -p source\n",
    "! mkdir -p model\n",
    "\n",
    "# save data as csv\n",
    "trainX.to_csv('data/california_housing_train.csv')\n",
    "testX.to_csv('data/california_housing_test.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a training script\n",
    "\n",
    "The SageMaker Scikit-Learn Framework Container provides the basic runtime, and we as users specify the actual training steps to run as a script file (or even a folder of several, perhaps including a *requirements.txt* file).\n",
    "\n",
    "The below code initializes a `.py` file from here in the notebook.\n",
    "\n",
    "The same script can be used at training time (run as a script) and inference time (imported as a module) - So below we:\n",
    "\n",
    "- Define some specific functions to override default inference behavior (e.g. `model_fn()`), and\n",
    "- Enclose the training entry point in an `if __name__ == '__main__'` *guard clause* so it only executes when the module is run as a script.\n",
    "\n",
    "You can find detailed guidance in the documentation on [Preparing a Scikit-Learn training script](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#prepare-a-scikit-learn-training-script) (for training) and the [SageMaker Scikit-Learn model server](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#sagemaker-scikit-learn-model-server) (for inference)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile source/sklearn_training_script.py\n",
    "# Python Built-Ins:\n",
    "import argparse\n",
    "import os\n",
    "\n",
    "# External Dependencies:\n",
    "#Joblib is a set of tools to provide lightweight pipelining in Python.\n",
    "# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, \n",
    "# along with a large collection of high-level mathematical functions to operate on these arrays.\n",
    "\n",
    "#  pandas is a software library written for the Python programming language for data manipulation and analysis. \n",
    "#In particular, it offers data structures and operations for manipulating numerical tables and time series.\n",
    "\n",
    "import joblib\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "#A random forest is a meta estimator that fits a number of classifying decision trees \n",
    "#on various sub-samples of the dataset and uses averaging to improve the predictive accuracy \n",
    "#and control over-fitting.\n",
    "\n",
    "# inference functions ---------------\n",
    "def model_fn(model_dir):\n",
    "    clf = joblib.load(os.path.join(model_dir, \"model.joblib\"))\n",
    "    return clf\n",
    "\n",
    "\n",
    "if __name__ == '__main__':\n",
    "\n",
    "    #------------------------------- parsing input parameters (from command line)\n",
    "    print('extracting arguments')\n",
    "    parser = argparse.ArgumentParser()\n",
    "\n",
    "    # RandomForest hyperparameters\n",
    "    parser.add_argument('--n_estimators', type=int, default=10)\n",
    "    parser.add_argument('--min_samples_leaf', type=int, default=3)\n",
    "\n",
    "    # Data, model, and output directories\n",
    "    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))\n",
    "    parser.add_argument('--train_dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))\n",
    "    parser.add_argument('--test_dir', type=str, default=os.environ.get('SM_CHANNEL_TEST'))\n",
    "    parser.add_argument('--train_file', type=str, default='california_housing_train.csv')\n",
    "    parser.add_argument('--test_file', type=str, default='california_housing_test.csv')\n",
    "    parser.add_argument('--features', type=str)  # explicitly name which features to use\n",
    "    parser.add_argument('--target_variable', type=str)  # explicitly name the column to be used as target\n",
    "\n",
    "    args, _ = parser.parse_known_args()\n",
    "\n",
    "    #------------------------------- data preparation\n",
    "    print('reading data')\n",
    "    train_df = pd.read_csv(os.path.join(args.train_dir, args.train_file))\n",
    "    test_df = pd.read_csv(os.path.join(args.test_dir, args.test_file))\n",
    "\n",
    "    print('building training and testing datasets')\n",
    "    X_train = train_df[args.features.split()]\n",
    "    X_test = test_df[args.features.split()]\n",
    "    y_train = train_df[args.target_variable]\n",
    "    y_test = test_df[args.target_variable]\n",
    "\n",
    "    #------------------------------- model training\n",
    "    print('training model')\n",
    "    model = RandomForestRegressor(\n",
    "        n_estimators=args.n_estimators,\n",
    "        min_samples_leaf=args.min_samples_leaf,\n",
    "        n_jobs=-1)\n",
    "\n",
    "    model.fit(X_train, y_train)\n",
    "\n",
    "    #-------------------------------  model testing\n",
    "    print('testing model')\n",
    "    abs_err = np.abs(model.predict(X_test) - y_test)\n",
    "\n",
    "    # percentile absolute errors\n",
    "    for q in [10, 50, 90]:\n",
    "        print('AE-at-' + str(q) + 'th-percentile: '\n",
    "              + str(np.percentile(a=abs_err, q=q)))\n",
    "        \n",
    "#Mean Absolute Error is a model evaluation metric used with regression models. \n",
    "\n",
    "#------------------------------- save model\n",
    "    path = os.path.join(args.model_dir, \"model.joblib\")\n",
    "    joblib.dump(model, path)\n",
    "    print('model saved at ' + path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Local training\n",
    "Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python source/sklearn_training_script.py \\\n",
    "    --n_estimators 100 \\\n",
    "    --min_samples_leaf 3 \\\n",
    "    --model_dir 'model/' \\\n",
    "    --train_dir 'data/' \\\n",
    "    --test_dir 'data/' \\\n",
    "    --train_file 'california_housing_train.csv' \\\n",
    "    --test_file 'california_housing_test.csv' \\\n",
    "    --features 'MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude' \\\n",
    "    --target_variable 'target'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SageMaker Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating data input channels (copy to S3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# send data to S3. SageMaker will take training data from s3\n",
    "train_path_s3 = sess.upload_data(\n",
    "    path='data/california_housing_train.csv',  # source\n",
    "    bucket=bucket,\n",
    "    key_prefix='sm101/sklearn'  # destination path in S3\n",
    ")\n",
    "\n",
    "test_path_s3 = sess.upload_data(\n",
    "    path='data/california_housing_test.csv',  # source\n",
    "    bucket=bucket,\n",
    "    key_prefix='sm101/sklearn'  # destination path in S3\n",
    ")\n",
    "\n",
    "print('Train set URI:', train_path_s3)\n",
    "print('Test set URI:', test_path_s3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Launching a training job with the Python SDK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We use the Estimator from the SageMaker Python SDK\n",
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "\n",
    "sklearn_estimator = SKLearn(\n",
    "    entry_point='source/sklearn_training_script.py',\n",
    "    role=get_execution_role(),\n",
    "    instance_count=1,\n",
    "    instance_type='ml.m5.large',\n",
    "    framework_version='0.23-1',\n",
    "    base_job_name='rf-scikit',\n",
    "    metric_definitions=[\n",
    "        { 'Name': 'median-AE', 'Regex': 'AE-at-50th-percentile: ([0-9.]+).*$' },\n",
    "    ],\n",
    "    hyperparameters={\n",
    "        'n_estimators': 100,\n",
    "        'min_samples_leaf': 3,\n",
    "        'features': 'MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude',\n",
    "        'target_variable': 'target',\n",
    "    },\n",
    "    max_run=20*60,  # Maximum allowed active runtime (in seconds)\n",
    "    use_spot_instances=True,  # Use spot instances to reduce cost\n",
    "    max_wait=30*60,  # Maximum clock time (including spot delays)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn_estimator.fit({'train':train_path_s3, 'test': test_path_s3}, wait=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that the training job that we ran is very \"light\", due to the very small dataset. As such, running locally on the notebook instance results in a faster execution time, compared to SageMaker. SageMaker takes longer time to run the job because it has to provision the training infrastructure. Since this example training job not very resource-intensive, the infrastructure provisioning process adds more overhead, compared to the training job itself. \n",
    "\n",
    "In a real situation, where datasets are large, running on SageMaker can considerably speed up the execution process - and help us optimize costs, by keeping this interactive notebook environment modest and spinning up more powerful training job resources on-demand.\n",
    "\n",
    "Note that this training job *did not run here on the notebook itself*. You'll be able to see the history in the [AWS Console for SageMaker - Training Jobs tab](https://console.aws.amazon.com/sagemaker/home?#/jobs).\n",
    "\n",
    "> ℹ️ **Tip:** There's **no need to re-run** a training job if your notebook kernel restarts or the estimator state is lost for some other reason... You can just *attach* to a previous training job by name - for example:\n",
    ">\n",
    "> ```python\n",
    "> estimator = SKLearn.attach('rf-scikit-2025-01-01-00-00-00-000')\n",
    "> ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy to a real-time endpoint\n",
    "\n",
    "### Deploy with Python SDK\n",
    "\n",
    "It's possible to deploy a trained `Estimator` to a SageMaker endpoint for real-time inference in one line of code, with `Estimator.deploy(...)` - which implicitly creates a SageMaker [Model](https://console.aws.amazon.com/sagemaker/home?#/models), [Endpoint Configuration](https://console.aws.amazon.com/sagemaker/home?#/endpointConfig), and [Endpoint](https://console.aws.amazon.com/sagemaker/home?#/endpoints).\n",
    "\n",
    "For more fine-grained control though, you can choose to create a `Model` object through the SageMaker Python SDK - referencing the `model.tar.gz` produced on Amazon S3 by the training job. This would allow us to, for example:\n",
    "\n",
    "- Modify environment variables or the Python files used between training and inference\n",
    "- Import a model trained outside SageMaker that's been packaged to a compatible `model.tar.gz` on Amazon S3\n",
    "\n",
    "We'll demonstrate the longer route here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn_estimator.latest_training_job.wait(logs='None')  # Check the job is finished\n",
    "\n",
    "model_artifact = sm_boto3.describe_training_job(\n",
    "    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']\n",
    "\n",
    "print('Model artifact saved at:', model_artifact)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.sklearn.model import SKLearnModel\n",
    "\n",
    "model = SKLearnModel(\n",
    "    model_data=model_artifact,\n",
    "    framework_version='0.23-1',\n",
    "    py_version='py3',\n",
    "    role=get_execution_role(),\n",
    "    entry_point='source/sklearn_training_script.py',\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = model.deploy(\n",
    "    instance_type='ml.c5.large',\n",
    "    initial_instance_count=1,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Realtime inference\n",
    "\n",
    "The [Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) class from the SageMaker Python SDK provides a Python wrapper around the endpoint which also handles (configurable) de/serialization of the request and response.\n",
    "\n",
    "Alternatively for clients which cannot use the SageMaker Python SDK (for example non-Python clients, or Python environments where the PyPI [sagemaker](https://pypi.org/project/sagemaker/) package can't be installed for some reason): The general AWS SDKs can be used to call the lower-level [SageMaker InvokeEndpoint API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the SKLearnPredictor does the serialization from pandas for us\n",
    "print(predictor.predict(testX[data.feature_names]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(data.feature_names)\n",
    "print('nl')\n",
    "print(testX[data.feature_names])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete endpoint\n",
    "\n",
    "While training job infrastructure is started on-demand and terminated as soon as the job stops, endpoints are live until we turn them off. Delete unused endpoints to prevent ongoing costs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint(delete_endpoint_config=True)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-southeast-2:452832661640:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}