{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# BYO Container Example: lightGBM\n", "\n", "In this notebook we'll examine how to BYO container in Amazon SageMaker. This is an option for algorithms and frameworks not directly supported in Amazon SageMaker as either (1) built-in algorithms, or (2) prebuilt Amazon SageMaker containers (such as the ones for TensorFlow, PyTorch, Apache MXNet, Scikit-learn, and XGBoost). As an example, we'll containerize the popular lightGBM gradient boosting framework, which is not supported off-the-shelf in Amazon SageMaker, and apply it to a public dataset from UCI's Machine Learning Repository. The dataset, which relates to predicting purchase intent by online shoppers, is at https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset. \n", "\n", "Besides BYO container, we'll also employ the following features: SageMaker Processing for data preprocessing, SageMaker hosted training, SageMaker Processing for model evaluation/batch scoring, and endpoints for real time inference. More specifically, we'll perform the following steps:\n", "\n", "- Obtain the dataset.\n", "- Build a Docker image for lightGBM to be run as a container in SageMaker Processing.\n", "- Preprocess the data with that image in SageMaker Processing.\n", "- Build a separate Docker image for lightGBM for training models with SageMaker hosted training.\n", "- Train a lightGBM model with that separate Docker image in SageMaker hosted training.\n", "- Evaluate the model / do batch scoring in SageMaker Processing with the same container used for preprocessing.\n", "- Build a separate Docker container for model serving using [Multi-Model Server](https://github.com/awslabs/multi-model-server/).\n", "- Deploy the model to a real time SageMaker endpoind.\n", "\n", "**PREREQUISITES:** Be sure to run this notebook on an instance or machine that supports Docker, with AWS IAM permissions to Amazon S3 and Amazon SageMaker (full access to both is fine for learning purposes). In the notebook below we'll also add access to Amazon ECR.\n", "\n", "**DO NOT SELECT \"RUN ALL\" CELLS**\n", "\n", "\n", "## Setup\n", "\n", "We'll begin with updating version of Sagemaker Python SDK and some imports that will be useful throughout the notebook, and set up some objects and variables we'll need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install --upgrade sagemaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**, that you may need to restart kernel to have changes applied." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import boto3\n", "import sys\n", "import sagemaker\n", "import numpy as np\n", "from sagemaker import get_execution_role\n", "\n", "role = get_execution_role()\n", "region = boto3.session.Session().region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "session = sagemaker.Session()\n", "s3_output = session.default_bucket()\n", "s3_prefix = 'lightGBM-BYO'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we move on, let's add access via AWS IAM to Amazon ECR, a fully-managed Docker container registry that makes it easy to store, manage, and deploy Docker container images. Later in this notebook we'll create Docker repositories in ECR and push Docker images to them.\n", "\n", "To do this, perform the following steps:\n", "\n", "- In a separate browser tab, open the IAM console: https://console.aws.amazon.com/iam\n", "- In the left panel/tray, click **Roles**.\n", "- Click on the name of your role as printed above (it will be the characters after the right-most \"/\" character).\n", "- Click on the **Attach policies** button.\n", "- In the search box, type **ec2containerregistry**; you should now see a list of policies that include the substring \"AmazonEC2ContainerRegistry\".\n", "- Click the box next to **AmazonEC2ContainerRegistryFullAccess**.\n", "- Click **Attach policy**.\n", "\n", "Continue with the rest of the notebook, your role will be updated almost instantaneously. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtain dataset\n", "\n", "Next we'll download the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p raw\n", "!wget -P ./raw https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's inspect the data briefly now, just to confirm it was properly downloaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('./raw/online_shoppers_intention.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The target we'd like to predict is the Revenue column, which is `True` if an online purchase transaction was completed. As you might expect, a relatively small number of transactions are actually completed, resulting in a class imbalance we can handle various ways with lightGBM." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "import matplotlib\n", "from matplotlib import pyplot as plt\n", "\n", "sns.countplot(df['Revenue'])\n", "plt.ylim(0,12000)\n", "plt.xlabel('Transactions Completed', fontsize=14)\n", "plt.ylabel('Count', fontsize=14)\n", "plt.text(x=-.175, y=11000 ,s='10,422', fontsize=16)\n", "plt.text(x=.875, y=2500, s='1908', fontsize=16)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since Exploratory Data Analysis (EDA) is not the focus of this example, we'll now move on to upload the raw data to S3 so it can be accessed by SageMaker." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rawdata_s3_prefix = '{}/raw'.format(s3_prefix)\n", "raw_s3 = session.upload_data(path='./raw/', key_prefix=rawdata_s3_prefix)\n", "print(raw_s3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Experiments setup\n", "\n", "SageMaker Experiments allows us to keep track of data preprocessing and model training; organize related models together; and log model configuration, parameters, and metrics to reproduce and iterate on previous models and compare models. We'll create a single experiment to keep track of the different approaches we'll use to train the model.\n", "\n", "Each approach or block of preprocessing or training code that we run can be an experiment trial. Later, we'll be able to compare different trials. To start, we'll install the SageMaker Experiments SDK." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!{sys.executable} -m pip install sagemaker-experiments requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only a few parameters are required to create the SageMaker Experiments object itself." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.analytics import ExperimentAnalytics\n", "from smexperiments.experiment import Experiment\n", "from smexperiments.trial import Trial\n", "from smexperiments.trial_component import TrialComponent\n", "from smexperiments.tracker import Tracker\n", "import time\n", "\n", "lightgbm_experiment = Experiment.create(\n", " experiment_name=f\"lightgbm-{int(time.time())}\", \n", " description=\"Purchase intent prediction with lightGBM\", \n", " sagemaker_boto_client=boto3.client('sagemaker'))\n", "print(lightgbm_experiment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Docker image for data preprocessing and model evaluation\n", "\n", "Before any further steps can be completed for SageMake Processing with lightGBM, we need to build a Docker image. We'll build one image first, and use that same image for multiple purposes:\n", "\n", "- Preprocessing data; and\n", "- Evaluating the model (batch scoring).\n", "\n", "A separate, but very similar, Docker image will be used for training below. \n", "\n", "To begin, we'll create a new directory for Docker-related files and write a Dockerfile." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p docker-proc-evaluate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A simple Dockerfile can be used to build the container. Of particular note are the following statements in the Dockerfile:\n", "- FROM statement: this sets the parent image. There are many choices, considerations include size (smaller may be better), \"up-to-dateness\", stability, and security. The chosen image is based on a slim version of Debian 10 (\"Buster\"). \n", "- RUN statements: used here primarily to install dependencies. Only a few are required. Note that libgomp1 is a library used by lightgbm, but is not included in this version of Debian.\n", "- ENTRYPOINT statement: specifies the command used to run the scripts that will be included in the container by SageMaker. In our case, they are ordinary Python 3 scripts so the command is simply `python3`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile docker-proc-evaluate/Dockerfile\n", "\n", "FROM python:3.7-slim-buster\n", "RUN apt -y update && apt install -y --no-install-recommends \\\n", " libgomp1 \\\n", " && apt clean \n", "RUN pip3 install lightgbm numpy pandas scikit-learn \n", "ENV PYTHONUNBUFFERED=TRUE\n", "ENTRYPOINT [\"python3\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This block of code builds the image using various Docker commands, creates an Amazon ECR repository, and pushes the image to Amazon ECR." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ecr_repository = 'lightgbm-byo-proc-eval'\n", "tag = ':latest'\n", "uri_suffix = 'amazonaws.com'\n", "processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create ECR repository and push docker image\n", "!docker build -t $ecr_repository docker-proc-evaluate\n", "!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)\n", "!aws ecr create-repository --repository-name $ecr_repository\n", "!docker tag {ecr_repository + tag} $processing_repository_uri\n", "!docker push $processing_repository_uri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your Docker image has all required dependencies, and enables you to run your own preprocessing, feature engineering, and model evaluation scripts all within the same container in a robust and repeatable way. \n", "\n", "To integrate the image with SageMaker, simply reference it in the SageMaker Python SDK's `ScriptProcessor` class, which lets you execute a command to run your own script inside a container based on this image." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.processing import ScriptProcessor\n", "\n", "script_processor = ScriptProcessor(command=['python3'],\n", " image_uri=processing_repository_uri,\n", " role=role,\n", " instance_count=1,\n", " instance_type='ml.c5.xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocess data with SageMaker Processing\n", "\n", "Some preprocessing should be performed on this dataset before training. For example, the data must be normalized, and split into train and test sets. Below is a preprocessing script. It is an ordinary Python script with very little specific to SageMaker. To comply with SageMaker, the script must read the input data from a specified directory, and save the preprocessed data to certain directories so it can be automatically uploaded to S3 by SageMaker at the end of the job. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile preprocessing.py\n", "\n", "import glob\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "\n", "\n", "if __name__=='__main__':\n", " \n", " input_file = glob.glob('{}/*.csv'.format('/opt/ml/processing/input'))\n", " print('\\nINPUT FILE: \\n{}\\n'.format(input_file)) \n", " df = pd.read_csv(input_file[0])\n", " \n", " # minor preprocessing (drop some uninformative columns etc.)\n", " print('Preprocessing the dataset . . . .') \n", " df_clean = df.drop(['Month','Browser','OperatingSystems','Region','TrafficType','Weekend'], axis=1)\n", " visitor_encoded = pd.get_dummies(df_clean['VisitorType'], prefix='Visitor_Type', drop_first = True)\n", " df_clean_merged = pd.concat([df_clean, visitor_encoded], axis=1).drop(['VisitorType'], axis=1)\n", " X = df_clean_merged.drop('Revenue', axis=1)\n", " y = df_clean_merged['Revenue']\n", " \n", " # split the preprocessed data with stratified sampling for class imbalance\n", " X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=2, test_size=.2)\n", "\n", " # save to container directory for uploading to S3\n", " print('Saving the preprocessed dataset . . . .') \n", " train_data_output_path = os.path.join('/opt/ml/processing/train', 'x_train.npy')\n", " np.save(train_data_output_path, X_train.to_numpy())\n", " train_labels_output_path = os.path.join('/opt/ml/processing/train', 'y_train.npy')\n", " np.save(train_labels_output_path, y_train.to_numpy()) \n", " test_data_output_path = os.path.join('/opt/ml/processing/test', 'x_test.npy')\n", " np.save(test_data_output_path, X_test.to_numpy())\n", " test_labels_output_path = os.path.join('/opt/ml/processing/test', 'y_test.npy')\n", " np.save(test_labels_output_path, y_test.to_numpy()) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the `ScriptProcessor` object created above can be used to run this `preprocessing.py` script. As mentioned above, the primary requirements are specifying input and output directories. Here, there are two outputs because the transformed train and test data are sent to different folders in S3. We also include an `experiment_config` parameter so this data preprocessing step can be tracked as part of a SageMaker Experiment and added to model lineage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "from time import gmtime, strftime \n", "\n", "processing_job_name = \"lightgbm-byo-process-{}\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n", "output_destination = 's3://{}/{}/data'.format(s3_output, s3_prefix)\n", "\n", "script_processor.run(code='preprocessing.py',\n", " job_name=processing_job_name,\n", " inputs=[ProcessingInput(\n", " source=raw_s3,\n", " destination='/opt/ml/processing/input')],\n", " outputs=[ProcessingOutput(output_name='train',\n", " destination='{}/train'.format(output_destination),\n", " source='/opt/ml/processing/train'),\n", " ProcessingOutput(output_name='test',\n", " destination='{}/test'.format(output_destination),\n", " source='/opt/ml/processing/test')],\n", " experiment_config={\n", " \"ExperimentName\": lightgbm_experiment.experiment_name,\n", " \"TrialComponentDisplayName\": \"Processing\",\n", " }\n", " )\n", "\n", "preprocessing_job_description = script_processor.jobs[-1].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the job is complete, it is easy to look up the location of the output in S3. The code below retrieves the S3 URLs of the locations of the transformed train and test data. These will be used as inputs to futher jobs below. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "output_config = preprocessing_job_description['ProcessingOutputConfig']\n", "for output in output_config['Outputs']:\n", " if output['OutputName'] == 'train':\n", " preprocessed_training_data = output['S3Output']['S3Uri']\n", " print(preprocessed_training_data)\n", " if output['OutputName'] == 'test':\n", " preprocessed_test_data = output['S3Output']['S3Uri']\n", " print(preprocessed_test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also can download the preprocessed test data for later use. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session.download_data(path='.', bucket=s3_output, key_prefix=s3_prefix+'/data/test/x_test.npy')\n", "session.download_data(path='.', bucket=s3_output, key_prefix=s3_prefix+'/data/test/y_test.npy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During the SageMaker Processing job, a SageMaker Experiments trial component was associated with it so we can include it when tracking model lineage. We can inspect all of the information automatically logged by the Experiment Tracker during the job:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for trial in lightgbm_experiment.list_trials():\n", " proc_job = trial\n", " break\n", " \n", "lightgbm_tracker = Tracker.load(proc_job.trial_name)\n", "preprocessing_trial_component = lightgbm_tracker.trial_component\n", "print(preprocessing_trial_component)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train a model with lightGBM\n", "\n", "There are multiple different ways to train a model in SageMaker. One of the simplest ways to do so is to reuse the same container from above within SageMaker Processing itself to do the training. This is possible due to the fact that the `ScriptProcessor` object we instantiated above can ingest an arbitrary Python script as long as we specify the input and output locations in S3. \n", "\n", "An alternative is to use SageMaker hosted training. Like SageMaker Processing, SageMaker Training spins up a right-sized, transient cluster for your job and then shuts it down when the job is done. This enables you to do most of your work in lower-cost notebooks while reserving full scale training and related costs for only when you need it. Using SageMaker hosted training offers several advantages over SageMaker Processing for training. These include easy integrations with: SageMaker Debugger, SageMaker Experiments, SageMaker Search, Managed Spot Training, Automatic Model Tuning, options for multiple file sources/channels with automated data shuffling and sharding, and more. \n", "\n", "To use SageMaker hosted training, we'll create another simple Docker image. We'll create another directory first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p docker-train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Dockerfile for training is similar to the first one, with a few key differences: \n", "- The parent image is from another ML framework's Docker image that bundles a bunch of necessary low-level build tools for the sagemaker-containers package (see next bullet point). Another parent with those tools could be substituted.\n", "- There is one additional Python package: sagemaker-containers, which integrates the container with SageMaker hosted training.\n", "- An environment variable indicating which Python module is the entry point for training.\n", "\n", "Note that you do NOT need to include the training script in the Docker image. The sagemaker-containers package allows you to pass in a training script from an Amazon S3 location dynamically each time you start a training job, so you can reuse the same Docker image without rebuilding it for code changes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile docker-train/Dockerfile\n", "\n", "FROM python:3.7-slim-buster\n", "RUN apt -y update && apt install -y --no-install-recommends \\\n", " libgomp1 build-essential \\\n", " && apt clean \n", "RUN pip install lightgbm numpy pandas scikit-learn sagemaker-training\n", "ENV SAGEMAKER_PROGRAM train.py\n", "ENV PYTHONUNBUFFERED=TRUE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll create a separate ECR repository for the training images, build the new training image, and push it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ecr_repository_train = 'lightgbm-byo-train'\n", "uri_suffix = 'amazonaws.com'\n", "train_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository_train + tag)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create ECR repository and push docker image\n", "!docker build -t $ecr_repository_train docker-train\n", "!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)\n", "!aws ecr create-repository --repository-name $ecr_repository_train\n", "!docker tag {ecr_repository_train + tag} $train_repository_uri\n", "!docker push $train_repository_uri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is the training script. Again, it is very similar to a Python script you would use outside SageMaker, and the main SageMaker-specific requirements are that you must specify several arguments from which you will extract hyperparameters such as the learning rate. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile docker-train/train.py\n", "\n", "import argparse\n", "import glob\n", "import lightgbm as lgb\n", "import numpy as np\n", "import os\n", "\n", "\n", "if __name__=='__main__':\n", " \n", " # extract training data S3 location and hyperparameter values\n", " parser = argparse.ArgumentParser()\n", " parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])\n", " parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])\n", " parser.add_argument('--num_leaves', type=int, default=28)\n", " parser.add_argument('--max_depth', type=int, default=5)\n", " parser.add_argument('--learning_rate', type=float, default=0.1)\n", " args = parser.parse_args()\n", " \n", " print('Loading training data from {}\\n'.format(args.train))\n", " input_files = glob.glob('{}/*.npy'.format(args.train))\n", " print('\\nTRAINING INPUT FILE LIST: \\n{}\\n'.format(input_files)) \n", " for file in input_files:\n", " if 'x_' in file:\n", " x_train = np.load(file)\n", " else:\n", " y_train = np.load(file) \n", " print('\\nx_train shape: \\n{}\\n'.format(x_train.shape))\n", " print('\\ny_train shape: \\n{}\\n'.format(y_train.shape))\n", " train_data = lgb.Dataset(x_train, label=y_train)\n", " \n", " print('Loading validation data from {}\\n'.format(args.validation))\n", " eval_input_files = glob.glob('{}/*.npy'.format(args.validation))\n", " print('\\nVALIDATION INPUT FILE LIST: \\n{}\\n'.format(eval_input_files)) \n", " for file in eval_input_files:\n", " if 'x_' in file:\n", " x_val = np.load(file)\n", " else:\n", " y_val = np.load(file) \n", " print('\\nx_val shape: \\n{}\\n'.format(x_val.shape))\n", " print('\\ny_val shape: \\n{}\\n'.format(y_val.shape))\n", " eval_data = lgb.Dataset(x_val, label=y_val)\n", " \n", " print('Training model with hyperparameters:\\n\\t num_leaves: {}\\n\\t max_depth: {}\\n\\t learning_rate: {}\\n'\n", " .format(args.num_leaves, args.max_depth, args.learning_rate))\n", " parameters = {\n", " 'objective': 'binary',\n", " 'metric': 'binary_logloss',\n", " 'is_unbalance': 'true',\n", " 'boosting': 'gbdt',\n", " 'num_leaves': args.num_leaves,\n", " 'max_depth': args.max_depth,\n", " 'learning_rate': args.learning_rate,\n", " 'verbose': 1\n", " }\n", " num_round = 10\n", " bst = lgb.train(parameters, train_data, num_round, eval_data, verbose_eval=1)\n", " \n", " print('Saving model . . . .')\n", " bst.save_model('/opt/ml/model/online_shoppers_model.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training script must be packaged as a .tar.gz file and uploaded to S3 for access by SageMaker. This step must be repeated every time the script is modified, but avoids having to rebuild the Docker image for code changes: you can just reuse the same Docker image with any lightGBM training script." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "import os\n", "\n", "def create_tar_file(source_files, target=None):\n", " if target:\n", " filename = target\n", " else:\n", " _, filename = tempfile.mkstemp()\n", "\n", " with tarfile.open(filename, mode=\"w:gz\") as t:\n", " for sf in source_files:\n", " t.add(sf, arcname=os.path.basename(sf))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_tar_file([\"docker-train/train.py\"], \"sourcedir.tar.gz\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sources = session.upload_data('sourcedir.tar.gz', s3_output, s3_prefix + '/code')\n", "print(sources)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With our training script in Amazon S3, we can now set up an Amazon SageMaker Estimator object to represent the actual training job. Similarly to the ScriptProcessor object, the Estimator takes in as parameters the Docker image, and instance type and amount. Additionally, it takes in an encoded dictionary of hyperparameters for training. For `train_instance_type` we specify `local`: this allows us during the prototyping phase of a project to test SageMaker training code locally on the instance running this code. Later in this example we will switch to a SageMaker instance type when we start multiple training jobs in parallel to find the best model.\n", "\n", "### Train model locally on Notebook Instance\n", "\n", "We will use Sagemaker Estimator class to instantiate training job which allows to define various parameters, such as algorithm hyperparameters, number and type of training instances, data input and output configuration.\n", "\n", "First, we'll try to train our model on local notebook instance. For this, we define `instance_type` as `local`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.estimator import Estimator\n", "import json\n", "\n", "def json_encode_hyperparameters(hyperparameters):\n", " return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}\n", "\n", "hyperparameters = json_encode_hyperparameters({\n", " \"sagemaker_program\": \"train.py\",\n", " \"sagemaker_submit_directory\": sources,\n", " 'num_leaves': 32,\n", " 'max_depth': 3,\n", " 'learning_rate': 0.08})\n", "\n", "estimator = Estimator(image_uri=train_repository_uri,\n", " role=role,\n", " instance_count=1,\n", " instance_type='local', # training job will run locally\n", " hyperparameters=hyperparameters,\n", " base_job_name='lightgbm-byo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `fit` method invocation starts the training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.fit({'train': preprocessed_training_data, 'validation': preprocessed_test_data})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can easily download the trained model, whether for further use inside of Amazon SageMaker or anywhere else." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 cp {estimator.model_data} ./model/model.tar.gz\n", "!tar -xvzf ./model/model.tar.gz -C ./model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll upload the unzipped version of the model back to Amazon S3 for use by SageMaker Processing in model evaluation / batch scoring. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_model = session.upload_data('./model/online_shoppers_model.txt', s3_output, s3_prefix + '/model')\n", "print(s3_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evalutate the model / batch scoring\n", "\n", "Next we can reuse the Docker image from data preprocessing for model evaluation, or batch scoring. Below is the evaluation script. This time the main SageMaker-specific requirement is specifying an input directory. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile evaluation.py\n", "\n", "import glob\n", "import lightgbm as lgb\n", "import numpy as np\n", "from sklearn.metrics import accuracy_score, roc_auc_score\n", "\n", "\n", "if __name__=='__main__':\n", " \n", " print('Loading data . . . .')\n", " input_files = glob.glob('{}/*.npy'.format('/opt/ml/processing/input'))\n", " print('\\nINPUT FILE LIST: \\n{}\\n'.format(input_files)) \n", " for file in input_files:\n", " if 'x_' in file:\n", " x_test = np.load(file)\n", " else:\n", " y_test = np.load(file)\n", " \n", " print('\\nx_test shape: \\n{}\\n'.format(x_test.shape))\n", " print('\\ny_test shape: \\n{}\\n'.format(y_test.shape))\n", " \n", " print('Loading model . . . .\\n') \n", " model_path = '/opt/ml/processing/model/'\n", " bst_loaded = lgb.Booster(model_file=model_path+'online_shoppers_model.txt')\n", " y_pred = bst_loaded.predict(x_test)\n", " \n", " print('Evaluating model . . . .\\n') \n", " acc = accuracy_score(y_test.astype(int), y_pred.round(0).astype(int))\n", " auc = roc_auc_score(y_test, y_pred)\n", " print('Accuracy: {:.2f}'.format(acc))\n", " print('AUC Score: {:.2f}'.format(auc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll also reuse the `ScriptProcessor` object we instantiated above, this time for the evaluation script. Instead of having two outputs, as in the preprocessing job, there are two inputs: one for the input data, and another for the model artifact to be used in the evaluation. At the end of the job, we'll log the accuracy and AUC score metrics. We also could have stored evaluation results to a file, or even saved visualization graphics, and asked SageMaker Processing to upload those to S3 at the end of the job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "processing_job_name = \"lightgbm-byo-eval-{}\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n", "output_destination = 's3://{}/{}/eval'.format(s3_output, s3_prefix)\n", "\n", "script_processor.run(code='evaluation.py',\n", " job_name=processing_job_name,\n", " inputs=[ProcessingInput(\n", " source=preprocessed_test_data,\n", " destination='/opt/ml/processing/input'),\n", " ProcessingInput(\n", " source=s3_model,\n", " destination='/opt/ml/processing/model')],\n", " outputs=[ProcessingOutput(output_name='eval',\n", " destination=output_destination,\n", " source='/opt/ml/processing/eval')]\n", " )\n", "\n", "eval_job_description = script_processor.jobs[-1].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parallel training jobs: Experiment with maximum tree depth\n", "\n", "Using the SageMaker Experiment we created earlier to track results, we will now experiment with the maximum tree depth hyperparameter of lightGBM. To do this, we will start multiple SageMaker hosted training jobs in parallel with different values of the `max_depth` lightGBM hyperparameter. Note that SageMaker also has an Automatic Model Tuning feature that enables you to do an automated, informed search over multiple hyperparameters using strategies such as Bayesian Optimization (which is the default). \n", "\n", "In this code, note that we are again attaching an `experiment_config` to each job to automatically track results, and that we have defined an objective metric (validation loss) to track in the `metric_definitions` parameter of each training job that is launched. \n", "\n", "\n", "### Training model on Sagemaker\n", "\n", "Note, unlike previously, we will train our models on remote Sagemaker Training cluster. Note, that `instance_type` this time is `ml.c5.xlarge`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trial_name_map = {}\n", "\n", "for i, max_depth in enumerate([3, 6, 9, 12]):\n", " # create trial\n", " trial_name = f\"lightgbm-training-depth-{max_depth}-{int(time.time())}\"\n", " trial = Trial.create(\n", " trial_name=trial_name, \n", " experiment_name=lightgbm_experiment.experiment_name,\n", " sagemaker_boto_client=boto3.client('sagemaker'),\n", " )\n", " trial_name_map[max_depth] = trial_name\n", " # associate the proprocessing trial component with the current trial\n", " trial.add_trial_component(preprocessing_trial_component)\n", " \n", " hyperparameters = json_encode_hyperparameters({ \"sagemaker_program\": \"train.py\",\n", " \"sagemaker_submit_directory\": sources,\n", " 'num_leaves': 32,\n", " 'max_depth': max_depth,\n", " 'learning_rate': 0.08 })\n", "\n", " estimator = Estimator(image_uri=train_repository_uri,\n", " role=role,\n", " instance_count=1,\n", " instance_type='ml.c5.xlarge',\n", " hyperparameters=hyperparameters,\n", " enable_sagemaker_metrics=True,\n", " metric_definitions=[\n", " {'Name':'validation:loss', 'Regex':'.*loss: ([0-9\\\\.]+)'}\n", " ]\n", " )\n", " \n", " training_job_name = f\"lightgbm-training-depth-{max_depth}-{int(time.time())}\"\n", " # Now associate the estimator with the Experiment and Trial\n", " estimator.fit(\n", " inputs={'train': preprocessed_training_data, 'validation': preprocessed_test_data}, \n", " job_name=training_job_name,\n", " experiment_config={\n", " \"TrialName\": trial.trial_name,\n", " \"TrialComponentDisplayName\": \"Training\",\n", " },\n", " wait=False,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**BEFORE CONTINUING WITH THE REST OF THIS NOTEBOOK DO THE FOLLOWING:**\n", "\n", "Go to the SageMaker console, and in the left panel click **Training jobs**. You should see multiple training jobs with names of the form `lightgbm-training-depth--