{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Profiler with XGBoost, Linear-Learner and Job Scheduler\n", "Now that we can easily schedule notebooks on the fly, let's see how to set up our own model profiler. In this case, we'll be using the direct marketing dataset used throughout the SageMaker examples to train multiple binary classifiers - XGBoost and Linear Learner are great places to start. Next, we'll spin up endpoints for these models to get some prediction results on our test data set. Finally, we'll consolidate all of the results below into a multi-model ROC curve. Notice that you can run this entire notebook on an ephemeral SageMaker Job, and still get the updated chart below! \n", "\n", "*Important* - To run this notebook, you'll need to download and prepare a train, validation, and test data set. We used the one available below. You're welcome to use a different set if you prefer!\n", "- https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_direct_marketing/xgboost_direct_marketing_sagemaker.ipynb \n", "\n", "Notice that you can run this entire notebook step-by-step. You can use the notebook scheduler to run this on its own ephemeral instance. To try it out, make sure you have valid parameters specified below for your S3 paths for your train, test, and validation data sets.\n", "\n", "Paste in the base ECR image, which you can find in the ECS/ECR console after you've fully installed the toolkit. Paste in your execution role ARN, and type in the instance you prefer. This was developed on an ml.m4.xlarge. Then click \"Run Now,\" and its running on a dedicated host! You can also click " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2020-06-11T17:17:06.522655Z", "iopub.status.busy": "2020-06-11T17:17:06.521973Z", "iopub.status.idle": "2020-06-11T17:17:10.768087Z", "shell.execute_reply": "2020-06-11T17:17:10.767465Z" }, "papermill": { "duration": 4.266229, "end_time": "2020-06-11T17:17:10.768225", "exception": false, "start_time": "2020-06-11T17:17:06.501996", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# you won't need to install the sagemaker sdk on Studio, but since we're running this on a base image, we'll need to do that here\n", "!pip install sagemaker\n", "!pip install s3fs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's get our imports and configs set up. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2020-06-11T17:17:10.817664Z", "iopub.status.busy": "2020-06-11T17:17:10.817041Z", "iopub.status.idle": "2020-06-11T17:17:11.676602Z", "shell.execute_reply": "2020-06-11T17:17:11.675937Z" }, "isConfigCell": true, "nbpresent": { "id": "6427e831-8f89-45c0-b150-0b134397d79a" }, "papermill": { "duration": 0.886175, "end_time": "2020-06-11T17:17:11.676733", "exception": false, "start_time": "2020-06-11T17:17:10.790558", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "import os\n", "import time\n", "import boto3\n", "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.session import Session\n", "\n", "role = sagemaker.get_execution_role()\n", "sess = sagemaker.Session()\n", "bucket=sess.default_bucket()\n", "\n", "prefix = 'sagemaker/model-profiler' " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# remember, Papermill will add a new cell just below this one, because it is tagged \"parameters\". \n", "# typically you'll overwrite the default variable values here, using the ones you bring in the key-value pairs \n", "s3_prefix = 'sagemaker/model-profiler'\n", "s3_absolute = 's3://{}/{}'.format(bucket, s3_prefix)\n", "test_relative = 'test/test_data.csv'\n", "train_relative = 'train/train.csv'\n", "validation_relative = 'validation/validation.csv'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_test_data = '{}/{}'.format(s3_absolute, test_relative)\n", "s3_train_data = '{}/{}'.format(s3_absolute, train_relative)\n", "s3_validation_data = '{}/{}'.format(s3_absolute, validation_relative)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print ('pointing to data paths ')\n", "print (s3_test_data)\n", "print (s3_train_data)\n", "print (s3_validation_data)" ] }, { "cell_type": "markdown", "metadata": { "nbpresent": { "id": "71cbcebd-a2a5-419e-8e50-b2bc0909f564" }, "papermill": { "duration": 0.023946, "end_time": "2020-06-11T17:17:14.908820", "exception": false, "start_time": "2020-06-11T17:17:14.884874", "status": "completed" }, "tags": [] }, "source": [ "---\n", "\n", "## Training job and model creation\n", "Assuming you have a valid dataset stored in S3, let's continue. We're going to loop through classifiers, here both XGBoost and Linear Learner. We'll train them both simultanously, because each is running on its own instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime\n", "from sagemaker.amazon.amazon_estimator import get_image_uri\n", "\n", "s3_input_train = sagemaker.s3_input(s3_data=s3_train_data, content_type='text/csv')\n", "s3_input_validation = sagemaker.s3_input(s3_data=s3_validation_data, content_type='text/csv')\n", "\n", "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n", "\n", "data_channels = {'train': s3_input_train, 'validation': s3_input_validation}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, here's the fun part. We need a base estimator function that can take the name of a model as a parameter. Then, we're going to loop through our two classifiers and run a training job for both of them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2020-06-11T17:17:15.021411Z", "iopub.status.busy": "2020-06-11T17:17:15.016426Z", "iopub.status.idle": "2020-06-11T17:20:26.539017Z", "shell.execute_reply": "2020-06-11T17:20:26.538368Z" }, "nbpresent": { "id": "f3b125ad-a2d5-464c-8cfa-bd203034eee4" }, "papermill": { "duration": 191.559442, "end_time": "2020-06-11T17:20:26.539148", "exception": false, "start_time": "2020-06-11T17:17:14.979706", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "def get_base_estimator(model):\n", " '''\n", " This function takes the name of a model, grabs the image from ECR, and build a base SageMaker estimator\n", " '''\n", " image = get_image_uri(boto3.Session().region_name, model)\n", " \n", " est = sagemaker.estimator.Estimator(image,\n", " role,\n", " train_instance_count=1,\n", " train_instance_type='ml.m5.4xlarge',\n", " train_volume_size=50,\n", " input_mode='File',\n", " output_path=output_location,\n", " sagemaker_session=sess)\n", " return est\n", "\n", "def run_training_job(clf, is_last):\n", " '''\n", " This function takes the name of the classifier you want to train, in addition to a binary indicator for it being the last model\n", " We need to halt the Jupyter execution on the last model so we wait for all the jobs to finish\n", " '''\n", " est = get_base_estimator(clf)\n", " \n", " if clf == 'xgboost':\n", " est.set_hyperparameters(objective=\"binary:logistic\",\n", " max_depth=5,\n", " eta=0.2,\n", " gamma=4,\n", " min_child_weight=6,\n", " subsample=0.8,\n", " silent=0,\n", " num_round=100) \n", " \n", " elif clf == 'linear-learner':\n", " est.set_hyperparameters(predictor_type='binary_classifier',\n", " num_classes=2,\n", " mini_batch_size = 50)\n", " \n", " # the wait parameter will hault the local Jupyter env if it's set to True\n", " # we want this to happen for the last job, but not the previous ones \n", " est.fit(inputs=data_channels, logs=False, wait=is_last)\n", "\n", " return est\n", "\n", "def get_estimators(models_to_run):\n", " '''\n", " Takes a list of classifiers you want to run\n", " Runs a training job and returns the estimator\n", " Waits until the last estimator has finished to give you back the Jupyter kernel\n", " '''\n", " rt = []\n", " \n", " for i, model in enumerate(models_to_run):\n", " est = run_training_job(model, i == len(models_to_run)-1)\n", " rt.append(est)\n", " \n", " return rt\n", "\n", "models_to_run = ['xgboost', 'linear-learner']\n", "estimators = get_estimators(models_to_run)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty fun, right? After that, we'll spin up an endpoint for each model. We'll make sure to tear these down afterward, so the costs stay low. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.predictor import csv_serializer, json_deserializer\n", "\n", "\n", "def get_endpoints(estimators):\n", " '''\n", " Takes a list of estimators and turns each into an endpoint.\n", " Waits until the last one is complete.\n", " '''\n", "\n", " rt = []\n", " \n", " for i, est in enumerate(estimators):\n", "\n", " endpoint = est.deploy(1, 'ml.m4.xlarge', wait= (i == len(estimators)- 1))\n", " endpoint.content_type = 'text/csv'\n", " endpoint.serializer = csv_serializer\n", " endpoint.deserializer = json_deserializer\n", "\n", " rt.append(endpoint)\n", " \n", " return rt\n", "\n", "endpoints = get_endpoints(estimators)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the endpoints online, we'll take our test data and package it as a request up to each endpoint. Then, with the model responses, we can build our multi-model ROC curve." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json \n", "import s3fs\n", "\n", "\n", "def get_pred_array(array):\n", " '''\n", " Takes a list of floats, or features, and converts these into a single string separated by commas \n", " '''\n", " pred_array = ''\n", " for i, x in enumerate(array):\n", " if i == len(array) - 1:\n", " pred_array += str(x)\n", " else:\n", " pred_array += str(x) + ', ' \n", " return pred_array\n", "\n", "\n", "def get_y_pred_matrix(endpoints, test_data):\n", " '''\n", " Takes a list of endpoints and an S3 path with the test data. \n", " Loops through the test data, builds prediction arrays, \n", " Hits both endpoints, and consolidates the results.\n", " Returns a pandas data frame. \n", " '''\n", " \n", " rt = []\n", " \n", " for idx, row in test_data.iterrows():\n", " \n", " pred_array = get_pred_array(row[:-1].to_numpy())\n", "\n", " xgb_prediction = endpoints[0].predict(pred_array)\n", " \n", " ll_prediction = endpoints[1].predict(pred_array)['predictions'][0]['score']\n", " \n", " y_true = row['y_yes']\n", " \n", " new_row = [xgb_prediction, ll_prediction, y_true] \n", " rt.append(new_row)\n", " \n", " return pd.DataFrame(rt, columns=['xgboost', 'linear-learner', 'y_true'])\n", " \n", "test_data = pd.read_csv(s3_test_data)\n", "td = test_data.drop(['y_no'], axis=1)\n", "\n", "predictions_df = get_y_pred_matrix(endpoints, td)" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.040481, "end_time": "2020-06-11T17:24:39.046651", "exception": false, "start_time": "2020-06-11T17:24:39.006170", "status": "completed" }, "tags": [] }, "source": [ "Now, let's take that matrix of prediction responses and get our ROC curve." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_curve\n", "\n", "fpr_rt_xgb, tpr_rt_xgb, _ = roc_curve(predictions_df['y_true'], predictions_df['xgboost'])\n", "fpr_rt_ll, tpr_rt_ll, _ = roc_curve(predictions_df['y_true'], predictions_df['linear-learner'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(1)\n", "plt.plot([0, 1], [0, 1], 'k--')\n", "plt.plot(fpr_rt_xgb, tpr_rt_xgb, label='XGBoost')\n", "plt.plot(fpr_rt_ll, tpr_rt_ll, label='Linear Learner')\n", "plt.xlabel('False positive rate')\n", "plt.ylabel('True positive rate')\n", "plt.title('ROC curve')\n", "plt.legend(loc='best')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make sure to run this step and delete your endpoints!\n", "for e in endpoints:\n", " e.delete_endpoint()" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the License). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the license file accompanying this file. This file is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.", "papermill": { "duration": 898.758934, "end_time": "2020-06-11T17:32:04.399007", "environment_variables": {}, "exception": null, "input_path": "notebook-2020-06-11-17-12-59.ipynb", "output_path": "/opt/ml/processing/output/Batch Transform - breast cancer prediction with high level SDK-2020-06-11-17-13-02.ipynb", "parameters": {}, "start_time": "2020-06-11T17:17:05.640073", "version": "2.1.1" }, "sagemaker_run_notebook": { "saved_parameters": [ { "name": "s3_prefix", "value": "sagemaker/model-profiler" }, { "name": "test_relative", "value": "test/test_data.csv" }, { "name": "train_relative", "value": "train/train.csv" }, { "name": "validation_relative", "value": "validation/validation.csv" } ] } }, "nbformat": 4, "nbformat_minor": 4 }