{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Targeting Direct Marketing with Features Store and Amazon SageMaker XGBoost\n", "_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_\n", "\n", "---\n", "\n", "---\n", "\n", "## Contents\n", "\n", "1. [Background](#Background)\n", "1. [Prepration](#Preparation)\n", "1. [Data](#Data)\n", " 1. [Exploration](#Exploration)\n", " 1. [Transformation](#Transformation)\n", "1. [Training](#Training)\n", "1. [Hosting](#Hosting)\n", "1. [Evaluation](#Evaluation)\n", "1. [Extensions](#Extensions)\n", "\n", "---\n", "\n", "## Background\n", "Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers. Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer. Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.\n", "\n", "This notebook presents an example problem to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls. The steps include:\n", "\n", "* Preparing your Amazon SageMaker notebook\n", "* Downloading data from the internet into Amazon SageMaker\n", "* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms\n", "* Prepare the data to train an XGBoost model.\n", "* Estimating a model using the Gradient Boosting algorithm\n", "* Evaluating the effectiveness of the model\n", "* Setting the model up to make on-going predictions\n", "\n", "---\n", "\n", "## Preparation\n", "\n", "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", "\n", "Let's start by specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "isConfigCell": true }, "outputs": [], "source": [ "# cell 01\n", "import sagemaker\n", "bucket=sagemaker.Session().default_bucket()\n", "prefix = 'sagemaker/DEMO-xgboost-dm'\n", " \n", "# Define IAM role\n", "import boto3\n", "import re\n", "from sagemaker import get_execution_role\n", "\n", "role = get_execution_role()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's bring in the Python libraries that we'll use throughout the analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 02\n", "import numpy as np # For matrix operations and numerical processing\n", "import pandas as pd # For munging tabular data\n", "import matplotlib.pyplot as plt # For charts and visualizations\n", "from IPython.display import Image # For displaying images in the notebook\n", "from IPython.display import display # For displaying outputs in the notebook\n", "from time import gmtime, strftime # For labeling SageMaker models, endpoints, etc.\n", "import sys # For writing outputs to notebook\n", "import math # For ceiling function\n", "import json # For parsing hosting outputs\n", "import os # For manipulating filepath names\n", "import sagemaker \n", "import zipfile # Amazon SageMaker's Python SDK provides many helper functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Data\n", "\n", "\\[Moro et al., 2014\\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transformation\n", "\n", "The transformations steps were made in DataWrangler & Features Store" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Get the data from FS**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 03\n", "from sagemaker.session import Session\n", "from sagemaker.feature_store.feature_group import FeatureGroup\n", "\n", "region = boto3.Session().region_name\n", "boto_session = boto3.Session(region_name=region)\n", "\n", "sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)\n", "featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)\n", "\n", "feature_store_session = Session(\n", " boto_session=boto_session,\n", " sagemaker_client=sagemaker_client,\n", " sagemaker_featurestore_runtime_client=featurestore_runtime\n", ")\n", "\n", "# This is where you will put the name of the Feature group you just created \n", "feature_group_name = \"YOUR FEATURE GROUP NAME\"\n", "feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=feature_store_session)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 04\n", "# Build SQL query to features group\n", "fs_query = feature_group.athena_query()\n", "fs_table = fs_query.table_name\n", "query_string = 'SELECT * FROM \"'+fs_table+'\"'\n", "print('Running ' + query_string)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 05\n", "# Run Athena query. The output is loaded to a Pandas dataframe.\n", "fs_query.run(query_string=query_string, output_location='s3://'+bucket+'/'+prefix+'/fs_query_results/')\n", "fs_query.wait()\n", "model_data = fs_query.as_dataframe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 06\n", "model_data = model_data.drop(['fs_id', 'fs_time', 'write_time', 'api_invocation_time', 'is_deleted'], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 07\n", "model_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When building a model whose primary goal is to predict a target value on new data, it is important to understand overfitting. Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given. This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown. These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.\n", "\n", "The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on \"new\" data. There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc. For our purposes, we'll simply randomly split the data into 3 uneven groups. The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on \"new\" data, and 10% will be held back as a final testing dataset which will be used later on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 08\n", "train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))]) # Randomly sort the data then split out first 70%, second 20%, and last 10%" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format. For this example, we'll stick to CSV. Note that the first column must be the target variable and the CSV should not include headers. Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before. This avoids any misalignment issues due to random reordering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 09\n", "pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)\n", "pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 10\n", "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')\n", "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great!! You have successfully prepared the data to train an XGBoost model.\n", "---\n", "\n", "## End of Lab 2\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Training\n", "\n", "To train a model in SageMaker, you create a training job. The training job includes the following information:\n", "\n", "* The Amazon Elastic Container Registry path where the training code is stored.\n", "\n", "* The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training data.\n", "\n", "* The compute resources that you want SageMaker to use for model training. Compute resources are ML compute instances that are managed by SageMaker.\n", "\n", "* The URL of the S3 bucket where you want to store the output of the job.\n", "\n", "SageMaker built-in algorithms require the least effort and scale if the data set is large and significant resources are needed to train and deploy the model. For this use case, we will use the built-in xgboost algorithm in SageMaker.\n", "\n", "`xgboost` is an extremely popular, open-source package for gradient boosted trees. It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions. Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 11\n", "container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specify locations for the training and validation data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 12\n", "s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')\n", "s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An estimator is a high level interface for SageMaker training. We will create an estimator object by supplying the required parameters, such as IAM role, compute instance count and type. and the S3 output path. \n", "\n", "We also supply hyperparameters for the algoirthm and then call its fit() method to start training the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 13\n", "sess = sagemaker.Session()\n", "\n", "xgb = sagemaker.estimator.Estimator(container,\n", " role, \n", " instance_count=1, \n", " instance_type='ml.m4.xlarge',\n", " output_path='s3://{}/{}/output'.format(bucket, prefix),\n", " sagemaker_session=sess)\n", "xgb.set_hyperparameters(max_depth=5,\n", " eta=0.2,\n", " gamma=4,\n", " min_child_weight=6,\n", " subsample=0.8,\n", " silent=0,\n", " objective='binary:logistic',\n", " num_round=100)\n", "\n", "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Hosting\n", "Hoting the trained model allows us to make inferences against it. The code below deploys our trained model to a real-time endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 14\n", "xgb_predictor = xgb.deploy(initial_instance_count=1,\n", " instance_type='ml.m4.xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Evaluation\n", "Let us evaluade our model against the test dataset.\n", "\n", "As our data is currently stored as NumPy arrays in memory of our notebook instance. To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.\n", "\n", "*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 15\n", "xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The helper method below allows us to pass in our test data and make predictions against it. The following steps are performed in this helper method. \n", "1. Loop over our test dataset\n", "1. Split it into mini-batches of rows \n", "1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)\n", "1. Retrieve mini-batch predictions by invoking the XGBoost endpoint\n", "1. Collect predictions and convert from the CSV output our model provides into a NumPy array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 16\n", "def predict(data, predictor, rows=500 ):\n", " split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n", " predictions = ''\n", " for array in split_array:\n", " predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])\n", "\n", " return np.fromstring(predictions[1:], sep=',')\n", "\n", "predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), xgb_predictor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A confusion matrix is a table that is often used to describe the performance of a classification model. Below we will check our confusion matrix to see how well we predicted versus actuals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 17\n", "pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our model predicted that 144 of the nearly 4000 customers would subscribe and 92 of them actually did. We also had 344 subscribers who subscribed that we did not predict would. This is less than desirable, but the model can (and should) be tuned to improve this.\n", "\n", "_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) Clean-up\n", "\n", "If you are done with this notebook, please run the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cell 28\n", "xgb_predictor.delete_endpoint(delete_endpoint_config=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 4 }