{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Targeting Direct Marketing with Amazon SageMaker XGBoost, Inference with AWS Lambda \n", "_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_\n", "\n", "---\n", "\n", "---\n", "\n", "## Contents\n", "\n", "1. [Background](#Background)\n", "1. [Prepration](#Preparation)\n", "1. [Data](#Data)\n", " 1. [Exploration](#Exploration)\n", " 1. [Transformation](#Transformation)\n", "1. [Training](#Training)\n", "1. [Deploying](#Deploying)\n", "\n", "---\n", "\n", "## Background\n", "Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers. Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer. Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.\n", "\n", "This notebook presents an example problem to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls. The steps include:\n", "\n", "* Preparing your Amazon SageMaker notebook\n", "* Downloading data from the internet into Amazon SageMaker\n", "* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms\n", "* Estimating a model using the Gradient Boosting algorithm\n", "* Getting the model and test data location in S3 needed to deploy AWS Lambda function for inference.\n", "\n", "---\n", "\n", "## Preparation\n", "\n", "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", "\n", "Let's start by specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "isConfigCell": true, "tags": [ "parameters" ] }, "outputs": [], "source": [ "import boto3\n", "import pandas\n", "import re\n", "import sagemaker # Amazon SageMaker's Python SDK provides many helper functions\n", "from sagemaker import get_execution_role\n", "\n", "\n", " # Define IAM role\n", "role = get_execution_role()\n", "\n", "sess = sagemaker.Session()\n", "\n", "prefix = 'sagemaker/DEMO-xgboost-dm-lambda-inference'\n", "\n", "# S3 bucket which will contain the train and test datasets\n", "bucket = sess.default_bucket()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's bring in the Python libraries that we'll use throughout the analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np # For matrix operations and numerical processing\n", "import pandas as pd # For munging tabular data\n", "import matplotlib.pyplot as plt # For charts and visualizations\n", "from IPython.display import Image # For displaying images in the notebook\n", "from IPython.display import display # For displaying outputs in the notebook\n", "from time import gmtime, strftime # For labeling SageMaker models, endpoints, etc.\n", "import sys # For writing outputs to notebook\n", "import math # For ceiling function\n", "import json # For parsing hosting outputs\n", "import os # For manipulating filepath names\n", "import zipfile # unzips the dataset\n", "from sagemaker import image_uris\n", "from sagemaker.inputs import TrainingInput\n", "from sagemaker.estimator import Estimator\n", "from sagemaker.serializers import CSVSerializer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Data\n", "Let's start by downloading the [direct marketing dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket. \n", "\n", "\\[Moro et al., 2014\\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip\n", "\n", "with zipfile.ZipFile('bank-additional.zip', 'r') as zip_ref:\n", " zip_ref.extractall('.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets read this into a Pandas data frame and take a look." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('./bank-additional/bank-additional-full.csv')\n", "pd.set_option('display.max_columns', 500) # Make sure we can see all of the columns\n", "pd.set_option('display.max_rows', 20) # Keep the output on one page\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's talk about the data. At a high level, we can see:\n", "\n", "* We have a little over 40K customer records, and 20 features for each customer\n", "* The features are mixed; some numeric, some categorical\n", "* The data appears to be sorted, at least by `time` and `contact`, maybe more\n", "\n", "_**Specifics on each of the features:**_\n", "\n", "*Demographics:*\n", "* `age`: Customer's age (numeric)\n", "* `job`: Type of job (categorical: 'admin.', 'services', ...)\n", "* `marital`: Marital status (categorical: 'married', 'single', ...)\n", "* `education`: Level of education (categorical: 'basic.4y', 'high.school', ...)\n", "\n", "*Past customer events:*\n", "* `default`: Has credit in default? (categorical: 'no', 'unknown', ...)\n", "* `housing`: Has housing loan? (categorical: 'no', 'yes', ...)\n", "* `loan`: Has personal loan? (categorical: 'no', 'yes', ...)\n", "\n", "*Past direct marketing contacts:*\n", "* `contact`: Contact communication type (categorical: 'cellular', 'telephone', ...)\n", "* `month`: Last contact month of year (categorical: 'may', 'nov', ...)\n", "* `day_of_week`: Last contact day of the week (categorical: 'mon', 'fri', ...)\n", "* `duration`: Last contact duration, in seconds (numeric). Important note: If duration = 0 then `y` = 'no'.\n", " \n", "*Campaign information:*\n", "* `campaign`: Number of contacts performed during this campaign and for this client (numeric, includes last contact)\n", "* `pdays`: Number of days that passed by after the client was last contacted from a previous campaign (numeric)\n", "* `previous`: Number of contacts performed before this campaign and for this client (numeric)\n", "* `poutcome`: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)\n", "\n", "*External environment factors:*\n", "* `emp.var.rate`: Employment variation rate - quarterly indicator (numeric)\n", "* `cons.price.idx`: Consumer price index - monthly indicator (numeric)\n", "* `cons.conf.idx`: Consumer confidence index - monthly indicator (numeric)\n", "* `euribor3m`: Euribor 3 month rate - daily indicator (numeric)\n", "* `nr.employed`: Number of employees - quarterly indicator (numeric)\n", "\n", "*Target variable:*\n", "* `y`: Has the client subscribed a term deposit? (binary: 'yes','no')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploration\n", "Let's start exploring the data. First, let's understand how the features are distributed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Frequency tables for each categorical feature\n", "for column in data.select_dtypes(include=['object']).columns:\n", " display(pd.crosstab(index=data[column], columns='% observations', normalize='columns'))\n", "\n", "# Histograms for each numeric features\n", "display(data.describe())\n", "%matplotlib inline\n", "hist = data.hist(bins=30, sharey=True, figsize=(10, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that:\n", "\n", "* Almost 90% of the values for our target variable `y` are \"no\", so most customers did not subscribe to a term deposit.\n", "* Many of the predictive features take on values of \"unknown\". Some are more common than others. We should think carefully as to what causes a value of \"unknown\" (are these customers non-representative in some way?) and how we that should be handled.\n", " * Even if \"unknown\" is included as it's own distinct category, what does it mean given that, in reality, those observations likely fall within one of the other categories of that feature?\n", "* Many of the predictive features have categories with very few observations in them. If we find a small category to be highly predictive of our target outcome, do we have enough evidence to make a generalization about that?\n", "* Contact timing is particularly skewed. Almost a third in May and less than 1% in December. What does this mean for predicting our target variable next December?\n", "* There are no missing values in our numeric features. Or missing values have already been imputed.\n", " * `pdays` takes a value near 1000 for almost all customers. Likely a placeholder value signifying no previous contact.\n", "* Several numeric features have a very long tail. Do we need to handle these few observations with extremely large values differently?\n", "* Several numeric features (particularly the macroeconomic ones) occur in distinct buckets. Should these be treated as categorical?\n", "\n", "Next, let's look at how our features relate to the target that we are attempting to predict." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for column in data.select_dtypes(include=['object']).columns:\n", " if column != 'y':\n", " display(pd.crosstab(index=data[column], columns=data['y'], normalize='columns'))\n", "\n", "for column in data.select_dtypes(exclude=['object']).columns:\n", " print(column)\n", " hist = data[[column, 'y']].hist(by='y', bins=30)\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that:\n", "\n", "* Customers who are-- \"blue-collar\", \"married\", \"unknown\" default status, contacted by \"telephone\", and/or in \"may\" are a substantially lower portion of \"yes\" than \"no\" for subscribing.\n", "* Distributions for numeric variables are different across \"yes\" and \"no\" subscribing groups, but the relationships may not be straightforward or obvious.\n", "\n", "Now let's look at how our features relate to one another." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(data.corr())\n", "pd.plotting.scatter_matrix(data, figsize=(12, 12))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that:\n", "* Features vary widely in their relationship with one another. Some with highly negative correlation, others with highly positive correlation.\n", "* Relationships between features is non-linear and discrete in many cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transformation\n", "\n", "Cleaning up data is part of nearly every machine learning project. It arguably presents the biggest risk if done incorrectly and is one of the more subjective aspects in the process. Several common techniques include:\n", "\n", "* Handling missing values: Some machine learning algorithms are capable of handling missing values, but most would rather not. Options include:\n", " * Removing observations with missing values: This works well if only a very small fraction of observations have incomplete information.\n", " * Removing features with missing values: This works well if there are a small number of features which have a large number of missing values.\n", " * Imputing missing values: Entire [books](https://www.amazon.com/Flexible-Imputation-Missing-Interdisciplinary-Statistics/dp/1439868247) have been written on this topic, but common choices are replacing the missing value with the mode or mean of that column's non-missing values.\n", "* Converting categorical to numeric: The most common method is one hot encoding, which for each feature maps every distinct value of that column to its own feature which takes a value of 1 when the categorical feature is equal to that value, and 0 otherwise.\n", "* Oddly distributed data: Although for non-linear models like Gradient Boosted Trees, this has very limited implications, parametric models like regression can produce wildly inaccurate estimates when fed highly skewed data. In some cases, simply taking the natural log of the features is sufficient to produce more normally distributed data. In others, bucketing values into discrete ranges is helpful. These buckets can then be treated as categorical variables and included in the model when one hot encoded.\n", "* Handling more complicated data types: Mainpulating images, text, or data at varying grains is left for other notebook templates.\n", "\n", "Luckily, some of these aspects have already been handled for us, and the algorithm we are showcasing tends to do well at handling sparse or oddly distributed data. Therefore, let's keep pre-processing simple." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0) # Indicator variable to capture when pdays takes a value of 999\n", "data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0) # Indicator for individuals not actively employed\n", "model_data = pd.get_dummies(data) # Convert categorical variables to sets of indicators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another question to ask yourself before building a model is whether certain features will add value in your final use case. For example, if your goal is to deliver the best prediction, then will you have access to that data at the moment of prediction? Knowing it's raining is highly predictive for umbrella sales, but forecasting weather far enough out to plan inventory on umbrellas is probably just as difficult as forecasting umbrella sales without knowledge of the weather. So, including this in your model may give you a false sense of precision.\n", "\n", "Following this logic, let's remove the economic features and `duration` from our data as they would need to be forecasted with high precision to use as inputs in future predictions.\n", "\n", "Even if we were to use values of the economic indicators from the previous quarter, this value is likely not as relevant for prospects contacted early in the next quarter as those contacted later on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When building a model whose primary goal is to predict a target value on new data, it is important to understand overfitting. Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given. This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown. These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.\n", "\n", "The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on \"new\" data. There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc. For our purposes, we'll simply randomly split the data into 3 uneven groups. The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on \"new\" data, and 10% will be held back as a final testing dataset which will be used later on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))]) # Randomly sort the data then split out first 70%, second 20%, and last 10%" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format. For this example, we'll stick to CSV. Note that the first column must be the target variable and the CSV should not include headers. Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before. This avoids any misalignment issues due to random reordering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)\n", "pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sess.upload_data(path='train.csv', bucket=bucket, key_prefix=prefix+'/train')\n", "sess.upload_data(path='validation.csv', bucket=bucket, key_prefix=prefix+'/validation')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll copy the test data file to S3 for the inference Lambda function to use it later to make predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data.to_csv('test.csv', index=False, header=True)\n", "sess.upload_data(path='test.csv', bucket=bucket, key_prefix=prefix+'/test')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Training\n", "Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable. Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted. Taken together, these aspects make gradient boosted trees a good candidate algorithm.\n", "\n", "There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models. By doing this the collection of simple models can actually outperform large, complex models. Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.\n", "\n", "`xgboost` is an extremely popular, open-source package for gradient boosted trees. It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions. Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.\n", "\n", "First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "container = image_uris.retrieve('xgboost', boto3.Session().region_name, '1.2-1' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_input_train = TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')\n", "s3_input_validation = TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we'll need to specify training parameters to the estimator. This includes:\n", "1. The `xgboost` algorithm container\n", "1. The IAM role to use\n", "1. Training instance type and count\n", "1. S3 location for output data\n", "1. Algorithm hyperparameters\n", "\n", "And then a `.fit()` function which specifies:\n", "1. S3 location for output data. In this case we have both a training and validation set which are passed in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgb = Estimator(container,\n", " role,\n", " instance_count=1,\n", " instance_type='ml.m4.xlarge',\n", " output_path='s3://{}/{}/output'.format(bucket, prefix))\n", "\n", "xgb.set_hyperparameters(max_depth=5,\n", " eta=0.2,\n", " gamma=4,\n", " min_child_weight=6,\n", " subsample=0.8,\n", " objective='binary:logistic',\n", " num_round=100)\n", "\n", "xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Deploying\n", "\n", "Next step is to deploy the model to AWS Lambda, for serverless inference, and prepare a test event.\n", "\n", "To do that, you'll need to copy the model file to your computer, and update the test event `event.json` in `events` folder." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Copy model location in S3\n", "\n", "Please copy the model location in S3. You'll need it later, to build and deploy the inference Lambda function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgb.model_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Copy test data location in S3\n", "\n", "To prepare the test event for the inference Lambda, you'll need to copy the following \n", "- `bucket`\n", "- `prefix` " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('bucket: {}'.format(bucket))\n", "print('prefix: {}'.format(prefix+'/test/'))" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 4 }