{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Targeting Direct Marketing with Amazon SageMaker XGBoost\n", "_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_\n", "\n", "\n", "## Background\n", "Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers. Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer. Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.\n", "\n", "This notebook presents an example problem to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls. The steps include:\n", "\n", "* Preparing your Amazon SageMaker notebook\n", "* Downloading data from the internet into Amazon SageMaker\n", "* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms\n", "* Estimating a model using the Gradient Boosting algorithm\n", "* Evaluating the effectiveness of the model\n", "* Setting the model up to make on-going predictions\n", "\n", "---\n", "\n", "## Preparation\n", "\n", "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", "\n", "Let's start by specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "isConfigCell": true, "tags": [ "parameters" ] }, "outputs": [], "source": [ "# cell 01\n", "# Define IAM role\n", "import boto3\n", "import sagemaker\n", "import re\n", "from sagemaker import get_execution_role\n", "\n", "region = boto3.Session().region_name\n", "session = sagemaker.Session()\n", "bucket = session.default_bucket()\n", "prefix = 'sagemaker/DEMO-xgboost-dm'\n", "role = get_execution_role() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's bring in the Python libraries that we'll use throughout the analysis" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: smdebug in /opt/conda/lib/python3.7/site-packages (1.0.12)\n", "Requirement already satisfied: boto3>=1.10.32 in /opt/conda/lib/python3.7/site-packages (from smdebug) (1.24.12)\n", "Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from smdebug) (20.1)\n", "Requirement already satisfied: protobuf>=3.6.0 in /opt/conda/lib/python3.7/site-packages (from smdebug) (3.20.1)\n", "Requirement already satisfied: numpy>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from smdebug) (1.21.6)\n", "Requirement already satisfied: pyinstrument==3.4.2 in /opt/conda/lib/python3.7/site-packages (from smdebug) (3.4.2)\n", "Requirement already satisfied: pyinstrument-cext>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from pyinstrument==3.4.2->smdebug) (0.2.4)\n", "Requirement already satisfied: botocore<1.28.0,>=1.27.12 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (1.27.12)\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (1.0.1)\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (0.6.0)\n", "Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging->smdebug) (2.4.6)\n", "Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from packaging->smdebug) (1.14.0)\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.28.0,>=1.27.12->boto3>=1.10.32->smdebug) (2.8.1)\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.28.0,>=1.27.12->boto3>=1.10.32->smdebug) (1.26.9)\n" ] }, { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# cell 02\n", "import numpy as np # For matrix operations and numerical processing\n", "import pandas as pd # For munging tabular data\n", "import matplotlib.pyplot as plt # For charts and visualizations\n", "from IPython.display import Image # For displaying images in the notebook\n", "from IPython.display import display # For displaying outputs in the notebook\n", "from time import gmtime, strftime # For labeling SageMaker models, endpoints, etc.\n", "import sys # For writing outputs to notebook\n", "import math # For ceiling function\n", "import json # For parsing hosting outputs\n", "import os # For manipulating filepath names\n", "import sagemaker # Amazon SageMaker's Python SDK provides many helper functions\n", "from sagemaker.predictor import csv_serializer # Converts strings for HTTP POST requests on inference\n", "from plotly.offline import init_notebook_mode, iplot # For rendering plots\n", "! python -m pip install smdebug \n", "\n", "init_notebook_mode(connected=True)\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Data\n", "Let's start by downloading the [direct marketing dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket. \n", "\n", "\\[Moro et al., 2014\\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-07-29 17:24:08-- https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip\n", "Resolving sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)... 52.218.250.33\n", "Connecting to sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)|52.218.250.33|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 432828 (423K) [application/zip]\n", "Saving to: ‘bank-additional.zip.8’\n", "\n", "bank-additional.zip 100%[===================>] 422.68K 1.67MB/s in 0.2s \n", "\n", "2022-07-29 17:24:09 (1.67 MB/s) - ‘bank-additional.zip.8’ saved [432828/432828]\n", "\n", "Collecting package metadata (current_repodata.json): done\n", "Solving environment: done\n", "\n", "# All requested packages already installed.\n", "\n", "Archive: bank-additional.zip\n", " inflating: bank-additional/bank-additional-names.txt \n", " inflating: bank-additional/bank-additional.csv \n", " inflating: bank-additional/bank-additional-full.csv \n" ] } ], "source": [ "# cell 03\n", "!wget https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip\n", "!conda install -y -c conda-forge unzip\n", "!unzip -o bank-additional.zip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now lets read this into a Pandas data frame and take a look." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "job | \n", "marital | \n", "education | \n", "default | \n", "housing | \n", "loan | \n", "contact | \n", "month | \n", "day_of_week | \n", "... | \n", "campaign | \n", "pdays | \n", "previous | \n", "poutcome | \n", "emp.var.rate | \n", "cons.price.idx | \n", "cons.conf.idx | \n", "euribor3m | \n", "nr.employed | \n", "y | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "56 | \n", "housemaid | \n", "married | \n", "basic.4y | \n", "no | \n", "no | \n", "no | \n", "telephone | \n", "may | \n", "mon | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
1 | \n", "57 | \n", "services | \n", "married | \n", "high.school | \n", "unknown | \n", "no | \n", "no | \n", "telephone | \n", "may | \n", "mon | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
2 | \n", "37 | \n", "services | \n", "married | \n", "high.school | \n", "no | \n", "yes | \n", "no | \n", "telephone | \n", "may | \n", "mon | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
3 | \n", "40 | \n", "admin. | \n", "married | \n", "basic.6y | \n", "no | \n", "no | \n", "no | \n", "telephone | \n", "may | \n", "mon | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
4 | \n", "56 | \n", "services | \n", "married | \n", "high.school | \n", "no | \n", "no | \n", "yes | \n", "telephone | \n", "may | \n", "mon | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
41183 | \n", "73 | \n", "retired | \n", "married | \n", "professional.course | \n", "no | \n", "yes | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "yes | \n", "
41184 | \n", "46 | \n", "blue-collar | \n", "married | \n", "professional.course | \n", "no | \n", "no | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "no | \n", "
41185 | \n", "56 | \n", "retired | \n", "married | \n", "university.degree | \n", "no | \n", "yes | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "... | \n", "2 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "no | \n", "
41186 | \n", "44 | \n", "technician | \n", "married | \n", "professional.course | \n", "no | \n", "no | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "... | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "yes | \n", "
41187 | \n", "74 | \n", "retired | \n", "married | \n", "professional.course | \n", "no | \n", "yes | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "... | \n", "3 | \n", "999 | \n", "1 | \n", "failure | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "no | \n", "
41188 rows × 21 columns
\n", "