{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Direct Marketing with Amazon SageMaker Autopilot\n", "---\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. [Introduction](#Introduction)\n", "1. [Prerequisites](#Prerequisites)\n", "1. [Downloading the dataset](#Downloading)\n", "1. [Upload the dataset to Amazon S3](#Uploading)\n", "1. [Setting up the SageMaker Autopilot Job](#Settingup)\n", "1. [Launching the SageMaker Autopilot Job](#Launching)\n", "1. [Tracking Sagemaker Autopilot Job Progress](#Tracking)\n", "1. [Results](#Results)\n", "1. [Cleanup](#Cleanup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.\n", "\n", "A typical introductory task in machine learning (the \"Hello World\" equivalent) is one that uses a dataset to predict whether a customer will enroll for a term deposit at a bank, after one or more phone calls. For more information about the task and the dataset used, see [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing).\n", "\n", "Direct marketing, through mail, email, phone, etc., is a common tactic to acquire customers. Because resources and a customer's attention are limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer. Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem. You can imagine that this task would readily translate to marketing lead prioritization in your own organization.\n", "\n", "This notebook demonstrates how you can use Autopilot on this dataset to get the most accurate ML pipeline through exploring a number of potential options, or \"candidates\". Each candidate generated by Autopilot consists of two steps. The first step performs automated feature engineering on the dataset and the second step trains and tunes an algorithm to produce a model. When you deploy this model, it follows similar steps. Feature engineering followed by inference, to decide whether the lead is worth pursuing or not. The notebook contains instructions on how to train the model as well as to deploy the model to perform batch predictions on a set of leads. Where it is possible, use the Amazon SageMaker Python SDK, a high level SDK, to simplify the way you interact with Amazon SageMaker.\n", "\n", "Other examples demonstrate how to customize models in various ways. For instance, models deployed to devices typically have memory constraints that need to be satisfied as well as accuracy. Other use cases have real-time deployment requirements and latency constraints. For now, keep it simple." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "Before you start the tasks in this tutorial, do the following:\n", "\n", "- The Amazon Simple Storage Service (Amazon S3) bucket and prefix that you want to use for training and model data. This should be within the same Region as Amazon SageMaker training. The code below will create, or if it exists, use, the default bucket.\n", "- The IAM role to give Autopilot access to your data. See the Amazon SageMaker documentation for more information on IAM roles: https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam.html" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# cell 01\n", "import sagemaker\n", "import boto3\n", "from sagemaker import get_execution_role\n", "\n", "region = boto3.Session().region_name\n", "\n", "session = sagemaker.Session()\n", "bucket = session.default_bucket()\n", "prefix = 'sagemaker/autopilot-dm'\n", "\n", "role = get_execution_role()\n", "\n", "sm = boto3.Session().client(service_name='sagemaker',region_name=region)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading the dataset\n", "Download the [direct marketing dataset](!wget -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket. \n", "\n", "\\[Moro et al., 2014\\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2021-02-02 06:21:51-- https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip\n", "Resolving sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)... 52.218.241.41\n", "Connecting to sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)|52.218.241.41|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 432828 (423K) [application/zip]\n", "Saving to: ‘bank-additional.zip’\n", "\n", "bank-additional.zip 100%[===================>] 422.68K 881KB/s in 0.5s \n", "\n", "2021-02-02 06:21:52 (881 KB/s) - ‘bank-additional.zip’ saved [432828/432828]\n", "\n", "Collecting package metadata (current_repodata.json): done\n", "Solving environment: done\n", "\n", "\n", "==> WARNING: A newer version of conda exists. <==\n", " current version: 4.8.2\n", " latest version: 4.9.2\n", "\n", "Please update conda by running\n", "\n", " $ conda update -n base -c defaults conda\n", "\n", "\n", "\n", "## Package Plan ##\n", "\n", " environment location: /opt/conda\n", "\n", " added / updated specs:\n", " - unzip\n", "\n", "\n", "The following packages will be downloaded:\n", "\n", " package | build\n", " ---------------------------|-----------------\n", " conda-4.9.2 | py37h89c1867_0 3.0 MB conda-forge\n", " python_abi-3.7 | 1_cp37m 4 KB conda-forge\n", " unzip-6.0 | h516909a_2 141 KB conda-forge\n", " ------------------------------------------------------------\n", " Total: 3.2 MB\n", "\n", "The following NEW packages will be INSTALLED:\n", "\n", " python_abi conda-forge/linux-64::python_abi-3.7-1_cp37m\n", " unzip conda-forge/linux-64::unzip-6.0-h516909a_2\n", "\n", "The following packages will be UPDATED:\n", "\n", " conda pkgs/main::conda-4.8.2-py37_0 --> conda-forge::conda-4.9.2-py37h89c1867_0\n", "\n", "\n", "\n", "Downloading and Extracting Packages\n", "python_abi-3.7 | 4 KB | ##################################### | 100% \n", "conda-4.9.2 | 3.0 MB | ##################################### | 100% \n", "unzip-6.0 | 141 KB | ##################################### | 100% \n", "Preparing transaction: done\n", "Verifying transaction: done\n", "Executing transaction: done\n", "Archive: bank-additional.zip\n", " creating: bank-additional/\n", " inflating: bank-additional/bank-additional-names.txt \n", " inflating: bank-additional/bank-additional.csv \n", " inflating: bank-additional/bank-additional-full.csv \n" ] } ], "source": [ "# cell 02\n", "!wget -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip\n", "!conda install -y -c conda-forge unzip\n", "!unzip -o bank-additional.zip\n", "\n", "local_data_path = './bank-additional/bank-additional-full.csv'\n" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true }, "source": [ "## Upload the dataset to Amazon S3\n", "\n", "Before you run Autopilot on the dataset, first perform a check of the dataset to make sure that it has no obvious errors. The Autopilot process can take long time, and it's generally a good practice to inspect the dataset before you start a job. This particular dataset is small, so you can inspect it in the notebook instance itself. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. Autopilot is capable of handling datasets up to 5 GB.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the data into a Pandas data frame and take a look." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "job | \n", "marital | \n", "education | \n", "default | \n", "housing | \n", "loan | \n", "contact | \n", "month | \n", "day_of_week | \n", "duration | \n", "campaign | \n", "pdays | \n", "previous | \n", "poutcome | \n", "emp.var.rate | \n", "cons.price.idx | \n", "cons.conf.idx | \n", "euribor3m | \n", "nr.employed | \n", "y | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "56 | \n", "housemaid | \n", "married | \n", "basic.4y | \n", "no | \n", "no | \n", "no | \n", "telephone | \n", "may | \n", "mon | \n", "261 | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
1 | \n", "57 | \n", "services | \n", "married | \n", "high.school | \n", "unknown | \n", "no | \n", "no | \n", "telephone | \n", "may | \n", "mon | \n", "149 | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
2 | \n", "37 | \n", "services | \n", "married | \n", "high.school | \n", "no | \n", "yes | \n", "no | \n", "telephone | \n", "may | \n", "mon | \n", "226 | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
3 | \n", "40 | \n", "admin. | \n", "married | \n", "basic.6y | \n", "no | \n", "no | \n", "no | \n", "telephone | \n", "may | \n", "mon | \n", "151 | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
4 | \n", "56 | \n", "services | \n", "married | \n", "high.school | \n", "no | \n", "no | \n", "yes | \n", "telephone | \n", "may | \n", "mon | \n", "307 | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "no | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
41183 | \n", "73 | \n", "retired | \n", "married | \n", "professional.course | \n", "no | \n", "yes | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "334 | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "yes | \n", "
41184 | \n", "46 | \n", "blue-collar | \n", "married | \n", "professional.course | \n", "no | \n", "no | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "383 | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "no | \n", "
41185 | \n", "56 | \n", "retired | \n", "married | \n", "university.degree | \n", "no | \n", "yes | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "189 | \n", "2 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "no | \n", "
41186 | \n", "44 | \n", "technician | \n", "married | \n", "professional.course | \n", "no | \n", "no | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "442 | \n", "1 | \n", "999 | \n", "0 | \n", "nonexistent | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "yes | \n", "
41187 | \n", "74 | \n", "retired | \n", "married | \n", "professional.course | \n", "no | \n", "yes | \n", "no | \n", "cellular | \n", "nov | \n", "fri | \n", "239 | \n", "3 | \n", "999 | \n", "1 | \n", "failure | \n", "-1.1 | \n", "94.767 | \n", "-50.8 | \n", "1.028 | \n", "4963.6 | \n", "no | \n", "
41188 rows × 21 columns
\n", "\n", " | no | \n", "
---|---|
0 | \n", "no | \n", "
1 | \n", "no | \n", "
2 | \n", "no | \n", "
3 | \n", "no | \n", "
4 | \n", "no | \n", "
... | \n", "... | \n", "
8232 | \n", "yes | \n", "
8233 | \n", "yes | \n", "
8234 | \n", "no | \n", "
8235 | \n", "yes | \n", "
8236 | \n", "yes | \n", "
8237 rows × 1 columns
\n", "