# Direct Marketing with Amazon SageMaker Autopilot

This notebook works well with the `Python 3 (Data Science 2.0)` kernel on SageMaker Studio.

---

## Contents

1. [Introduction](#Introduction)
1. [Prerequisites](#Prerequisites)
1. [Downloading the dataset](#Downloading)
1. [Upload the dataset to Amazon S3](#Uploading)
1. [Setting up the SageMaker Autopilot Job](#Settingup)
1. [Launching the SageMaker Autopilot Job](#Launching)
1. [Tracking Sagemaker Autopilot Job Progress](#Tracking)
1. [Results](#Results)
1. [Cleanup](#Cleanup)

## Introduction

[Amazon SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/) is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways:

- On autopilot (hence the name) or with human guidance
- Without code (through the SageMaker Studio UI), or using the AWS SDKs.

This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.

A typical introductory task in machine learning (the "Hello World" equivalent) is one that uses a dataset to predict whether a customer will enroll for a term deposit at a bank, after one or more phone calls. For more information about the task and the dataset used, see [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing).

Direct marketing, through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention are limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem. You can imagine that this task would readily translate to marketing lead prioritization in your own organization.

This notebook demonstrates how you can use Autopilot on this dataset to get the most accurate ML pipeline through exploring a number of potential options, or "candidates". Each candidate generated by Autopilot consists of two steps. The first step performs automated feature engineering on the dataset and the second step trains and tunes an algorithm to produce a model. When you deploy this model, it follows similar steps. Feature engineering followed by inference, to decide whether the lead is worth pursuing or not. The notebook contains instructions on how to train the model as well as to deploy the model to perform batch predictions on a set of leads. Where it is possible, use the Amazon SageMaker Python SDK, a high level SDK, to simplify the way you interact with Amazon SageMaker.

Other examples demonstrate how to customize models in various ways. For instance, models deployed to devices typically have memory constraints that need to be satisfied as well as accuracy. Other use cases have real-time deployment requirements and latency constraints. For now, keep it simple.

## Prerequisites

Before getting started, we'll check & upgrade a few installed library versions to avoid some past incompatibilities (see [numpy#18355](https://github.com/numpy/numpy/issues/18355) and [aiobotocore#905](https://github.com/aio-libs/aiobotocore/issues/905)).

> ⚠️ **Note:** If you have any other notebooks running in the same Studio "app" (same kernel and instance type) that have already `import`ed these libraries into memory, you might see unexpected errors and need to restart those notebook kernels after this install.

In [None]:
!pip install "pandas>=1.0.5" "s3fs>=2022.01.0"

With dependencies installed, this next code cell will:

- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to SageMaker with the open-source [Sagemaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), `sagemaker`

Note that while [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the general-purpose AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.

In [None]:
import sagemaker

session = sagemaker.Session()

bucket = session.default_bucket()
prefix = "sm101/autopilot-dm"

sm = session.sagemaker_client

role = sagemaker.get_execution_role()

## Downloading the dataset<a name="Downloading"></a>

In this example we'll use the **direct marketing dataset** as per:

> *\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014*

Here, we'll download [the dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the SageMaker sample data S3 bucket.

In [None]:
!wget -P data/ -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip

import zipfile

with zipfile.ZipFile("data/bank-additional.zip", "r") as zip_ref:
    print("Unzipping...")
    zip_ref.extractall("data")
print("Done")

local_data_path = "./data/bank-additional/bank-additional-full.csv"

## Upload the dataset to Amazon S3<a name="Uploading"></a>

Before you run Autopilot on the dataset, first perform a check of the dataset to make sure that it has no obvious errors. The Autopilot process can take long time, and it's generally a good practice to inspect the dataset before you start a job. This particular dataset is small, so you can inspect it in the notebook instance itself. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. Autopilot is capable of handling datasets up to 5 GB.

Read the data into a Pandas data frame and take a look.

In [None]:
import pandas as pd

data = pd.read_csv(local_data_path)
with pd.option_context("display.max_columns", 500):
    # Make sure we can see all of the columns
    display(data)

Note that there are 20 features to help predict the target column 'y'.

Amazon SageMaker Autopilot takes care of preprocessing your data for you. You do not need to perform conventional data preprocssing techniques such as handling missing values, converting categorical features to numeric features, scaling data, and handling more complicated data types.

Moreover, splitting the dataset into training and validation splits is not necessary. Autopilot takes care of this for you. You may, however, want to split out a test set. That's next, although you use it for batch inference at the end instead of testing the model.


### Reserve some data for calling batch inference on the model

Divide the data into training and testing splits. The training split is used by SageMaker Autopilot. The testing split is reserved to perform inference using the suggested model.


In [None]:
train_data = data.sample(frac=0.8, random_state=200)

test_data = data.drop(train_data.index)

test_data_no_target = test_data.drop(columns=["y"])

### Upload the dataset to Amazon S3
Copy the file to Amazon Simple Storage Service (Amazon S3) in a .csv format for Amazon SageMaker training to use.

In [None]:
train_file = "data/train_data.csv"
train_data.to_csv(train_file, index=False, header=True)
train_data_s3_path = session.upload_data(path=train_file, key_prefix=prefix + "/train")
print("Train data uploaded to: " + train_data_s3_path)

test_file = "data/test_data.csv"
test_data_no_target.to_csv(test_file, index=False, header=False)
test_data_s3_path = session.upload_data(path=test_file, key_prefix=prefix + "/test")
print("Test data uploaded to: " + test_data_s3_path)

## Setting up the SageMaker Autopilot Job<a name="Settingup"></a>

After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset. 

The required inputs for invoking a Autopilot job are:
* Amazon S3 location for input dataset and for all output artifacts
* Name of the column of the dataset you want to predict (`y` in this case) 
* An IAM role

Currently Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order, is expected to have a header row.

In [None]:
from datetime import datetime
from time import sleep

from sagemaker.automl import automl

timestamp_suffix = f"{datetime.now():%Y-%m-%d-%H-%M-%S}"
auto_ml_job_name = f"automl-banking-{timestamp_suffix}"
print(f"AutoMLJobName: {auto_ml_job_name}")

automl_job = automl.AutoML(
    role=role,
    target_attribute_name="y",
    output_path=f"s3://{bucket}/{prefix}/automl-output",
    problem_type="BinaryClassification",
    max_candidates=10,  # (We've set this low to prioritize demo speed over accuracy)
    job_objective={"MetricName": "F1"},
)

Specifying the type of problem you want to solve with your dataset (`Regression, MulticlassClassification, BinaryClassification`) is **optional**. In case you are not sure, SageMaker Autopilot will infer the problem type based on statistics of the target column (the column you want to predict). 

You have the option to limit the running time of a SageMaker Autopilot job by providing either the maximum number of pipeline evaluations or candidates (one pipeline evaluation is called a `Candidate` because it generates a candidate model) or providing the total time allocated for the overall Autopilot job. Under default settings, this job takes about four hours to run. This varies between runs because of the nature of the exploratory process Autopilot uses to find optimal training parameters.

## Launching the SageMaker Autopilot Job<a name="Launching"></a>

You can now launch the Autopilot job by calling the `fit()` method as described in the [SageMaker Python SDK AutoML doc](https://sagemaker.readthedocs.io/en/stable/api/training/automl.html#sagemaker.automl.automl.AutoML.fit).

In [None]:
automl_job.fit(inputs=train_data_s3_path, wait=False, logs=False)

In [None]:
# Or if you need (e.g. after notebook restart), you can instead 'attach' to previous jobs by name:
# automl_job = automl.AutoML.attach("automl-2022-02-15-13-51-51-239")

## Tracking SageMaker Autopilot job progress<a name="Tracking"></a>
SageMaker Autopilot job consists of the following high-level steps : 
* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). 

In [None]:
print("JobStatus - Secondary Status\n----------------------------")

while True:
    sleep(60)
    describe_response = automl_job.describe_auto_ml_job()
    print("{AutoMLJobStatus} - {AutoMLJobSecondaryStatus}".format(**describe_response))
    if describe_response["AutoMLJobStatus"] in ("Failed", "Completed", "Stopped"):
        break

## Results

The Autopilot job is completed, and we now have a set of models with their associated performance metric.
Let's consider the top 5.

In [None]:
candidates_list = automl_job.list_candidates(max_results=10, sort_by="FinalObjectiveMetricValue")

models = pd.json_normalize(candidates_list)[
    [
        "CandidateName",
        "FinalAutoMLJobObjectiveMetric.Value",
        "FinalAutoMLJobObjectiveMetric.MetricName",
    ]
].rename(
    columns={
        "FinalAutoMLJobObjectiveMetric.Value": "metric_value",
        "FinalAutoMLJobObjectiveMetric.MetricName": "metric_name",
        "CandidateName": "candidate_name",
    }
)

models

In [None]:
automl_job.best_candidate()

### Perform batch inference using the best candidate

Now that you have successfully completed the SageMaker Autopilot job on the dataset, create a model from any of the candidates by using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). 

For classification problem types, the inference containers generated by SageMaker Autopilot allow you to select the response content for predictions. Valid inference response content are defined below for binary classification and multiclass classification problem types.

- `predicted_label` - predicted class
- `probability` - In binary classification, the probability that the result is predicted as the second or True class in the target column. In multiclass classification, the probability of the winning class.
- `labels` - list of all possible classes
- `probabilities` - list of all probabilities for all classes (order corresponds with `labels`)

By default the inference contianers are configured to generate the `predicted_label` only.

In this binary classification example we'll request both `predicted_label` and `probability` - demonstrating how this additional "confidence" output from the model can be used.

In [None]:
model_name = "automl-banking-model-" + timestamp_suffix
inference_response_keys = ["predicted_label", "probability"]
model = automl_job.create_model(
    name=model_name,
    candidate=automl_job.best_candidate(),
    inference_response_keys=inference_response_keys,
)

You can use batch inference by using Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting.

In [None]:
output_path = f"s3://{bucket}/{prefix}/inference-results/"

transformer = model.transformer(
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    assemble_with="Line",
    output_path=output_path,
)

We can now start the transform job.

In [None]:
transformer_job = transformer.transform(
    data=test_data_s3_path,
    data_type="S3Prefix",
    content_type="text/csv",
    split_type="Line",
    wait=False,
)

Watch the transform job for completion.

In [None]:
print("JobStatus\n----------")

while True:
    sleep(30)
    describe_response = sm.describe_transform_job(
        TransformJobName=transformer._current_job_name,
    )
    job_run_status = describe_response["TransformJobStatus"]
    print(job_run_status)
    if job_run_status in ("Failed", "Completed", "Stopped"):
        break

Now let's view the results of the transform job:

In [None]:
test_data_preds = pd.read_csv(
    transformer.output_path + "test_data.csv.out",
    header=None,
    names=inference_response_keys,
)

test_data_preds

### Additional metrics

We can use the result of the transform job to evaluate additional metrics on the test dataset, using the [scikit-learn](https://scikit-learn.org/stable/index.html) library.

Common metrics for classification problems are [AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) and [AP](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html).

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    average_precision_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_recall_curve,
    roc_auc_score,
    roc_curve,
)

labels = test_data["y"]
AUC = roc_auc_score(labels == "yes", test_data_preds.probability)
AP = average_precision_score(labels, test_data_preds.probability, pos_label="yes")

print(f"AUC: {AUC:.3f}\nAP {AP:.3f}")

We can also generate a classification report and a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix):

In [None]:
print(classification_report(labels == "yes", test_data_preds.predicted_label == "yes"))
cm = confusion_matrix(labels, test_data_preds.predicted_label, labels=["yes", "no"])
ConfusionMatrixDisplay(cm, display_labels=["yes", "no"]).plot(
    include_values=["yes", "no"], cmap=plt.cm.Blues, values_format="d"
);

And present the model performance using [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and Precision-Recall curves.

In [None]:
f, [ax0, ax1] = plt.subplots(1, 2, figsize=(16, 9))

fpr, tpr, _ = roc_curve(labels == "yes", test_data_preds.probability)
ax0.step(fpr, tpr, where="post")
ax0.fill_between(fpr, tpr, step="post", alpha=0.2, color="b")
ax0.plot([0, 1], [0, 1], linestyle="--")
ax0.set_xlabel("False Positive Rate")
ax0.set_ylabel("True Positive Rate")
ax0.set_title("ROC (Receiver Operating Characteristic)")

precision, recall, _ = precision_recall_curve(labels == "yes", test_data_preds.probability)
ax1.step(recall, precision, where="post")
ax1.fill_between(recall, precision, step="post", alpha=0.2, color="b")
ax1.set_xlabel("Recall")
ax1.set_ylabel("Precision")
ax1.set_title("Precision-Recall Curve")

for ax in [ax0, ax1]:
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 1])
    ax.set_aspect(1)
    ax.grid()

### Exploration and Modelling Notebooks

As well as the results and candidate models themselves, Autopilot generates other artifacts including:

- A **Data Exploration Notebook**: produced during the analysis phase of the job, that helps you identify problems in your dataset.
- A **Candidate Definitions Notebook**: interactively stepping through the steps taken by Autopilot to define and train candidates, and select the best one.
- **Supporting Python code**: Including the actual code used for the different pre-processing steps.

To get a good overview of the available assets, we'll download not just the notebooks but the whole output folder:

In [None]:
automl_job_desc = automl_job.describe_auto_ml_job()

automl_output_s3uri = automl_job_desc["OutputDataConfig"]["S3OutputPath"]
print(f"Autopilot output:\n{automl_output_s3uri}")

print(f"Downloading to autopilot_output/...")
sagemaker.s3.S3Downloader.download(automl_output_s3uri, "autopilot_output/")
print("Done")

From this download we can view not just the notebooks, but also other assets like the generated Python code they link to, and pre-processed datasets. Explore the notebooks linked below, but also check out the other contents in the `autopilot_output` folder!

In [None]:
from IPython.display import Markdown

candidate_notebook_s3uri = automl_job_desc["AutoMLJobArtifacts"][
    "CandidateDefinitionNotebookLocation"
]
candidate_notebook_path = "autopilot_output" + candidate_notebook_s3uri[len(automl_output_s3uri) :]

print(f"Candidate definition notebook:\n{candidate_notebook_s3uri}")
print(f"\nDownloaded at:")
display(Markdown(f"[{candidate_notebook_path}]({candidate_notebook_path})"))

In [None]:
dataexp_notebook_s3uri = automl_job_desc["AutoMLJobArtifacts"]["DataExplorationNotebookLocation"]
dataexp_notebook_path = "autopilot_output" + dataexp_notebook_s3uri[len(automl_output_s3uri) :]

print(f"Data exploration notebook:\n{dataexp_notebook_s3uri}")
print(f"\nDownloaded at:")
display(Markdown(f"[{dataexp_notebook_path}]({dataexp_notebook_path})"))

### Best Model Explainability Artifacts
SageMaker AutoPilot uses SageMaker Clarify to generate an explainability report for the best candidate.

These Clarify artifacts are also available from S3, and were already included in our download above:

In [None]:
explainability_s3uri = automl_job.best_candidate()["CandidateProperties"][
    "CandidateArtifactLocations"
]["Explainability"]
explainability_path = "autopilot_output" + explainability_s3uri[len(automl_output_s3uri) :]

print(f"Explainability artifacts:\n{explainability_s3uri}")
print(f"\nDownloaded to folder:")
print(f"{explainability_path}")

## Cleanup

The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them. This operation deletes all the generated models and the auto-generated notebooks as well. 

In [None]:
# import boto3

# s3 = boto3.resource('s3')
# bucket = s3.Bucket(bucket)

# job_outputs_prefix = '{}/output/{}'.format(prefix, auto_ml_job_name)
# bucket.objects.filter(Prefix=job_outputs_prefix).delete()