### Direct Marketing in Banking - Propensity Modelling with Tabular Data

# Part 2: AutoGluon-Tabular Ensemble Models

> *This notebook works well with the `Python 3 (Data Science 3.0)` kernel on SageMaker Studio*

This workshop explores a tabular, [binary classification](https://en.wikipedia.org/wiki/Binary_classification) use-case with significant **class imbalance**: predicting which of a bank's customers are likely to respond to a targeted marketing campaign.

In this optional second notebook, you'll explore [AutoGluon-Tabular](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html) - another advanced, built-in algorithm from SageMaker that automatically ensembles different model types together for high accuracy.

> ⚠️ **You must** have run [Notebook 1 Autopilot and XGBoost.ipynb](1%20Autopilot%20and%20XGBoost.ipynb) before this notebook (at least to the point of having queried a data snapshot from SageMaker Feature Store)


## Contents

> ℹ️ **Tip:** You can use the Table of Contents panel in the left sidebar on JupyterLab / SageMaker Studio, to view and navigate sections

1. **[Prepare our environment](#Prepare-our-environment)**
1. **[Algorithms, AutoML, and AutoGluon](#intro)**
1. **[Understand the algorithm requirements](#Understand-the-algorithm-requirements)**
1. **[Prepare training and test data](#Prepare-training-and-test-data)**
1. **[Train a model](#Train-a-model)**
1. **[Batch inference](#Batch-inference)**
1. **[Hyperparameter Optimization (HPO)](#Hyperparameter-Optimization-(HPO))**
1. **[Deploy and test the optimized model](#Deploy-and-test-the-optimized-model)**
1. **[Conclusions](#Conclusions)**

## Prepare our environment

As in the previous notebook, we'll start by importing libraries and configuring AWS/Sagemaker service connections:

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import json
import time

# External Dependencies:
import boto3 # General-purpose AWS SDK for Python
import numpy as np # For matrix operations and numerical processing
import pandas as pd # Tabular data utilities
import sagemaker # High-level SDK specifically for Amazon SageMaker

# Local Helper Functions:
import util

# Setting up SageMaker parameters
sgmk_session = sagemaker.Session() # Connect to SageMaker APIs
bucket_name = sgmk_session.default_bucket() # Select an Amazon S3 bucket
bucket = boto3.resource("s3").Bucket(bucket_name)
bucket_prefix = "sm101/direct-marketing" # Location in the bucket to store our files
sgmk_role = sagemaker.get_execution_role() # IAM Execution Role to use for permissions

print(f"s3://{bucket_name}/{bucket_prefix}")
print(sgmk_role)

## Algorithms, AutoML, and AutoGluon

Another useful tool to build highly-accurate models quickly is the open-source [AutoGluon framework](https://auto.gluon.ai/stable/index.html) and the SageMaker built-in [AutoGluon-Tabular algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html).

As outlined in the [2020 paper by Erickson, Mueller et al](https://arxiv.org/abs/2003.06505), AutoGluon-Tabular is an advanced model stacking ensembling framework that beat 99% of participating data scientists in benchmark Kaggle contests with just 4hrs of model training.

In fact at the time of writing, SageMaker Autopilot makes use of AutoGluon under the hood when running in ensembling mode: But you can also use AutoGluon directly as shown here for more customized experiments.

## Understand the algorithm requirements

As described in the [how to use](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html#autogluon-tabular-modes) section of the AutoGluon-Tabular doc page, [SageMaker JumpStart-based](https://sagemaker.readthedocs.io/en/stable/overview.html#use-built-in-algorithms-with-pre-trained-models-in-sagemaker-python-sdk) algorithms like AutoGluon-Tabular need a `script_uri` and `model_uri` in addition to the container `image_uri` as we configured for XGBoost.

These resources are all pre-built, and we can look them up by the `retrieve()` functions in the SageMaker Python SDK as shown below.

The default [hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) for the algorithm can also be loaded through the SDK, and below we make some minor customizations ready for inference.

In [None]:
from sagemaker import image_uris, script_uris, model_uris
from sagemaker.hyperparameters import retrieve_default as retrieve_default_hyperparams

ag_model_id, ag_model_version, train_scope = (
 "autogluon-classification-ensemble",
 "*",
 "training",
)
training_instance_type = "ml.p3.2xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
 region=None,
 framework=None,
 model_id=ag_model_id,
 model_version=ag_model_version,
 image_scope=train_scope,
 instance_type=training_instance_type,
)
print(train_image_uri)
# Retrieve the training script
train_source_uri = script_uris.retrieve(
 model_id=ag_model_id, model_version=ag_model_version, script_scope=train_scope
)
print(train_source_uri)
# Retrieve the pre-trained model tarball to further fine-tune. In tabular case, however, the pre-trained model tarball is dummy and fine-tune means training from scratch.
train_model_uri = model_uris.retrieve(
 model_id=ag_model_id, model_version=ag_model_version, model_scope=train_scope
)
print(train_model_uri)

# Retrieve the default hyper-parameters for training the model
hyperparameters = retrieve_default_hyperparams(
 model_id=ag_model_id, model_version=ag_model_version
)

# [Optional] Override default hyperparameters with custom values
hyperparameters["auto_stack"] = "True"
hyperparameters["save_space"] = "True"
print("\n", hyperparameters)

## Prepare training and test data

We'll **re-use the snapshot** queried from SageMaker Feature Store in the previous notebook, reading all CSVs under the S3 prefix into a combined dataframe.

▶️ **Check** the `data_extract_s3uri` here matches your `data_extract_s3uri` from notebook 1

In [None]:
data_extract_s3uri = f"s3://{bucket_name}/{bucket_prefix}/data-extract"
data_extract_prefix = data_extract_s3uri[len("s3://"):].partition("/")[2]

full_df = pd.concat(
 [
 pd.read_csv(f"s3://{s3obj.bucket_name}/{s3obj.key}")
 for s3obj in bucket.objects.filter(Prefix=data_extract_prefix)
 if s3obj.key.lower().endswith(".csv")
 ],
 axis=0,
)
full_df

From the [Input and Output Interface section](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html#InputOutput-AutoGluon-Tabular) of the algorithm doc, we know that AutoGluon-Tabular expects **CSV data in a particular structure**: `train/` and `validation/` folders each containing a single `data.csv`, with **no headers**, and the **target column first** in the files.

Unlike XGBoost, string categorical fields can be left as-is. Below we'll split the raw data snapshot as done previously - and upload to S3 in the required format:

In [None]:
df_model_data = full_df.drop(
 columns=[
 # Drop Feature Store metadata fields that aren't relevant to the model:
 "customer_id", "event_time", "write_time", "api_invocation_time", "is_deleted", "row_number"
 ],
 errors="ignore", # Your DF may not have 'row_number' if you did a simple 'select * from' query
)
df_model_data

# Shuffle and split dataset
train_data, validation_data, test_data = np.split(
 df_model_data.sample(frac=1, random_state=1729),
 [int(0.7 * len(df_model_data)), int(0.9 * len(df_model_data))],
)

# Create CSV files for Train / Validation / Test
train_data.to_csv("data/train.csv", index=False, header=False)
validation_data.to_csv("data/validation.csv", index=False, header=False)
test_data.to_csv("data/test.csv", index=False, header=False)

df_model_data

In [None]:
model_data_s3uri = f"s3://{bucket_name}/{bucket_prefix}/model-data-ag"

# Upload data to Amazon S3:
train_data_s3uri = model_data_s3uri + "/train/data.csv"
train_data.to_csv(train_data_s3uri, index=False, header=False)
validation_data_s3uri = model_data_s3uri + "/validation/data.csv"
validation_data.to_csv(validation_data_s3uri, index=False, header=False)
test_data_s3uri = model_data_s3uri + "/test/data.csv"
test_data.to_csv(test_data_s3uri, index=False, header=False)

## Train a model

With the parameters collected and data prepared in a compatible format, we're ready to train an AutoGluon model.

Like in the previous XGBoost example, this process uses the [Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) SDK class to define and run the training job.

Unlike the XGBoost example:

- The AutoGluon algorithm uses a single `training` data channel, with subfolders in S3 defining the separate splits of data.
- Additional parameters are needed (`source_dir`, `model_uri`, `entry_point`) to reference the separate (but pre-built) input artifacts that need to be bundled into the job.

In [None]:
%%time

ag_estimator = sagemaker.estimator.Estimator(
 base_job_name="autogluon",
 role=sgmk_role, # IAM role for job permissions
 output_path=f"s3://{bucket_name}/{bucket_prefix}/train-output", # Optional artifact output loc

 image_uri=train_image_uri, # AutoGluon-Tabular algorithm container
 source_dir=train_source_uri, # AutoGluon-Tabular script bundle (pre-built)
 model_uri=train_model_uri, # AutoGluon-Tabular pre-trained artifacts
 entry_point="transfer_learning.py", # Training script in the source_dir

 hyperparameters=hyperparameters,

 instance_type=training_instance_type, # Type of compute instance
 instance_count=1,
 max_run=25 * 60, # Limit job to 25 minutes
)

# Launch a SageMaker Training job by passing the S3 path of the datasets:
ag_estimator.fit({"training": model_data_s3uri})

## Deploy the model

When the training job is completed successfully, your model is ready to use for inference either in batch or real-time.

For this particular algorithm, the **container URI and script are different** for inference than training, so we need to look up the inference artifacts similarly to training above:

In [None]:
inference_instance_type = "ml.m5.large"

inference_image_uri = image_uris.retrieve(
 region=None,
 framework=None,
 model_id=ag_model_id,
 model_version=ag_model_version,
 image_scope="inference",
 instance_type=inference_instance_type,
)
print(inference_image_uri)

inference_src_uri = script_uris.retrieve(
 model_id=ag_model_id, model_version=ag_model_version, script_scope="inference"
)
print(inference_src_uri)

Although you could deploy in **one line** with `ag_predictor = ag_estimator.deploy(...)`, this encapsulates [multiple steps](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-deployment.html) of generating a Model, Endpoint Configuration, and Endpoint.

We'll explicitly separate out the model step here, which will be helpful for storing model metadata later:

In [None]:
ag_model = ag_estimator.create_model(
 image_uri=inference_image_uri,
 source_dir=inference_src_uri,
 entry_point="inference.py",
)

Whether from a Model or direct from an Estimator, setting up a real-time endpoint for the trained model is just one `.deploy(...)` function call as shown below.

> ⏰ This deployment might take **up to 5-10 minutes**, and by default the code will wait for the deployment to complete.

If you like, you can instead:

- Un-comment the `wait=False` parameter (or if you already ran the cell, press the ⏹ "stop" button in the toolbar above)
- Use the [Endpoints page of the SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/endpoints) to check the status of the deployment

In [None]:
%%time
ag_predictor = ag_model.deploy(
 initial_instance_count=1,
 instance_type=inference_instance_type,
 
 # wait=False, # Remember, predictor.predict() won't work until deployment finishes!

 # We will also turn on data capture here, in case you want to experiment with monitoring later:
 data_capture_config=sagemaker.model_monitor.DataCaptureConfig(
 enable_capture=True,
 sampling_percentage=100,
 destination_s3_uri=f"s3://{bucket_name}/{bucket_prefix}/data-capture",
 ),
)

## Use the endpoint

As in the previous notebook, we can use the high-level SageMaker Python SDK [Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) class to interact with our deployed model.

Again, when using a pre-built algorithm, refer to the [algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html#InputOutput-AutoGluon-Tabular) to see what input and output formats are supported at inference time.

When using `application/json` with the `verbose` flag, AutoGluon-Tabular can return **both** the predicted class labels and the class probabilities, which we'll use below:

In [None]:
ag_predictor.serializer = sagemaker.serializers.CSVSerializer()
ag_predictor.deserializer = sagemaker.deserializers.JSONDeserializer(
 accept="application/json;verbose"
)

X_test_numpy = test_data.drop(["y"], axis=1).values

model_response = ag_predictor.predict(X_test_numpy)

print("Response keys:", model_response.keys())

# probabilities is (N, 2) with probs for both classes, so convert to 1D probability of cls '1':
probabilities = np.array(model_response["probabilities"], dtype=float)[:, 1].squeeze()
probabilities

We can use both the probabilities and the assigned class labels in downstream processing - depending whether we want to use the model's own inferred threshold or override it. As a reminder for this bank marketing tasks, the class labels are:

- 0: The person **will not** enroll
- 1: The person **will** enroll (making them a good candidate for direct marketing)

In [None]:
test_results = pd.concat(
 [
 pd.Series(probabilities, name="y_prob", index=test_data.index),
 pd.Series(model_response["predicted_label"], name="y_pred", index=test_data.index),
 test_data,
 ],
 axis=1,
)
test_results.head()

From this joined data we can calculate standard quality metrics to measure the performance of the classifier. Run the below to produce a model quality report similar to the previous notebook.

Note that here we're using the model's own labels, rather than automatically inferring the F1-score-maximising probability threshold:

In [None]:
report = util.reporting.generate_binary_classification_report(
 y_real=test_data["y"].values,
 y_predict_proba=probabilities,
 # Since this model already outputs both labels and probabilities, we can use both:
 y_predict_label=test_results["y_pred"].values,
 # No need for an arbitrary decision threshold:
 # decision_threshold=0.5,
 class_names_list=["Did not enroll", "Enrolled"],
 title="AutoGluon model",
)

# Store the model quality report locally and on Amazon S3:
with open("data/report-autogluon.json", "w") as f:
 json.dump(report, f, indent=2)
model_quality_s3uri = f"s3://{bucket_name}/{bucket_prefix}/{ag_model.name}/model-quality.json"
!aws s3 cp data/report-autogluon.json {model_quality_s3uri}

### Register and share the model

As with XGBoost, we can add this model candidate to SageMaker Model Registry - and from there compare its performance to other candidates.

In [None]:
ag_model.register(
 content_types=["text/csv"],
 response_types=["application/json", "application/json;verbose"],
 model_package_group_name="sm101-dm",
 description="AutoGluon-Tabular model",
 model_metrics=sagemaker.model_metrics.ModelMetrics(
 model_statistics=sagemaker.model_metrics.MetricsSource(
 content_type="application/json",
 s3_uri=model_quality_s3uri,
 ),
 ),
 domain="MACHINE_LEARNING",
 task="CLASSIFICATION",
 sample_payload_url=test_data_s3uri,
)

You can compare the model charts and statistics side-by-side in SageMaker Studio's Model Registry UI to assess their performance - as shown in the screenshot below:

![](img/model-registry-compare.png "Screenshot of side-by-side comparison in SageMaker Studio Model Registry UI")

Note that:

- F1-related comparisons may not be entirely fair: Our XGBoost models' metrics automatically inferred the F1-maximising threshold and used it to drive decisions, whereas AutoGluon-Tabular used its own threshold selection algorithm to assign labels.
- This model package group can contain versions with different I/O contracts (Our XGBoost models expect one-hot encoded inputs, and our AutoGluon model produces JSON output instead of CSV). You could consider also attaching [data quality reports](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-quality.html) to fully specify the expected distribution of model inputs and outputs from training, and attaching additional lineage metadata.

## Conclusions

In this example we used an alternative built-in algorithm for tabular data on SageMaker, [AutoGluon-Tabular **built-in algorithm**](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html), which automatically ensembles different modelling approaches to deliver high accuracy. In fact, this uses similar techniques to [SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html) runs in `Ensembling` mode. It's well worth checking this algorithm out (and the open-source [AutoGluon library](https://auto.gluon.ai/)), if you're mainly using single-algorithm approaches like XGBoost or LightGBM today.

Some key things to remember:

- In the case of SageMaker Autopilot, you don't even need to write code to get started: Just upload your data to Amazon S3 and work through the `Create experiment` UI flow.
- When using built-in algorithms, **refer to the [algorithm's doc pages](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html)** for important usage info like data formats, and whether multi-instance training parallelism is supported.

Check out the other workshops in this repository to dive deeper on custom ML with bring-your-own-script training jobs.

## Releasing cloud resources

As mentioned in the previous notebook, you should shut down any created inference endpoints when finished experimenting. You may also choose to clear out your Amazon S3 storage, in which case do remember to delete your SageMaker Feature Store Feature Group and Model Registry Model Group first.

You can un-comment the below code to delete the inference endpoint created by this notebook:

In [None]:
# ag_predictor.delete_endpoint(delete_endpoint_config=True)