{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Direct Marketing with Amazon SageMaker Autopilot\n",
    "---\n",
    "\n",
    "---\n",
    "\n",
    "Kernel `Python 3 (Data Science)` works well with this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Introduction](#Introduction)\n",
    "1. [Prerequisites](#Prerequisites)\n",
    "1. [Downloading the dataset](#Downloading)\n",
    "1. [Upload the dataset to Amazon S3](#Uploading)\n",
    "1. [Setting up the SageMaker Autopilot Job](#Settingup)\n",
    "1. [Launching the SageMaker Autopilot Job](#Launching)\n",
    "1. [Tracking Sagemaker Autopilot Job Progress](#Tracking)\n",
    "1. [Results](#Results)\n",
    "1. [Cleanup](#Cleanup)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.\n",
    "\n",
    "A typical introductory task in machine learning (the \"Hello World\" equivalent) is one that uses a dataset to predict whether a customer will enroll for a term deposit at a bank, after one or more phone calls. For more information about the task and the dataset used, see [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing).\n",
    "\n",
    "Direct marketing, through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention are limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem. You can imagine that this task would readily translate to marketing lead prioritization in your own organization.\n",
    "\n",
    "This notebook demonstrates how you can use Autopilot on this dataset to get the most accurate ML pipeline through exploring a number of potential options, or \"candidates\". Each candidate generated by Autopilot consists of two steps. The first step performs automated feature engineering on the dataset and the second step trains and tunes an algorithm to produce a model. When you deploy this model, it follows similar steps. Feature engineering followed by inference, to decide whether the lead is worth pursuing or not. The notebook contains instructions on how to train the model as well as to deploy the model to perform batch predictions on a set of leads. Where it is possible, use the Amazon SageMaker Python SDK, a high level SDK, to simplify the way you interact with Amazon SageMaker.\n",
    "\n",
    "Other examples demonstrate how to customize models in various ways. For instance, models deployed to devices typically have memory constraints that need to be satisfied as well as accuracy. Other use cases have real-time deployment requirements and latency constraints. For now, keep it simple."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "Before you start the tasks in this notebook, do the following:\n",
    "\n",
    "- Create a SageMaker service role using AWS Identity and Access Management (IAM). Follow instructions here: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-service.html The IAM role to give Autopilot access to your data. See the Amazon SageMaker documentation for more information on IAM roles: https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam.html Make sure add access permission to your S3 bucket from this role. \n",
    "- Provide your Amazon S3 bucket name, AWS account keys and SageMaker Service Role as environment variables within the Project settings.\n",
    "- Provide prefix for the folder and object that will be created in the process in Cell #2. The bucket should be within the same Region you specify. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sts = boto3.client('sts')\n",
    "iam = boto3.client('iam')\n",
    "\n",
    "# Use the region where your S3 bucket is located\n",
    "region = 'us-east-1'\n",
    "\n",
    "session = sts.get_session_token()\n",
    "\n",
    "# Use the name of the bucket you have created in your AWS account\n",
    "bucket = os.getenv('S3_BUCKET')\n",
    "prefix = 'autopilot/autopilot-dm'\n",
    "\n",
    "# Ensure that the role name matches the one you created in your IAM.\n",
    "role = iam.get_role(RoleName=os.getenv('ROLE_NAME'))['Role']['Arn'] \n",
    "\n",
    "sagemaker = boto3.Session().client(service_name='sagemaker',region_name=region)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading the dataset<a name=\"Downloading\"></a>\n",
    "Download the [direct marketing dataset](!wget -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket. \n",
    "\n",
    "\\[Moro et al., 2014\\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)\n",
      "E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?\n",
      "--2021-03-18 20:29:43--  https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip\n",
      "Resolving sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)... 52.218.220.209\n",
      "Connecting to sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com (sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com)|52.218.220.209|:443... connected.\n",
      "HTTP request sent, awaiting response... 304 Not Modified\n",
      "File ‘bank-additional.zip’ not modified on server. Omitting download.\n",
      "\n",
      "Archive:  bank-additional.zip\n",
      "  inflating: bank-additional/bank-additional-names.txt  \n",
      "  inflating: bank-additional/bank-additional.csv  \n",
      "  inflating: bank-additional/bank-additional-full.csv  \n"
     ]
    }
   ],
   "source": [
    "!wget -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip\n",
    "!unzip -o bank-additional.zip\n",
    "\n",
    "local_data_path = './bank-additional/bank-additional-full.csv'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": true
   },
   "source": [
    "## Upload the dataset to Amazon S3<a name=\"Uploading\"></a>\n",
    "\n",
    "Before you run Autopilot on the dataset, first perform a check of the dataset to make sure that it has no obvious errors. The Autopilot process can take long time, and it's generally a good practice to inspect the dataset before you start a job. This particular dataset is small, so you can inspect it in the notebook instance itself. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. Autopilot is capable of handling datasets up to 5 GB.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the data into a Pandas data frame and take a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>job</th>\n",
       "      <th>marital</th>\n",
       "      <th>education</th>\n",
       "      <th>default</th>\n",
       "      <th>housing</th>\n",
       "      <th>loan</th>\n",
       "      <th>contact</th>\n",
       "      <th>month</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>duration</th>\n",
       "      <th>campaign</th>\n",
       "      <th>pdays</th>\n",
       "      <th>previous</th>\n",
       "      <th>poutcome</th>\n",
       "      <th>emp.var.rate</th>\n",
       "      <th>cons.price.idx</th>\n",
       "      <th>cons.conf.idx</th>\n",
       "      <th>euribor3m</th>\n",
       "      <th>nr.employed</th>\n",
       "      <th>y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>56</td>\n",
       "      <td>housemaid</td>\n",
       "      <td>married</td>\n",
       "      <td>basic.4y</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>261</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>57</td>\n",
       "      <td>services</td>\n",
       "      <td>married</td>\n",
       "      <td>high.school</td>\n",
       "      <td>unknown</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>149</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37</td>\n",
       "      <td>services</td>\n",
       "      <td>married</td>\n",
       "      <td>high.school</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>226</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40</td>\n",
       "      <td>admin.</td>\n",
       "      <td>married</td>\n",
       "      <td>basic.6y</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>151</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>56</td>\n",
       "      <td>services</td>\n",
       "      <td>married</td>\n",
       "      <td>high.school</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>307</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41183</th>\n",
       "      <td>73</td>\n",
       "      <td>retired</td>\n",
       "      <td>married</td>\n",
       "      <td>professional.course</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>nov</td>\n",
       "      <td>fri</td>\n",
       "      <td>334</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>-1.1</td>\n",
       "      <td>94.767</td>\n",
       "      <td>-50.8</td>\n",
       "      <td>1.028</td>\n",
       "      <td>4963.6</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41184</th>\n",
       "      <td>46</td>\n",
       "      <td>blue-collar</td>\n",
       "      <td>married</td>\n",
       "      <td>professional.course</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>nov</td>\n",
       "      <td>fri</td>\n",
       "      <td>383</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>-1.1</td>\n",
       "      <td>94.767</td>\n",
       "      <td>-50.8</td>\n",
       "      <td>1.028</td>\n",
       "      <td>4963.6</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41185</th>\n",
       "      <td>56</td>\n",
       "      <td>retired</td>\n",
       "      <td>married</td>\n",
       "      <td>university.degree</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>nov</td>\n",
       "      <td>fri</td>\n",
       "      <td>189</td>\n",
       "      <td>2</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>-1.1</td>\n",
       "      <td>94.767</td>\n",
       "      <td>-50.8</td>\n",
       "      <td>1.028</td>\n",
       "      <td>4963.6</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41186</th>\n",
       "      <td>44</td>\n",
       "      <td>technician</td>\n",
       "      <td>married</td>\n",
       "      <td>professional.course</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>nov</td>\n",
       "      <td>fri</td>\n",
       "      <td>442</td>\n",
       "      <td>1</td>\n",
       "      <td>999</td>\n",
       "      <td>0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>-1.1</td>\n",
       "      <td>94.767</td>\n",
       "      <td>-50.8</td>\n",
       "      <td>1.028</td>\n",
       "      <td>4963.6</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41187</th>\n",
       "      <td>74</td>\n",
       "      <td>retired</td>\n",
       "      <td>married</td>\n",
       "      <td>professional.course</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>cellular</td>\n",
       "      <td>nov</td>\n",
       "      <td>fri</td>\n",
       "      <td>239</td>\n",
       "      <td>3</td>\n",
       "      <td>999</td>\n",
       "      <td>1</td>\n",
       "      <td>failure</td>\n",
       "      <td>-1.1</td>\n",
       "      <td>94.767</td>\n",
       "      <td>-50.8</td>\n",
       "      <td>1.028</td>\n",
       "      <td>4963.6</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>41188 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       age          job  marital            education  default housing loan  \\\n",
       "0       56    housemaid  married             basic.4y       no      no   no   \n",
       "1       57     services  married          high.school  unknown      no   no   \n",
       "2       37     services  married          high.school       no     yes   no   \n",
       "3       40       admin.  married             basic.6y       no      no   no   \n",
       "4       56     services  married          high.school       no      no  yes   \n",
       "...    ...          ...      ...                  ...      ...     ...  ...   \n",
       "41183   73      retired  married  professional.course       no     yes   no   \n",
       "41184   46  blue-collar  married  professional.course       no      no   no   \n",
       "41185   56      retired  married    university.degree       no     yes   no   \n",
       "41186   44   technician  married  professional.course       no      no   no   \n",
       "41187   74      retired  married  professional.course       no     yes   no   \n",
       "\n",
       "         contact month day_of_week  duration  campaign  pdays  previous  \\\n",
       "0      telephone   may         mon       261         1    999         0   \n",
       "1      telephone   may         mon       149         1    999         0   \n",
       "2      telephone   may         mon       226         1    999         0   \n",
       "3      telephone   may         mon       151         1    999         0   \n",
       "4      telephone   may         mon       307         1    999         0   \n",
       "...          ...   ...         ...       ...       ...    ...       ...   \n",
       "41183   cellular   nov         fri       334         1    999         0   \n",
       "41184   cellular   nov         fri       383         1    999         0   \n",
       "41185   cellular   nov         fri       189         2    999         0   \n",
       "41186   cellular   nov         fri       442         1    999         0   \n",
       "41187   cellular   nov         fri       239         3    999         1   \n",
       "\n",
       "          poutcome  emp.var.rate  cons.price.idx  cons.conf.idx  euribor3m  \\\n",
       "0      nonexistent           1.1          93.994          -36.4      4.857   \n",
       "1      nonexistent           1.1          93.994          -36.4      4.857   \n",
       "2      nonexistent           1.1          93.994          -36.4      4.857   \n",
       "3      nonexistent           1.1          93.994          -36.4      4.857   \n",
       "4      nonexistent           1.1          93.994          -36.4      4.857   \n",
       "...            ...           ...             ...            ...        ...   \n",
       "41183  nonexistent          -1.1          94.767          -50.8      1.028   \n",
       "41184  nonexistent          -1.1          94.767          -50.8      1.028   \n",
       "41185  nonexistent          -1.1          94.767          -50.8      1.028   \n",
       "41186  nonexistent          -1.1          94.767          -50.8      1.028   \n",
       "41187      failure          -1.1          94.767          -50.8      1.028   \n",
       "\n",
       "       nr.employed    y  \n",
       "0           5191.0   no  \n",
       "1           5191.0   no  \n",
       "2           5191.0   no  \n",
       "3           5191.0   no  \n",
       "4           5191.0   no  \n",
       "...            ...  ...  \n",
       "41183       4963.6  yes  \n",
       "41184       4963.6   no  \n",
       "41185       4963.6   no  \n",
       "41186       4963.6  yes  \n",
       "41187       4963.6   no  \n",
       "\n",
       "[41188 rows x 21 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = pd.read_csv(local_data_path)\n",
    "pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns\n",
    "pd.set_option('display.max_rows', 10)         # Keep the output on one page\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that there are 20 features to help predict the target column 'y'.\n",
    "\n",
    "Amazon SageMaker Autopilot takes care of preprocessing your data for you. You do not need to perform conventional data preprocssing techniques such as handling missing values, converting categorical features to numeric features, scaling data, and handling more complicated data types.\n",
    "\n",
    "Moreover, splitting the dataset into training and validation splits is not necessary. Autopilot takes care of this for you. You may, however, want to split out a test set. That's next, although you use it for batch inference at the end instead of testing the model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reserve some data for calling batch inference on the model\n",
    "\n",
    "Divide the data into training and testing splits. The training split is used by SageMaker Autopilot. The testing split is reserved to perform inference using the suggested model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = data.sample(frac=0.8,random_state=200)\n",
    "\n",
    "test_data = data.drop(train_data.index)\n",
    "\n",
    "test_data_no_target = test_data.drop(columns=['y'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Upload the dataset to Amazon S3\n",
    "Copy the file to Amazon Simple Storage Service (Amazon S3) in a .csv format for Amazon SageMaker training to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.resource('s3')\n",
    "\n",
    "train_file = 'train_data.csv';\n",
    "train_data.to_csv(train_file, index=False, header=True)\n",
    "s3.meta.client.upload_file(train_file, bucket, prefix + '/' + train_file)\n",
    "\n",
    "test_file = 'test_data.csv';\n",
    "test_data_no_target.to_csv(test_file, index=False, header=False)\n",
    "s3.meta.client.upload_file(test_file, bucket, prefix + '/' + test_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up the SageMaker Autopilot Job<a name=\"Settingup\"></a>\n",
    "\n",
    "After uploading the dataset to Amazon S3, you can invoke Autopilot to find the best ML pipeline to train a model on this dataset. \n",
    "\n",
    "The required inputs for invoking a Autopilot job are:\n",
    "* Amazon S3 location for input dataset and for all output artifacts\n",
    "* Name of the column of the dataset you want to predict (`y` in this case) \n",
    "* An IAM role\n",
    "\n",
    "Currently Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order, is expected to have a header row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_data_config = [{\n",
    "      'DataSource': {\n",
    "        'S3DataSource': {\n",
    "          'S3DataType': 'S3Prefix',\n",
    "          'S3Uri': 's3://{}/{}/train'.format(bucket,prefix)\n",
    "        }\n",
    "      },\n",
    "      'TargetAttributeName': 'y'\n",
    "    }\n",
    "  ]\n",
    "\n",
    "output_data_config = {\n",
    "    'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix)\n",
    "  }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also specify the type of problem you want to solve with your dataset (`Regression, MulticlassClassification, BinaryClassification`). In case you are not sure, SageMaker Autopilot will infer the problem type based on statistics of the target column (the column you want to predict). \n",
    "\n",
    "You have the option to limit the running time of a SageMaker Autopilot job by providing either the maximum number of pipeline evaluations or candidates (one pipeline evaluation is called a `Candidate` because it generates a candidate model) or providing the total time allocated for the overall Autopilot job. Under default settings, this job takes about four hours to run. This varies between runs because of the nature of the exploratory process Autopilot uses to find optimal training parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Launching the SageMaker Autopilot Job<a name=\"Launching\"></a>\n",
    "\n",
    "You can now launch the Autopilot job by calling the `create_auto_ml_job` API. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AutoMLJobName: automl-banking-18-20-30-33\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'AutoMLJobArn': 'arn:aws:sagemaker:us-east-1:661554271967:automl-job/automl-banking-18-20-30-33',\n",
       " 'ResponseMetadata': {'RequestId': '8b384469-ed2e-4f64-99ef-49c373e66e3b',\n",
       "  'HTTPStatusCode': 200,\n",
       "  'HTTPHeaders': {'x-amzn-requestid': '8b384469-ed2e-4f64-99ef-49c373e66e3b',\n",
       "   'content-type': 'application/x-amz-json-1.1',\n",
       "   'content-length': '97',\n",
       "   'date': 'Thu, 18 Mar 2021 20:30:34 GMT'},\n",
       "  'RetryAttempts': 0}}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from time import gmtime, strftime, sleep\n",
    "timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())\n",
    "\n",
    "auto_ml_job_name = 'automl-banking-' + timestamp_suffix\n",
    "print('AutoMLJobName: ' + auto_ml_job_name)\n",
    "\n",
    "sagemaker.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,\n",
    "                      InputDataConfig=input_data_config,\n",
    "                      OutputDataConfig=output_data_config,\n",
    "                      RoleArn=role)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tracking SageMaker Autopilot job progress<a name=\"Tracking\"></a>\n",
    "SageMaker Autopilot job consists of the following high-level steps : \n",
    "* Analyzing Data, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.\n",
    "* Feature Engineering, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.\n",
    "* Model Tuning, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "JobStatus - Secondary Status\n",
      "------------------------------\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - AnalyzingData\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - FeatureEngineering\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "InProgress - ModelTuning\n",
      "Completed - MaxCandidatesReached\n"
     ]
    }
   ],
   "source": [
    "print ('JobStatus - Secondary Status')\n",
    "print('------------------------------')\n",
    "\n",
    "\n",
    "describe_response = sagemaker.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)\n",
    "print (describe_response['AutoMLJobStatus'] + \" - \" + describe_response['AutoMLJobSecondaryStatus'])\n",
    "job_run_status = describe_response['AutoMLJobStatus']\n",
    "    \n",
    "while job_run_status not in ('Failed', 'Completed', 'Stopped'):\n",
    "    describe_response = sagemaker.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)\n",
    "    job_run_status = describe_response['AutoMLJobStatus']\n",
    "    \n",
    "    print (describe_response['AutoMLJobStatus'] + \" - \" + describe_response['AutoMLJobSecondaryStatus'])\n",
    "    sleep(30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": true
   },
   "source": [
    "## Results\n",
    "\n",
    "Now use the describe_auto_ml_job API to look up the best candidate selected by the SageMaker Autopilot job. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'CandidateName': 'tuning-job-1-b6fce942e64f462b85-159-becf07eb', 'FinalAutoMLJobObjectiveMetric': {'MetricName': 'validation:f1', 'Value': 0.7820500135421753}, 'ObjectiveStatus': 'Succeeded', 'CandidateSteps': [{'CandidateStepType': 'AWS::SageMaker::ProcessingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:661554271967:processing-job/db-1-8b5eb2ca9d794116af8572604bf1c0aee38e9974a452406da6df7d54a7', 'CandidateStepName': 'db-1-8b5eb2ca9d794116af8572604bf1c0aee38e9974a452406da6df7d54a7'}, {'CandidateStepType': 'AWS::SageMaker::TrainingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:661554271967:training-job/automl-ban-dpp1-1-95b9889c0cc94f4288b8f32c3b213fca60f37549412e4', 'CandidateStepName': 'automl-ban-dpp1-1-95b9889c0cc94f4288b8f32c3b213fca60f37549412e4'}, {'CandidateStepType': 'AWS::SageMaker::TransformJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:661554271967:transform-job/automl-ban-dpp1-csv-1-a5c4209233014ea6b8be6631cff6354288ea59ac9', 'CandidateStepName': 'automl-ban-dpp1-csv-1-a5c4209233014ea6b8be6631cff6354288ea59ac9'}, {'CandidateStepType': 'AWS::SageMaker::TrainingJob', 'CandidateStepArn': 'arn:aws:sagemaker:us-east-1:661554271967:training-job/tuning-job-1-b6fce942e64f462b85-159-becf07eb', 'CandidateStepName': 'tuning-job-1-b6fce942e64f462b85-159-becf07eb'}], 'CandidateStatus': 'Completed', 'InferenceContainers': [{'Image': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3', 'ModelDataUrl': 's3://dominos-aws-yojeo/autopilot/autopilot-dm/output/automl-banking-18-20-30-33/data-processor-models/automl-ban-dpp1-1-95b9889c0cc94f4288b8f32c3b213fca60f37549412e4/output/model.tar.gz', 'Environment': {'AUTOML_TRANSFORM_MODE': 'feature-transform', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'application/x-recordio-protobuf', 'SAGEMAKER_PROGRAM': 'sagemaker_serve', 'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/code'}}, {'Image': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3', 'ModelDataUrl': 's3://dominos-aws-yojeo/autopilot/autopilot-dm/output/automl-banking-18-20-30-33/tuning/automl-ban-dpp1-xgb/tuning-job-1-b6fce942e64f462b85-159-becf07eb/output/model.tar.gz', 'Environment': {'MAX_CONTENT_LENGTH': '20971520', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'text/csv', 'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label', 'SAGEMAKER_INFERENCE_SUPPORTED': 'predicted_label,probability,probabilities'}}, {'Image': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-sklearn-automl:0.2-1-cpu-py3', 'ModelDataUrl': 's3://dominos-aws-yojeo/autopilot/autopilot-dm/output/automl-banking-18-20-30-33/data-processor-models/automl-ban-dpp1-1-95b9889c0cc94f4288b8f32c3b213fca60f37549412e4/output/model.tar.gz', 'Environment': {'AUTOML_TRANSFORM_MODE': 'inverse-label-transform', 'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT': 'text/csv', 'SAGEMAKER_INFERENCE_INPUT': 'predicted_label', 'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label', 'SAGEMAKER_INFERENCE_SUPPORTED': 'predicted_label,probability,labels,probabilities', 'SAGEMAKER_PROGRAM': 'sagemaker_serve', 'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/model/code'}}], 'CreationTime': datetime.datetime(2021, 3, 18, 21, 32, 7, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2021, 3, 18, 21, 34, 17, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2021, 3, 18, 21, 57, 36, 243000, tzinfo=tzlocal())}\n",
      "\n",
      "\n",
      "CandidateName: tuning-job-1-b6fce942e64f462b85-159-becf07eb\n",
      "FinalAutoMLJobObjectiveMetricName: validation:f1\n",
      "FinalAutoMLJobObjectiveMetricValue: 0.7820500135421753\n"
     ]
    }
   ],
   "source": [
    "best_candidate = sagemaker.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']\n",
    "best_candidate_name = best_candidate['CandidateName']\n",
    "print(best_candidate)\n",
    "print('\\n')\n",
    "print(\"CandidateName: \" + best_candidate_name)\n",
    "print(\"FinalAutoMLJobObjectiveMetricName: \" + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])\n",
    "print(\"FinalAutoMLJobObjectiveMetricValue: \" + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc-hr-collapsed": false
   },
   "source": [
    "### Deploy the best candidate as a SageMaker model endpoint using the best candidate\n",
    "\n",
    "Now that you have successfully completed the SageMaker Autopilot job on the dataset, create a model from any of the candidates by using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model ARN corresponding to the best candidate is : arn:aws:sagemaker:us-east-1:661554271967:model/automl-banking-model-18-20-30-33\n",
      "Model Name corresponding to the best candidate is : automl-banking-model-18-20-30-33\n"
     ]
    }
   ],
   "source": [
    "model_name = 'automl-banking-model-' + timestamp_suffix\n",
    "\n",
    "model = sagemaker.create_model(Containers=best_candidate['InferenceContainers'],\n",
    "                                ModelName=model_name,\n",
    "                                ExecutionRoleArn=role)\n",
    "\n",
    "print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Model Endpoint Configuration\n",
    "\n",
    "Before deployng the chosen model, we need to create an endpoint configuration.\n",
    "\n",
    "We set an initial instance count to 1 and an initial variant weight to 1. The initial variant weight being 1 means all of the inference traffic will go to this variant. You could set up multiple variants to divert percentage of traffic to another variant. This allows A/B testing with different model versions. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = sagemaker.create_endpoint_config(\n",
    "    EndpointConfigName='automl-banking-model-endpoint-config',\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            'VariantName': 'variant1',\n",
    "            'ModelName': model_name,\n",
    "            'InitialInstanceCount': 1,\n",
    "            'InstanceType': 'ml.m4.xlarge',\n",
    "            'InitialVariantWeight': 1\n",
    "        }\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy SageMaker Model Endpoint\n",
    "\n",
    "Let's go ahead and deploy the model as a SageMaker endpoint:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "EndpointName=automl-banking-2021-03-18-23-35-05\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'EndpointArn': 'arn:aws:sagemaker:us-east-1:661554271967:endpoint/automl-banking-2021-03-18-23-35-05',\n",
       " 'ResponseMetadata': {'RequestId': '62a54fc7-9de1-4ab1-8ab5-285d0a337f12',\n",
       "  'HTTPStatusCode': 200,\n",
       "  'HTTPHeaders': {'x-amzn-requestid': '62a54fc7-9de1-4ab1-8ab5-285d0a337f12',\n",
       "   'content-type': 'application/x-amz-json-1.1',\n",
       "   'content-length': '102',\n",
       "   'date': 'Thu, 18 Mar 2021 23:35:06 GMT'},\n",
       "  'RetryAttempts': 0}}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from datetime import datetime\n",
    "endpoint_name = f\"automl-banking-{datetime.now():%Y-%m-%d-%H-%M-%S}\"\n",
    "print(f\"EndpointName={endpoint_name}\")\n",
    "\n",
    "sagemaker.create_endpoint(\n",
    "        EndpointName=endpoint_name,\n",
    "        EndpointConfigName='automl-banking-model-endpoint-config'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Perform batch inference using the best candidate (optional)\n",
    "\n",
    "You can use batch inference by using Amazon SageMaker batch transform. The same model can also be deployed to perform online inference using Amazon SageMaker hosting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'TransformJobArn': 'arn:aws:sagemaker:us-east-1:661554271967:transform-job/automl-banking-transform-18-20-30-33',\n",
       " 'ResponseMetadata': {'RequestId': '588458a9-8056-4858-ad8d-1da60208809d',\n",
       "  'HTTPStatusCode': 200,\n",
       "  'HTTPHeaders': {'x-amzn-requestid': '588458a9-8056-4858-ad8d-1da60208809d',\n",
       "   'content-type': 'application/x-amz-json-1.1',\n",
       "   'content-length': '113',\n",
       "   'date': 'Fri, 19 Mar 2021 01:10:09 GMT'},\n",
       "  'RetryAttempts': 0}}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_data_s3_path = 's3://{}/{}/test_data.csv'.format(bucket,prefix)\n",
    "transform_job_name = 'automl-banking-transform-' + timestamp_suffix\n",
    "\n",
    "transform_input = {\n",
    "        'DataSource': {\n",
    "            'S3DataSource': {\n",
    "                'S3DataType': 'S3Prefix',\n",
    "                'S3Uri': test_data_s3_path\n",
    "            }\n",
    "        },\n",
    "        'ContentType': 'text/csv',\n",
    "        'CompressionType': 'None',\n",
    "        'SplitType': 'Line'\n",
    "    }\n",
    "\n",
    "transform_output = {\n",
    "        'S3OutputPath': 's3://{}/{}/inference-results'.format(bucket,prefix),\n",
    "    }\n",
    "\n",
    "transform_resources = {\n",
    "        'InstanceType': 'ml.m5.4xlarge',\n",
    "        'InstanceCount': 1\n",
    "    }\n",
    "\n",
    "sagemaker.create_transform_job(TransformJobName = transform_job_name,\n",
    "                                ModelName = model_name,\n",
    "                                TransformInput = transform_input,\n",
    "                                TransformOutput = transform_output,\n",
    "                                TransformResources = transform_resources\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Watch the transform job for completion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "JobStatus\n",
      "----------\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "InProgress\n",
      "Completed\n"
     ]
    }
   ],
   "source": [
    "print ('JobStatus')\n",
    "print('----------')\n",
    "\n",
    "\n",
    "describe_response = sagemaker.describe_transform_job(TransformJobName = transform_job_name)\n",
    "job_run_status = describe_response['TransformJobStatus']\n",
    "print (job_run_status)\n",
    "\n",
    "while job_run_status not in ('Failed', 'Completed', 'Stopped'):\n",
    "    describe_response = sagemaker.describe_transform_job(TransformJobName = transform_job_name)\n",
    "    job_run_status = describe_response['TransformJobStatus']\n",
    "    print (job_run_status)\n",
    "    sleep(30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's view the results of the transform job:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>no</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8232</th>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8233</th>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8234</th>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8235</th>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8236</th>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8237 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       no\n",
       "0      no\n",
       "1      no\n",
       "2      no\n",
       "3      no\n",
       "4      no\n",
       "...   ...\n",
       "8232  yes\n",
       "8233  yes\n",
       "8234   no\n",
       "8235  yes\n",
       "8236  yes\n",
       "\n",
       "[8237 rows x 1 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s3_output_key = '{}/inference-results/test_data.csv.out'.format(prefix);\n",
    "local_inference_results_path = 'inference_results.csv'\n",
    "\n",
    "s3 = boto3.resource('s3')\n",
    "inference_results_bucket = s3.Bucket(bucket) #session.default_bucket())\n",
    "\n",
    "inference_results_bucket.download_file(s3_output_key, local_inference_results_path);\n",
    "\n",
    "data = pd.read_csv(local_inference_results_path, sep=';')\n",
    "pd.set_option('display.max_rows', 10)         # Keep the output on one page\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View other candidates explored by SageMaker Autopilot\n",
    "You can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by SageMaker Autopilot and sort them by their final performance metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1  tuning-job-1-b6fce942e64f462b85-159-becf07eb  0.7820500135421753\n",
      "2  tuning-job-1-b6fce942e64f462b85-151-8e1854bf  0.780269980430603\n",
      "3  tuning-job-1-b6fce942e64f462b85-240-c42cdb5d  0.780239999294281\n",
      "4  tuning-job-1-b6fce942e64f462b85-248-ff591660  0.7782999873161316\n",
      "5  tuning-job-1-b6fce942e64f462b85-155-7b329a73  0.7778099775314331\n",
      "6  tuning-job-1-b6fce942e64f462b85-213-0be20055  0.777649998664856\n",
      "7  tuning-job-1-b6fce942e64f462b85-065-1cb80127  0.7772899866104126\n",
      "8  tuning-job-1-b6fce942e64f462b85-249-c0c22465  0.776669979095459\n",
      "9  tuning-job-1-b6fce942e64f462b85-230-f5edd353  0.776419997215271\n",
      "10  tuning-job-1-b6fce942e64f462b85-175-490041ce  0.7763500213623047\n"
     ]
    }
   ],
   "source": [
    "candidates = sagemaker.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']\n",
    "index = 1\n",
    "for candidate in candidates:\n",
    "  print(str(index) + \"  \" + candidate['CandidateName'] + \"  \" + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))\n",
    "  index += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Candidate Generation Notebook\n",
    "    \n",
    "Sagemaker AutoPilot also auto-generates a Candidate Definitions notebook. This notebook can be used to interactively step through the various steps taken by the Sagemaker Autopilot to arrive at the best candidate. This notebook can also be used to override various runtime parameters like parallelism, hardware used, algorithms explored, feature extraction scripts and more.\n",
    "    \n",
    "The notebook can be downloaded from the following Amazon S3 location:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'s3://dominos-aws-yojeo/autopilot/autopilot-dm/output/automl-banking-18-20-30-33/sagemaker-automl-candidates/pr-1-7e7cd77737e8473190b89f3a728cfa9e68b88f72319e49b49122112da6/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sagemaker.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Exploration Notebook\n",
    "Sagemaker Autopilot also auto-generates a Data Exploration notebook, which can be downloaded from the following Amazon S3 location:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'s3://dominos-aws-yojeo/autopilot/autopilot-dm/output/automl-banking-18-20-30-33/sagemaker-automl-candidates/pr-1-7e7cd77737e8473190b89f3a728cfa9e68b88f72319e49b49122112da6/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sagemaker.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['DataExplorationNotebookLocation']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleanup\n",
    "\n",
    "The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them. This operation deletes all the generated models and the auto-generated notebooks as well. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#s3 = boto3.resource('s3')\n",
    "#bucket = s3.Bucket(bucket)\n",
    "\n",
    "#job_outputs_prefix = '{}/output/{}'.format(prefix,auto_ml_job_name)\n",
    "#bucket.objects.filter(Prefix=job_outputs_prefix).delete()"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}