{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ac5252bc-c2b2-4e66-9925-5ce14efb7c5b",
   "metadata": {},
   "source": [
    "# Train Autopilot models using a subset of data, and reuse its recipes on the whole dataset\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.\n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/ap-batch-transform.ipynb)\n",
    "\n",
    "---\n",
    "\n",
    "Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (without any human input) or with human guidance, without code through SageMaker Studio or scripted using the AWS SDKs. This notebook will use the AWS SDKs to create an Autopilot job on a subset of data, and apply the recipes on a whole dataset. Compared with running Autopilot job directly on the whole dataset, you save cost and build models more efficiently."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d5c582b-4744-4543-8f04-3b98c84cb327",
   "metadata": {},
   "source": [
    "## Setup Environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd7c6d92-6254-4c71-9e26-bfe919faffef",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "import botocore\n",
    "import time\n",
    "import json\n",
    "import sagemaker\n",
    "from sagemaker import AutoML, get_execution_role\n",
    "from datetime import datetime, timezone\n",
    "import pandas as pd\n",
    "\n",
    "sess = sagemaker.session.Session()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.session.Session().region_name\n",
    "\n",
    "bucket = sess.default_bucket()\n",
    "prefix = \"autopilot-subset-0708\"  # change this to your own prefix\n",
    "sagemaker_client = boto3.client(\"sagemaker\")\n",
    "session = sagemaker.Session()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5322de50-4d15-45db-adb7-a0bff3809f86",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from time import gmtime, strftime, sleep\n",
    "\n",
    "timestamp_suffix = strftime(\"%Y%m%d-%H-%M\", gmtime())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8e3dbb2-14e1-428d-ba1e-82a0c09dc802",
   "metadata": {},
   "source": [
    "## Prepare Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9daf23a1-0419-4c58-8b4d-e3e3ea03f1c3",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# download data from s3\n",
    "s3 = boto3.client(\"s3\")\n",
    "s3.download_file(\n",
    "    \"sagemaker-example-files-prod-{}\".format(region),\n",
    "    \"datasets/tabular/uci_bank_marketing/bank-additional-full.csv\",\n",
    "    \"bank-data.csv\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f216a2d-51d7-4183-a2eb-5a5247436d23",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# read the dataset\n",
    "bank_df = pd.read_csv(\"bank-data.csv\", sep=\",\")\n",
    "pd.set_option(\"display.max_columns\", 500)\n",
    "bank_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc49d0ac-f2b6-4472-b62d-f196fcacf5c2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Split the whole dataset to training and testing (80/20 split)\n",
    "target = \"y\"\n",
    "train_data = bank_df.sample(frac=0.8, random_state=200)\n",
    "\n",
    "test_data = bank_df.drop(train_data.index)\n",
    "\n",
    "test_data_no_target = test_data.drop(columns=[target])\n",
    "\n",
    "# shift the target column from the last to the first\n",
    "cols = list(train_data.columns)\n",
    "cols = [cols[-1]] + cols[:-1]\n",
    "train_data = train_data[cols]\n",
    "\n",
    "### Upload the datasets to S3\n",
    "train_val_file = \"train_val_whole_data.csv\"\n",
    "train_data.to_csv(train_val_file, index=False, header=True)\n",
    "train_val_data_s3_path = sess.upload_data(path=train_val_file, key_prefix=prefix + \"/train\")\n",
    "print(\"Train-validation data uploaded to: \" + train_val_data_s3_path)\n",
    "\n",
    "test_file = \"test_data_no_target.csv\"\n",
    "test_data_no_target.to_csv(test_file, index=False, header=False)\n",
    "test_data_s3_path = sess.upload_data(path=test_file, key_prefix=prefix + \"/test\")\n",
    "print(\"Test data uploaded to: \" + test_data_s3_path)\n",
    "\n",
    "# further split the whole training dataset into training and validation datasets (75/25 split)\n",
    "# these will be used to re-fit model using automl's recipes\n",
    "val_data = train_data.sample(frac=0.25, random_state=200)\n",
    "train_data = train_data.drop(val_data.index)\n",
    "\n",
    "### Upload the datasets to S3\n",
    "train_file = \"train_whole_data.csv\"\n",
    "train_data.to_csv(train_file, index=False, header=False)\n",
    "train_data_s3_path = sess.upload_data(path=train_file, key_prefix=prefix + \"/train\")\n",
    "print(\"Train data uploaded to: \" + train_data_s3_path)\n",
    "\n",
    "val_file = \"val_whole_data.csv\"\n",
    "val_data.to_csv(val_file, index=False, header=False)\n",
    "val_data_s3_path = sess.upload_data(path=val_file, key_prefix=prefix + \"/val\")\n",
    "print(\"validation data uploaded to: \" + val_data_s3_path)\n",
    "\n",
    "# take a subset of the training data to run autopilot job, e.g. 20%\n",
    "# TODO: stratified sampling\n",
    "\n",
    "train_subset = train_data.sample(frac=0.2, random_state=100)\n",
    "\n",
    "# keep header for subset data, since autopilot need header information to pick target\n",
    "train_file = \"train_subset_data.csv\"\n",
    "train_subset.to_csv(train_file, index=False, header=True)\n",
    "train_subset_s3_path = sess.upload_data(path=train_file, key_prefix=prefix + \"/train\")\n",
    "print(\"Train subset uploaded to: \" + train_subset_s3_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aefbc26e-4ae9-4e9c-8cd6-47f85219702f",
   "metadata": {},
   "source": [
    "## Setup Autopilot Jobs\n",
    "\n",
    "Kick off an autopilot job using only a subset of the training data.\n",
    "\n",
    "For baselining purpose, we also kick off an autopilot job using the whole training data. Note that the solution itself only needs the first autopilot job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec182367-804c-4b73-8cac-5c8072758251",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def run_auto_ml_job(input_s3_path, target, auto_ml_job_name, job_config, role, bucket, prefix):\n",
    "    input_data_config = [\n",
    "        {\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": input_s3_path,\n",
    "                }\n",
    "            },\n",
    "            \"TargetAttributeName\": target,\n",
    "        }\n",
    "    ]\n",
    "\n",
    "    output_data_config = {\"S3OutputPath\": \"s3://{}/{}/full-output\".format(bucket, prefix)}\n",
    "\n",
    "    print(\"AutoMLJobName: \" + auto_ml_job_name)\n",
    "\n",
    "    sagemaker_client.create_auto_ml_job(\n",
    "        AutoMLJobName=auto_ml_job_name,\n",
    "        InputDataConfig=input_data_config,\n",
    "        OutputDataConfig=output_data_config,\n",
    "        AutoMLJobConfig=job_config,\n",
    "        RoleArn=role,\n",
    "    )\n",
    "\n",
    "    print(\"JobStatus - Secondary Status\")\n",
    "    print(\"------------------------------\")\n",
    "\n",
    "    describe_response = sagemaker_client.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)\n",
    "    print(\n",
    "        describe_response[\"AutoMLJobStatus\"] + \" - \" + describe_response[\"AutoMLJobSecondaryStatus\"]\n",
    "    )\n",
    "    job_run_status = describe_response[\"AutoMLJobStatus\"]\n",
    "\n",
    "    while job_run_status not in (\"Failed\", \"Completed\", \"Stopped\"):\n",
    "        describe_response = sagemaker_client.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)\n",
    "        job_run_status = describe_response[\"AutoMLJobStatus\"]\n",
    "\n",
    "        print(\n",
    "            describe_response[\"AutoMLJobStatus\"]\n",
    "            + \" - \"\n",
    "            + describe_response[\"AutoMLJobSecondaryStatus\"]\n",
    "        )\n",
    "        sleep(60)\n",
    "\n",
    "    return job_run_status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8af5eb0-5807-407e-ba38-b96459064e72",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "input_s3_path = train_subset_s3_path\n",
    "auto_ml_job_name = \"auto-bank-sub-\" + timestamp_suffix\n",
    "job_config = {\"CompletionCriteria\": {\"MaxCandidates\": 10}}\n",
    "\n",
    "response = run_auto_ml_job(\n",
    "    input_s3_path, target, auto_ml_job_name, job_config, role, bucket, prefix\n",
    ")\n",
    "response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42bf189f-09a1-450d-bf2e-2bc3e82b8098",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Autopilot job for baselining purpose\n",
    "input_s3_path = train_val_data_s3_path\n",
    "auto_ml_baseline_job_name = \"auto-bank-all-\" + timestamp_suffix\n",
    "\n",
    "response = run_auto_ml_job(\n",
    "    input_s3_path, target, auto_ml_baseline_job_name, job_config, role, bucket, prefix\n",
    ")\n",
    "response"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5edcde14-f008-41a4-986f-a9ebd0452252",
   "metadata": {},
   "source": [
    "## Retrieve best models from the Autopilot job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "004c08b1-f671-4e66-9f11-e61a430f8d0f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# get best candidate of the job\n",
    "\n",
    "# auto_ml_job_name = 'bank-subset-20230524-13-12'\n",
    "automl = AutoML.attach(auto_ml_job_name=auto_ml_job_name)\n",
    "best_candidate = automl.describe_auto_ml_job()[\"BestCandidate\"]\n",
    "best_candidate_name = best_candidate[\"CandidateName\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a98fc6b-e58f-4831-9547-39d3fa029a41",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "best_candidate[\"CandidateSteps\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8108042c-a7bc-4851-a4ce-ce9ec4e99161",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# you will need transform job for data processing/feature engineering\n",
    "# and tuning job for model tuning/fitting\n",
    "transform_job = \"\"\n",
    "tuning_job = \"\"\n",
    "for index in range(len(best_candidate[\"CandidateSteps\"])):\n",
    "    if (\n",
    "        best_candidate[\"CandidateSteps\"][index][\"CandidateStepType\"]\n",
    "        == \"AWS::SageMaker::TransformJob\"\n",
    "    ):\n",
    "        transform_job = best_candidate[\"CandidateSteps\"][index][\"CandidateStepName\"]\n",
    "    if (\n",
    "        best_candidate[\"CandidateSteps\"][index][\"CandidateStepType\"]\n",
    "        == \"AWS::SageMaker::TrainingJob\"\n",
    "    ):\n",
    "        tuning_job = best_candidate[\"CandidateSteps\"][index][\"CandidateStepName\"]\n",
    "\n",
    "print(f\"transform_job:{transform_job}, tuning_job:{tuning_job}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85e0afd4-11f1-4aee-92e0-2c2227c179c6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# If you describe the transform job, you will see the recipe of data processing\n",
    "response = sagemaker_client.describe_transform_job(TransformJobName=transform_job)\n",
    "response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "949a920a-cf5a-4987-962e-1d2d6bb831f0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# The recipe of data processing we will use in later steps\n",
    "transform_model = response[\"ModelName\"]\n",
    "input_content_type = response[\"TransformInput\"][\"ContentType\"]\n",
    "output_accept = response[\"TransformOutput\"][\"Accept\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38c5bc2a-7aee-4182-b94f-817e75fad3a8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# If you describe the training job, you will see the recipe of model tuning\n",
    "response = sagemaker_client.describe_training_job(TrainingJobName=tuning_job)\n",
    "# training_model = response['ModelName']\n",
    "response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35faf2be-e92e-4447-948d-c0c8bc9c13e4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# The recipe of model tuning we will use in later steps\n",
    "hyperparameters = response[\"HyperParameters\"]\n",
    "TrainingImage = response[\"AlgorithmSpecification\"][\"TrainingImage\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "969636a9-7493-4a68-af3a-8b4d027b3114",
   "metadata": {},
   "source": [
    "## Transform the whole training dataset with the best transform model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "348a9331-8477-4546-b9b8-634ad360d3b5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "timestamp_suffix = strftime(\"%Y%m%d-%H-%M\", gmtime())\n",
    "#\n",
    "train_transform_job = \"ap-transform-whole-train-\" + timestamp_suffix\n",
    "val_transform_job = \"ap-transform-whole-val-\" + timestamp_suffix\n",
    "#'s3://sagemaker-us-east-1-827930657850/autopilot-subset-recipe-0524/output/bank-subset-20230524-13-12/preprocessed-data/tuning_data/train/chunk_0.csv'\n",
    "train_transform_input = train_data_s3_path\n",
    "val_transform_input = val_data_s3_path\n",
    "\n",
    "\n",
    "train_transform_output = \"s3://{}/{}/output/whole/train\".format(bucket, prefix)\n",
    "val_transform_output = \"s3://{}/{}/output/whole/val\".format(bucket, prefix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "288929fe-f1f0-4519-a287-115775421e09",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# kick off a new batch transform job for training and validation dataset\n",
    "new_transform_job = train_transform_job\n",
    "new_transform_input = train_transform_input\n",
    "new_transform_output = train_transform_output\n",
    "\n",
    "sagemaker_client.create_transform_job(\n",
    "    TransformJobName=new_transform_job,\n",
    "    ModelName=transform_model,\n",
    "    MaxConcurrentTransforms=4,\n",
    "    BatchStrategy=\"MultiRecord\",\n",
    "    TransformInput={\n",
    "        \"DataSource\": {\"S3DataSource\": {\"S3DataType\": \"S3Prefix\", \"S3Uri\": new_transform_input}},\n",
    "        \"ContentType\": input_content_type,\n",
    "        \"SplitType\": \"Line\",\n",
    "    },\n",
    "    TransformResources={\"InstanceType\": \"ml.m5.4xlarge\", \"InstanceCount\": 1},\n",
    "    TransformOutput={\"S3OutputPath\": new_transform_output, \"Accept\": output_accept},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20456002-53eb-495b-94b4-1148db317348",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "new_transform_job = val_transform_job\n",
    "new_transform_input = val_transform_input\n",
    "new_transform_output = val_transform_output\n",
    "\n",
    "sagemaker_client.create_transform_job(\n",
    "    TransformJobName=new_transform_job,\n",
    "    ModelName=transform_model,\n",
    "    MaxConcurrentTransforms=4,\n",
    "    BatchStrategy=\"MultiRecord\",\n",
    "    TransformInput={\n",
    "        \"DataSource\": {\"S3DataSource\": {\"S3DataType\": \"S3Prefix\", \"S3Uri\": new_transform_input}},\n",
    "        \"ContentType\": input_content_type,\n",
    "        \"SplitType\": \"Line\",\n",
    "    },\n",
    "    TransformResources={\"InstanceType\": \"ml.m5.2xlarge\", \"InstanceCount\": 1},\n",
    "    TransformOutput={\"S3OutputPath\": new_transform_output, \"Accept\": output_accept},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7a3b233-4039-4057-8038-60be74967269",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# track batch transform progress\n",
    "print(\"JobStatus\")\n",
    "print(\"------------------------------\")\n",
    "\n",
    "info = sagemaker_client.describe_transform_job(TransformJobName=train_transform_job)\n",
    "job_run_status = info[\"TransformJobStatus\"]\n",
    "# print(info[\"TrainingJobStatus\"] + \" - \" + info[\"SecondaryStatus\"])\n",
    "\n",
    "while job_run_status not in (\"Failed\", \"Completed\", \"Stopped\"):\n",
    "    info = sagemaker_client.describe_transform_job(TransformJobName=train_transform_job)\n",
    "    job_run_status = info[\"TransformJobStatus\"]\n",
    "    print(info[\"TransformJobStatus\"])\n",
    "    sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd456342-a0e3-4c25-bb80-1f231eb2f3bc",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Create a new training job\n",
    "Using the hyperparameters from the best candidate and fit on a larger dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "247769c0-50a9-43e1-acaf-d6a81a929b8d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# train_data_s3_path = 's3://{}/{}/output/bank-subset-20230523-21-26/transformed-data/dpp5/rpb/train'.format(bucket, prefix)\n",
    "# val_data_s3_path = 's3://{}/{}/output/bank-subset-20230523-21-26/transformed-data/dpp5/rpb/validation'.format(bucket, prefix)\n",
    "train_data_path = train_transform_output\n",
    "val_data_path = train_transform_output\n",
    "\n",
    "model_s3_path = \"s3://{}/{}/output/training\".format(bucket, prefix)\n",
    "train_instance = \"ml.m5.4xlarge\"\n",
    "instance_count = 1\n",
    "training_job_name = \"ap-training-api-\" + timestamp_suffix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41930f30-cd09-4903-84b5-816f450f0571",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# kick off a new training job\n",
    "sagemaker_client.create_training_job(\n",
    "    TrainingJobName=training_job_name,\n",
    "    RoleArn=role,\n",
    "    AlgorithmSpecification={\n",
    "        \"TrainingImage\": response[\"AlgorithmSpecification\"][\"TrainingImage\"],\n",
    "        \"TrainingInputMode\": \"File\",\n",
    "    },\n",
    "    HyperParameters=response[\"HyperParameters\"],\n",
    "    InputDataConfig=[\n",
    "        {\n",
    "            \"ChannelName\": \"train\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": train_data_path,\n",
    "                    \"S3DataDistributionType\": \"ShardedByS3Key\",\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": output_accept,\n",
    "            \"CompressionType\": \"None\",\n",
    "            \"RecordWrapperType\": \"None\",\n",
    "            \"InputMode\": \"Pipe\",\n",
    "            # \"EnableFFM\": False,\n",
    "        },\n",
    "        {\n",
    "            \"ChannelName\": \"validation\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": val_data_path,\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\",\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": output_accept,\n",
    "            \"CompressionType\": \"None\",\n",
    "            \"RecordWrapperType\": \"None\",\n",
    "            \"InputMode\": \"Pipe\",\n",
    "            # \"EnableFFM\": False,\n",
    "        },\n",
    "    ],\n",
    "    OutputDataConfig={\n",
    "        \"KmsKeyId\": \"\",\n",
    "        \"S3OutputPath\": model_s3_path,\n",
    "    },\n",
    "    ResourceConfig={\n",
    "        \"InstanceType\": train_instance,\n",
    "        \"InstanceCount\": instance_count,\n",
    "        \"VolumeSizeInGB\": 50,\n",
    "    },\n",
    "    StoppingCondition={\"MaxRuntimeInSeconds\": 86400},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a7036a3-f410-42c7-b0ca-0a9a6c4e2baf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "model_name = training_job_name\n",
    "\n",
    "# track model training progress\n",
    "print(\"JobStatus - Secondary Status\")\n",
    "print(\"------------------------------\")\n",
    "\n",
    "\n",
    "info = sagemaker_client.describe_training_job(TrainingJobName=model_name)\n",
    "job_run_status = info[\"TrainingJobStatus\"]\n",
    "print(info[\"TrainingJobStatus\"] + \" - \" + info[\"SecondaryStatus\"])\n",
    "\n",
    "while job_run_status not in (\"Failed\", \"Completed\", \"Stopped\"):\n",
    "    info = sagemaker_client.describe_training_job(TrainingJobName=model_name)\n",
    "    job_run_status = info[\"TrainingJobStatus\"]\n",
    "    print(info[\"TrainingJobStatus\"] + \" - \" + info[\"SecondaryStatus\"])\n",
    "    sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9deabc34-d9f2-4984-8a06-841c062f1b99",
   "metadata": {},
   "source": [
    "## Create a model using SageMaker AutoML API & Batch Transform Predict test dataset\n",
    "We will replace the original tuning model in the autopilot job with the newly fitted model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56e23660-d40b-4f5a-aaea-6ab46064f557",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# original recipe\n",
    "print(best_candidate[\"CandidateSteps\"])\n",
    "# replace tuning candidate with re-fitted model\n",
    "best_candidate[\"CandidateSteps\"][3][\"CandidateStepName\"] = training_job_name\n",
    "best_candidate[\"CandidateSteps\"][3][\"CandidateStepArn\"] = info[\"TrainingJobArn\"]\n",
    "# validate\n",
    "print(best_candidate[\"CandidateSteps\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12945400-c7cd-45f2-9dab-625b8aa48d7f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# original recipe\n",
    "print(best_candidate[\"InferenceContainers\"][1])\n",
    "# replace tuning container with re-fitted model\n",
    "best_candidate[\"InferenceContainers\"][1][\"ModelDataUrl\"] = info[\"ModelArtifacts\"][\n",
    "    \"S3ModelArtifacts\"\n",
    "]\n",
    "# validate\n",
    "print(best_candidate[\"InferenceContainers\"][1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f38c7db-ade8-4ea6-bcbe-b157d4db143e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# creat model by replacing recipes of best candidate\n",
    "inference_response_keys = [\"predicted_label\", \"probability\"]\n",
    "\n",
    "hack_model = automl.create_model(\n",
    "    name=best_candidate_name + \"-hack\",\n",
    "    candidate=best_candidate,\n",
    "    inference_response_keys=inference_response_keys,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea3d06e4-85e9-44c2-89c1-76be4704def9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def batch_predict(model, job_name, output_path, instance_count, instance_type, content_type):\n",
    "    transformer = model.transformer(\n",
    "        instance_count=instance_count,\n",
    "        instance_type=instance_type,\n",
    "        assemble_with=\"Line\",\n",
    "        output_path=output_path,\n",
    "    )\n",
    "\n",
    "    response = transformer.transform(\n",
    "        data=test_data_s3_path,\n",
    "        split_type=\"Line\",\n",
    "        content_type=content_type,\n",
    "        job_name=job_name,\n",
    "        wait=False,\n",
    "    )\n",
    "    # track batch transform progress\n",
    "\n",
    "    info = sagemaker_client.describe_transform_job(TransformJobName=job_name)\n",
    "    job_run_status = info[\"TransformJobStatus\"]\n",
    "\n",
    "    while job_run_status not in (\"Failed\", \"Completed\", \"Stopped\"):\n",
    "        info = sagemaker_client.describe_transform_job(TransformJobName=train_transform_job)\n",
    "        job_run_status = info[\"TransformJobStatus\"]\n",
    "        sleep(60)\n",
    "    return job_run_status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f0b5158-6743-496d-9bf8-851d26de61c2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Now you can batch transform the test dataset\n",
    "batch_output = \"s3://{}/{}/output/whole/batchTransformHack\".format(bucket, prefix)\n",
    "job_name = \"hack-batch-predict\" + timestamp_suffix\n",
    "job_run_status = batch_predict(\n",
    "    hack_model,\n",
    "    \"test-batch-predict-with-status\",\n",
    "    batch_output,\n",
    "    1,\n",
    "    \"ml.m5.xlarge\",\n",
    "    \"text/csv\",\n",
    ")\n",
    "job_run_status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2773d6e7-8255-499e-baa1-0b209a04222a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# model_data = info[\"ModelArtifacts\"][\"S3ModelArtifacts\"]\n",
    "\n",
    "# primary_container = {\"Image\": TrainingImage, \"ModelDataUrl\": model_data}\n",
    "\n",
    "# create_model_response = sagemaker_client.create_model(\n",
    "#     ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container\n",
    "# )\n",
    "\n",
    "# print(create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e59e051b-c9a4-419f-a92d-825bd2e8acb5",
   "metadata": {},
   "source": [
    "## Last Step! Evaluate the model and compare it with the baseline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d1d5c37-be95-4c5b-a705-c1ddf61dcec1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# creat model using the baseline autopilot job\n",
    "\n",
    "automl = AutoML.attach(auto_ml_job_name=auto_ml_baseline_job_name)\n",
    "best_candidate = automl.describe_auto_ml_job()[\"BestCandidate\"]\n",
    "best_candidate_name = best_candidate[\"CandidateName\"]\n",
    "\n",
    "baseline_model = automl.create_model(\n",
    "    name=best_candidate_name,\n",
    "    candidate=best_candidate,\n",
    "    inference_response_keys=inference_response_keys,\n",
    ")\n",
    "\n",
    "batch_baseline_output = \"s3://{}/{}/output/whole/batchTransform\".format(bucket, prefix)\n",
    "job_name = \"baseline-batch-predict\" + timestamp_suffix\n",
    "job_run_status = batch_predict(\n",
    "    baseline_model, job_name, batch_baseline_output, 1, \"ml.m5.xlarge\", \"text/csv\"\n",
    ")\n",
    "job_run_status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ef0d0a4-0774-43a4-96c7-0d8fa0a4ae42",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# create pandas dataframe from s3 files\n",
    "import re\n",
    "\n",
    "\n",
    "def get_csv_output_from_s3(s3uri, batch_file):\n",
    "    file_name = \"{}.out\".format(batch_file)\n",
    "    match = re.match(\"s3://([^/]+)/(.*)\", \"{}/{}\".format(s3uri, file_name))\n",
    "    output_bucket, output_prefix = match.group(1), match.group(2)\n",
    "    s3.download_file(output_bucket, output_prefix, file_name)\n",
    "    return pd.read_csv(file_name, sep=\",\", header=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2da766f8-d0a5-4721-b6bb-1c3cbff6712a",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "baseline_df = get_csv_output_from_s3(batch_baseline_output, test_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59bd2f88-bda8-4c6d-b6ac-9e55d967fffa",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "hackmodel_df = get_csv_output_from_s3(batch_output, test_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b9fb66e-dd5f-486e-ad09-a880836549ff",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# groundtruth\n",
    "test_groundtruth = test_data[\"y\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7442c6f1-9363-431b-afaa-c6afe989e3a9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# compare this evaluation result with the autopilot model trained on whole dataset\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "baseline_score = accuracy_score(test_groundtruth, baseline_df.iloc[:, 0])\n",
    "hackmodel_score = accuracy_score(test_groundtruth, hackmodel_df.iloc[:, 0])\n",
    "print(\n",
    "    \"Baseline model (autopilot on whole training data) score: {}, our model score: {}\".format(\n",
    "        baseline_score, hackmodel_score\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "227dd304-950e-48bf-9fef-1fff94723eac",
   "metadata": {},
   "source": [
    "## The end"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a2c3a3d-da2b-43c4-8925-f00e9dcb54ff",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/ap-batch-transform.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/ap-batch-transform.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/ap-batch-transform.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/ap-batch-transform.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/ap-batch-transform.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/ap-batch-transform.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/ap-batch-transform.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 3.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}