{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "42b5e80b-ad1d-4335-a1f7-10a91127e3dc"
    }
   },
   "source": [
    "# Amazon SageMaker Batch Transform\n",
    "\n",
    "## Background\n",
    "This purpose of this notebook is to train a model using SageMaker's XGBoost and UCI's breast cancer diagnostic data set to illustrate at how to run batch inferences and how to use the Batch Transform I/O join feature. UCI's breast cancer diagnostic data set is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on Kaggle at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictve model of whether a breast mass image indicates benign or malignant tumor. \n",
    "\n",
    "\n",
    "\n",
    "---\n",
    "\n",
    "## Setup\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "* The SageMaker role arn used to give training and batch transform access to your data. The snippet below will use the same role used by your SageMaker notebook instance. Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.\n",
    "* The S3 bucket that you want to use for training and storing model objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!python -m pip install --upgrade pip --quiet\n",
    "!pip install -U awscli --quiet\n",
    "!pip install sagemaker --upgrade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true,
    "nbpresent": {
     "id": "6427e831-8f89-45c0-b150-0b134397d79a"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import boto3\n",
    "import sagemaker\n",
    "from time import gmtime, strftime\n",
    "from datetime import datetime\n",
    "\n",
    "boto_session = boto3.session.Session()\n",
    "sm_session = sagemaker.session.Session()\n",
    "sm_role = sagemaker.get_execution_role()\n",
    "region = boto_session.region_name\n",
    "s3_bucket = sm_session.default_bucket()\n",
    "bucket_prefix = \"DEMO-breast-cancer-prediction-xgboost-highlevel\"\n",
    "resource_name = \"BatchInferenceDemo-{}-{}\"\n",
    "\n",
    "print(f\"Will use bucket '{s3_bucket}' for storing all resources related to this notebook\")\n",
    "print(f\"Using Role: {sm_role}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "142777ae-c072-448e-b941-72bc75735d01"
    }
   },
   "source": [
    "---\n",
    "## Data sources\n",
    "\n",
    "> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.\n",
    "\n",
    "> Breast Cancer Wisconsin (Diagnostic) Data Set [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)].\n",
    "\n",
    "> _Also see:_ Breast Cancer Wisconsin (Diagnostic) Data Set [https://www.kaggle.com/uciml/breast-cancer-wisconsin-data].\n",
    "\n",
    "## Data preparation\n",
    "\n",
    "\n",
    "Let's download the data and save it in the local folder with the name data.csv and take a look at it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbpresent": {
     "id": "f8976dad-6897-4c7e-8c95-ae2f53070ef5"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "s3 = boto3.client(\"s3\")\n",
    "\n",
    "filename = \"wdbc.csv\"\n",
    "s3.download_file(\"sagemaker-sample-files\", \"datasets/tabular/breast_cancer/wdbc.csv\", filename)\n",
    "data = pd.read_csv(filename, header=None)\n",
    "\n",
    "# specify columns extracted from wbdc.names\n",
    "data.columns = [\n",
    "    \"id\",\n",
    "    \"diagnosis\",\n",
    "    \"radius_mean\",\n",
    "    \"texture_mean\",\n",
    "    \"perimeter_mean\",\n",
    "    \"area_mean\",\n",
    "    \"smoothness_mean\",\n",
    "    \"compactness_mean\",\n",
    "    \"concavity_mean\",\n",
    "    \"concave points_mean\",\n",
    "    \"symmetry_mean\",\n",
    "    \"fractal_dimension_mean\",\n",
    "    \"radius_se\",\n",
    "    \"texture_se\",\n",
    "    \"perimeter_se\",\n",
    "    \"area_se\",\n",
    "    \"smoothness_se\",\n",
    "    \"compactness_se\",\n",
    "    \"concavity_se\",\n",
    "    \"concave points_se\",\n",
    "    \"symmetry_se\",\n",
    "    \"fractal_dimension_se\",\n",
    "    \"radius_worst\",\n",
    "    \"texture_worst\",\n",
    "    \"perimeter_worst\",\n",
    "    \"area_worst\",\n",
    "    \"smoothness_worst\",\n",
    "    \"compactness_worst\",\n",
    "    \"concavity_worst\",\n",
    "    \"concave points_worst\",\n",
    "    \"symmetry_worst\",\n",
    "    \"fractal_dimension_worst\",\n",
    "]\n",
    "\n",
    "# save the data\n",
    "data.to_csv(\"data.csv\", sep=\",\", index=False)\n",
    "\n",
    "data.sample(8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Key observations:\n",
    "* The data has 569 observations and 32 columns.\n",
    "* The first field is the 'id' attribute that we will want to drop before batch inference and add to the final inference output next to the probability of malignancy.\n",
    "* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).\n",
    "* There are 30 other numeric features that we will use for training and inferencing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's replace the M/B diagnosis with a 1/0 boolean value. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"diagnosis\"] = data[\"diagnosis\"].apply(lambda x: ((x == \"M\")) + 0)\n",
    "data.sample(8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's split the data and set 10% aside for our batch inference job. In addition, let's drop the 'id' field on the training set and validation set as 'id' is not a training feature. For our batch set however, we keep the 'id' feature. We'll want to filter it out prior to running our inferences so that the input data features match the ones of training set and then ultimately, we'll want to join it with inference result. We are however dropping the diagnosis attribute for the batch set since this is what we'll try to predict."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "ff9d10f9-b611-423b-80da-6dcdafd1c8b9"
    }
   },
   "source": [
    "Let's upload those data sets in S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbpresent": {
     "id": "cd8e3431-79d9-40b6-91d1-d67cd61894e7"
    }
   },
   "outputs": [],
   "source": [
    "rand_split = np.random.rand(len(data))\n",
    "batch_list = rand_split >= 0.9\n",
    "data_batch = data[batch_list].drop([\"diagnosis\"], axis=1)\n",
    "data_batch_noID = data_batch.drop([\"id\"], axis=1)\n",
    "\n",
    "batch_file = \"batch_data.csv\"\n",
    "data_batch.to_csv(batch_file, index=False, header=False)\n",
    "sm_session.upload_data(batch_file, key_prefix=\"{}/batch\".format(bucket_prefix))\n",
    "\n",
    "batch_file_noID = \"batch_data_noID.csv\"\n",
    "data_batch_noID.to_csv(batch_file_noID, index=False, header=False)\n",
    "sm_session.upload_data(batch_file_noID, key_prefix=\"{}/batch\".format(bucket_prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a SageMaker Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify the location of the pre-trained model stored in Amazon S3. This example uses a pre-trained XGBoost model name demo-xgboost-model.tar.gz. The full Amazon S3 URI is stored in a string variable model_url."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_s3_key = f\"{bucket_prefix}/model.tar.gz\"\n",
    "model_url = f\"s3://{s3_bucket}/{model_s3_key}\"\n",
    "print(f\"Uploading Model to {model_url}\")\n",
    "\n",
    "with open(\"model/model.tar.gz\", \"rb\") as model_file:\n",
    "    boto_session.resource(\"s3\").Bucket(s3_bucket).Object(model_s3_key).upload_fileobj(model_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify a primary container. For the primary container, you specify the Docker image that contains inference code, artifacts (from prior training), and a custom environment map that the inference code uses when you deploy the model for predictions. In this example, we specify an XGBoost built-in algorithm container image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import image_uris\n",
    "\n",
    "# Specify an AWS container image and region as desired\n",
    "container = image_uris.retrieve(region=region, framework=\"xgboost\", version=\"0.90-1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a SageMaker Model by specifying the name, the role (the ARN of the IAM role that Amazon SageMaker can assume to access model artifacts/ docker images for deployment), and the image_uri of the XGBoost built-in algorithm container image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.model import Model\n",
    "from sagemaker.predictor import Predictor\n",
    "\n",
    "model_name = resource_name.format(\"Model\", datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\"))\n",
    "\n",
    "model_predictor = Model(\n",
    "    name=model_name,\n",
    "    image_uri=container,\n",
    "    model_data=model_url,\n",
    "    role=sm_role,\n",
    "    predictor_cls=Predictor,\n",
    ")\n",
    "model_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "397fb60a-c48b-453f-88ea-4d832b70c919"
    }
   },
   "source": [
    "---\n",
    "\n",
    "## Batch Transform\n",
    "\n",
    "In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. In the below cell, we use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick-off several Batch Transform jobs using different configurations of these 3 new attributes. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Create a transform job with the default configurations\n",
    "Let's first skip these 3 new attributes and inspect the inference results. We'll use it as a baseline to compare to the results with data processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%time\n",
    "\n",
    "from sagemaker.transformer import Transformer\n",
    "\n",
    "transformed_output_path = f\"s3://{s3_bucket}/{bucket_prefix}/output\"\n",
    "\n",
    "sm_transformer = model_predictor.transformer(instance_count=1,\n",
    "                                             instance_type='ml.m4.xlarge',\n",
    "                                             max_payload=1,\n",
    "                                             max_concurrent_transforms=2,\n",
    "                                             output_path=transformed_output_path)\n",
    "\n",
    "#Start the Batch Transform job\n",
    "\n",
    "input_location = \"s3://{}/{}/batch/{}\".format(\n",
    "    s3_bucket, bucket_prefix, batch_file_noID\n",
    ")  # use input data without ID column\n",
    "\n",
    "sm_transformer.transform(input_location, content_type=\"text/csv\", split_type=\"Line\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's inspect the output of the Batch Transform job in S3. It should show the list probabilities of tumors being malignant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "\n",
    "def get_csv_output_from_s3(s3uri, batch_file):\n",
    "    file_name = \"{}.out\".format(batch_file)\n",
    "    match = re.match(\"s3://([^/]+)/(.*)\", \"{}/{}\".format(s3uri, file_name))\n",
    "    output_bucket, output_prefix = match.group(1), match.group(2)\n",
    "    s3.download_file(output_bucket, output_prefix, file_name)\n",
    "    return pd.read_csv(file_name, sep=\",\", header=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's inspect the output of the Batch Transform job in S3. It should show the list of tumors identified by their original feature columns and their corresponding probabilities of being malignant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file_noID)\n",
    "output_df.head(8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the License). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the license file accompanying this file. This file is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 4
}