{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 3: Model Training\n",
    "**This notebook uses the feature set extracted by `module-2` to create a XGBoost based machine learning model for binary classification**\n",
    "\n",
    "**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Setup](#Setup)\n",
    "1. [Load transformed feature set](#Load-transformed-feature-set)\n",
    "1. [Split data](#Split-data)\n",
    "1. [Train a model using SageMaker built-in XgBoost algorithm](#Train-a-model-using-SageMaker-built-in-XgBoost-algorithm)\n",
    "1. [Real time inference using the deployed endpoint](#Real-time-inference-using-the-deployed-endpoint)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Background\n",
    "\n",
    "In this notebook, we demonstrate how to use the feature set derived in `Module-2` and create a machine learning model for predicting whether a customer will reorder a product or not based on historical records. Given the problem type is supervised binary classification, we will use a SageMaker built-in algorithm XGBoost to design this classifier. Once the model is trained, we will also deploy the trained model as a SageMaker endpoint for real-time inference. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.serializers import CSVSerializer\n",
    "from sagemaker.inputs import TrainingInput\n",
    "from sagemaker.predictor import Predictor\n",
    "from sagemaker import get_execution_role\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sagemaker\n",
    "import logging\n",
    "import boto3\n",
    "import json\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logger = logging.getLogger('__name__')\n",
    "logger.setLevel(logging.DEBUG)\n",
    "logger.addHandler(logging.StreamHandler())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Essentials"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_execution_role = get_execution_role()\n",
    "logger.info(f'Role = {sagemaker_execution_role}')\n",
    "session = boto3.Session()\n",
    "sagemaker_session = sagemaker.Session()\n",
    "default_bucket = sagemaker_session.default_bucket()\n",
    "prefix = 'sagemaker-featurestore-workshop'\n",
    "s3 = session.resource('s3')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load transformed feature set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('.././data/train/transformed.csv')\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Move column `is_redordered` to be the first column since our training algorithm `XGBoost` expects the target column to be the first column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_column = df.pop('is_reordered')\n",
    "df.insert(0, 'is_reordered', first_column)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Split data\n",
    "\n",
    "We will shuffle the whole dataset first (df.sample(frac=1, random_state=123)) and then split our data set into the following parts:\n",
    "\n",
    "* 70% - train set,\n",
    "* 20% - validation set,\n",
    "* 10% - test set\n",
    "\n",
    "**Note:**  In the code below, the first element denotes size for train (0.7 = 70%), second element denotes size for test (1-0.9 = 0.1 = 10%) and difference between the two denotes size for validation(1 - [0.7+0.1] = 0.2 = 20%)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df, validation_df, test_df = np.split(df.sample(frac=1, random_state=123), [int(.7*len(df)), int(.9*len(df))])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validation_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save split datasets to local"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.to_csv('../data/train/train.csv', index=False)\n",
    "validation_df.to_csv('../data/validation/validation.csv', index=False)\n",
    "test_df.to_csv('../data/test/test.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copy datasets to S3 from local"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3.Bucket(default_bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('.././data/train/train.csv')\n",
    "s3.Bucket(default_bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('.././data/validation/validation.csv')\n",
    "s3.Bucket(default_bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('.././data/test/test.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create Pointers to the uploaded files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_set_location = 's3://{}/{}/train/'.format(default_bucket, prefix)\n",
    "validation_set_location = 's3://{}/{}/validation/'.format(default_bucket, prefix)\n",
    "test_set_location = 's3://{}/{}/test/'.format(default_bucket, prefix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_set_pointer = TrainingInput(s3_data=train_set_location, content_type='csv')\n",
    "validation_set_pointer = TrainingInput(s3_data=validation_set_location, content_type='csv')\n",
    "test_set_pointer = TrainingInput(s3_data=test_set_location, content_type='csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(json.dumps(train_set_pointer.__dict__, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train a model using SageMaker built-in XgBoost algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "container_uri = sagemaker.image_uris.retrieve(region=session.region_name, \n",
    "                                              framework='xgboost', \n",
    "                                              version='1.0-1', \n",
    "                                              image_scope='training')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb = sagemaker.estimator.Estimator(image_uri=container_uri,\n",
    "                                    role=sagemaker_execution_role, \n",
    "                                    instance_count=2, \n",
    "                                    instance_type='ml.m5.xlarge',\n",
    "                                    output_path='s3://{}/{}/model-artifacts'.format(default_bucket, prefix),\n",
    "                                    sagemaker_session=sagemaker_session,\n",
    "                                    base_job_name='reorder-classifier')\n",
    "\n",
    "xgb.set_hyperparameters(objective='binary:logistic',\n",
    "                        num_round=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb.fit({'train': train_set_pointer, 'validation': validation_set_pointer})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Saving Training Job Information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Saving training job information to be used in the ML lineage module\n",
    "training_job_info = xgb.latest_training_job.describe()\n",
    "if training_job_info != None :\n",
    "    training_jobName = training_job_info[\"TrainingJobName\"]\n",
    "    %store training_jobName"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Host the trained XGBoost model as a SageMaker Endpoint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** The deployment usually takes ~10 mins - good time to take a coffee break :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_predictor = xgb.deploy(initial_instance_count=2,\n",
    "                           instance_type='ml.m5.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Real time inference using the deployed endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csv_serializer = CSVSerializer()\n",
    "endpoint_name = xgb_predictor.endpoint_name\n",
    "%store endpoint_name\n",
    "predictor = Predictor(endpoint_name=endpoint_name, \n",
    "                      serializer=csv_serializer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df = pd.read_csv('.././data/test/test.csv')\n",
    "record = test_df.sample(1)\n",
    "record"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = record.values[0]\n",
    "payload = X[1:]\n",
    "payload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "predicted_class_prob = predictor.predict(payload).decode('utf-8')\n",
    "if float(predicted_class_prob) < 0.5:\n",
    "    logger.info('Prediction (y) = Will not reorder')\n",
    "else:\n",
    "    logger.info('Prediction (y) = Will reorder')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}