{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Amazon SageMaker Workshop\n",
    "## _** Batch Transform Deployment**_\n",
    "\n",
    "---\n",
    "\n",
    "In this part of the workshop we will deploy our model created in the previous lab in an batch endpoint for asynchronous inferences to Predict Mobile Customer Departure.\n",
    "\n",
    "Batch transform uses the same mechanics as real-time hosting to generate predictions. However, unlike real-time hosted endpoints which have persistent hardware (instances stay running until you shut them down), batch transform clusters are torn down when the job completes.\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html)\n",
    "  * Set up a asynchronous endpoint to get predictions from your model\n",
    "  \n",
    "---\n",
    "\n",
    "## Background\n",
    "\n",
    "In the previous labs [Modeling](../../2-Modeling/modeling.ipynb) and [Evaluation](../../3-Evaluation/evaluation.ipynb) we trained multiple models with multiple SageMaker training jobs and evaluated them .\n",
    "\n",
    "Let's import the libraries for this lab:\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "#Supress default INFO loggingd\n",
    "import logging\n",
    "logger = logging.getLogger()\n",
    "logger.setLevel(logging.ERROR)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import os\n",
    "import time\n",
    "import json\n",
    "import tarfile\n",
    "from time import strftime, gmtime\n",
    "\n",
    "import boto3\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import pickle\n",
    "import xgboost\n",
    "\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker.predictor import csv_serializer\n",
    "from sagemaker.s3 import S3Uploader, S3Downloader\n",
    "\n",
    "from sklearn import metrics"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "sess = boto3.Session()\n",
    "sm = sess.client('sagemaker')\n",
    "role = sagemaker.get_execution_role()"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "%store -r bucket\n",
    "%store -r prefix\n",
    "%store -r region\n",
    "%store -r docker_image_name\n",
    "%store -r framework_version\n",
    "%store -r s3uri_test"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "bucket, prefix, region, docker_image_name, framework_version, s3uri_test"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "---\n",
    "### - if you _**skipped**_ the lab `2-Modeling/` follow instructions:\n",
    "\n",
    "   - **run this:**"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# # Uncomment if you have not done Lab 2-Modeling\n",
    "\n",
    "#from config.solution_lab2 import get_estimator_from_lab2\n",
    "#xgb = get_estimator_from_lab2(docker_image_name, framework_version)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "---\n",
    "### - if you _**have done**_ the lab `2-Modeling/` follow instructions:\n",
    "\n",
    "   - **run this:**"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# # Uncomment if you've done Lab 2-Modeling\n",
    "\n",
    "#%store -r training_job_name\n",
    "#xgb = sagemaker.estimator.Estimator.attach(training_job_name)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "---\n",
    "## Batch Prediction"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "Batch Transform manages all necessary compute resources, including launching instances to deploy endpoints and deleting them afterward."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Download Test Dataset and Model"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "S3Downloader.download(xgb.model_data, \".\")\n",
    "S3Downloader.download(s3uri_test, \".\")"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Visualizing Test Data"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "test_path = \"test.csv\"\n",
    "df = pd.read_csv(test_path, header=None)\n",
    "df"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "* batch_input The batch input dataset used for prediction(test dataset) cannot have target column and should be saved in S3 buckets\n",
    "* batch_output We need to specify the path for the batch output"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "test_true_y = df.iloc[:,0] # get target column\n",
    "test_true_y.to_frame()\n",
    "test_data_batch = df.iloc[:, 1:] # delete the target column\n",
    "test_data_batch.to_csv('test_batch.csv', header=False, index=False)\n",
    "test_data_batch"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Upload on S3"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# upload to S3\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'batch/test_batch.csv')).upload_file('test_batch.csv')"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "s3_batch_input = 's3://{}/{}/batch/test_batch.csv'.format(bucket,prefix) # test data used for prediction\n",
    "s3_batch_output = 's3://{}/{}/batch/batch-inference'.format(bucket, prefix) # specify the location of batch output"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "s3_batch_input, s3_batch_output"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Import Pickle"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "model_path = \"model.tar.gz\"\n",
    "with tarfile.open(model_path) as tar:\n",
    "    tar.extractall(path=\".\")\n",
    "\n",
    "print(\"Loading xgboost model.\")\n",
    "model = pickle.load(open(\"xgboost-model\", \"rb\"))\n",
    "model"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Testing model locally for randomly subset"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "print(\"Some random test data\")\n",
    "x = test_data_batch.sample(1)\n",
    "print(x)\n",
    "\n",
    "\n",
    "print(\"Performing predictions against test data.\")\n",
    "\n",
    "X_test = xgboost.DMatrix(x.values)\n",
    "predictions_probs = model.predict(X_test)\n",
    "predictions = predictions_probs.round()\n",
    "\n",
    "print(predictions)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Create Batch job and make batch predictions\n",
    "\n",
    "As we saw in the **2-Modeling** lab, we added custom the inference logic in our script (with the *input_fn and predict_fn*). So just by selecting our previous estimator, we can deploy it and run batch inferences:"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# creates a transformer object from the trained model\n",
    "transformer = xgb.transformer(\n",
    "                          instance_count=1,\n",
    "                          instance_type='ml.m5.large',\n",
    "                          output_path=s3_batch_output)\n",
    "\n",
    "# calls that object's transform method to create a transform job\n",
    "transformer.transform(data=s3_batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line')\n",
    "\n",
    "transformer.wait()"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Track Results on Sagemaker Experiments\n",
    "If you open *Experiments and trials* again, and select the \"Unassigned trial components\", you should see that your SageMaker Transform job executed successfully:\n",
    "\n",
    "![batch_transform_result.png](./media/batch_transform_result.png)"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Download Batch result from S3"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "batch_output = 's3://{}/{}/batch/batch-inference/test_batch.csv.out'.format(bucket,prefix)\n",
    "S3Downloader.download(batch_output, \".\")"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "batch_output = pd.read_csv('test_batch.csv.out', header=None)\n",
    "pred_y = np.round(batch_output)\n",
    "pred_y"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Evaluating Results\n",
    "\n",
    "Following codes will evaluate job output data, to check accuracy of our Batch Transform model."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "def get_score(y_true,y_pred):\n",
    "    f1 = metrics.f1_score(y_true, y_pred)\n",
    "    precision = metrics.precision_score(y_true, y_pred)\n",
    "    recall = metrics.recall_score(y_true, y_pred)\n",
    "    accuracy = metrics.accuracy_score(y_true, y_pred)\n",
    "    tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_pred).ravel()\n",
    "    return precision, recall, f1, accuracy, tn, fp, fn, tp"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "#get scores\n",
    "temp_precision, temp_recall, temp_f1, temp_accuracy, tn, fp, fn, tp = get_score(test_true_y, pred_y)\n",
    "output = [temp_precision,temp_recall,temp_f1,temp_accuracy,tp, fp, tn, fn]\n",
    "output = pd.Series(output, index=['precision', 'recall', 'f1', 'accuracy', 'tp', 'fp', 'tn', 'fn']) \n",
    "print(output[['accuracy', 'tp', 'fp', 'tn', 'fn']])\n",
    "\n",
    "from sklearn.metrics import classification_report\n",
    "print(classification_report(test_true_y, pred_y))"
   ],
   "outputs": [],
   "metadata": {}
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}