{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "def71601",
   "metadata": {},
   "source": [
    "# Train a XGBoost regression model on Amazon SageMaker, host inference on a serverless function in AWS Lambda and optionally expose as an API with Amazon API Gateway\n",
    "\n",
    "[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed end-to-end Machine Learning (ML) service. With SageMaker, you have the option of using the built-in algorithms or you can bring your own algorithms and frameworks to train your models.  After training, you can deploy the models in [one of two ways](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html) for inference - persistent endpoint or batch transform.\n",
    "\n",
    "With a persistent inference endpoint, you get a fully-managed real-time HTTPS endpoint hosted on either CPU or GPU based EC2 instances.  It supports features like auto scaling, data capture, model monitoring and also provides cost-effective GPU support using [Amazon Elastic Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html).  It also supports hosting multiple models using multi-model endpoints that provide A/B testing capability.  You can monitor the endpoint using [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/).  In addition to all these, you can use [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) which provides a purpose-built, easy-to-use Continuous Integration and Continuous Delivery (CI/CD) service for Machine Learning.\n",
    "\n",
    "There are use cases where you may want to host the ML model on a real-time inference endpoint that is cost-effective and do not require all the capabilities provided by the SageMaker persistent inference endpoint.  These may involve,\n",
    "* simple models\n",
    "* models whose sizes are lesser than 200 MB\n",
    "* models that are invoked sparsely and do not need inference instances running all the time\n",
    "* models that do not need to be re-trained and re-deployed frequently\n",
    "* models that do not need GPUs for inference\n",
    "\n",
    "In these cases, you can take the trained ML model and host it as a serverless function on [AWS Lambda](https://aws.amazon.com/lambda/) and optionally expose it as an API by front-ending it with a HTTP/REST API hosted on [Amazon API Gateway](https://aws.amazon.com/api-gateway/).  This will be cost-effective as compared to having inference instances running all the time and still provide a fully-managed and scalable solution.\n",
    "\n",
    "This notebook demonstrates this solution by using SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to train a regression model on the [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).  It loads the trained model as a Python3 [pickle](https://docs.python.org/3/library/pickle.html) object in a container to be hosted on an [AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) function.  Finally, it provides instructions for exposing it as an API by front-ending it with a HTTP/REST API hosted on [Amazon API Gateway](https://aws.amazon.com/api-gateway/).\n",
    "\n",
    "**Warning:** The Python3 [pickle](https://docs.python.org/3/library/pickle.html) module is not secure.  Only unpickle data you trust.  Keep this in mind if you decide to get the trained ML model file from somewhere instead of building your own model.\n",
    "\n",
    "**Note:**\n",
    "\n",
    "* This notebook should only be run from within a SageMaker notebook instance as it references SageMaker native APIs.  The underlying OS of the notebook instance can either be Amazon Linux v1 or v2.\n",
    "* At the time of writing this notebook, the most relevant latest version of the Jupyter notebook kernel for this notebook was `conda_python3` and this came built-in with SageMaker notebooks.\n",
    "* This notebook uses CPU based instances for training.\n",
    "* If you already have a trained model that can be loaded as a Python3 [pickle](https://docs.python.org/3/library/pickle.html) object, then you can skip the training step in this notebook and directly upload the model file to S3 and update the code in this notebook's cells accordingly.\n",
    "* Although you can host a Python3 function directly on AWS Lambda, choosing the container option to package the code and dependencies is the best fit for this use case as the ML model file along with its dependencies will easily exceed the maximum deployment package size of 50 MB for zipped or 250 MB for unzipped files.\n",
    "* In this notebook, the ML model generated in the training step has not been tuned as that is not the intent of this demo.\n",
    "* This notebook will create resources in the same AWS account and in the same region where this notebook is running.\n",
    "* Users of this notebook require `root` access to install/update required software.  This is set by default when you create the notebook.  For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-root-access.html).\n",
    "\n",
    "**Table of Contents:**\n",
    "\n",
    "1. [Complete prerequisites](#Complete%20prerequisites)\n",
    "\n",
    "    1. [Check and configure access to the Internet](#Check%20and%20configure%20access%20to%20the%20Internet)\n",
    "\n",
    "    2. [Check and upgrade required software versions](#Check%20and%20upgrade%20required%20software%20versions)\n",
    "    \n",
    "    3. [Check and configure security permissions](#Check%20and%20configure%20security%20permissions)\n",
    "\n",
    "    4. [Organize imports](#Organize%20imports)\n",
    "    \n",
    "    5. [Create common objects](#Create%20common%20objects)\n",
    "\n",
    "2. [Prepare the data](#Prepare%20the%20data)\n",
    "\n",
    "    1. [Create the local directories](#Create%20the%20local%20directories)\n",
    "    \n",
    "    2. [Load the dataset and view the details](#Load%20the%20dataset%20and%20view%20the%20details)\n",
    "    \n",
    "    3. [(Optional) Visualize the dataset](#(Optional)%20Visualize%20the%20dataset)\n",
    "    \n",
    "    4. [Split the dataset into train, validate and test sets](#Split%20the%20dataset%20into%20train,%20validate%20and%20test%20sets)\n",
    "    \n",
    "    5. [Standardize the datasets](#Standardize%20the%20datasets)\n",
    "    \n",
    "    6. [Save the prepared datasets locally](#Save%20the%20prepared%20datasets%20locally)\n",
    "    \n",
    "    7. [Upload the prepared datasets to S3](#Upload%20the%20prepared%20datasets%20to%20S3)\n",
    "\n",
    "3. [Perform training](#Perform%20training)\n",
    "\n",
    "    1. [Set the training parameters](#Set%20the%20training%20parameters)\n",
    "    \n",
    "    2. [(Optional) Delete previous checkpoints](#(Optional)%20Delete%20previous%20checkpoints)\n",
    "    \n",
    "    3. [Run the training job](#Run%20the%20training%20job)\n",
    "\n",
    "4. [Create and push the Docker container to an Amazon ECR repository](#Create%20and%20push%20the%20Docker%20container%20to%20an%20Amazon%20ECR%20repository)\n",
    "\n",
    "    1. [Retrieve the model pickle file](#Retrieve%20the%20model%20pickle%20file)\n",
    "    \n",
    "    2. [(Optional) Test the model pickle file](#(Optional)%20Test%20the%20model%20pickle%20file)\n",
    "    \n",
    "    3. [View the inference script](#View%20the%20inference%20script)\n",
    "    \n",
    "    4. [Create the Dockerfile](#Create%20the%20Dockerfile)\n",
    "    \n",
    "    5. [Create the container](#Create%20the%20container)\n",
    "    \n",
    "    6. [Create the private repository in ECR](#Create%20the%20private%20repository%20in%20ECR)\n",
    "    \n",
    "    7. [Push the container to ECR](#Push%20the%20container%20to%20ECR)\n",
    "\n",
    "5. [Create and test the AWS Lambda function](#Create%20and%20test%20the%20AWS%20Lambda%20function)\n",
    "    \n",
    "    1. [Create the Lambda function](#Create%20the%20Lambda%20function)\n",
    "    \n",
    "    2. [Test the Lambda function](#Test%20the%20Lambda%20function)\n",
    "    \n",
    "6. [(Optional) Front-end the Lambda function with Amazon API Gateway](#(Optional)%20Front-end%20the%20Lambda%20function%20with%20Amazon%20API%20Gateway)\n",
    "\n",
    "7. [Cleanup](#Cleanup)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85783b52",
   "metadata": {},
   "source": [
    "##  1. Complete prerequisites <a id='Complete%20prerequisites'></a>\n",
    "\n",
    "Check and complete the prerequisites."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09e0c57c",
   "metadata": {},
   "source": [
    "###  A. Check and configure access to the Internet <a id='Check%20and%20configure%20access%20to%20the%20Internet'></a>\n",
    "\n",
    "This notebook requires outbound access to the Internet to download the required software updates.  You can either provide direct Internet access (default) or provide Internet access through a VPC.  For more information on this, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07d02b93",
   "metadata": {},
   "source": [
    "### B. Check and upgrade required software versions  <a id='Check%20and%20upgrade%20required%20software%20versions'></a>\n",
    "\n",
    "This notebook requires:\n",
    "* [SageMaker Python SDK version 2.x](https://sagemaker.readthedocs.io/en/stable/v2.html)\n",
    "* [Python 3.6.x](https://www.python.org/downloads/release/python-360/)\n",
    "* [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)\n",
    "* [AWS Command Line Interface](https://aws.amazon.com/cli/)\n",
    "* [Docker](https://www.docker.com/)\n",
    "* [XGBoost Python module](https://xgboost.readthedocs.io/en/latest/python/python_intro.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e78177ac",
   "metadata": {},
   "source": [
    "Capture the version of the OS on which this notebook is running."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "041b8d52",
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "from subprocess import Popen\n",
    "\n",
    "p = Popen(['cat','/etc/system-release'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)\n",
    "os_cmd_output, os_cmd_error = p.communicate()\n",
    "if len(os_cmd_error) > 0:\n",
    "    print('Notebook OS command returned error :: {}'.format(os_cmd_error))\n",
    "    os_version = ''\n",
    "else:\n",
    "    if os_cmd_output.find('Amazon Linux release 2') >= 0:\n",
    "        os_version = 'ALv2'\n",
    "    elif os_cmd_output.find('Amazon Linux AMI release 2018.03') >= 0:\n",
    "        os_version = 'ALv1'\n",
    "    else:\n",
    "        os_version = ''\n",
    "print('Notebook OS version : {}'.format(os_version))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "739469e9",
   "metadata": {},
   "source": [
    "**Note:** When running the following cell, if you get 'module not found' errors, then uncomment the appropriate installation commands and install the modules.  Also, uncomment and run the kernel shutdown command.  When the kernel comes back, comment out the installation and kernel shutdown commands and run the following cell.  Now, you should not see any errors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4bc9f642",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "\n",
    "Last tested versions:\n",
    "\n",
    "\n",
    "On Amazon Linux v1 (ALv1) notebook:\n",
    "-----------------------------------\n",
    "SageMaker Python SDK version : 2.54.0\n",
    "Python version : 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) \n",
    "[GCC 9.3.0]\n",
    "Boto3 version : 1.18.27\n",
    "XGBoost Python module version : 1.4.2\n",
    "AWS CLI version : aws-cli/1.20.21 Python/3.6.13 Linux/4.14.238-125.422.amzn1.x86_64 botocore/1.21.27\n",
    "Docker version : 19.03.13-ce, build 4484c46\n",
    "\n",
    "\n",
    "On Amazon Linux v2 (ALv2) notebook:\n",
    "-----------------------------------\n",
    "SageMaker Python SDK version : 2.59.1\n",
    "Python version : 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) \n",
    "[GCC 9.3.0]\n",
    "Boto3 version : 1.18.36\n",
    "XGBoost Python module version : 1.4.2\n",
    "AWS CLI version : aws-cli/1.20.24 Python/3.6.13 Linux/4.14.243-185.433.amzn2.x86_64 botocore/1.21.36\n",
    "Docker version : 20.10.7, build f0df350\n",
    "Amazon ECR Docker Credential Helper : 0.6.3\n",
    "\n",
    "\"\"\"\n",
    "\n",
    "import boto3\n",
    "import IPython\n",
    "import sagemaker\n",
    "import sys\n",
    "try:\n",
    "    import xgboost as xgb\n",
    "except ModuleNotFoundError:\n",
    "    # Install XGBoost and restart kernel\n",
    "    print('Installing XGBoost module...')\n",
    "    !{sys.executable} -m pip install -U xgboost\n",
    "    IPython.Application.instance().kernel.do_shutdown(True)\n",
    "\n",
    "\n",
    "# Install/upgrade the Sagemaker SDK, Boto3 and XGBoost and restart kernel\n",
    "#!{sys.executable} -m pip install -U sagemaker boto3 xgboost\n",
    "#IPython.Application.instance().kernel.do_shutdown(True)\n",
    "\n",
    "# Get the current installed version of Sagemaker SDK, Python, Boto3 and XGBoost\n",
    "print('SageMaker Python SDK version : {}'.format(sagemaker.__version__))\n",
    "print('Python version : {}'.format(sys.version))\n",
    "print('Boto3 version : {}'.format(boto3.__version__))\n",
    "print('XGBoost Python module version : {}'.format(xgb.__version__))\n",
    "\n",
    "# Get the AWS CLI version\n",
    "print('AWS CLI version : ')\n",
    "!aws --version"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10857e2a",
   "metadata": {},
   "source": [
    "Docker should be pre-installed in the SageMaker notebook instance.  Verify it by running the `docker --version` command.  If Docker is not installed, you can install it by uncommenting the install command in the following cell.  You will require `sudo` rights to install."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f715b13a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify if docker is installed\n",
    "!docker --version\n",
    "\n",
    "# Install docker\n",
    "#!sudo yum --assumeyes install docker"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1fd8d86",
   "metadata": {},
   "source": [
    "**Additional prerequisite (when notebook is running on Amazon Linux v2):**\n",
    "\n",
    "Install and configure the [Amazon ECR credential helper](https://github.com/awslabs/amazon-ecr-credential-helper).  This makes it easier to store and use Docker credentials for use with Amazon ECR private registries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f499be84",
   "metadata": {},
   "outputs": [],
   "source": [
    "if os_version == 'ALv2':\n",
    "    # Install\n",
    "    !sudo yum --assumeyes install amazon-ecr-credential-helper\n",
    "    # Verify installation\n",
    "    print('Amazon ECR Docker Credential Helper version : ')\n",
    "    !docker-credential-ecr-login version\n",
    "    # Create the .docker directory if it doesn't exist\n",
    "    !mkdir -p ~/.docker\n",
    "    # Configure\n",
    "    !printf \"{\\\\n\\\\t\\\"credsStore\\\": \\\"ecr-login\\\"\\\\n}\" > ~/.docker/config.json\n",
    "    # Verify configuration\n",
    "    !cat ~/.docker/config.json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ea2e09f",
   "metadata": {},
   "source": [
    "###  C. Check and configure security permissions <a id='Check%20and%20configure%20security%20permissions'></a>\n",
    "\n",
    "Users of this notebook require `root` access to install/update required software.  This is set by default when you create the notebook.  For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-root-access.html).\n",
    "\n",
    "This notebook uses the IAM role attached to the underlying notebook instance.  This role should have the following permissions,\n",
    "\n",
    "1. Full access to the S3 bucket that will be used to store training and output data.\n",
    "2. Full access to launch training instances.\n",
    "3. Access to write to CloudWatch logs and metrics.\n",
    "4. Access to create and write to Amazon ECR private registries.\n",
    "5. Access to create and invoke AWS Lambda functions.\n",
    "\n",
    "To view the name of this role, run the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b58d52a",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sagemaker.get_execution_role())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4c4caf5",
   "metadata": {},
   "source": [
    "This notebook creates an AWS Lambda function for hosting the ML model.  This function requires an IAM role that it assumes when it is invoked.  For more information on this, refer [here](https://docs.aws.amazon.com/lambda/latest/dg/lambda-intro-execution-role.html).\n",
    "\n",
    "For the function created in this notebook, at a minimum, this role should provide access to write to CloudWatch logs and metrics."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2040df05",
   "metadata": {},
   "source": [
    "###  D. Organize imports <a id='Organize%20imports'></a>\n",
    "\n",
    "Organize all the library and module imports for later use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3895de10",
   "metadata": {},
   "outputs": [],
   "source": [
    "from io import StringIO\n",
    "import json\n",
    "import logging\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import os\n",
    "import pickle\n",
    "import pandas as pd\n",
    "from sagemaker.inputs import TrainingInput\n",
    "import seaborn as sns\n",
    "import sklearn.model_selection\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "import tarfile\n",
    "import time"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98ff5baa",
   "metadata": {},
   "source": [
    "###  E. Create common objects <a id='Create%20common%20objects'></a>\n",
    "\n",
    "Create common objects to be used in future steps in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40d38bbb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify the S3 bucket name\n",
    "s3_bucket = '<Specify the S3 bucket name>'\n",
    "\n",
    "# Create the S3 Boto3 resource\n",
    "s3_resource = boto3.resource('s3')\n",
    "s3_bucket_resource = s3_resource.Bucket(s3_bucket)\n",
    "\n",
    "# Create the SageMaker Boto3 client\n",
    "sm_client = boto3.client('sagemaker')\n",
    "\n",
    "# Create the ECR client\n",
    "ecr_client = boto3.client('ecr')\n",
    "\n",
    "# Create the AWS Lambda client\n",
    "lambda_client = boto3.client('lambda')\n",
    "\n",
    "# Get the AWS region name\n",
    "region_name = sagemaker.Session().boto_region_name\n",
    "\n",
    "# Base name to be used to create resources\n",
    "nb_name = 'sm-xgboost-ca-housing-lambda-model-hosting'\n",
    "\n",
    "# Names of various resources\n",
    "train_job_name = 'train-{}'.format(nb_name)\n",
    "\n",
    "# Names of local sub-directories in the notebook file system\n",
    "data_dir = os.path.join(os.getcwd(), 'data/{}'.format(nb_name))\n",
    "train_dir = os.path.join(os.getcwd(), 'data/{}/train'.format(nb_name))\n",
    "val_dir = os.path.join(os.getcwd(), 'data/{}/validate'.format(nb_name))\n",
    "test_dir = os.path.join(os.getcwd(), 'data/{}/test'.format(nb_name))\n",
    "\n",
    "# Location of the datasets file in the notebook file system\n",
    "dataset_csv_file = os.path.join(os.getcwd(), 'datasets/california_housing.csv')\n",
    "\n",
    "# Container artifacts directory in the notebook file system\n",
    "container_artifacts_dir = os.path.join(os.getcwd(), 'container-artifacts/{}'.format(nb_name))\n",
    "\n",
    "# Location of the AWS Lambda script (containing the inference code) in the notebook file system\n",
    "lambda_script_file_name = 'lambda_sm_xgboost_ca_housing_inference.py'\n",
    "lambda_script_file = os.path.join(os.getcwd(), 'scripts/{}'.format(lambda_script_file_name))\n",
    "\n",
    "# Sub-folder names in S3\n",
    "train_dir_s3_prefix = '{}/data/train'.format(nb_name)\n",
    "val_dir_s3_prefix = '{}/data/validate'.format(nb_name)\n",
    "test_dir_s3_prefix = '{}/data/test'.format(nb_name)\n",
    "\n",
    "# Location in S3 where the model checkpoint will be stored\n",
    "model_checkpoint_s3_path = 's3://{}/{}/checkpoint/'.format(s3_bucket, nb_name)\n",
    "\n",
    "# Location in S3 where the trained model will be stored\n",
    "model_output_s3_path = 's3://{}/{}/output/'.format(s3_bucket, nb_name)\n",
    "\n",
    "# Names of the model tar file and extracted file - these are dependent on the\n",
    "# framework and algorithm you used to train the model.  This notebook uses\n",
    "# SageMaker's built-in XGBoost algorithm and that will have the names as follows:\n",
    "model_tar_file_name = 'model.tar.gz'\n",
    "extracted_model_file_name = 'xgboost-model'\n",
    "\n",
    "# Container details\n",
    "container_image_name = nb_name\n",
    "container_registry_url_prefix = '<Specify the ECR URL prefix in this format {aws_account_id}.dkr.ecr.{region}.amazonaws.com>'\n",
    "\n",
    "# Lambda function details\n",
    "lambda_function_name = nb_name\n",
    "lambda_iam_role = '<Specify the ARN for the Lambda execution role>'\n",
    "lambda_timeout_in_seconds = 30\n",
    "lambda_memory_size_in_mb = 1024"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24dda660",
   "metadata": {},
   "source": [
    "## 2. Prepare the data <a id='Prepare%20the%20data'></a>\n",
    "\n",
    "The [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) consists of 20,640 observations on housing prices with 9 economic covariates.  These covariates are,\n",
    "\n",
    "* MedianHouseValue\n",
    "* MedianIncome\n",
    "* HousingMedianAge\n",
    "* TotalRooms\n",
    "* TotalBedrooms\n",
    "* Population\n",
    "* Households\n",
    "* Latitude\n",
    "* Longitude\n",
    "\n",
    "This dataset has been downloaded to the local `datasets` directory and modified as a CSV file with the feature names in the first row.  This will be used in this notebook.\n",
    "\n",
    "The following steps will help with preparing the datasets for training, validation and testing."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "514dbdae",
   "metadata": {},
   "source": [
    "### A) Create the local directories <a id='Create%20the%20local%20directories'></a>\n",
    "\n",
    "Create the directories in the local system where the dataset will be copied to and processed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd8aef9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the local directories if they don't exist\n",
    "os.makedirs(data_dir, exist_ok=True)\n",
    "os.makedirs(train_dir, exist_ok=True)\n",
    "os.makedirs(val_dir, exist_ok=True)\n",
    "os.makedirs(test_dir, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e45766aa",
   "metadata": {},
   "source": [
    "### B) Load the dataset and view the details <a id='Load%20the%20dataset%20and%20view%20the%20details'></a>\n",
    "\n",
    "Check if the CSV file exists in the `datasets` directory and load it into a Pandas DataFrame.  Finally, print the details of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fb60ebd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check if the dataset file exists and proceed\n",
    "if os.path.exists(dataset_csv_file):\n",
    "    print('Dataset CSV file \\'{}\\' exists.'.format(dataset_csv_file))\n",
    "    # Load the data into a Pandas DataFrame\n",
    "    pd_data_frame = pd.read_csv(dataset_csv_file)\n",
    "    # Print the first 5 records\n",
    "    #print(pd_data_frame.head(5))\n",
    "    # Describe the dataset\n",
    "    print(pd_data_frame.describe())\n",
    "else:\n",
    "    print('Dataset CSV file \\'{}\\' does not exist.'.format(dataset_csv_file))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7573baf3",
   "metadata": {},
   "source": [
    "### C) (Optional) Visualize the dataset <a id='(Optional)%20Visualize%20the%20dataset'></a>\n",
    "\n",
    "Display the distributions in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "88b67cc0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the correlation matrix\n",
    "plt.figure(figsize=(11, 7))\n",
    "sns.heatmap(cbar=False, annot=True, data=(pd_data_frame.corr() * 100), cmap='coolwarm')\n",
    "plt.title('% Correlation Matrix')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f7ee243",
   "metadata": {},
   "source": [
    "### D) Split the dataset into train, validate and test sets <a id='Split%20the%20dataset%20into%20train,%20validate%20and%20test%20sets'></a>\n",
    "\n",
    "Split the dataset into train, validate and test sets after shuffling.  Split further into x and y sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e592d882",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split into train and test datasets after shuffling\n",
    "train, test = sklearn.model_selection.train_test_split(pd_data_frame, test_size=0.2,\n",
    "                                                       random_state=35, shuffle=True)\n",
    "# Split the train dataset further into train and validation datasets after shuffling\n",
    "train, val = sklearn.model_selection.train_test_split(train, test_size=0.1,\n",
    "                                                      random_state=25, shuffle=True)\n",
    "\n",
    "# Define functions to get x and y columns\n",
    "def get_x(df):\n",
    "    return df[['median_income','housing_median_age','total_rooms','total_bedrooms',\n",
    "                 'population','households','latitude','longitude']]\n",
    "def get_y(df):\n",
    "    return df[['median_house_value']]\n",
    "\n",
    "# Load the x and y columns for train, validation and test datasets\n",
    "x_train = get_x(train)\n",
    "y_train = get_y(train)\n",
    "x_val = get_x(val)\n",
    "y_val = get_y(val)\n",
    "x_test = get_x(test)\n",
    "y_test = get_y(test)\n",
    "\n",
    "# Summarize the datasets\n",
    "print(\"x_train shape:\", x_train.shape)\n",
    "print(\"y_train shape:\", y_train.shape)\n",
    "print(\"x_val shape:\", x_val.shape)\n",
    "print(\"y_val shape:\", y_val.shape)\n",
    "print(\"x_test shape:\", x_test.shape)\n",
    "print(\"y_test shape:\", y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "258e7269",
   "metadata": {},
   "source": [
    "### E) Standardize the datasets <a id='Standardize%20the%20datasets'></a>\n",
    "\n",
    "* Standardize the x columns of the train dataset using the `fit_transform()` function of `StandardScaler`.\n",
    "* Standardize the x columns of the validate and test datasets using the `transform()` function of `StandardScaler`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d73985c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standardize the dataset\n",
    "scaler = StandardScaler()\n",
    "x_train = scaler.fit_transform(x_train)\n",
    "x_val = scaler.transform(x_val)\n",
    "x_test = scaler.transform(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8990503c",
   "metadata": {},
   "source": [
    "### F) Save the prepared datasets locally <a id='Save%20the%20prepared%20datasets%20locally'></a>\n",
    "\n",
    "Save the prepared train, validate and test datasets to local directories.  Prior to saving, concatenate x and y columns as needed.  Create the directories if they don't exist."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8db4c579",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the prepared dataset (in numpy format) to the local directories as csv files\n",
    "\n",
    "np.savetxt(os.path.join(train_dir, 'train.csv'),\n",
    "           np.concatenate((y_train.to_numpy(), x_train), axis=1), delimiter=',')\n",
    "np.savetxt(os.path.join(train_dir, 'train_x.csv'), x_train)\n",
    "np.savetxt(os.path.join(train_dir, 'train_y.csv'), y_train.to_numpy())\n",
    "\n",
    "np.savetxt(os.path.join(val_dir, 'validate.csv'),\n",
    "           np.concatenate((y_val.to_numpy(), x_val), axis=1), delimiter=',')\n",
    "np.savetxt(os.path.join(val_dir, 'validate_x.csv'), x_val)\n",
    "np.savetxt(os.path.join(val_dir, 'validate_y.csv'), y_val.to_numpy())\n",
    "\n",
    "np.savetxt(os.path.join(test_dir, 'test.csv'),\n",
    "           np.concatenate((y_test.to_numpy(), x_test), axis=1), delimiter=',')\n",
    "np.savetxt(os.path.join(test_dir, 'test_x.csv'), x_test)\n",
    "np.savetxt(os.path.join(test_dir, 'test_y.csv'), y_test.to_numpy())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af049f67",
   "metadata": {},
   "source": [
    "### G) Upload the prepared datasets to S3 <a id='Upload%20the%20prepared%20datasets%20to%20S3'></a>\n",
    "\n",
    "Upload the datasets from the local directories to appropriate sub-directories in the specified S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61ee8d07",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Upload the data to S3\n",
    "train_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/train/'.format(nb_name),\n",
    "                                                          bucket=s3_bucket,\n",
    "                                                          key_prefix=train_dir_s3_prefix)\n",
    "val_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/validate/'.format(nb_name),\n",
    "                                                        bucket=s3_bucket,\n",
    "                                                        key_prefix=val_dir_s3_prefix)\n",
    "test_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/test/'.format(nb_name),\n",
    "                                                         bucket=s3_bucket,\n",
    "                                                         key_prefix=test_dir_s3_prefix)\n",
    "\n",
    "# Capture the S3 locations of the uploaded datasets\n",
    "train_s3_path = '{}/train.csv'.format(train_dir_s3_path)\n",
    "train_x_s3_path = '{}/train_x.csv'.format(train_dir_s3_path)\n",
    "train_y_s3_path = '{}/train_y.csv'.format(train_dir_s3_path)\n",
    "val_s3_path = '{}/validate.csv'.format(val_dir_s3_path)\n",
    "val_x_s3_path = '{}/validate_x.csv'.format(val_dir_s3_path)\n",
    "val_y_s3_path = '{}/validate_y.csv'.format(val_dir_s3_path)\n",
    "test_s3_path = '{}/test.csv'.format(test_dir_s3_path)\n",
    "test_x_s3_path = '{}/test_x.csv'.format(test_dir_s3_path)\n",
    "test_y_s3_path = '{}/test_y.csv'.format(test_dir_s3_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c9b2f18",
   "metadata": {},
   "source": [
    "##  3. Perform training <a id='Perform%20training'></a>\n",
    "\n",
    "In this step, SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) is used to train a regression model on the [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).\n",
    "\n",
    "Note: This model has not been tuned as that is not the intent of this demo."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9933cd21",
   "metadata": {},
   "source": [
    "### A) Set the training parameters <a id='Set%20the%20training%20parameters'></a>\n",
    "\n",
    "1. Inputs - S3 location of the training and validation data.\n",
    "2. Hyperparameters.\n",
    "3. Training instance details:\n",
    "\n",
    "    1. Instance count\n",
    "    \n",
    "    2. Instance type\n",
    "    \n",
    "    3. The max run time of the training job\n",
    "    \n",
    "    4. (Optional) Use Spot instances.  For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html).\n",
    "    \n",
    "    5. (Optional) The max wait for Spot instances, if using Spot.  This should be larger than the max run time.\n",
    "    \n",
    "4. Base job name\n",
    "5. Appropriate local and S3 directories that will be used by the training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "468f3997",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the input data input along with their content types\n",
    "train_input = TrainingInput(train_s3_path, content_type='text/csv')\n",
    "val_input = TrainingInput(val_s3_path, content_type='text/csv')\n",
    "inputs = {'train':train_input, 'validation':val_input}\n",
    "\n",
    "# Set the hyperparameters\n",
    "hyperparameters = {\n",
    "        'objective':'reg:squarederror',\n",
    "        'max_depth':'7',\n",
    "        'eta':'0.02',\n",
    "        'alpha':'1.77',\n",
    "        'colsample_bytree':'0.7',\n",
    "        'num_round':'1864'}\n",
    "\n",
    "# Set the instance count, instance type, volume size, options to use Spot instances and other parameters\n",
    "train_instance_count = 1\n",
    "train_instance_type = 'ml.m5.xlarge'\n",
    "train_instance_volume_size_in_gb = 5\n",
    "#use_spot_instances = True\n",
    "#spot_max_wait_time_in_seconds = 5400\n",
    "use_spot_instances = False\n",
    "spot_max_wait_time_in_seconds = None\n",
    "max_run_time_in_seconds = 3600\n",
    "algorithm_name = 'xgboost'\n",
    "algorithm_version = '1.2-1'\n",
    "py_version = 'py37'\n",
    "# Get the container image URI for the specified parameters\n",
    "container_image_uri = sagemaker.image_uris.retrieve(framework=algorithm_name,\n",
    "                                                    region=region_name,\n",
    "                                                    version=algorithm_version,\n",
    "                                                    py_version=py_version,\n",
    "                                                    instance_type=train_instance_type,\n",
    "                                                    image_scope='training')\n",
    "\n",
    "# Set the training container related parameters\n",
    "container_log_level = logging.INFO\n",
    "\n",
    "# Location where the model checkpoints will be stored locally in the container before being uploaded to S3\n",
    "model_checkpoint_local_dir = '/opt/ml/checkpoints/'\n",
    "\n",
    "# Location where the trained model will be stored locally in the container before being uploaded to S3\n",
    "model_local_dir = '/opt/ml/model'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d337f98b",
   "metadata": {},
   "source": [
    "### B) (Optional) Delete previous checkpoints <a id='(Optional)%20Delete%20previous%20checkpoints'></a>\n",
    "\n",
    "If model checkpoints from previous trainings are found in the S3 checkpoint location specified in the previous step, then training will resume from those checkpoints.  In order to start a fresh training, run the following code cell to delete all checkpoint objects from S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25a7ed62",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete the checkpoints if you want to train from the beginning; else ignore this code cell\n",
    "for checkpoint_file in s3_bucket_resource.objects.filter(Prefix='{}/checkpoint/'.format(nb_name)):\n",
    "    checkpoint_file_key = checkpoint_file.key\n",
    "    print('Deleting {} ...'.format(checkpoint_file_key))\n",
    "    s3_resource.Object(s3_bucket_resource.name, checkpoint_file_key).delete()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66fb1a2e",
   "metadata": {},
   "source": [
    "### C) Run the training job <a id='Run%20the%20training%20job'></a>\n",
    "\n",
    "Prepare the `estimator` and call the `fit()` method.  This will pull the container containing the specified version of the algorithm in the AWS region and run the training job in the specified type of EC2 instance(s).  The training data will be pulled from the specified location in S3 and training results and checkpoints will be written to the specified locations in S3.\n",
    "\n",
    "Note: SageMaker Debugger is disabled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ca1310e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the estimator\n",
    "estimator = sagemaker.estimator.Estimator(\n",
    "    image_uri=container_image_uri,\n",
    "    checkpoint_local_path=model_checkpoint_local_dir,\n",
    "    checkpoint_s3_uri=model_checkpoint_s3_path,\n",
    "    model_dir=model_local_dir,\n",
    "    output_path=model_output_s3_path,\n",
    "    instance_type=train_instance_type,\n",
    "    instance_count=train_instance_count,\n",
    "    use_spot_instances=use_spot_instances,\n",
    "    max_wait=spot_max_wait_time_in_seconds,\n",
    "    max_run=max_run_time_in_seconds,\n",
    "    hyperparameters=hyperparameters,\n",
    "    role=sagemaker.get_execution_role(),\n",
    "    base_job_name=train_job_name,\n",
    "    framework_version=algorithm_version,\n",
    "    py_version=py_version,\n",
    "    container_log_level=container_log_level,\n",
    "    script_mode=False,\n",
    "    debugger_hook_config=False,\n",
    "    disable_profiler=True)\n",
    "\n",
    "# Perform the training\n",
    "estimator.fit(inputs, wait=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "704d2882",
   "metadata": {},
   "source": [
    "##  4. Create and push the Docker container to an Amazon ECR repository <a id='Create%20and%20push%20the%20Docker%20container%20to%20an%20Amazon%20ECR%20repository'></a>\n",
    "\n",
    "In this step, we will create a Docker container containing the generated model along with its dependencies.  If you bring a pre-trained model, you can upload it to S3 and use it to build the container.  The following steps contains instructions for doing so."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af1a0c83",
   "metadata": {},
   "source": [
    "### A) Retrieve the model pickle file <a id='Retrieve%20the%20model%20pickle%20file'></a>\n",
    "\n",
    "* The model file generated using SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) will be a Python pickle file zipped up in a tar file named `model.tar.gz`.  The S3 URI for this file will be available in the `model_data` attribute of the `estimator` object created in the training step.\n",
    "\n",
    "* If you bring your pre-trained model, you have to specify the S3 URI appropriately in the following cell.\n",
    "\n",
    "* The zip file needs to be downloaded from S3 and extracted.\n",
    "\n",
    "* The name of the extracted pickle file will depend on the framework and algorithm that was used to train the model.  In this notebook example, we have used SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) and so the pickle file will be named `xgboost-model`.  You will see this when the model tar file is extracted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f848081e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the container artifacts directory if it doesn't exist\n",
    "os.makedirs(container_artifacts_dir, exist_ok=True)\n",
    "\n",
    "# Set the file paths\n",
    "model_tar_file_s3_path_suffix = '{}/output/{}/output/{}'.format(nb_name,\n",
    "                                                                estimator.latest_training_job.name,\n",
    "                                                                model_tar_file_name)\n",
    "model_tar_file_local_path = '{}/{}'.format(container_artifacts_dir, model_tar_file_name)\n",
    "extracted_model_file_local_path = '{}/{}'.format(container_artifacts_dir, extracted_model_file_name)\n",
    "\n",
    "# Delete old model files if they exist\n",
    "if os.path.exists(model_tar_file_local_path):\n",
    "    os.remove(model_tar_file_local_path)\n",
    "if os.path.exists(extracted_model_file_local_path):\n",
    "    os.remove(extracted_model_file_local_path)\n",
    "\n",
    "# Download the model tar file from S3\n",
    "s3_bucket_resource.download_file(model_tar_file_s3_path_suffix, model_tar_file_local_path)\n",
    "\n",
    "# Extract the model tar file and retrieve the model pickle file\n",
    "with tarfile.open(model_tar_file_local_path, \"r:gz\") as tar:\n",
    "    tar.extractall(path=container_artifacts_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfd51b69",
   "metadata": {},
   "source": [
    "### B) (Optional) Test the model pickle file <a id='(Optional)%20Test%20the%20model%20pickle%20file'></a>\n",
    "\n",
    "The code in the following cell entirely depends on the framework and algorithm that was used to train the model.  The extracted Python3 pickle file will contain the appropriate object name.  If you are bringing your own model file, you have to change this cell appropriately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8427f6b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the model pickle file as a pickle object\n",
    "pickle_file_path = extracted_model_file_local_path\n",
    "with open(pickle_file_path, 'rb') as pkl_file:\n",
    "    model = pickle.load(pkl_file)\n",
    "\n",
    "# Run a prediction against the model loaded as a pickle object\n",
    "# by sending the first record of the test dataset\n",
    "test_pred_x_df = pd.read_csv(StringIO(','.join(map(str, x_test[0]))), sep=',', header=None)\n",
    "test_pred_x = xgb.DMatrix(test_pred_x_df.values)\n",
    "print('Input for prediction = {}'.format(test_pred_x_df.values))\n",
    "print('Predicted value = {}'.format(model.predict(test_pred_x)[0]))\n",
    "print('Actual value = {}'.format(y_test.values[0][0]))\n",
    "print('Note: There may be a huge difference between the actual and predicted values as the model has not been tuned in the training step.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25dc92bf",
   "metadata": {},
   "source": [
    "### C) View the inference script <a id='View%20the%20inference%20script'></a>\n",
    "\n",
    "The inference script is a Python3 script that implements the `handler` function required by Lambda and contains the following logic:\n",
    "* Load the ML model pickle object into memory.\n",
    "* Parse the request sent to the Lambda function either from direct invocation or from a REST/HTTP API in Amazon API Gateway.\n",
    "* Run the prediction.\n",
    "* Format the response to match with the parameter specified in the request.\n",
    "* Return the response.\n",
    "\n",
    "The request should be in the following format:\n",
    "\n",
    "`{\n",
    "  \"response_content_type\": \"<Specify either text/plain or application/json>\",\n",
    "  \"pred_x_csv\": \"<The comma-separated x column values to be used for prediction>\"\n",
    "}`\n",
    "\n",
    "This script will be packaged into the container that will be built in the upcoming steps.\n",
    "\n",
    "You can view the script by running the following code cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "097c4d0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# View the inference script\n",
    "!cat {lambda_script_file}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75de0e2b",
   "metadata": {},
   "source": [
    "### D) Create the Dockerfile <a id='Create%20the%20Dockerfile'></a>\n",
    "\n",
    "In this step, we will create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) which is required to build our [Docker](https://www.docker.com/) container containing the model pickle file, an inference script and its dependencies.\n",
    "\n",
    "In order to create the container, we will use the [AWS Lambda Python 3.9 container image](https://gallery.ecr.aws/lambda/python) available in the [Amazon ECR public registry](https://aws.amazon.com/ecr/) as the base image.  As this is a public registry, you do not require any credentials or permissions to download it.\n",
    "\n",
    "Note: At the time of writing this notebook, this image was based on [Amazon Linux 2](https://aws.amazon.com/amazon-linux-2/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02715cc2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy the inference script to the container-artifacts directory\n",
    "!cp -pr {lambda_script_file} {container_artifacts_dir}/app.py\n",
    "\n",
    "# Create the Dockerfile content\n",
    "dockerfile_content_lines = []\n",
    "dockerfile_content_lines.append('# syntax=docker/dockerfile:1\\n\\n')\n",
    "dockerfile_content_lines.append('# Use AWS Lambda Python 3.9 as the base image\\n')\n",
    "dockerfile_content_lines.append('FROM public.ecr.aws/lambda/python:3.9\\n\\n')\n",
    "dockerfile_content_lines.append('# Install the Python packages required for the inference script\\n')\n",
    "dockerfile_content_lines.append('RUN pip install --upgrade pip\\n')\n",
    "dockerfile_content_lines.append('RUN pip install pandas\\n')\n",
    "dockerfile_content_lines.append('RUN pip install xgboost\\n\\n')\n",
    "dockerfile_content_lines.append('# Copy the extracted model file and the inference script\\n')\n",
    "dockerfile_content_lines.append('COPY ')\n",
    "dockerfile_content_lines.append(extracted_model_file_name)\n",
    "dockerfile_content_lines.append(' ./\\n')\n",
    "dockerfile_content_lines.append('COPY app.py ./\\n\\n')\n",
    "dockerfile_content_lines.append('# Specify the path to the extracted model file as an ENV variable\\n')\n",
    "dockerfile_content_lines.append('ENV MODEL_PICKLE_FILE_PATH=')\n",
    "dockerfile_content_lines.append(extracted_model_file_name)\n",
    "dockerfile_content_lines.append('\\n\\n')\n",
    "dockerfile_content_lines.append('# Specify the default command to run\\n')\n",
    "dockerfile_content_lines.append('CMD [\"app.handler\"]')\n",
    "\n",
    "# Create the Dockerfile\n",
    "dockerfile_local_path = '{}/Dockerfile'.format(container_artifacts_dir)\n",
    "with open(dockerfile_local_path, 'wt') as file:\n",
    "    file.write(''.join(dockerfile_content_lines))\n",
    "    \n",
    "# Print the contents of the generated Dockerfile\n",
    "!cat {dockerfile_local_path}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22c5abdb",
   "metadata": {},
   "source": [
    "### E) Create the container <a id='Create%20the%20container'></a>\n",
    "\n",
    "Create the Docker container using the `docker build` command.  Specify the container image name and point to the container-artifacts directory that contains all the files to build the container.\n",
    "\n",
    "Note: You may see warning messages when the container is built with the Dockerfile that we created in the prior step.  These warnings will be around installing the Python packages that are required by the inference script.  You can choose to either ignore or fix them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71231545",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the Docker container\n",
    "!docker build -t {container_image_name} {container_artifacts_dir}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfeb2a56",
   "metadata": {},
   "source": [
    "### F) Create the private repository in ECR <a id='Create%20the%20private%20repository%20in%20ECR'></a>\n",
    "\n",
    "In order to create an AWS Lambda function using a container, the container image should exist in [Amazon ECR](https://aws.amazon.com/ecr/).  We will create a private repository in Amazon ECR for this demo.\n",
    "\n",
    "In this step, we will check if the private repository in Amazon ECR that we intend to create already exists or not.  If it does not exist, we will create it with the repository name the same as the container image name.\n",
    "\n",
    "Note: When creating the repository, setting the `scanOnPush` parameter to `True` will automatically initiate a vulnerability scan on the container image that is pushed to the repository.  For more info on image scanning, refer [here](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2052a1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check if the ECR repository exists already; if not, then create it\n",
    "try:\n",
    "    ecr_client.describe_repositories(repositoryNames=[container_image_name])\n",
    "    print('ECR repository {} already exists.'.format(container_image_name))\n",
    "except ecr_client.exceptions.RepositoryNotFoundException:\n",
    "    print('ECR repository {} does not exist.'.format(container_image_name))\n",
    "    print('Creating ECR repository {}...'.format(container_image_name))\n",
    "    # Create the ECR repository - here we use the container image name for the repository name\n",
    "    ecr_client.create_repository(repositoryName=container_image_name,\n",
    "                                 imageScanningConfiguration={\n",
    "                                     'scanOnPush': True\n",
    "                                 })\n",
    "    print('Completed creating ECR repository {}.'.format(container_image_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e992fecb",
   "metadata": {},
   "source": [
    "### G) Push the container to ECR <a id='Push%20the%20container%20to%20ECR'></a>\n",
    "\n",
    "In this step, we will push the container to a private registry that we created in Amazon ECR.\n",
    "\n",
    "When using an Amazon ECR private registry, you must authenticate your Docker client to your private registry so that you can use the `docker push` and `docker pull` commands to push and pull images to and from the repositories in that registry.  For more information about this, refer [here](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html).\n",
    "\n",
    "1. If this notebook instance is running on Amazon Linux v1, the authentication happens through an authorization token generated by an AWS CLI command in the following code cell.  This token will be automatically deleted when the code cell completes execution.\n",
    "2. If this notebook instance is running on Amazon Linux v2, the authentication happens through temporary credentials generated based on the IAM role attached to this notebook.  For this, you have to complete the prerequisite mentioned in the first step of this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48521e6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the image names\n",
    "source_image_name = '{}:latest'.format(container_image_name)\n",
    "target_image_name = '{}/{}:latest'.format(container_registry_url_prefix, container_image_name)\n",
    "\n",
    "if os_version == 'ALv1':\n",
    "    # Get the private registry credentials using an authorization token\n",
    "    !aws ecr get-login-password --region {region_name} | docker login --username AWS --password-stdin {container_registry_url_prefix}\n",
    "\n",
    "# Tag the container\n",
    "!docker tag {source_image_name} {target_image_name}\n",
    "\n",
    "# Push the container to the specified registry in Amazon ECR\n",
    "!docker push {target_image_name}\n",
    "\n",
    "if os_version == 'ALv1':\n",
    "    # Delete the Docker credentials file\n",
    "    print('\\nDeleting the generated Docker credentials file...')\n",
    "    !rm /home/ec2-user/.docker/config.json\n",
    "    print('Completed deleting the generated Docker credentials file.')\n",
    "    # Verify the delete\n",
    "    print('Verifying the delete of the generated Docker credentials file...')\n",
    "    !cat /home/ec2-user/.docker/config.json\n",
    "    print('Completed verifying the delete of the generated Docker credentials file.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f87199a",
   "metadata": {},
   "source": [
    "##  5. Create and test the AWS Lambda function <a id='Create%20and%20test%20the%20AWS%20Lambda%20function'></a>\n",
    "\n",
    "In this step, we will create and test the [AWS Lambda](https://aws.amazon.com/lambda/) function using the Docker container that was created in the previous step."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1b9d407",
   "metadata": {},
   "source": [
    "### A) Create the Lambda function <a id='Create%20the%20Lambda%20function'></a>\n",
    "\n",
    "In this step, we will check if the Lambda function that we intend to create already exists or not.  If it does not exist, we will create it.\n",
    "\n",
    "Note: We have not configured this function to use an [Amazon VPC](https://aws.amazon.com/vpc) for networking.  If you require it, refer to the instructions [here](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa915d88",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check if the AWS Lambda function exists already; if not, then create it\n",
    "try:\n",
    "    lambda_client.get_function(FunctionName=lambda_function_name)\n",
    "    print('AWS Lambda function {} already exists.'.format(lambda_function_name))\n",
    "except lambda_client.exceptions.ResourceNotFoundException:\n",
    "    print('AWS Lambda function {} does not exist.'.format(lambda_function_name))\n",
    "    print('Creating AWS Lambda function {}...'.format(lambda_function_name))\n",
    "    lambda_client.create_function(\n",
    "        FunctionName=lambda_function_name,\n",
    "        Role=lambda_iam_role,\n",
    "        Code={'ImageUri' : target_image_name},\n",
    "        Description='California Housing price prediction regression model built on the SageMaker built-in XGBoost algorithm and a Python3 based inference function hosted inside a Docker container.',\n",
    "        Timeout=lambda_timeout_in_seconds,\n",
    "        MemorySize=lambda_memory_size_in_mb,\n",
    "        Publish=True,\n",
    "        PackageType='Image'\n",
    "    )\n",
    "    print('Completed creating AWS Lambda function {}. The function will be in \\'Pending\\' state immediately after creation. Wait for it to be ready before invoking it. This should take a few seconds.'.format(lambda_function_name))\n",
    "    \n",
    "# Sleep every 5 seconds and print the state of the Lambda function until it is not 'Pending'\n",
    "while True:\n",
    "    get_function_response = lambda_client.get_function(FunctionName=lambda_function_name)\n",
    "    function_state = get_function_response['Configuration']['State']\n",
    "    print('Lambda function state = {}'.format(function_state))\n",
    "    if function_state not in {'Pending'}:\n",
    "        break\n",
    "    time.sleep(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea000644",
   "metadata": {},
   "source": [
    "### B) Test the Lambda function <a id='Test%20the%20Lambda%20function'></a>\n",
    "\n",
    "In this step, we will test the Lambda function that we created in the previous step by invoking it synchronously.  For this, we will send the first record of the test dataset as a CSV string.\n",
    "\n",
    "The request should be in the following format:\n",
    "\n",
    "`{\n",
    "  \"response_content_type\": \"<Specify either text/plain or application/json>\",\n",
    "  \"pred_x_csv\": \"<The comma-separated x column values to be used for prediction>\"\n",
    "}`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e79492fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the payload\n",
    "x_test_lambda_payload_csv = ','.join(map(str, x_test[0]))\n",
    "lambda_payload = json.dumps({ 'response_content_type': 'text/plain', 'pred_x_csv':  x_test_lambda_payload_csv})\n",
    "\n",
    "# Invoke the Lambda function and test it\n",
    "lambda_invoke_response = lambda_client.invoke(\n",
    "    FunctionName=lambda_function_name,\n",
    "    InvocationType='RequestResponse',\n",
    "    LogType='Tail',\n",
    "    Payload=lambda_payload\n",
    ")\n",
    "\n",
    "# Print the response\n",
    "try:\n",
    "    lambda_function_error = lambda_invoke_response['FunctionError']\n",
    "    print('Function error :: {}'.format(lambda_function_error))\n",
    "except KeyError:\n",
    "    print('No function errors.')\n",
    "print('Response status code = {}'.format(lambda_invoke_response['StatusCode']))\n",
    "print('Payload :: {}'.format(lambda_invoke_response['Payload'].read()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a367e84",
   "metadata": {},
   "source": [
    "##  6. (Optional) Front-end the Lambda function with Amazon API Gateway <a id='(Optional)%20Front-end%20the%20Lambda%20function%20with%20Amazon%20API%20Gateway'></a>\n",
    "\n",
    "For some use cases, you may prefer to front-end the Lambda function hosting the model with [Amazon API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/welcome.html).  With this setup, you can serve the model inference as an API with a HTTPS endpoint.\n",
    "\n",
    "For the API, you have the following options to choose from:\n",
    "* [HTTP API](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api.html)\n",
    "* [REST API](https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-rest-api.html)\n",
    "\n",
    "For guidance on choosing the right API option, refere [here](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-vs-rest.html).\n",
    "\n",
    "For information on setting up an AWS Lambda function as the backend for Amazon API Gateway, refer [here](https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-lambda-integrations.html).\n",
    "\n",
    "Note: The Lambda function that we created in prior steps has the logic to handle both REST and HTTP API requests from the Amazon API Gateway assuming the gateway passes through the request payload as-is to the backend Lambda function."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5a366d8",
   "metadata": {},
   "source": [
    "## 7. Cleanup <a id='Cleanup'></a>\n",
    "\n",
    "As a best practice, you should delete resources and S3 objects when no longer required.  This will help you avoid incurring unncessary costs.\n",
    "\n",
    "This step will cleanup the resources and S3 objects created by this notebook.\n",
    "\n",
    "Note: Apart from these resources, there will be Docker containers and related images created in the notebook instance that is running this Jupyter notebook.  As they are already part of the notebook instance, you do not need to delete them.  If you decide to delete them, then go to the Terminal of the Jupyter notebook and and run appropriate `docker` commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a491118",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete the AWS Lambda function\n",
    "try:\n",
    "    lambda_client.delete_function(FunctionName=lambda_function_name)\n",
    "    print('AWS Lambda function {} deleted.'.format(lambda_function_name))\n",
    "except lambda_client.exceptions.ResourceNotFoundException:\n",
    "    print('AWS Lambda function {} does not exist.'.format(lambda_function_name))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82f1fb2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete the ECR private repository\n",
    "try:\n",
    "    ecr_client.delete_repository(repositoryName=container_image_name, force=True)\n",
    "    print('ECR repository {} deleted.'.format(container_image_name))\n",
    "except ecr_client.exceptions.RepositoryNotFoundException:\n",
    "    print('ECR repository {} does not exist.'.format(container_image_name))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6e63946",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete data from S3 bucket\n",
    "for file in s3_bucket_resource.objects.filter(Prefix='{}/'.format(nb_name)):\n",
    "    file_key = file.key\n",
    "    print('Deleting {} ...'.format(file_key))\n",
    "    s3_resource.Object(s3_bucket_resource.name, file_key).delete()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e45f00b",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}