{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bring Your Own Model with Amazon Redshift ML\n",
    "_**On a Redshift Cluster, Deploy and Run Predictions on a model trained and validated on Amazon Sagemaker**_\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "## Contents\n",
    "1. [Introduction](#Introduction)\n",
    "2. [Setup](#Setup)\n",
    "3. [Training the XGBOOST Model on Sagemaker](#Training-the-XGBOOST-Model-on-Sagemaker)   \n",
    "4. [Create Model and run predictions on Redshift](#Create-Model-and-run-predictions-on-Redshift)\n",
    "    1. [Configure Redshift Data Api](#Configure-Redshift-Data-Api)\n",
    "    2. [Create BYOM Model](#Create-BYOM-Model)\n",
    "    3. [Data Preparation](#Data-preparation)\n",
    "    4. [Inference](#Inference)\n",
    "    5. [Evaluation](#Evaluation)\n",
    "5. [Cleanup](#Cleanup)\n",
    "\n",
    "---\n",
    "## Introduction\n",
    "\n",
    "In this notebook, we have a working example to create \"bring your own model\" (BYOM) using Amazon Redshift ML. BYOM uses a pre-trained model which is trained and validated on Amazon Sagemaker.  In this usecase, the model that Redshift ML BYOM uses, is built using the XGBOOST linear regression problem type, which is used to predict a numerical outcome, like the price of a house or how many people will use a city’s bike rental service or age of an abalone.\n",
    "\n",
    "Notebook used in this use case demonstrates how to train and validate a XGBOOST linear regression machine learning model on Amazon Sagemaker and then how to import the trained model into Redshift ML and run inference locally on Redshift Cluster. \n",
    "\n",
    "With BYOM method - you can bring models of type XGBOOST and MLP to Redshift ML. Once pre-trained model is deployed onto Redshift ML, inferences can be run locally on Redshift without using Sagemaker endpoint or Sagemaker Studio.  This makes really easy for Data analysts to run inference on new data against models created outside of Redshift with out worrying about access to Sagemaker Services/endpoint.  This method helps data scientiest quickly deliver Machine Learnings built outside of Reddshift to Data team in short span. Since Redshift ML is interacted with native Redshift SQL, the data team user experience is consistent with other data anlaysis work that they do on data warehouse.  \n",
    "\n",
    "The content in this notebook can be treated as two sections\n",
    "\n",
    "  * Train and Validate XGboost algorithm on Sagemaker\n",
    "  \n",
    "  * Create BYOM Local Inference on Redshift Cluster\n",
    "\n",
    "--------\n",
    "## Use Case\n",
    "We use the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the UCI data repository [1]. More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names).  In the libsvm converted [version](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html), the nominal feature (Male/Female/Infant) has been converted into a real valued feature. Age of abalone is to be predicted from eight physical measurements. Dataset is already processed and stored on S3. Scripts used for processing the data can be found in the [Appendix](#Appendix). These include downloading the data, splitting into train, validation and test, and uploading to S3 bucket. These steps do not need to run again, they are provided for reference. \n",
    "\n",
    ">[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------\n",
    "## Setup\n",
    "\n",
    "This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel. Redshift Cluster is 2 node ra3.xlplus cluster. Connection to Redshift Server made using Redshift Data Api.\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "1. S3_BUCKET: The S3 bucket that you want to use for saving the model and where training data is located. This should be within the same region as the Notebook Instance, training, and hosting. This bucket also stores intermediates results genereated by Redsfhit ML.\n",
    "\n",
    "1. REDSHIFT_IAM_ROLE: The IAM role arn attached to Redshift Cluster.\n",
    "\n",
    "1. REDSHIFT_USER: Database users to run SQL commands\n",
    "\n",
    "1. REDSHIFT_ENDPOINT: Redshift Cluster end point.\n",
    "\n",
    "1. CUSTOMER_PRETRAINED_MODEL_DATA: If you are planning to use your pretrained XGBOOST Model, then provide the location into this parameter.\n",
    "\n",
    "**NOTE**:  Except for the last parameter, the first four parameters are found under the output tab of Cloud Formation stack.  Please copy and paste them in below cell. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "REDSHIFT_ENDPOINT = 'redshift-cluster-1.cwt0y36legsj.us-east-1.redshift.amazonaws.com:5439/dev'\n",
    "REDSHIFT_USER=\"demo\"\n",
    "REDSHIFT_IAM_ROLE='arn:aws:iam::822469723147:role/byom-2-RedshiftMLIAMRole-822469723147'\n",
    "S3_BUCKET='byom-2-redshiftmlbucket-hwcr2e0x844p'\n",
    "CUSTOMER_PRETRAINED_MODEL_DATA = '' #<<S3 path to your model.tar.gz file>>. Use this if you are skipping part 1\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true,
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "import os\n",
    "import boto3\n",
    "import re\n",
    "import sagemaker\n",
    "import time\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "# S3 bucket where the training data is located.\n",
    "# Feel free to specify a different bucket and prefix\n",
    "data_bucket = f\"jumpstart-cache-prod-{region}\"\n",
    "data_prefix = \"1p-notebooks-datasets/abalone/libsvm\"\n",
    "data_bucket_path = f\"s3://{data_bucket}\"\n",
    "\n",
    "# S3 bucket for saving code and model artifacts.\n",
    "# Feel free to specify a different bucket and prefix\n",
    "#output_bucket = sagemaker.Session().default_bucket()\n",
    "output_bucket=S3_BUCKET\n",
    "output_prefix = \"sagemaker/DEMO-xgboost-abalone-default\"\n",
    "output_bucket_path = f\"s3://{output_bucket}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Training the XGBOOST Model on Sagemaker\n",
    "\n",
    "After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 5 and 6 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import image_uris\n",
    "container = image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='1.0-1')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "job_name = f\"DEMO-xgboost-regression-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
    "print(\"Training job\", job_name)\n",
    "\n",
    "# Ensure that the training and validation data folders generated above are reflected in the \"InputDataConfig\" parameter below.\n",
    "\n",
    "create_training_params = {\n",
    "    \"AlgorithmSpecification\": {\"TrainingImage\": container, \"TrainingInputMode\": \"File\"},\n",
    "    \"RoleArn\": role,\n",
    "    \"OutputDataConfig\": {\"S3OutputPath\": f\"{output_bucket_path}/{output_prefix}/single-xgboost\"},\n",
    "    \"ResourceConfig\": {\"InstanceCount\": 1, \"InstanceType\": \"ml.m5.2xlarge\", \"VolumeSizeInGB\": 5},\n",
    "    \"TrainingJobName\": job_name,\n",
    "    \"HyperParameters\": {\n",
    "        \"max_depth\": \"5\",\n",
    "        \"eta\": \"0.2\",\n",
    "        \"gamma\": \"4\",\n",
    "        \"min_child_weight\": \"6\",\n",
    "        \"subsample\": \"0.7\",\n",
    "        \"silent\": \"0\",\n",
    "        \"objective\": \"reg:linear\",\n",
    "        \"num_round\": \"50\",\n",
    "    },\n",
    "    \"StoppingCondition\": {\"MaxRuntimeInSeconds\": 3600},\n",
    "    \"InputDataConfig\": [\n",
    "        {\n",
    "            \"ChannelName\": \"train\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": f\"{data_bucket_path}/{data_prefix}/train\",\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\",\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"libsvm\",\n",
    "            \"CompressionType\": \"None\",\n",
    "        },\n",
    "        {\n",
    "            \"ChannelName\": \"validation\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": f\"{data_bucket_path}/{data_prefix}/validation\",\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\",\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"libsvm\",\n",
    "            \"CompressionType\": \"None\",\n",
    "        },\n",
    "    ],\n",
    "}\n",
    "\n",
    "\n",
    "client = boto3.client(\"sagemaker\", region_name=region)\n",
    "client.create_training_job(**create_training_params)\n",
    "\n",
    "import time\n",
    "\n",
    "status = client.describe_training_job(TrainingJobName=job_name)[\"TrainingJobStatus\"]\n",
    "print(status)\n",
    "while status != \"Completed\" and status != \"Failed\":\n",
    "    time.sleep(60)\n",
    "    status = client.describe_training_job(TrainingJobName=job_name)[\"TrainingJobStatus\"]\n",
    "    print(status)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the \"validation\" channel has been initialized too. The SageMaker XGBoost algorithm actually calculates RMSE and writes it to the CloudWatch logs on the data passed to the \"validation\" channel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up hosting for the model\n",
    "In order to set up hosting, we have to import the model from training to hosting. \n",
    "\n",
    "### Import model into hosting\n",
    "\n",
    "Register the model with hosting. This allows the flexibility of importing models trained elsewhere."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "model_name = f\"{job_name}-model\"\n",
    "print(model_name)\n",
    "\n",
    "info = client.describe_training_job(TrainingJobName=job_name)\n",
    "model_data = info[\"ModelArtifacts\"][\"S3ModelArtifacts\"]\n",
    "print(model_data)\n",
    "\n",
    "primary_container = {\"Image\": container, \"ModelDataUrl\": model_data}\n",
    "\n",
    "create_model_response = client.create_model(\n",
    "    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container\n",
    ")\n",
    "\n",
    "print(create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure model artifact is created. Print the model_data variable to find out where model data is saved. This is location that is passed to create model statement. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#print model artifact location which is used later part in the notebook\n",
    "print(model_data)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint configuration\n",
    "\n",
    "SageMaker supports configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way. In addition, the endpoint configuration describes the instance type required for model deployment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import gmtime, strftime\n",
    "\n",
    "endpoint_config_name = f\"DEMO-XGBoostEndpointConfig-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
    "print(endpoint_config_name)\n",
    "create_endpoint_config_response = client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.m5.xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(f\"Endpoint Config Arn: {create_endpoint_config_response['EndpointConfigArn']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint\n",
    "Lastly, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "import time\n",
    "\n",
    "endpoint_name = f'DEMO-XGBoostEndpoint-{strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())}'\n",
    "print(endpoint_name)\n",
    "create_endpoint_response = client.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "print(create_endpoint_response[\"EndpointArn\"])\n",
    "\n",
    "resp = client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "while status == \"Creating\":\n",
    "    print(f\"Status: {status}\")\n",
    "    time.sleep(60)\n",
    "    resp = client.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "\n",
    "print(f\"Arn: {resp['EndpointArn']}\")\n",
    "print(f\"Status: {status}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validate the model for use\n",
    "Finally, the customer can now validate the model for use. They can obtain the endpoint from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "runtime_client = boto3.client(\"runtime.sagemaker\", region_name=region)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "FILE_TEST = \"abalone.test\"\n",
    "s3 = boto3.client(\"s3\")\n",
    "s3.download_file(data_bucket, f\"{data_prefix}/test/{FILE_TEST}\", FILE_TEST)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start with a single prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -1 abalone.test > abalone.single.test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import json\n",
    "from itertools import islice\n",
    "import math\n",
    "import struct\n",
    "\n",
    "file_name = \"abalone.single.test\"  # customize to your test file\n",
    "with open(file_name, \"r\") as f:\n",
    "    payload = f.read().strip()\n",
    "response = runtime_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name, ContentType=\"text/x-libsvm\", Body=payload\n",
    ")\n",
    "result = response[\"Body\"].read()\n",
    "result = result.decode(\"utf-8\")\n",
    "result = result.split(\",\")\n",
    "result = [math.ceil(float(i)) for i in result]\n",
    "label = payload.strip(\" \").split()[0]\n",
    "print(f\"Label: {label}\\nPrediction: {result[0]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, a single prediction works. Let's do a whole batch to see how good is the predictions accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import math\n",
    "\n",
    "\n",
    "def do_predict(data, endpoint_name, content_type):\n",
    "    payload = \"\\n\".join(data)\n",
    "    response = runtime_client.invoke_endpoint(\n",
    "        EndpointName=endpoint_name, ContentType=content_type, Body=payload\n",
    "    )\n",
    "    result = response[\"Body\"].read()\n",
    "    result = result.decode(\"utf-8\")\n",
    "    result = result.split(\",\")\n",
    "    preds = [float((num)) for num in result]\n",
    "    preds = [math.ceil(num) for num in preds]\n",
    "    return preds\n",
    "\n",
    "\n",
    "def batch_predict(data, batch_size, endpoint_name, content_type):\n",
    "    items = len(data)\n",
    "    arrs = []\n",
    "\n",
    "    for offset in range(0, items, batch_size):\n",
    "        if offset + batch_size < items:\n",
    "            results = do_predict(data[offset : (offset + batch_size)], endpoint_name, content_type)\n",
    "            arrs.extend(results)\n",
    "        else:\n",
    "            arrs.extend(do_predict(data[offset:items], endpoint_name, content_type))\n",
    "        sys.stdout.write(\".\")\n",
    "    return arrs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following helps us calculate the Median Absolute Percent Error (MdAPE) on the batch dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "import json\n",
    "import numpy as np\n",
    "\n",
    "with open(FILE_TEST, \"r\") as f:\n",
    "    payload = f.read().strip()\n",
    "\n",
    "labels = [int(line.split(\" \")[0]) for line in payload.split(\"\\n\")]\n",
    "test_data = [line for line in payload.split(\"\\n\")]\n",
    "preds = batch_predict(test_data, 100, endpoint_name, \"text/x-libsvm\")\n",
    "\n",
    "print(\n",
    "    \"\\n Median Absolute Percent Error (MdAPE) = \",\n",
    "    np.median(np.abs(np.array(labels) - np.array(preds)) / np.array(labels)),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have successfully completed Part 1 i.e Create XGBOOST Machine Learning model using sagemaker, validated it.  Now we move onto sescond Part where we show you how to deploy the already trained model and run predictions on Redshift ML."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "------\n",
    "# Create Model and run predictions on Redshift\n",
    "## Configure Redshift Data Api"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Setup Run SQL function using Redshift Data API to get SQL query output directly into pandas dataframe\n",
    "\n",
    "In this step, we are creating function run_sql, which we will use to get SQL query output directly into pandas dataframe. We will also use this function to run DDL statements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import time\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "session = boto3.session.Session()\n",
    "region = session.region_name\n",
    "\n",
    "\n",
    "def run_sql(sql_text):\n",
    "    client = boto3.client(\"redshift-data\")\n",
    "    res = client.execute_statement(Database=REDSHIFT_ENDPOINT.split('/')[1], DbUser=REDSHIFT_USER, Sql=sql_text,\n",
    "                                   ClusterIdentifier=REDSHIFT_ENDPOINT.split('.')[0])\n",
    "    query_id = res[\"Id\"]\n",
    "    done = False\n",
    "    while not done:\n",
    "        time.sleep(1)\n",
    "        status_description = client.describe_statement(Id=query_id)\n",
    "        status = status_description[\"Status\"]\n",
    "        if status == \"FAILED\":\n",
    "            raise Exception('SQL query failed:' + query_id + \": \" + status_description[\"Error\"])\n",
    "        elif status == \"FINISHED\":\n",
    "            if status_description['ResultRows']>0:\n",
    "                results = client.get_statement_result(Id=query_id)\n",
    "                column_labels = []\n",
    "                for i in range(len(results[\"ColumnMetadata\"])): column_labels.append(results[\"ColumnMetadata\"][i]['label'])\n",
    "                records = []\n",
    "                for record in results.get('Records'):\n",
    "                    records.append([list(rec.values())[0] for rec in record])\n",
    "                df = pd.DataFrame(np.array(records), columns=column_labels)\n",
    "                return df\n",
    "            else:\n",
    "                \n",
    "                return query_id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create BYOM Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Model\n",
    "   Run the below create model command to create get above model into Redshift ML. If you notice the \"create model\" syntax, it has \"FROM\" clause where we are going to pass path of training artifact created by Sagemaker. This value is coming through model_data parameter.  Below command will take about 5 minutes to run.  Create Model BYOM also expects Redshift IAM role and S3_bucket as parameters values.  S3 Bucket is Amazon S3 location that is used to store intermediate results by Redshift ML. <br><br>\n",
    "**NOTE** \n",
    "#### Use your own pre-trained model \n",
    "* Set the CUSTOMER_PRETRAINED_MODEL_DATA parameter in first cell of this notebook\n",
    "* Update feature list and data types accordingly in the \"Function\" clause.\n",
    "* Update table name and columns passed to predict function accordingly in Select statement"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#check if Customer using pre created machine learning model, if so then update the model_data\n",
    "if CUSTOMER_PRETRAINED_MODEL_DATA !='':\n",
    "    MODEL_PATH = CUSTOMER_PRETRAINED_MODEL_DATA\n",
    "else:\n",
    "    MODEL_PATH=model_data\n",
    "print(MODEL_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_text=(\"drop model if exists predict_abalone_age; \\\n",
    " CREATE MODEL predict_abalone_age \\\n",
    "FROM '{}' \\\n",
    "FUNCTION predict_abalone_age ( int, int, float, float,float,float,float,float,float) \\\n",
    "RETURNS int \\\n",
    "IAM_ROLE '{}' \\\n",
    "settings( S3_BUCKET '{}') \\\n",
    "\")\n",
    "df=run_sql(sql_text.format(model_data,REDSHIFT_IAM_ROLE, S3_BUCKET))\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once create model statement is finished. You can use \"Show Model\" command to see model status. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df=run_sql(\" show model byom_abalone_xgboost_local_inference;\"          \n",
    "          );\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data preparation\n",
    "Load test data from S3 bucket to Redshift Table.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_text=\"\"\"drop table if exists abalone_test; \n",
    "           create table abalone_test \n",
    "(Rings int, sex int,Length_ float, Diameter float, Height float, \n",
    "WholeWeight float, ShuckedWeight float,  VisceraWeight float,  ShellWeight float ); \n",
    "copy abalone_test \n",
    "from 's3://jumpstart-cache-prod-us-east-1/1p-notebooks-datasets/abalone/text-csv/test/' \n",
    "IAM_ROLE '{}' \n",
    "csv ; \\\n",
    "\"\"\" \n",
    "\n",
    "df=run_sql(sql_text.format(REDSHIFT_IAM_ROLE))\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sample the test table to make sure data is loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "sql_text=(\" select * from abalone_test limit 10;\" )\n",
    "\n",
    "df=run_sql(sql_text.format(REDSHIFT_IAM_ROLE))\n",
    "print(df)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference\n",
    "Now call prediction function which was created as part of the Create Model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_text=\"\"\"Select original_age, predicted_age, original_age-predicted_age as Error \n",
    "From( \n",
    "select predict_abalone_age(Rings,sex, \n",
    "Length_ , \n",
    "Diameter ,  \n",
    "Height , \n",
    "WholeWeight ,\n",
    "ShuckedWeight ,  \n",
    "VisceraWeight , \n",
    "ShellWeight ) predicted_age, rings as original_age \n",
    "from abalone_test ); \"\"\"\n",
    "\n",
    "df=run_sql(sql_text)\n",
    "print(df.head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation \n",
    "If you choose to find out (R<sup>2</sup>) or RMSE values on predicted value, you can calculate using below code\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df=run_sql(\"SELECT 1-(SUM(POWER(rings - (predict_abalone_age(Rings,sex, \\\n",
    "Length_ , \\\n",
    "Diameter , \\\n",
    "Height , \\\n",
    "WholeWeight , \\\n",
    "ShuckedWeight , \\\n",
    "VisceraWeight , \\\n",
    "ShellWeight )),2)))/(SUM(POWER(rings - (SELECT AVG(rings) FROM abalone_test),2))) R2_Value FROM abalone_test; \\\n",
    "\")\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_text=\"\"\"SELECT SQRT(Avg( POWER(rings - (predict_abalone_age(Rings,sex, \n",
    "Length_ , \n",
    "Diameter , \n",
    "Height , \n",
    "WholeWeight , \n",
    "ShuckedWeight ,  \n",
    "VisceraWeight ,  \n",
    "ShellWeight )) , 2) ) ) as abalone_age_rmse FROM abalone_test;\"\"\"\n",
    "\n",
    "df=run_sql(sql_text)\n",
    "print(df.head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleanup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Drop Model and Table\n",
    "Clean up objects created on your Redshift Cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_text=\"\"\"drop table if exists abalone_test; \n",
    "drop model if exists byom_abalone_xgboost_local_inference;\n",
    "\"\"\" \n",
    "\n",
    "df=run_sql(sql_text.format(REDSHIFT_IAM_ROLE))\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Delete Endpoint\n",
    "Once you are done using the endpoint, you can use the following to delete it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Appendix\n",
    "Hidden not used"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "source": [
    "### Data split and upload\n",
    "\n",
    "Following methods split the data into train/test/validation datasets and upload files to S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "import io\n",
    "import boto3\n",
    "import random\n",
    "\n",
    "\n",
    "def data_split(\n",
    "    FILE_DATA, FILE_TRAIN, FILE_VALIDATION, FILE_TEST, PERCENT_TRAIN, PERCENT_VALIDATION, PERCENT_TEST\n",
    "):\n",
    "    data = [l for l in open(FILE_DATA, \"r\")]\n",
    "    train_file = open(FILE_TRAIN, \"w\")\n",
    "    valid_file = open(FILE_VALIDATION, \"w\")\n",
    "    tests_file = open(FILE_TEST, \"w\")\n",
    "\n",
    "    num_of_data = len(data)\n",
    "    num_train = int((PERCENT_TRAIN / 100.0) * num_of_data)\n",
    "    num_valid = int((PERCENT_VALIDATION / 100.0) * num_of_data)\n",
    "    num_tests = int((PERCENT_TEST / 100.0) * num_of_data)\n",
    "\n",
    "    data_fractions = [num_train, num_valid, num_tests]\n",
    "    split_data = [[], [], []]\n",
    "\n",
    "    rand_data_ind = 0\n",
    "\n",
    "    for split_ind, fraction in enumerate(data_fractions):\n",
    "        for i in range(fraction):\n",
    "            rand_data_ind = random.randint(0, len(data) - 1)\n",
    "            split_data[split_ind].append(data[rand_data_ind])\n",
    "            data.pop(rand_data_ind)\n",
    "\n",
    "    for l in split_data[0]:\n",
    "        train_file.write(l)\n",
    "\n",
    "    for l in split_data[1]:\n",
    "        valid_file.write(l)\n",
    "\n",
    "    for l in split_data[2]:\n",
    "        tests_file.write(l)\n",
    "\n",
    "    train_file.close()\n",
    "    valid_file.close()\n",
    "    tests_file.close()\n",
    "\n",
    "\n",
    "def write_to_s3(fobj, bucket, key):\n",
    "    return (\n",
    "        boto3.Session(region_name=region).resource(\"s3\").Bucket(bucket).Object(key).upload_fileobj(fobj)\n",
    "    )\n",
    "\n",
    "\n",
    "def upload_to_s3(bucket, channel, filename):\n",
    "    fobj = open(filename, \"rb\")\n",
    "    key = f\"{prefix}/{channel}\"\n",
    "    url = f\"s3://{bucket}/{key}/{filename}\"\n",
    "    print(f\"Writing to {url}\")\n",
    "    write_to_s3(fobj, bucket, key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "source": [
    "### Data ingestion\n",
    "\n",
    "Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "import urllib.request\n",
    "\n",
    "bucket = sagemaker.Session().default_bucket()\n",
    "prefix = \"sagemaker/DEMO-xgboost-abalone-default\"\n",
    "# Load the dataset\n",
    "FILE_DATA = \"abalone\"\n",
    "urllib.request.urlretrieve(\n",
    "    \"https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone\", FILE_DATA\n",
    ")\n",
    "\n",
    "# split the downloaded data into train/test/validation files\n",
    "FILE_TRAIN = \"abalone.train\"\n",
    "FILE_VALIDATION = \"abalone.validation\"\n",
    "FILE_TEST = \"abalone.test\"\n",
    "PERCENT_TRAIN = 70\n",
    "PERCENT_VALIDATION = 15\n",
    "PERCENT_TEST = 15\n",
    "data_split(\n",
    "    FILE_DATA, FILE_TRAIN, FILE_VALIDATION, FILE_TEST, PERCENT_TRAIN, PERCENT_VALIDATION, PERCENT_TEST\n",
    ")\n",
    "\n",
    "# upload the files to the S3 bucket\n",
    "upload_to_s3(bucket, \"train\", FILE_TRAIN)\n",
    "upload_to_s3(bucket, \"validation\", FILE_VALIDATION)\n",
    "upload_to_s3(bucket, \"test\", FILE_TEST)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Tags",
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}