{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An Introduction to Linear Learner with MNIST\n",
    "_**Making a Binary Prediction of Whether a Handwritten Digit is a 0**_\n",
    "\n",
    "1. [Introduction](#Introduction)\n",
    "2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)\n",
    "  1. [Permissions and environment variables](#Permissions-and-environment-variables)\n",
    "  2. [Data ingestion](#Data-ingestion)\n",
    "  3. [Data inspection](#Data-inspection)\n",
    "  4. [Data conversion](#Data-conversion)\n",
    "3. [Training the linear model](#Training-the-linear-model)\n",
    "4. [Set up hosting for the model](#Set-up-hosting-for-the-model)\n",
    "5. [Validate the model for use](#Validate-the-model-for-use)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Welcome to our example introducing Amazon SageMaker's Linear Learner Algorithm!  Today, we're analyzing the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset which consists of images of handwritten digits, from zero to nine.  We'll use the individual pixel values from each 28 x 28 grayscale image to predict a yes or no label of whether the digit is a 0 or some other digit (1, 2, 3, ... 9).\n",
    "\n",
    "The method that we'll use is a linear binary classifier.  Linear models are supervised learning algorithms used for solving either classification or regression problems.  As input, the model is given labeled examples ( **`x`**, `y`). **`x`** is a high dimensional vector and `y` is a numeric label.  Since we are doing binary classification, the algorithm expects the label to be either 0 or 1 (but Amazon SageMaker Linear Learner also supports regression on continuous values of `y`).  The algorithm learns a linear function, or linear threshold function for classification, mapping the vector **`x`** to an approximation of the label `y`.\n",
    "\n",
    "Amazon SageMaker's Linear Learner algorithm extends upon typical linear models by training many models in parallel, in a computationally efficient manner.  Each model has a different set of hyperparameters, and then the algorithm finds the set that optimizes a specific criteria.  This can provide substantially more accurate models than typical linear algorithms at the same, or lower, cost.\n",
    "\n",
    "To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prequisites and Preprocessing\n",
    "\n",
    "Lets first set up our enviromnet and get the stack outputs of our build pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "import time\n",
    "import os\n",
    "import time\n",
    "import tarfile\n",
    "import boto3\n",
    "import json\n",
    "from botocore.exceptions import ClientError\n",
    "cf = boto3.client('cloudformation')\n",
    "s3 = boto3.client('s3')\n",
    "sns = boto3.client('sns')\n",
    "step = boto3.client('stepfunctions')\n",
    "sagemaker = boto3.client('sagemaker-runtime')\n",
    "ssm=boto3.client('ssm')\n",
    "cf = boto3.client('cloudformation')\n",
    "\n",
    "with open('../config.json') as json_file:  \n",
    "    config = json.load(json_file)\n",
    "StackName=config[\"StackName\"]\n",
    "\n",
    "result=cf.describe_stacks(\n",
    "    StackName=StackName\n",
    ")\n",
    "outputs={}\n",
    "for output in result['Stacks'][0]['Outputs']:\n",
    "    outputs[output['OutputKey']]=output['OutputValue']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data ingestion\n",
    "\n",
    "Next, we read the dataset from an online URL into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import pickle, gzip, numpy, urllib.request, json\n",
    "\n",
    "# Load the dataset\n",
    "urllib.request.urlretrieve(\"http://deeplearning.net/data/mnist/mnist.pkl.gz\", \"mnist.pkl.gz\")\n",
    "with gzip.open('mnist.pkl.gz', 'rb') as f:\n",
    "    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data inspection\n",
    "\n",
    "Once the dataset is imported, it's typical as part of the machine learning process to inspect the data, understand the distributions, and determine what type(s) of preprocessing might be needed. You can perform those tasks right here in the notebook. As an example, let's go ahead and look at one of the digits that is part of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams[\"figure.figsize\"] = (2,10)\n",
    "\n",
    "\n",
    "def show_digit(img, caption='', subplot=None):\n",
    "    if subplot==None:\n",
    "        _,(subplot)=plt.subplots(1,1)\n",
    "    imgr=img.reshape((28,28))\n",
    "    subplot.axis('off')\n",
    "    subplot.imshow(imgr, cmap='gray')\n",
    "    plt.title(caption)\n",
    "\n",
    "show_digit(train_set[0][30], 'This is a {}'.format(train_set[1][30]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data conversion\n",
    "\n",
    "Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the Amazon SageMaker implementation of Linear Learner takes recordIO-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk.\n",
    "\n",
    "Most of the conversion effort is handled by the Amazon SageMaker Python SDK, imported as `sagemaker` below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import numpy as np\n",
    "import sagemaker.amazon.common as smac\n",
    "\n",
    "vectors = np.array([t.tolist() for t in train_set[0]]).astype('float32')\n",
    "labels = np.where(np.array([t.tolist() for t in train_set[1]]) == 0, 1, 0).astype('float32')\n",
    "\n",
    "buf = io.BytesIO()\n",
    "smac.write_numpy_to_dense_tensor(buf, vectors, labels)\n",
    "buf.seek(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload training data\n",
    "Now that we've created our recordIO-wrapped protobuf, we'll need to upload it to S3, so that Amazon SageMaker training can use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "import os\n",
    "\n",
    "key = 'recordio-pb-data'\n",
    "bucket=outputs[\"DataBucket\"]\n",
    "boto3.resource('s3').Bucket(outputs[\"DataBucket\"]).Object(os.path.join('train/mnist-linear', key)).upload_fileobj(buf)\n",
    "s3_train_data = 's3://{}/{}'.format(bucket, key)\n",
    "print('uploaded training data location: {}'.format(s3_train_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to make sure the Sagebuild template is configured correctly for Amazon Algorithms. the following code will set the stack configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params=result[\"Stacks\"][0][\"Parameters\"]\n",
    "for n,i in enumerate(params):\n",
    "    if(i[\"ParameterKey\"]==\"ConfigFramework\"):\n",
    "        i[\"ParameterValue\"]=\"AMAZON\" \n",
    "\n",
    "try:\n",
    "    cf.update_stack(\n",
    "        StackName=StackName,\n",
    "        UsePreviousTemplate=True,\n",
    "        Parameters=params,\n",
    "        Capabilities=[\n",
    "            'CAPABILITY_NAMED_IAM',\n",
    "        ]\n",
    "    )\n",
    "    waiter = cf.get_waiter('stack_update_complete')\n",
    "    print(\"Waiting for stack update\")\n",
    "    waiter.wait(\n",
    "        StackName=StackName,\n",
    "        WaiterConfig={\n",
    "            'Delay':1,\n",
    "            'MaxAttempts':600\n",
    "        }\n",
    "    )\n",
    "\n",
    "except ClientError as e:\n",
    "    if(e.response[\"Error\"][\"Message\"]==\"No updates are to be performed.\"):\n",
    "        pass\n",
    "    else:\n",
    "        raise e\n",
    "print(\"stack ready!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the linear model\n",
    "\n",
    "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the Linear Learner training algorithm, although we have tested it on multi-terabyte datasets.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll kick off the base estimator, making sure to pass in the necessary hyperparameters.  Notice:\n",
    "- `feature_dim` is set to 784, which is the number of pixels in each 28 x 28 image.\n",
    "- `predictor_type` is set to 'binary_classifier' since we are trying to predict whether the image is or is not a 0.\n",
    "- `mini_batch_size` is set to 200.  This value can be tuned for relatively minor improvements in fit and speed, but selecting a reasonable value relative to the dataset is appropriate in most cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "store=outputs[\"ParameterStore\"]\n",
    "result=ssm.get_parameter(Name=store)\n",
    "\n",
    "params=json.loads(result[\"Parameter\"][\"Value\"])\n",
    "params[\"hyperparameters\"]={\n",
    "    \"feature_dim\":\"784\",\n",
    "    \"predictor_type\":'binary_classifier',\n",
    "    \"mini_batch_size\":\"200\"\n",
    "}\n",
    "params[\"algorithm\"]=\"linear-learner\"\n",
    "params[\"traininstancecount\"]=1\n",
    "params[\"traininstancetype\"]=\"ml.c4.xlarge\"\n",
    "params[\"channels\"]={\n",
    "    \"train\":{\n",
    "        \"path\":\"train/mnist-linear\",\n",
    "        \"dist\":False\n",
    "    }\n",
    "}\n",
    "\n",
    "ssm.put_parameter(\n",
    "    Name=store,\n",
    "    Type=\"String\",\n",
    "    Overwrite=True,\n",
    "    Value=json.dumps(params)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, lets start our Training!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "result=sns.publish(\n",
    "    TopicArn=outputs['LaunchTopic'],\n",
    "    Message=\"{}\" #message is not important, just publishing to topic starts build\n",
    ")\n",
    "print(result)\n",
    "time.sleep(5)\n",
    "#list all executions for our StateMachine to get our current running one\n",
    "result=step.list_executions(\n",
    "    stateMachineArn=outputs['StateMachine'],\n",
    "    statusFilter=\"RUNNING\"\n",
    ")['executions']\n",
    "\n",
    "if len(result) > 0:\n",
    "    response = step.describe_execution(\n",
    "        executionArn=result[0]['executionArn']\n",
    "    )\n",
    "    status=response['status']\n",
    "    print(status,response['name'])\n",
    "    #poll status till execution finishes\n",
    "    while status == \"RUNNING\":\n",
    "        print('.',end=\"\")\n",
    "        time.sleep(5)\n",
    "        status=step.describe_execution(executionArn=result[0]['executionArn'])['status']\n",
    "    print()\n",
    "    print(status)\n",
    "else:\n",
    "    print(\"no running tasks\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validate the model for use\n",
    "Finally, we can now validate the model for use.  We can pass HTTP POST requests to the endpoint to get back predictions.  To make this easier, we'll again use the Amazon SageMaker Python SDK and specify how to serialize requests and deserialize responses that are specific to the algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.predictor import csv_serializer, json_deserializer\n",
    "from sagemaker.predictor import RealTimePredictor\n",
    "\n",
    "predictor=RealTimePredictor(outputs[\"SageMakerEndpoint\"], False, csv_serializer,json_deserializer)\n",
    "result=predictor.predict(train_set[0][30:31])\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's try getting a prediction for a single record."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result =predictor.predict(train_set[0][30:31])\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, a single prediction works.  We see that for one record our endpoint returned some JSON which contains `predictions`, including the `score` and `predicted_label`.  In this case, `score` will be a continuous value between [0, 1] representing the probability we think the digit is a 0 or not.  `predicted_label` will take a value of either `0` or `1` where (somewhat counterintuitively) `1` denotes that we predict the image is a 0, while `0` denotes that we are predicting the image is not of a 0.\n",
    "\n",
    "Let's do a whole batch of images and evaluate our predictive accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "predictions = []\n",
    "for array in np.array_split(test_set[0], 100):\n",
    "    result = predictor.predict(array)\n",
    "    predictions += [r['predicted_label'] for r in result['predictions']]\n",
    "\n",
    "predictions = np.array(predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.crosstab(np.where(test_set[1] == 0, 1, 0), predictions, rownames=['actuals'], colnames=['predictions'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see from the confusion matrix above, we predict 8972 images of 0 correctly, while we predict 39 images as 0s that aren't, and miss predicting 48 images of 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}