{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multiclass classification with Amazon SageMaker XGBoost algorithm\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_**Single machine and distributed training for multiclass classification with Amazon SageMaker XGBoost algorithm**_\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "## Contents\n",
    "\n",
    "1. [Introduction](#Introduction)\n",
    "2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)\n",
    "   1. [Permissions and environment variables](#Permissions-and-environment-variables)\n",
    "   2. [Data ingestion](#Data-ingestion)\n",
    "   3. [Data conversion](#Data-conversion)\n",
    "3. [Training the XGBoost model](#Training-the-XGBoost-model)\n",
    "   1. [Training on a single instance](#Training-on-a-single-instance)\n",
    "   2. [Training on multiple instances](#Training-on-multiple-instances)\n",
    "   3. [Training with Automatic Model Tuning (HPO)](#Training-with-automatic-model-tuning-HPO)\n",
    "4. [Set up hosting for the model](#Set-up-hosting-for-the-model)\n",
    "   1. [Import model into hosting](#Import-model-into-hosting)\n",
    "   2. [Create endpoint configuration](#Create-endpoint-configuration)\n",
    "   3. [Create endpoint](#Create-endpoint)\n",
    "5. [Validate the model for use](#Validate-the-model-for-use)\n",
    "\n",
    "---\n",
    "## Introduction\n",
    "\n",
    "\n",
    "This notebook demonstrates the use of Amazon SageMaker’s implementation of the XGBoost algorithm to train and host a multiclass classification model. The MNIST dataset is used for training. It has a training set of 60,000 examples and a test set of 10,000 examples. To illustrate the use of libsvm training data format, we download the dataset and convert it to the libsvm format before training.\n",
    "\n",
    "To get started, we need to set up the environment with a few prerequisites for permissions and configurations.\n",
    "\n",
    "---\n",
    "## Prequisites and Preprocessing\n",
    "This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel. \n",
    "\n",
    "### Permissions and environment variables\n",
    "\n",
    "Here we set up the linkage and authentication to AWS services.\n",
    "\n",
    "1. The roles used to give learning and hosting access to your data. See the documentation for how to specify these.\n",
    "2. The S3 buckets that you want to use for training and model data and where the downloaded data is located."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install --upgrade sagemaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true,
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "import os\n",
    "import boto3\n",
    "import re\n",
    "import copy\n",
    "import time\n",
    "from time import gmtime, strftime\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "# S3 bucket where the original mnist data is downloaded and stored.\n",
    "downloaded_data_bucket = f\"sagemaker-example-files-prod-{region}\"\n",
    "downloaded_data_prefix = \"datasets/image/MNIST\"\n",
    "\n",
    "# S3 bucket for saving code and model artifacts.\n",
    "# Feel free to specify a different bucket and prefix\n",
    "bucket = sess.default_bucket()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "prefix = \"sagemaker/DEMO-xgboost-multiclass-classification\"\n",
    "# customize to your bucket where you have stored the data\n",
    "bucket_path = f\"s3://{bucket}\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data ingestion\n",
    "\n",
    "Next, we read the MNIST dataset [1] from an existing repository into memory, for preprocessing prior to training. It was downloaded from this [link](http://deeplearning.net/data/mnist/mnist.pkl.gz) and stored in `downloaded_data_bucket`. Processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets.\n",
    "\n",
    "> [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import pickle, gzip, numpy, json\n",
    "\n",
    "# Load the dataset\n",
    "s3 = boto3.client(\"s3\")\n",
    "s3.download_file(downloaded_data_bucket, f\"{downloaded_data_prefix}/mnist.pkl.gz\", \"mnist.pkl.gz\")\n",
    "with gzip.open(\"mnist.pkl.gz\", \"rb\") as f:\n",
    "    train_set, valid_set, test_set = pickle.load(f, encoding=\"latin1\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data conversion\n",
    "\n",
    "Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the data is converted from pickle-ized numpy array to the libsvm format before being uploaded to S3. The hosted implementation of xgboost consumes the libsvm converted data from S3 for training. The following provides functions for data conversions and file upload to S3 and download from S3. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "import struct\n",
    "import io\n",
    "import boto3\n",
    "\n",
    "\n",
    "def to_libsvm(f, labels, values):\n",
    "    f.write(\n",
    "        bytes(\n",
    "            \"\\n\".join(\n",
    "                [\n",
    "                    \"{} {}\".format(\n",
    "                        label, \" \".join([\"{}:{}\".format(i + 1, el) for i, el in enumerate(vec)])\n",
    "                    )\n",
    "                    for label, vec in zip(labels, values)\n",
    "                ]\n",
    "            ),\n",
    "            \"utf-8\",\n",
    "        )\n",
    "    )\n",
    "    return f\n",
    "\n",
    "\n",
    "def write_to_s3(fobj, bucket, key):\n",
    "    return (\n",
    "        boto3.Session(region_name=region)\n",
    "        .resource(\"s3\")\n",
    "        .Bucket(bucket)\n",
    "        .Object(key)\n",
    "        .upload_fileobj(fobj)\n",
    "    )\n",
    "\n",
    "\n",
    "def get_dataset():\n",
    "    import pickle\n",
    "    import gzip\n",
    "\n",
    "    with gzip.open(\"mnist.pkl.gz\", \"rb\") as f:\n",
    "        u = pickle._Unpickler(f)\n",
    "        u.encoding = \"latin1\"\n",
    "        return u.load()\n",
    "\n",
    "\n",
    "def upload_to_s3(partition_name, partition):\n",
    "    labels = [t.tolist() for t in partition[1]]\n",
    "    vectors = [t.tolist() for t in partition[0]]\n",
    "    num_partition = 5  # partition file into 5 parts\n",
    "    partition_bound = int(len(labels) / num_partition)\n",
    "    for i in range(num_partition):\n",
    "        f = io.BytesIO()\n",
    "        to_libsvm(\n",
    "            f,\n",
    "            labels[i * partition_bound : (i + 1) * partition_bound],\n",
    "            vectors[i * partition_bound : (i + 1) * partition_bound],\n",
    "        )\n",
    "        f.seek(0)\n",
    "        key = f\"{prefix}/{partition_name}/examples{str(i)}\"\n",
    "        url = f\"s3://{bucket}/{key}\"\n",
    "        print(f\"Writing to {url}\")\n",
    "        write_to_s3(f, bucket, key)\n",
    "        print(f\"Done writing to {url}\")\n",
    "\n",
    "\n",
    "def download_from_s3(partition_name, number, filename):\n",
    "    key = f\"{prefix}/{partition_name}/examples{number}\"\n",
    "    url = f\"s3://{bucket}/{key}\"\n",
    "    print(f\"Reading from {url}\")\n",
    "    s3 = boto3.resource(\"s3\", region_name=region)\n",
    "    s3.Bucket(bucket).download_file(key, filename)\n",
    "    try:\n",
    "        s3.Bucket(bucket).download_file(key, \"mnist.local.test\")\n",
    "    except botocore.exceptions.ClientError as e:\n",
    "        if e.response[\"Error\"][\"Code\"] == \"404\":\n",
    "            print(f\"The object does not exist at {url}.\")\n",
    "        else:\n",
    "            raise\n",
    "\n",
    "\n",
    "def convert_data():\n",
    "    train_set, valid_set, test_set = get_dataset()\n",
    "    partitions = [(\"train\", train_set), (\"validation\", valid_set), (\"test\", test_set)]\n",
    "    for partition_name, partition in partitions:\n",
    "        print(f\"{partition_name}: {partition[0].shape} {partition[1].shape}\")\n",
    "        upload_to_s3(partition_name, partition)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "convert_data()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the XGBoost model\n",
    "\n",
    "Now that we have our data in S3, we can begin training. We'll use Amazon SageMaker XGboost algorithm, and will actually fit two models in order to demonstrate the single machine and distributed training on SageMaker. In the first job, we'll use a single machine to train. In the second job, we'll use two machines and use the ShardedByS3Key mode for the train channel. Since we have 5 part file, one machine will train on three and the other on two part files. Note that the number of instances should not exceed the number of part files. \n",
    "\n",
    "First let's setup a list of training parameters which are common across the two jobs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.image_uris import retrieve\n",
    "\n",
    "container = retrieve(framework=\"xgboost\", region=region, version=\"1.7-1\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ensure that the train and validation data folders generated above are reflected in the \"InputDataConfig\" parameter below.\n",
    "common_training_params = {\n",
    "    \"AlgorithmSpecification\": {\"TrainingImage\": container, \"TrainingInputMode\": \"File\"},\n",
    "    \"RoleArn\": role,\n",
    "    \"OutputDataConfig\": {\"S3OutputPath\": f\"{bucket_path}/{prefix}/xgboost\"},\n",
    "    \"ResourceConfig\": {\"InstanceCount\": 1, \"InstanceType\": \"ml.m4.10xlarge\", \"VolumeSizeInGB\": 5},\n",
    "    \"HyperParameters\": {\n",
    "        \"max_depth\": \"5\",\n",
    "        \"eta\": \"0.2\",\n",
    "        \"gamma\": \"4\",\n",
    "        \"min_child_weight\": \"6\",\n",
    "        \"verbosity\": \"0\",\n",
    "        \"objective\": \"multi:softmax\",\n",
    "        \"num_class\": \"10\",\n",
    "        \"num_round\": \"10\",\n",
    "    },\n",
    "    \"StoppingCondition\": {\"MaxRuntimeInSeconds\": 86400},\n",
    "    \"InputDataConfig\": [\n",
    "        {\n",
    "            \"ChannelName\": \"train\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": f\"{bucket_path}/{prefix}/train/\",\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\",\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"libsvm\",\n",
    "            \"CompressionType\": \"None\",\n",
    "        },\n",
    "        {\n",
    "            \"ChannelName\": \"validation\",\n",
    "            \"DataSource\": {\n",
    "                \"S3DataSource\": {\n",
    "                    \"S3DataType\": \"S3Prefix\",\n",
    "                    \"S3Uri\": f\"{bucket_path}/{prefix}/validation/\",\n",
    "                    \"S3DataDistributionType\": \"FullyReplicated\",\n",
    "                }\n",
    "            },\n",
    "            \"ContentType\": \"libsvm\",\n",
    "            \"CompressionType\": \"None\",\n",
    "        },\n",
    "    ],\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll create two separate jobs, updating the parameters that are unique to each.\n",
    "\n",
    "### Training on a single instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# single machine job params\n",
    "single_machine_job_name = f'DEMO-xgboost-classification{strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())}'\n",
    "print(\"Job name is:\", single_machine_job_name)\n",
    "\n",
    "single_machine_job_params = copy.deepcopy(common_training_params)\n",
    "single_machine_job_params[\"TrainingJobName\"] = single_machine_job_name\n",
    "single_machine_job_params[\"OutputDataConfig\"][\n",
    "    \"S3OutputPath\"\n",
    "] = f\"{bucket_path}/{prefix}/xgboost-single\"\n",
    "single_machine_job_params[\"ResourceConfig\"][\"InstanceCount\"] = 1"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training on multiple instances\n",
    "\n",
    "You can also run the training job distributed over multiple instances. For larger datasets with multiple partitions, this can significantly boost the training speed. Here we'll still use the small/toy MNIST dataset to demo this feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# distributed job params\n",
    "distributed_job_name = (\n",
    "    f'DEMO-xgboost-distrib-classification{strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())}'\n",
    ")\n",
    "print(\"Job name is:\", distributed_job_name)\n",
    "\n",
    "distributed_job_params = copy.deepcopy(common_training_params)\n",
    "distributed_job_params[\"TrainingJobName\"] = distributed_job_name\n",
    "distributed_job_params[\"OutputDataConfig\"][\n",
    "    \"S3OutputPath\"\n",
    "] = f\"{bucket_path}/{prefix}/xgboost-distributed\"\n",
    "# number of instances used for training\n",
    "distributed_job_params[\"ResourceConfig\"][\n",
    "    \"InstanceCount\"\n",
    "] = 2  # no more than 5 if there are total 5 partition files generated above\n",
    "\n",
    "# data distribution type for train channel\n",
    "distributed_job_params[\"InputDataConfig\"][0][\"DataSource\"][\"S3DataSource\"][\n",
    "    \"S3DataDistributionType\"\n",
    "] = \"ShardedByS3Key\"\n",
    "# data distribution type for validation channel\n",
    "distributed_job_params[\"InputDataConfig\"][1][\"DataSource\"][\"S3DataSource\"][\n",
    "    \"S3DataDistributionType\"\n",
    "] = \"ShardedByS3Key\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's submit these jobs, taking note that the first will be submitted to run in the background so that we can immediately run the second in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "sm = boto3.Session(region_name=region).client(\"sagemaker\")\n",
    "\n",
    "sm.create_training_job(**single_machine_job_params)\n",
    "sm.create_training_job(**distributed_job_params)\n",
    "\n",
    "status = sm.describe_training_job(TrainingJobName=distributed_job_name)[\"TrainingJobStatus\"]\n",
    "print(status)\n",
    "sm.get_waiter(\"training_job_completed_or_stopped\").wait(TrainingJobName=distributed_job_name)\n",
    "status = sm.describe_training_job(TrainingJobName=distributed_job_name)[\"TrainingJobStatus\"]\n",
    "print(f\"Training job ended with status: {status}\")\n",
    "if status == \"Failed\":\n",
    "    message = sm.describe_training_job(TrainingJobName=distributed_job_name)[\"FailureReason\"]\n",
    "    print(f\"Training failed with the following error: {message}\")\n",
    "    raise Exception(\"Training job failed\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's confirm both jobs have finished."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Single Machine:\",\n",
    "    sm.describe_training_job(TrainingJobName=single_machine_job_name)[\"TrainingJobStatus\"],\n",
    ")\n",
    "print(\n",
    "    \"Distributed:\",\n",
    "    sm.describe_training_job(TrainingJobName=distributed_job_name)[\"TrainingJobStatus\"],\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)) <a id='AMT'></a>\n",
    "***\n",
    "Instead of maunally configuring your hyper parameter values and training with SageMaker Training, you could also train with Amazon SageMaker Automatic Model Tuning. AMT, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.\n",
    "        \n",
    "To create a tuning job using the AWS SageMaker Automatic Model Tuning API, you need to define 3 attributes. \n",
    "\n",
    "1. the tuning job name (string)\n",
    "2. the tuning job config (to specify settings for the hyperparameter tuning job - JSON object)\n",
    "3. training job definition (to configure the training jobs that the tuning job launches - JSON object).\n",
    "\n",
    "To learn more about that, refer to the [Configure and Launch a Hyperparameter Tuning Job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html) documentation.\n",
    "\n",
    "In this notebook, the model that the HPO job creates is the one that is eventually hosted. You can instead choose to deploy the model created by the standalone training job by changing the below variable `deploy_amt_model` to False.\n",
    "\n",
    "Note that the tuning job will 10-12 minutes to complete.\n",
    "***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import gmtime, strftime, sleep\n",
    "\n",
    "deploy_amt_model = True\n",
    "\n",
    "tuning_job_name = f'DEMO-hpo-xgb-{strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())}'\n",
    "print(\"Job name is:\", single_machine_job_name)\n",
    "\n",
    "tuning_job_config = {\n",
    "    \"ParameterRanges\": {\n",
    "        \"CategoricalParameterRanges\": [],\n",
    "        \"ContinuousParameterRanges\": [\n",
    "            {\n",
    "                \"MaxValue\": \"0.5\",\n",
    "                \"MinValue\": \"0.1\",\n",
    "                \"Name\": \"eta\",\n",
    "            },\n",
    "            {\n",
    "                \"MaxValue\": \"5\",\n",
    "                \"MinValue\": \"0\",\n",
    "                \"Name\": \"gamma\",\n",
    "            },\n",
    "            {\n",
    "                \"MaxValue\": \"120\",\n",
    "                \"MinValue\": \"0\",\n",
    "                \"Name\": \"min_child_weight\",\n",
    "            },\n",
    "            {\n",
    "                \"MaxValue\": \"1\",\n",
    "                \"MinValue\": \"0.5\",\n",
    "                \"Name\": \"subsample\",\n",
    "            },\n",
    "            {\n",
    "                \"MaxValue\": \"2\",\n",
    "                \"MinValue\": \"0\",\n",
    "                \"Name\": \"alpha\",\n",
    "            },\n",
    "        ],\n",
    "        \"IntegerParameterRanges\": [\n",
    "            {\n",
    "                \"MaxValue\": \"10\",\n",
    "                \"MinValue\": \"0\",\n",
    "                \"Name\": \"max_depth\",\n",
    "            },\n",
    "            {\n",
    "                \"MaxValue\": \"50\",\n",
    "                \"MinValue\": \"1\",\n",
    "                \"Name\": \"num_round\",\n",
    "            },\n",
    "        ],\n",
    "    },\n",
    "    # SageMaker sets the following default limits for resources used by automatic model tuning:\n",
    "    # https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-limits.html\n",
    "    \"ResourceLimits\": {\n",
    "        # Increase the max number of training jobs for increased accuracy (and training time).\n",
    "        \"MaxNumberOfTrainingJobs\": 6,\n",
    "        # Change parallel training jobs run by AMT to reduce total training time. Constrained by your account limits.\n",
    "        # if max_jobs=max_parallel_jobs then Bayesian search turns to Random.\n",
    "        \"MaxParallelTrainingJobs\": 2,\n",
    "    },\n",
    "    \"Strategy\": \"Bayesian\",\n",
    "    \"HyperParameterTuningJobObjective\": {\"MetricName\": \"validation:merror\", \"Type\": \"Minimize\"},\n",
    "}\n",
    "\n",
    "\n",
    "training_job_definition = copy.deepcopy(common_training_params)\n",
    "del training_job_definition[\"HyperParameters\"]\n",
    "\n",
    "training_job_definition[\"OutputDataConfig\"][\n",
    "    \"S3OutputPath\"\n",
    "] = f\"{bucket_path}/{prefix}/hpo-xgboost-class\"\n",
    "training_job_definition[\"StaticHyperParameters\"] = {\n",
    "    \"objective\": \"multi:softmax\",\n",
    "    \"verbosity\": \"2\",\n",
    "    \"num_class\": \"10\",\n",
    "}\n",
    "\n",
    "\n",
    "print(\n",
    "    f\"Creating a tuning job with name: {tuning_job_name}. It should take between 10 and 21 minutes to complete.\"\n",
    ")\n",
    "sm.create_hyper_parameter_tuning_job(\n",
    "    HyperParameterTuningJobName=tuning_job_name,\n",
    "    HyperParameterTuningJobConfig=tuning_job_config,\n",
    "    TrainingJobDefinition=training_job_definition,\n",
    ")\n",
    "\n",
    "status = sm.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\n",
    "    \"HyperParameterTuningJobStatus\"\n",
    "]\n",
    "print(status)\n",
    "while status != \"Completed\" and status != \"Failed\":\n",
    "    time.sleep(60)\n",
    "    status = sm.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\n",
    "        \"HyperParameterTuningJobStatus\"\n",
    "    ]\n",
    "    print(status)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up hosting for the model\n",
    "In order to set up hosting, we have to import the model from training to hosting. The step below demonstrated hosting the model generated from the standalone distributed training job scheduled directly against SageMaker Training. Same steps can be followed to host the model obtained from the single machine job, or the one from the tuning job. Note that for the tuning job, the best training (field accessible from the tuning job object) need to be selected first (see the comment in the code reference below). \n",
    "\n",
    "### Import model into hosting\n",
    "Next, you register the model with hosting. This allows you the flexibility of importing models trained elsewhere."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "model_name = f\"{distributed_job_name}-mod\"\n",
    "print(model_name)\n",
    "\n",
    "\n",
    "if deploy_amt_model:\n",
    "    training_name_of_model = sm.describe_hyper_parameter_tuning_job(\n",
    "        HyperParameterTuningJobName=tuning_job_name\n",
    "    )[\"BestTrainingJob\"][\"TrainingJobName\"]\n",
    "else:\n",
    "    training_name_of_model = distributed_job_name\n",
    "\n",
    "info = sm.describe_training_job(TrainingJobName=training_name_of_model)\n",
    "\n",
    "model_data = info[\"ModelArtifacts\"][\"S3ModelArtifacts\"]\n",
    "print(model_data)\n",
    "\n",
    "primary_container = {\"Image\": container, \"ModelDataUrl\": model_data}\n",
    "\n",
    "create_model_response = sm.create_model(\n",
    "    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container\n",
    ")\n",
    "\n",
    "print(create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint configuration\n",
    "SageMaker supports configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way. In addition, the endpoint configuration describes the instance type required for model deployment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import gmtime, strftime\n",
    "\n",
    "endpoint_config_name = f'DEMO-XGBoostEndpointConfig-{strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())}'\n",
    "print(endpoint_config_name)\n",
    "create_endpoint_config_response = sm.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"InstanceType\": \"ml.m4.xlarge\",\n",
    "            \"InitialVariantWeight\": 1,\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ModelName\": model_name,\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(f'Endpoint Config Arn: {create_endpoint_config_response[\"EndpointConfigArn\"]}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint\n",
    "Lastly, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import time\n",
    "\n",
    "endpoint_name = f'DEMO-XGBoostEndpoint-{strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())}'\n",
    "print(endpoint_name)\n",
    "create_endpoint_response = sm.create_endpoint(\n",
    "    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "print(create_endpoint_response[\"EndpointArn\"])\n",
    "\n",
    "resp = sm.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(f\"Status: {status}\")\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(f\"Status: {status}\")\n",
    "\n",
    "print(f'Arn: {resp[\"EndpointArn\"]}')\n",
    "print(f\"Status: {status}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validate the model for use\n",
    "Finally, the customer can now validate the model for use. They can obtain the endpoint from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "runtime_client = boto3.client(\"runtime.sagemaker\", region_name=region)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to evaluate the model, we'll use the test dataset previously generated. Let us first download the data from S3 to the local host."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "download_from_s3(\"test\", 0, \"mnist.local.test\")  # reading the first part file within test"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start with a single prediction. Lets use the first record from the test file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -1 mnist.local.test > mnist.single.test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import json\n",
    "\n",
    "file_name = (\n",
    "    \"mnist.single.test\"  # customize to your test file 'mnist.single.test' if use the data above\n",
    ")\n",
    "\n",
    "with open(file_name, \"r\") as f:\n",
    "    payload = f.read()\n",
    "\n",
    "response = runtime_client.invoke_endpoint(\n",
    "    EndpointName=endpoint_name, ContentType=\"text/x-libsvm\", Body=payload\n",
    ")\n",
    "result = response[\"Body\"].read().decode(\"ascii\")\n",
    "print(f\"Predicted label is {result}.\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, a single prediction works.\n",
    "Let's do a whole batch and see how good is the predictions accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "\n",
    "def do_predict(data, endpoint_name, content_type):\n",
    "    payload = \"\\n\".join(data)\n",
    "    response = runtime_client.invoke_endpoint(\n",
    "        EndpointName=endpoint_name, ContentType=content_type, Body=payload\n",
    "    )\n",
    "    result = response[\"Body\"].read().decode(\"ascii\")\n",
    "    preds = [float(num) for num in result.split(\"\\n\")[:-1]]\n",
    "    return preds\n",
    "\n",
    "\n",
    "def batch_predict(data, batch_size, endpoint_name, content_type):\n",
    "    items = len(data)\n",
    "    arrs = []\n",
    "    for offset in range(0, items, batch_size):\n",
    "        arrs.extend(\n",
    "            do_predict(data[offset : min(offset + batch_size, items)], endpoint_name, content_type)\n",
    "        )\n",
    "        sys.stdout.write(\".\")\n",
    "    return arrs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following function helps us calculate the error rate on the batch dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "import json\n",
    "\n",
    "file_name = \"mnist.local.test\"\n",
    "with open(file_name, \"r\") as f:\n",
    "    payload = f.read().strip()\n",
    "\n",
    "labels = [float(line.split(\" \")[0]) for line in payload.split(\"\\n\")]\n",
    "test_data = payload.split(\"\\n\")\n",
    "preds = batch_predict(test_data, 100, endpoint_name, \"text/x-libsvm\")\n",
    "\n",
    "print(\n",
    "    \"\\nerror rate=%f\"\n",
    "    % (sum(1 for i in range(len(preds)) if preds[i] != labels[i]) / float(len(preds)))\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are a few predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds[0:10]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and the corresponding labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels[0:10]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following function helps us create the confusion matrix on the labeled batch test dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy\n",
    "\n",
    "\n",
    "def error_rate(predictions, labels):\n",
    "    \"\"\"Return the error rate and confusions.\"\"\"\n",
    "    correct = numpy.sum(predictions == labels)\n",
    "    total = predictions.shape[0]\n",
    "\n",
    "    error = 100.0 - (100 * float(correct) / float(total))\n",
    "\n",
    "    confusions = numpy.zeros([10, 10], numpy.int32)\n",
    "    bundled = zip(predictions, labels)\n",
    "    for predicted, actual in bundled:\n",
    "        confusions[int(predicted), int(actual)] += 1\n",
    "\n",
    "    return error, confusions"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following helps us visualize the erros that the XGBoost classifier is making. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "NUM_LABELS = 10  # change it according to num_class in your dataset\n",
    "test_error, confusions = error_rate(numpy.asarray(preds), numpy.asarray(labels))\n",
    "print(\"Test error: %.1f%%\" % test_error)\n",
    "\n",
    "plt.xlabel(\"Actual\")\n",
    "plt.ylabel(\"Predicted\")\n",
    "plt.grid(False)\n",
    "plt.xticks(numpy.arange(NUM_LABELS))\n",
    "plt.yticks(numpy.arange(NUM_LABELS))\n",
    "plt.imshow(confusions, cmap=plt.cm.jet, interpolation=\"nearest\")\n",
    "\n",
    "for i, cas in enumerate(confusions):\n",
    "    for j, count in enumerate(cas):\n",
    "        if count > 0:\n",
    "            xoff = 0.07 * len(str(count))\n",
    "            plt.text(j - xoff, i + 0.2, int(count), fontsize=9, color=\"white\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete Endpoint\n",
    "Once you are done using the endpoint, you can use the following to delete it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sm.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/introduction_to_amazon_algorithms|xgboost_mnist|xgboost_mnist.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Tags",
  "instance_type": "ml.m5.large",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 3.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 4
}