{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inference Pipeline with Custom Containers and xgBoost\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Typically a Machine Learning (ML) process consists of few steps: data gathering with various ETL jobs, pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm. \n",
    "In many cases, when the trained model is used for processing real time or batch prediction requests, the model receives data in a format which needs to pre-processed (e.g. featurized) before it can be passed to the algorithm. In the following notebook, we will demonstrate how you can build your ML Pipeline leveraging the ability to create custom Sagemaker algorithms and the out of the box SageMaker xgBoost algorithm. After the model is trained we will deploy the ML Pipeline (data preprocessing, the xgBoost classifier, and data postprocessing) as an Inference Pipeline behind a single SageMaker Endpoint for real time inference. We will also use the preprocessor with batch transformation using Amazon SageMaker Batch Transform to prepare xgBoost training data.\n",
    "\n",
    "![Inference Diagram](./Inference_diagram.png)\n",
    "\n",
    "The toy problem that is being solved here is to match a set of keywords to a category of questions. From there we can match that category against a list of available agents who specialize in answering that category of question. The agents and their availability is stored externally in a DynamoDB database. The data transformations, matching against our model, and querying of the database are all done as part of the inference pipeline.\n",
    "\n",
    "The preprocessing step of the pipeline encodes a comma-separated list of words into a format that xgBoost understands using a CountVectorizer. It also trains a LabelEncoder, which is used to transform from the categories of questions to a set of integers - having the labels encoded as integers is also a requirement of the xgBoost multiclass classifer. \n",
    "\n",
    "The xgBoost model maps the encoded list of words to an integer, which represents the encoded class of question that best matches those words.\n",
    "\n",
    "Finally, the postprocessing step of the pipeline uses the LabelEncoding model trained in the preprocessing step to map the number representing the classification of the question back to the text. It then takes the category and queries dynamodb for available agents that matches that category."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first create our Sagemaker session and role, and create a S3 prefix to use for the notebook example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p returns_data\n",
    "!python3 generate-training-data.py --samples 100000 --filename returns_data/samples.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 load-ddb-data.py PipelineLookupTable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# S3 prefix\n",
    "\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "sagemaker_session = sagemaker.Session()\n",
    "\n",
    "# Get a SageMaker-compatible role used by this Notebook Instance.\n",
    "role = get_execution_role()\n",
    "\n",
    "bucket = sagemaker_session.default_bucket()\n",
    "prefix = \"Custom-Pipeline-Inference-Example\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload the data for training <a class=\"anchor\" id=\"upload_data\"></a>\n",
    "\n",
    "When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. We can use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "WORK_DIRECTORY = \"returns_data\"\n",
    "\n",
    "train_input = sagemaker_session.upload_data(\n",
    "    path=\"{}/{}\".format(WORK_DIRECTORY, \"samples.csv\"),\n",
    "    bucket=bucket,\n",
    "    key_prefix=\"{}/{}\".format(prefix, \"train\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up a loader function\n",
    "\n",
    "The load_data function pulls in the CSV data into two columns: the first column of the CSV is mapped to the label, and every subsequent CSV column is loaded as a dictionary into the second Pandas column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import csv\n",
    "\n",
    "\n",
    "def load_data(raw, columns, skip_first_row=True):\n",
    "    recs = [(row[0], set(row[1:])) for row in csv.reader(raw)]\n",
    "    if skip_first_row:\n",
    "        return pd.DataFrame.from_records(recs[1:], columns=columns)\n",
    "    else:\n",
    "        return pd.DataFrame.from_records(recs, columns=columns)\n",
    "\n",
    "\n",
    "def load(files, columns, skip_first_row=True):\n",
    "    raw_data = []\n",
    "    for file in files:\n",
    "        raw_data.append(load_data(open(file), columns, skip_first_row))\n",
    "\n",
    "    return pd.concat(raw_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = load([\"returns_data/samples.csv\"], [\"label\", \"words\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>words</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>category_properties</td>\n",
       "      <td>{rental, properties, investment}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>category_medical</td>\n",
       "      <td>{medical, covid}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>category_itemization</td>\n",
       "      <td>{donation, itemization}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>category_estate taxes</td>\n",
       "      <td>{medical, inheritance, estate}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>category_estate taxes</td>\n",
       "      <td>{estate}</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   label                             words\n",
       "0    category_properties  {rental, properties, investment}\n",
       "1       category_medical                  {medical, covid}\n",
       "2   category_itemization           {donation, itemization}\n",
       "3  category_estate taxes    {medical, inheritance, estate}\n",
       "4  category_estate taxes                          {estate}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "num_labels = len(df[\"label\"].unique())\n",
    "num_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Words in our dataset\n",
    "\n",
    "Let's take a look at the set of words being used. We use a CountVectorizer with the set analyzer to encode the column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['401k',\n",
       " '403b',\n",
       " 'capital',\n",
       " 'charitable',\n",
       " 'covid',\n",
       " 'deduction',\n",
       " 'deferment',\n",
       " 'delay',\n",
       " 'donation',\n",
       " 'estate',\n",
       " 'expense',\n",
       " 'gains',\n",
       " 'inheritance',\n",
       " 'investment',\n",
       " 'ira',\n",
       " 'itemization',\n",
       " 'late',\n",
       " 'local',\n",
       " 'losses',\n",
       " 'medical',\n",
       " 'mortgage',\n",
       " 'payment',\n",
       " 'properties',\n",
       " 'rental',\n",
       " 'state',\n",
       " 'tax']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "vectorizer = CountVectorizer(analyzer=set)\n",
    "count_res = vectorizer.fit_transform(df[\"words\"])\n",
    "vectorizer.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "305752278501.dkr.ecr.us-west-2.amazonaws.com/custompipeline/preprocessor:latest\n"
     ]
    }
   ],
   "source": [
    "import boto3\n",
    "\n",
    "ecr_namespace = \"custompipeline/\"\n",
    "prefix = \"preprocessor\"\n",
    "\n",
    "ecr_repository_name = ecr_namespace + prefix\n",
    "role = get_execution_role()\n",
    "account_id = role.split(\":\")[4]\n",
    "region = boto3.Session().region_name\n",
    "sagemaker_session = sagemaker.session.Session()\n",
    "bucket = sagemaker_session.default_bucket()\n",
    "container_image_uri = \"{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest\".format(\n",
    "    account_id, region, ecr_repository_name\n",
    ")\n",
    "print(container_image_uri)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.\n",
      "'s3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.\n",
      "'s3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.\n",
      "'s3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2020-10-02 00:28:04 Starting - Starting the training job...\n",
      "2020-10-02 00:28:06 Starting - Launching requested ML instances......\n",
      "2020-10-02 00:29:07 Starting - Preparing the instances for training...\n",
      "2020-10-02 00:29:59 Downloading - Downloading input data\n",
      "2020-10-02 00:29:59 Training - Downloading the training image.....\u001b[34mStarting script\u001b[0m\n",
      "\u001b[34marguments: ['main.py', 'train']\u001b[0m\n",
      "\u001b[34mstarting training...\n",
      "\u001b[0m\n",
      "\u001b[34mHyperparameters configuration:\u001b[0m\n",
      "\u001b[34m{}\n",
      "\u001b[0m\n",
      "\u001b[34mInput data configuration:\u001b[0m\n",
      "\u001b[34m{'train': {'RecordWrapperType': 'None',\n",
      "           'S3DistributionType': 'FullyReplicated',\n",
      "           'TrainingInputMode': 'File'}}\n",
      "\u001b[0m\n",
      "\u001b[34mList of files in train channel: \u001b[0m\n",
      "\u001b[34m/opt/ml/input/data/train/samples.csv\n",
      "\u001b[0m\n",
      "\u001b[34mResource configuration:\u001b[0m\n",
      "\u001b[34m{'current_host': 'algo-1',\n",
      " 'hosts': ['algo-1'],\n",
      " 'network_interface_name': 'eth0'}\u001b[0m\n",
      "\u001b[34m<class 'pandas.core.frame.DataFrame'>\u001b[0m\n",
      "\u001b[34mRangeIndex: 100000 entries, 0 to 99999\u001b[0m\n",
      "\u001b[34mData columns (total 2 columns):\n",
      " #   Column  Non-Null Count   Dtype \u001b[0m\n",
      "\u001b[34m---  ------  --------------   ----- \n",
      " 0   label   100000 non-null  object\n",
      " 1   words   100000 non-null  object\u001b[0m\n",
      "\u001b[34mdtypes: object(2)\u001b[0m\n",
      "\u001b[34mmemory usage: 1.5+ MB\u001b[0m\n",
      "\u001b[34mNone\u001b[0m\n",
      "\u001b[34mfitting...\u001b[0m\n",
      "\u001b[34mfinished fitting...\u001b[0m\n",
      "\u001b[34m['401k', '403b', 'capital', 'charitable', 'covid', 'deduction', 'deferment', 'delay', 'donation', 'estate', 'expense', 'gains', 'inheritance', 'investment', 'ira', 'itemization', 'late', 'local', 'losses', 'medical', 'mortgage', 'payment', 'properties', 'rental', 'state', 'tax']\u001b[0m\n",
      "\u001b[34mle classes:  ['category_deferments' 'category_estate taxes' 'category_investments'\n",
      " 'category_itemization' 'category_medical' 'category_properties']\u001b[0m\n",
      "\u001b[34msaved model!\u001b[0m\n",
      "\n",
      "2020-10-02 00:30:56 Uploading - Uploading generated training model\n",
      "2020-10-02 00:30:56 Completed - Training job completed\n",
      "Training seconds: 69\n",
      "Billable seconds: 69\n"
     ]
    }
   ],
   "source": [
    "import sagemaker\n",
    "\n",
    "custom_preprocessor = sagemaker.estimator.Estimator(\n",
    "    container_image_uri,\n",
    "    role,\n",
    "    train_instance_count=1,\n",
    "    # train_instance_type='local', # use local mode\n",
    "    train_instance_type=\"ml.m5.4xlarge\",\n",
    "    base_job_name=prefix,\n",
    ")\n",
    "\n",
    "train_config = sagemaker.session.s3_input(\n",
    "    \"s3://{0}/{1}/train/\".format(bucket, prefix), content_type=\"text/csv\"\n",
    ")\n",
    "val_config = sagemaker.session.s3_input(\n",
    "    \"s3://{0}/{1}/val/\".format(bucket, prefix), content_type=\"text/csv\"\n",
    ")\n",
    "\n",
    "custom_preprocessor.fit({\"train\": train_input})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-----------!"
     ]
    }
   ],
   "source": [
    "from time import gmtime, strftime\n",
    "\n",
    "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "endpoint_name = \"simple-ep-\" + timestamp_prefix\n",
    "predictor = custom_preprocessor.deploy(\n",
    "    initial_instance_count=1, instance_type=\"ml.c5.4xlarge\", endpoint_name=endpoint_name\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content type csv text/csv\n",
      "b'0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0\\n0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0\\n'\n"
     ]
    }
   ],
   "source": [
    "from sagemaker.predictor import (\n",
    "    json_serializer,\n",
    "    csv_serializer,\n",
    "    json_deserializer,\n",
    "    RealTimePredictor,\n",
    ")\n",
    "from sagemaker.content_types import CONTENT_TYPE_CSV, CONTENT_TYPE_JSON\n",
    "\n",
    "payload = \"rental,peanut,butter\\ncovid\"\n",
    "print(\"content type csv\", CONTENT_TYPE_CSV)\n",
    "\n",
    "predictor = RealTimePredictor(\n",
    "    endpoint=endpoint_name,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    serializer=csv_serializer,\n",
    "    content_type=CONTENT_TYPE_CSV,\n",
    "    accept=CONTENT_TYPE_CSV,\n",
    ")\n",
    "\n",
    "print(predictor.predict(payload))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_client = sagemaker_session.boto_session.client(\"sagemaker\")\n",
    "sm_client.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch transform our training data <a class=\"anchor\" id=\"preprocess_train_data\"></a>\n",
    "Now that our proprocessor is properly fitted, let's go ahead and preprocess our training data. Let's use batch transform to directly preprocess the raw data and store right back into s3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
      "Using already existing model: preprocessor-2020-10-02-00-28-04-621\n"
     ]
    }
   ],
   "source": [
    "# Define a SKLearn Transformer from the trained SKLearn Estimator\n",
    "transformer = custom_preprocessor.transformer(\n",
    "    instance_count=1, instance_type=\"ml.m4.xlarge\", assemble_with=\"Line\", accept=\"text/csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Waiting for transform job: preprocessor-2020-10-02-00-36-49-266\n",
      "...........................\n",
      "\u001b[32m2020-10-02T00:41:08.139:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD\u001b[0m\n",
      "\u001b[34mStarting script\u001b[0m\n",
      "\u001b[34marguments: ['main.py', 'serve']\u001b[0m\n",
      "\u001b[34mStarting the inference server with 4 workers.\u001b[0m\n",
      "\u001b[34musing port:  8080\u001b[0m\n",
      "\u001b[34mnginx.conf: worker_processes 1;\u001b[0m\n",
      "\u001b[34mdaemon off; #Prevent forking\n",
      "\n",
      "\u001b[0m\n",
      "\u001b[34mpid /tmp/nginx.pid;\u001b[0m\n",
      "\u001b[34merror_log /var/log/nginx/error.log;\n",
      "\u001b[0m\n",
      "\u001b[34mevents {\n",
      "  # defaults\u001b[0m\n",
      "\u001b[34m}\n",
      "\u001b[0m\n",
      "\u001b[35mStarting script\u001b[0m\n",
      "\u001b[35marguments: ['main.py', 'serve']\u001b[0m\n",
      "\u001b[35mStarting the inference server with 4 workers.\u001b[0m\n",
      "\u001b[35musing port:  8080\u001b[0m\n",
      "\u001b[35mnginx.conf: worker_processes 1;\u001b[0m\n",
      "\u001b[35mdaemon off; #Prevent forking\n",
      "\n",
      "\u001b[0m\n",
      "\u001b[35mpid /tmp/nginx.pid;\u001b[0m\n",
      "\u001b[35merror_log /var/log/nginx/error.log;\n",
      "\u001b[0m\n",
      "\u001b[35mevents {\n",
      "  # defaults\u001b[0m\n",
      "\u001b[35m}\n",
      "\u001b[0m\n",
      "\u001b[34mhttp {\n",
      "  include /etc/nginx/mime.types;\n",
      "  default_type application/octet-stream;\n",
      "  access_log /var/log/nginx/access.log combined;\n",
      "\n",
      "  upstream gunicorn {\n",
      "    server unix:/tmp/gunicorn.sock;\n",
      "  }\n",
      "\n",
      "  server {\n",
      "    listen 8080 deferred;\n",
      "    client_max_body_size 5m;\n",
      "\n",
      "    keepalive_timeout 5;\n",
      "\n",
      "    location ~ ^/(ping|invocations) {\n",
      "      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n",
      "      proxy_set_header Host $http_host;\n",
      "      proxy_redirect off;\n",
      "      proxy_pass http://gunicorn;\n",
      "    }\n",
      "\n",
      "    location / {\n",
      "      return 404 \"{}\";\n",
      "    }\n",
      "  }\u001b[0m\n",
      "\u001b[34m}\n",
      "\u001b[0m\n",
      "\u001b[34m2020/10/02 00:41:06 [crit] 18#18: *1 connect() to unix:/tmp/gunicorn.sock failed (2: No such file or directory) while connecting to upstream, client: 169.254.255.130, server: , request: \"GET /ping HTTP/1.1\", upstream: \"http://unix:/tmp/gunicorn.sock:/ping\", host: \"169.254.255.131:8080\"\u001b[0m\n",
      "\u001b[34m169.254.255.130 - - [02/Oct/2020:00:41:06 +0000] \"GET /ping HTTP/1.1\" 502 182 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[34m2020/10/02 00:41:06 [crit] 18#18: *3 connect() to unix:/tmp/gunicorn.sock failed (2: No such file or directory) while connecting to upstream, client: 169.254.255.130, server: , request: \"GET /ping HTTP/1.1\", upstream: \"http://unix:/tmp/gunicorn.sock:/ping\", host: \"169.254.255.131:8080\"\u001b[0m\n",
      "\u001b[34m169.254.255.130 - - [02/Oct/2020:00:41:06 +0000] \"GET /ping HTTP/1.1\" 502 182 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[34m[2020-10-02 00:41:06 +0000] [17] [INFO] Starting gunicorn 20.0.4\u001b[0m\n",
      "\u001b[34m[2020-10-02 00:41:06 +0000] [17] [INFO] Listening at: unix:/tmp/gunicorn.sock (17)\u001b[0m\n",
      "\u001b[34m[2020-10-02 00:41:06 +0000] [17] [INFO] Using worker: gevent\u001b[0m\n",
      "\u001b[34m[2020-10-02 00:41:06 +0000] [21] [INFO] Booting worker with pid: 21\u001b[0m\n",
      "\u001b[34m[2020-10-02 00:41:06 +0000] [22] [INFO] Booting worker with pid: 22\u001b[0m\n",
      "\u001b[34m[2020-10-02 00:41:06 +0000] [23] [INFO] Booting worker with pid: 23\u001b[0m\n",
      "\u001b[34m[2020-10-02 00:41:06 +0000] [24] [INFO] Booting worker with pid: 24\u001b[0m\n",
      "\u001b[34m169.254.255.130 - - [02/Oct/2020:00:41:08 +0000] \"GET /ping HTTP/1.1\" 200 1 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[34m169.254.255.130 - - [02/Oct/2020:00:41:08 +0000] \"GET /execution-parameters HTTP/1.1\" 404 2 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[34mdata:  b'label,words\\ncategory_properties,rental,investment,properties\\ncategory_medical,medical,covid\\ncategory'\u001b[0m\n",
      "\u001b[34mcookies:  ImmutableMultiDict([])\u001b[0m\n",
      "\u001b[34mheaders:  {'X-Forwarded-For': '169.254.255.130', 'Host': '169.254.255.131:8080', 'Connection': 'close', 'Content-Length': '4265739', 'User-Agent': 'Go-http-client/1.1', 'Accept': 'text/csv', 'Content-Type': 'text/csv', 'X-Amzn-Sagemaker-Input-Object': 'sagemaker-us-west-2-305752278501/Custom-Pipeline-Inference-Example/train/samples.csv', 'X-Amzn-Sagemaker-Input-Object-Base64': 'c2FnZW1ha2VyLXVzLXdlc3QtMi0zMDU3NTIyNzg1MDEvQ3VzdG9tLVBpcGVsaW5lLUluZmVyZW5jZS1FeGFtcGxlL3RyYWluL3NhbXBsZXMuY3N2', 'Accept-Encoding': 'gzip'}\u001b[0m\n",
      "\u001b[35mhttp {\n",
      "  include /etc/nginx/mime.types;\n",
      "  default_type application/octet-stream;\n",
      "  access_log /var/log/nginx/access.log combined;\n",
      "\n",
      "  upstream gunicorn {\n",
      "    server unix:/tmp/gunicorn.sock;\n",
      "  }\n",
      "\n",
      "  server {\n",
      "    listen 8080 deferred;\n",
      "    client_max_body_size 5m;\n",
      "\n",
      "    keepalive_timeout 5;\n",
      "\n",
      "    location ~ ^/(ping|invocations) {\n",
      "      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;\n",
      "      proxy_set_header Host $http_host;\n",
      "      proxy_redirect off;\n",
      "      proxy_pass http://gunicorn;\n",
      "    }\n",
      "\n",
      "    location / {\n",
      "      return 404 \"{}\";\n",
      "    }\n",
      "  }\u001b[0m\n",
      "\u001b[35m}\n",
      "\u001b[0m\n",
      "\u001b[35m2020/10/02 00:41:06 [crit] 18#18: *1 connect() to unix:/tmp/gunicorn.sock failed (2: No such file or directory) while connecting to upstream, client: 169.254.255.130, server: , request: \"GET /ping HTTP/1.1\", upstream: \"http://unix:/tmp/gunicorn.sock:/ping\", host: \"169.254.255.131:8080\"\u001b[0m\n",
      "\u001b[35m169.254.255.130 - - [02/Oct/2020:00:41:06 +0000] \"GET /ping HTTP/1.1\" 502 182 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[35m2020/10/02 00:41:06 [crit] 18#18: *3 connect() to unix:/tmp/gunicorn.sock failed (2: No such file or directory) while connecting to upstream, client: 169.254.255.130, server: , request: \"GET /ping HTTP/1.1\", upstream: \"http://unix:/tmp/gunicorn.sock:/ping\", host: \"169.254.255.131:8080\"\u001b[0m\n",
      "\u001b[35m169.254.255.130 - - [02/Oct/2020:00:41:06 +0000] \"GET /ping HTTP/1.1\" 502 182 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[35m[2020-10-02 00:41:06 +0000] [17] [INFO] Starting gunicorn 20.0.4\u001b[0m\n",
      "\u001b[35m[2020-10-02 00:41:06 +0000] [17] [INFO] Listening at: unix:/tmp/gunicorn.sock (17)\u001b[0m\n",
      "\u001b[35m[2020-10-02 00:41:06 +0000] [17] [INFO] Using worker: gevent\u001b[0m\n",
      "\u001b[35m[2020-10-02 00:41:06 +0000] [21] [INFO] Booting worker with pid: 21\u001b[0m\n",
      "\u001b[35m[2020-10-02 00:41:06 +0000] [22] [INFO] Booting worker with pid: 22\u001b[0m\n",
      "\u001b[35m[2020-10-02 00:41:06 +0000] [23] [INFO] Booting worker with pid: 23\u001b[0m\n",
      "\u001b[35m[2020-10-02 00:41:06 +0000] [24] [INFO] Booting worker with pid: 24\u001b[0m\n",
      "\u001b[35m169.254.255.130 - - [02/Oct/2020:00:41:08 +0000] \"GET /ping HTTP/1.1\" 200 1 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[35m169.254.255.130 - - [02/Oct/2020:00:41:08 +0000] \"GET /execution-parameters HTTP/1.1\" 404 2 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[35mdata:  b'label,words\\ncategory_properties,rental,investment,properties\\ncategory_medical,medical,covid\\ncategory'\u001b[0m\n",
      "\u001b[35mcookies:  ImmutableMultiDict([])\u001b[0m\n",
      "\u001b[35mheaders:  {'X-Forwarded-For': '169.254.255.130', 'Host': '169.254.255.131:8080', 'Connection': 'close', 'Content-Length': '4265739', 'User-Agent': 'Go-http-client/1.1', 'Accept': 'text/csv', 'Content-Type': 'text/csv', 'X-Amzn-Sagemaker-Input-Object': 'sagemaker-us-west-2-305752278501/Custom-Pipeline-Inference-Example/train/samples.csv', 'X-Amzn-Sagemaker-Input-Object-Base64': 'c2FnZW1ha2VyLXVzLXdlc3QtMi0zMDU3NTIyNzg1MDEvQ3VzdG9tLVBpcGVsaW5lLUluZmVyZW5jZS1FeGFtcGxlL3RyYWluL3NhbXBsZXMuY3N2', 'Accept-Encoding': 'gzip'}\u001b[0m\n",
      "\u001b[34margs:  ImmutableMultiDict([])\u001b[0m\n",
      "\u001b[35margs:  ImmutableMultiDict([])\u001b[0m\n",
      "\u001b[34mContent type text/csv\u001b[0m\n",
      "\u001b[34mAccept text/csv\u001b[0m\n",
      "\u001b[34mFirst entry is:  label\u001b[0m\n",
      "\u001b[34mLength indicates that label is included\u001b[0m\n",
      "\u001b[34mmerged df                    label                             words\u001b[0m\n",
      "\u001b[34m0    category_properties  {rental, investment, properties}\u001b[0m\n",
      "\u001b[34m1       category_medical                  {covid, medical}\u001b[0m\n",
      "\u001b[34m2   category_itemization           {itemization, donation}\u001b[0m\n",
      "\u001b[34m3  category_estate taxes    {estate, inheritance, medical}\u001b[0m\n",
      "\u001b[34m4  category_estate taxes                          {estate}\u001b[0m\n",
      "\u001b[34mlabel_column in input_data\u001b[0m\n",
      "\u001b[35mContent type text/csv\u001b[0m\n",
      "\u001b[35mAccept text/csv\u001b[0m\n",
      "\u001b[35mFirst entry is:  label\u001b[0m\n",
      "\u001b[35mLength indicates that label is included\u001b[0m\n",
      "\u001b[35mmerged df                    label                             words\u001b[0m\n",
      "\u001b[35m0    category_properties  {rental, investment, properties}\u001b[0m\n",
      "\u001b[35m1       category_medical                  {covid, medical}\u001b[0m\n",
      "\u001b[35m2   category_itemization           {itemization, donation}\u001b[0m\n",
      "\u001b[35m3  category_estate taxes    {estate, inheritance, medical}\u001b[0m\n",
      "\u001b[35m4  category_estate taxes                          {estate}\u001b[0m\n",
      "\u001b[35mlabel_column in input_data\u001b[0m\n",
      "\u001b[34m169.254.255.130 - - [02/Oct/2020:00:41:11 +0000] \"POST /invocations HTTP/1.1\" 200 5400000 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "\u001b[35m169.254.255.130 - - [02/Oct/2020:00:41:11 +0000] \"POST /invocations HTTP/1.1\" 200 5400000 \"-\" \"Go-http-client/1.1\"\u001b[0m\n",
      "s3://sagemaker-us-west-2-305752278501/preprocessor-2020-10-02-00-36-49-266\n"
     ]
    }
   ],
   "source": [
    "# Preprocess training input\n",
    "transformer.transform(train_input, content_type=\"text/csv\", split_type=\"Line\")\n",
    "print(\"Waiting for transform job: \" + transformer.latest_transform_job.job_name)\n",
    "transformer.wait()\n",
    "preprocessed_train = transformer.output_path\n",
    "print(preprocessed_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit a xgBoost Model with the preprocessed data <a class=\"anchor\" id=\"training_model\"></a>\n",
    "Let's take the preprocessed training data and fit a xgBoost Model. Sagemaker provides prebuilt algorithm containers that can be used with the Python SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.\n",
      "There is a more up to date SageMaker XGBoost image. To use the newer image, please set 'repo_version'='1.0-1'. For example:\n",
      "\tget_image_uri(region, 'xgboost', '1.0-1').\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:1'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "\n",
    "xgboost_container = get_image_uri(region, \"xgboost\")\n",
    "xgboost_container"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgboost_hyperparameters = {\n",
    "    \"max_depth\": \"5\",\n",
    "    \"eta\": \"0.2\",\n",
    "    \"gamma\": \"4\",\n",
    "    \"min_child_weight\": \"6\",\n",
    "    \"silent\": \"0\",\n",
    "    \"objective\": \"multi:softmax\",\n",
    "    \"num_class\": num_labels,\n",
    "    \"num_round\": \"10\",\n",
    "}\n",
    "\n",
    "# set an output path where the trained model will be saved\n",
    "output_path = \"s3://{}/{}/{}/output\".format(bucket, prefix, \"xgboost-pipeline-training\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.\n"
     ]
    }
   ],
   "source": [
    "xgboost_estimator = sagemaker.estimator.Estimator(\n",
    "    image_name=xgboost_container,\n",
    "    hyperparameters=xgboost_hyperparameters,\n",
    "    role=sagemaker.get_execution_role(),\n",
    "    train_instance_count=1,\n",
    "    train_instance_type=\"ml.m5.4xlarge\",\n",
    "    train_volume_size=5,  # 5 GB\n",
    "    output_path=output_path,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "'s3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.\n"
     ]
    }
   ],
   "source": [
    "xgboost_train_data = sagemaker.session.s3_input(\n",
    "    preprocessed_train,\n",
    "    distribution=\"FullyReplicated\",\n",
    "    content_type=\"text/csv\",\n",
    "    s3_data_type=\"S3Prefix\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2020-10-02 00:41:34 Starting - Starting the training job...\n",
      "2020-10-02 00:41:36 Starting - Launching requested ML instances......\n",
      "2020-10-02 00:42:37 Starting - Preparing the instances for training...\n",
      "2020-10-02 00:43:31 Downloading - Downloading input data...\n",
      "2020-10-02 00:44:04 Training - Training image download completed. Training in progress.\n",
      "2020-10-02 00:44:04 Uploading - Uploading generated training model\n",
      "2020-10-02 00:44:04 Completed - Training job completed\n",
      "\u001b[34mArguments: train\u001b[0m\n",
      "\u001b[34m[2020-10-02:00:43:52:INFO] Running standalone xgboost training.\u001b[0m\n",
      "\u001b[34m[2020-10-02:00:43:52:INFO] Path /opt/ml/input/data/validation does not exist!\u001b[0m\n",
      "\u001b[34m[2020-10-02:00:43:52:INFO] File size need to be processed in the node: 5.15mb. Available memory size in the node: 54839.87mb\u001b[0m\n",
      "\u001b[34m[2020-10-02:00:43:52:INFO] Determined delimiter of CSV input is ','\u001b[0m\n",
      "\u001b[34m[00:43:52] S3DistributionType set as FullyReplicated\u001b[0m\n",
      "\u001b[34m[00:43:52] 100000x26 matrix with 2600000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 46 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 56 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[0]#011train-merror:0.0507\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 48 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[1]#011train-merror:0.03634\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 48 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 36 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[2]#011train-merror:0.03495\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 42 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 36 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[3]#011train-merror:0.03284\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 50 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 50 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[4]#011train-merror:0.03009\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 50 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[5]#011train-merror:0.02876\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 38 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 8 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 48 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 36 extra nodes, 0 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 6 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[6]#011train-merror:0.02767\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 44 extra nodes, 10 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 48 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 2 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[7]#011train-merror:0.02629\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 18 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 44 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 50 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[8]#011train-merror:0.02592\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 16 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 14 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 52 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 12 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[00:43:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 38 extra nodes, 4 pruned nodes, max_depth=5\u001b[0m\n",
      "\u001b[34m[9]#011train-merror:0.02487\u001b[0m\n",
      "Training seconds: 33\n",
      "Billable seconds: 33\n"
     ]
    }
   ],
   "source": [
    "# execute the XGBoost training job\n",
    "xgboost_estimator.fit({\"train\": xgboost_train_data})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Serial Inference Pipeline with the preprocessor, xgBoost classifier and postprocessor <a class=\"anchor\" id=\"serial_inference\"></a>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the inference pipeline <a class=\"anchor\" id=\"pipeline_setup\"></a>\n",
    "Setting up a Machine Learning pipeline can be done with the Pipeline Model. This sets up a list of models in a single endpoint; in this example, we configure our pipeline model with the fitted preprocessor model, the fitted xgBoost model, and the postprocessor (which uses the preprocessor model data). Deploying the model follows the same ```deploy``` pattern in the SDK.\n",
    "\n",
    "Notice that we pass in the trained model from the proprocessor into the postpostprocessor. The reason we do this is so that the postprocesser can access the LabelEncoder that was trained in the preprocessor to invert that operation. This allows the inference pipeline to return the actual name of the category instead of the category label (e.g. \"medical\" instead of 7)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'305752278501.dkr.ecr.us-west-2.amazonaws.com/custompipeline/postprocessor:latest'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ecr_namespace = \"custompipeline/\"\n",
    "prefix = \"postprocessor\"\n",
    "ecr_repository_name = ecr_namespace + prefix\n",
    "postprocessor_container_image_uri = \"{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest\".format(\n",
    "    account_id, region, ecr_repository_name\n",
    ")\n",
    "postprocessor_container_image_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n"
     ]
    }
   ],
   "source": [
    "from sagemaker.model import Model\n",
    "\n",
    "# print(Model.__init__.__doc__)\n",
    "custom_postprocessor = Model(\n",
    "    image=postprocessor_container_image_uri,\n",
    "    model_data=custom_preprocessor.model_data,\n",
    "    role=role,\n",
    "    sagemaker_session=sagemaker_session,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
      "Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------!"
     ]
    }
   ],
   "source": [
    "from sagemaker.model import Model\n",
    "from sagemaker.pipeline import PipelineModel\n",
    "import boto3\n",
    "from time import gmtime, strftime\n",
    "\n",
    "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "\n",
    "preprocessor_model = custom_preprocessor.create_model()\n",
    "classifier_model = xgboost_estimator.create_model()\n",
    "postprocessor_model = custom_postprocessor\n",
    "\n",
    "model_name = \"inference-pipeline-\" + timestamp_prefix\n",
    "endpoint_name = \"inference-pipeline-ep-\" + timestamp_prefix\n",
    "sm_model = PipelineModel(\n",
    "    name=model_name, role=role, models=[preprocessor_model, classifier_model, postprocessor_model]\n",
    ")\n",
    "\n",
    "sm_model.deploy(initial_instance_count=1, instance_type=\"ml.c4.xlarge\", endpoint_name=endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make a request to our pipeline endpoint <a class=\"anchor\" id=\"pipeline_inference_request\"></a>\n",
    "\n",
    "Here we just grab the first line from the test data (you'll notice that the inference python script is very particular about the ordering of the inference request data). The ```ContentType``` field configures the first container, while the ```Accept``` field configures the last container. You can also specify each container's ```Accept``` and ```ContentType``` values using environment variables.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'{\"response\": [{\"category\": \"properties\", \"agent\": {\"ID\": \"2345\", \"FirstName\": \"Megan\", \"LastName\": \"Duvernoy\"}}, {\"category\": \"medical\", \"agent\": {\"ID\": \"5678\", \"FirstName\": \"Mohammad\", \"LastName\": \"Asif\"}}]}'\n"
     ]
    }
   ],
   "source": [
    "from sagemaker.predictor import (\n",
    "    json_serializer,\n",
    "    csv_serializer,\n",
    "    json_deserializer,\n",
    "    RealTimePredictor,\n",
    ")\n",
    "from sagemaker.content_types import CONTENT_TYPE_CSV, CONTENT_TYPE_JSON\n",
    "\n",
    "payload = \"rental,peanut,butter\\ncovid\"\n",
    "\n",
    "predictor = RealTimePredictor(\n",
    "    endpoint=endpoint_name,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    serializer=csv_serializer,\n",
    "    content_type=CONTENT_TYPE_CSV,\n",
    "    accept=CONTENT_TYPE_JSON,\n",
    ")\n",
    "\n",
    "print(predictor.predict(payload))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Delete Endpoint <a class=\"anchor\" id=\"delete_endpoint\"></a>\n",
    "Once we are finished with the endpoint, we clean up the resources!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ResponseMetadata': {'RequestId': '9eda73a2-7319-4e8a-ae5e-a0edf85d86d2',\n",
       "  'HTTPStatusCode': 200,\n",
       "  'HTTPHeaders': {'x-amzn-requestid': '9eda73a2-7319-4e8a-ae5e-a0edf85d86d2',\n",
       "   'content-type': 'application/x-amz-json-1.1',\n",
       "   'content-length': '0',\n",
       "   'date': 'Fri, 02 Oct 2020 00:50:50 GMT'},\n",
       "  'RetryAttempts': 0}}"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sm_client = sagemaker_session.boto_session.client(\"sagemaker\")\n",
    "sm_client.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/contrib|inference_pipeline_custom_containers|inference-pipeline.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}