{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Inference Pipeline with Custom Containers and xgBoost\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Typically a Machine Learning (ML) process consists of few steps: data gathering with various ETL jobs, pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm. \n", "In many cases, when the trained model is used for processing real time or batch prediction requests, the model receives data in a format which needs to pre-processed (e.g. featurized) before it can be passed to the algorithm. In the following notebook, we will demonstrate how you can build your ML Pipeline leveraging the ability to create custom Sagemaker algorithms and the out of the box SageMaker xgBoost algorithm. After the model is trained we will deploy the ML Pipeline (data preprocessing, the xgBoost classifier, and data postprocessing) as an Inference Pipeline behind a single SageMaker Endpoint for real time inference. We will also use the preprocessor with batch transformation using Amazon SageMaker Batch Transform to prepare xgBoost training data.\n", "\n", "\n", "\n", "The toy problem that is being solved here is to match a set of keywords to a category of questions. From there we can match that category against a list of available agents who specialize in answering that category of question. The agents and their availability is stored externally in a DynamoDB database. The data transformations, matching against our model, and querying of the database are all done as part of the inference pipeline.\n", "\n", "The preprocessing step of the pipeline encodes a comma-separated list of words into a format that xgBoost understands using a CountVectorizer. It also trains a LabelEncoder, which is used to transform from the categories of questions to a set of integers - having the labels encoded as integers is also a requirement of the xgBoost multiclass classifer. \n", "\n", "The xgBoost model maps the encoded list of words to an integer, which represents the encoded class of question that best matches those words.\n", "\n", "Finally, the postprocessing step of the pipeline uses the LabelEncoding model trained in the preprocessing step to map the number representing the classification of the question back to the text. It then takes the category and queries dynamodb for available agents that matches that category." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first create our Sagemaker session and role, and create a S3 prefix to use for the notebook example." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "!mkdir -p returns_data\n", "!python3 generate-training-data.py --samples 100000 --filename returns_data/samples.csv" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "!python3 load-ddb-data.py PipelineLookupTable" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# S3 prefix\n", "\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "sagemaker_session = sagemaker.Session()\n", "\n", "# Get a SageMaker-compatible role used by this Notebook Instance.\n", "role = get_execution_role()\n", "\n", "bucket = sagemaker_session.default_bucket()\n", "prefix = \"Custom-Pipeline-Inference-Example\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload the data for training \n", "\n", "When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. We can use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "WORK_DIRECTORY = \"returns_data\"\n", "\n", "train_input = sagemaker_session.upload_data(\n", " path=\"{}/{}\".format(WORK_DIRECTORY, \"samples.csv\"),\n", " bucket=bucket,\n", " key_prefix=\"{}/{}\".format(prefix, \"train\"),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up a loader function\n", "\n", "The load_data function pulls in the CSV data into two columns: the first column of the CSV is mapped to the label, and every subsequent CSV column is loaded as a dictionary into the second Pandas column" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import csv\n", "\n", "\n", "def load_data(raw, columns, skip_first_row=True):\n", " recs = [(row[0], set(row[1:])) for row in csv.reader(raw)]\n", " if skip_first_row:\n", " return pd.DataFrame.from_records(recs[1:], columns=columns)\n", " else:\n", " return pd.DataFrame.from_records(recs, columns=columns)\n", "\n", "\n", "def load(files, columns, skip_first_row=True):\n", " raw_data = []\n", " for file in files:\n", " raw_data.append(load_data(open(file), columns, skip_first_row))\n", "\n", " return pd.concat(raw_data)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df = load([\"returns_data/samples.csv\"], [\"label\", \"words\"])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | label | \n", "words | \n", "
---|---|---|
0 | \n", "category_properties | \n", "{rental, properties, investment} | \n", "
1 | \n", "category_medical | \n", "{medical, covid} | \n", "
2 | \n", "category_itemization | \n", "{donation, itemization} | \n", "
3 | \n", "category_estate taxes | \n", "{medical, inheritance, estate} | \n", "
4 | \n", "category_estate taxes | \n", "{estate} | \n", "