{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 3: Model Training\n", "**This notebook uses the feature set extracted by `module-2` to create a XGBoost based machine learning model for binary classification**\n", "\n", "**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`\n", "\n", "---\n", "\n", "## Contents\n", "\n", "1. [Background](#Background)\n", "1. [Setup](#Setup)\n", "1. [Load transformed feature set](#Load-transformed-feature-set)\n", "1. [Split data](#Split-data)\n", "1. [Train a model using SageMaker built-in XgBoost algorithm](#Train-a-model-using-SageMaker-built-in-XgBoost-algorithm)\n", "1. [Real time inference using the deployed endpoint](#Real-time-inference-using-the-deployed-endpoint)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Background\n", "\n", "In this notebook, we demonstrate how to use the feature set derived in `Module-2` and create a machine learning model for predicting whether a customer will reorder a product or not based on historical records. Given the problem type is supervised binary classification, we will use a SageMaker built-in algorithm XGBoost to design this classifier. Once the model is trained, we will also deploy the trained model as a SageMaker endpoint for real-time inference. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.serializers import CSVSerializer\n", "from sagemaker.inputs import TrainingInput\n", "from sagemaker.predictor import Predictor\n", "from sagemaker import get_execution_role\n", "import pandas as pd\n", "import numpy as np\n", "import sagemaker\n", "import logging\n", "import boto3\n", "import json\n", "import os" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logger = logging.getLogger('__name__')\n", "logger.setLevel(logging.DEBUG)\n", "logger.addHandler(logging.StreamHandler())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Essentials" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker_execution_role = get_execution_role()\n", "logger.info(f'Role = {sagemaker_execution_role}')\n", "session = boto3.Session()\n", "sagemaker_session = sagemaker.Session()\n", "default_bucket = sagemaker_session.default_bucket()\n", "prefix = 'sagemaker-featurestore-workshop'\n", "s3 = session.resource('s3')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load transformed feature set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('.././data/train/transformed.csv')\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Move column `is_redordered` to be the first column since our training algorithm `XGBoost` expects the target column to be the first column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_column = df.pop('is_reordered')\n", "df.insert(0, 'is_reordered', first_column)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Split data\n", "\n", "We will shuffle the whole dataset first (df.sample(frac=1, random_state=123)) and then split our data set into the following parts:\n", "\n", "* 70% - train set,\n", "* 20% - validation set,\n", "* 10% - test set\n", "\n", "**Note:** In the code below, the first element denotes size for train (0.7 = 70%), second element denotes size for test (1-0.9 = 0.1 = 10%) and difference between the two denotes size for validation(1 - [0.7+0.1] = 0.2 = 20%)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df, validation_df, test_df = np.split(df.sample(frac=1, random_state=123), [int(.7*len(df)), int(.9*len(df))])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "validation_df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save split datasets to local" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_df.to_csv('../data/train/train.csv', index=False)\n", "validation_df.to_csv('../data/validation/validation.csv', index=False)\n", "test_df.to_csv('../data/test/test.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy datasets to S3 from local" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3.Bucket(default_bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('.././data/train/train.csv')\n", "s3.Bucket(default_bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('.././data/validation/validation.csv')\n", "s3.Bucket(default_bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('.././data/test/test.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create Pointers to the uploaded files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_set_location = 's3://{}/{}/train/'.format(default_bucket, prefix)\n", "validation_set_location = 's3://{}/{}/validation/'.format(default_bucket, prefix)\n", "test_set_location = 's3://{}/{}/test/'.format(default_bucket, prefix)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_set_pointer = TrainingInput(s3_data=train_set_location, content_type='csv')\n", "validation_set_pointer = TrainingInput(s3_data=validation_set_location, content_type='csv')\n", "test_set_pointer = TrainingInput(s3_data=test_set_location, content_type='csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(json.dumps(train_set_pointer.__dict__, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train a model using SageMaker built-in XgBoost algorithm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "container_uri = sagemaker.image_uris.retrieve(region=session.region_name, \n", " framework='xgboost', \n", " version='1.0-1', \n", " image_scope='training')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgb = sagemaker.estimator.Estimator(image_uri=container_uri,\n", " role=sagemaker_execution_role, \n", " instance_count=2, \n", " instance_type='ml.m5.xlarge',\n", " output_path='s3://{}/{}/model-artifacts'.format(default_bucket, prefix),\n", " sagemaker_session=sagemaker_session,\n", " base_job_name='reorder-classifier')\n", "\n", "xgb.set_hyperparameters(objective='binary:logistic',\n", " num_round=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgb.fit({'train': train_set_pointer, 'validation': validation_set_pointer})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Saving Training Job Information" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Saving training job information to be used in the ML lineage module\n", "training_job_info = xgb.latest_training_job.describe()\n", "if training_job_info != None :\n", " training_jobName = training_job_info[\"TrainingJobName\"]\n", " %store training_jobName" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Host the trained XGBoost model as a SageMaker Endpoint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** The deployment usually takes ~10 mins - good time to take a coffee break :)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgb_predictor = xgb.deploy(initial_instance_count=2,\n", " instance_type='ml.m5.xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Real time inference using the deployed endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "csv_serializer = CSVSerializer()\n", "endpoint_name = xgb_predictor.endpoint_name\n", "%store endpoint_name\n", "predictor = Predictor(endpoint_name=endpoint_name, \n", " serializer=csv_serializer)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_df = pd.read_csv('.././data/test/test.csv')\n", "record = test_df.sample(1)\n", "record" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = record.values[0]\n", "payload = X[1:]\n", "payload" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "predicted_class_prob = predictor.predict(payload).decode('utf-8')\n", "if float(predicted_class_prob) < 0.5:\n", " logger.info('Prediction (y) = Will not reorder')\n", "else:\n", " logger.info('Prediction (y) = Will reorder')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }