{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Customer Churn Prediction with XGBoost\n",
    "_**Using Gradient Boosted Trees to Predict Mobile Customer Departure**_\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Setup](#Setup)\n",
    "1. [Data](#Data)\n",
    "1. [Train](#Train)\n",
    "1. [Inference Pipeline](#Inference)\n",
    "\n",
    "\n",
    "---\n",
    "\n",
    "## Background\n",
    "\n",
    "_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_\n",
    "\n",
    "Losing customers is costly for any business.  Identifying unhappy customers early on gives you a chance to offer them incentives to stay.  This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.\n",
    "\n",
    "We use an example of churn that is familiar to all of us–leaving a mobile phone operator.  Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives–I can always use a phone upgrade or perhaps have a new feature activated–and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.\n",
    "\n",
    "---\n",
    "\n",
    "## Setup\n",
    "\n",
    "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "- The S3 bucket and prefix that you want to use for training and model data. You need to create an S3 bucket to store your datasets and the model you build. Follow this [instruction](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-bucket.html) to create an S3 bucket. This should be within the same region as the Notebook Instance, training, and hosting.\n",
    "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true,
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "bucket = 'QS-SM-customer-churn'\n",
    "prefix = 'sagemaker/QS-xgboost-churn'\n",
    "\n",
    "# Define IAM role\n",
    "import boto3\n",
    "import re\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll import the Python libraries we'll need for the remainder of the exercise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import io\n",
    "import os\n",
    "import sys\n",
    "import time\n",
    "import json\n",
    "from IPython.display import display\n",
    "from time import strftime, gmtime\n",
    "import sagemaker\n",
    "from sagemaker.predictor import csv_serializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Data\n",
    "\n",
    "Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes–after all, predicting the future is tricky business! But I’ll also show how to deal with prediction errors.\n",
    "\n",
    "The dataset we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.  Let's download and read that dataset in now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget http://dataminingconsultant.com/DKD2e_data_sets.zip\n",
    "!unzip -o DKD2e_data_sets.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "churn = pd.read_csv('./Data sets/churn.txt')\n",
    "pd.set_option('display.max_columns', 500)\n",
    "churn.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By modern standards, it’s a relatively small dataset, with only 3,333 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:\n",
    "\n",
    "- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ\n",
    "- `Account Length`: the number of days that this account has been active\n",
    "- `Area Code`: the three-digit area code of the corresponding customer’s phone number\n",
    "- `Phone`: the remaining seven-digit phone number\n",
    "- `Int’l Plan`: whether the customer has an international calling plan: yes/no\n",
    "- `VMail Plan`: whether the customer has a voice mail feature: yes/no\n",
    "- `VMail Message`: presumably the average number of voice mail messages per month\n",
    "- `Day Mins`: the total number of calling minutes used during the day\n",
    "- `Day Calls`: the total number of calls placed during the day\n",
    "- `Day Charge`: the billed cost of daytime calls\n",
    "- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening\n",
    "- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime\n",
    "- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls\n",
    "- `CustServ Calls`: the number of calls placed to Customer Service\n",
    "- `Churn?`: whether the customer left the service: true/false\n",
    "\n",
    "The last attribute, `Churn?`, is known as the target attribute–the attribute that we want the ML model to predict.  Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "let's split the data into training, validation, and test sets.  This will help prevent us from overfitting the model, and allow us to test the models accuracy on data it hasn't already seen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data, validation_data, test_data = np.split(churn.sample(frac=1, random_state=1729), [int(0.7 * len(churn)), int(0.9 * len(churn))])\n",
    "train_data.to_csv('train.csv', header=False, index=False)\n",
    "validation_data.to_csv('validation.csv', header=False, index=False)\n",
    "test_data.drop('Churn?', axis=1).to_csv('test.csv', header=True, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll upload these files to S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'rawtrain/train.csv')).upload_file('train.csv')\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'rawvalidation/validation.csv')).upload_file('validation.csv')\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'rawtest/test.csv')).upload_file('test.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/rawtrain/'.format(bucket, prefix), content_type='csv')\n",
    "s3_input_validation = sagemaker.TrainingInput(s3_data='s3://{}/{}/rawvalidation/'.format(bucket, prefix), content_type='csv')\n",
    "s3_input_test = sagemaker.TrainingInput(s3_data='s3://{}/{}/rawtest/'.format(bucket, prefix), content_type='csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing data <a class=\"anchor\" id=\"Pre-processing\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to do typical preprocessing tasks, including cleaning, feature transformation, feature selection on input data before train the prediction model. For example:  \n",
    "- `Phone` takes on too many unique values to be of any practical use. It's possible parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it.\n",
    "- `Area Code` showing up as a feature we should convert to non-numeric.\n",
    "- If we dig into features and run correlaiton analysis, we see several features that essentially have high correlation with one another. Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias. We should remove one feature from each of the highly correlated pairs.\n",
    "\n",
    "We will use Amazon SageMaker built-in Scikit-learn library for preprocessing (and also postprocessing), and then use the Amazon SageMaker built-in XGboost algorithm for predictions. We’ll deploy both the library and the algorithm on the same endpoint using the Amazon SageMaker Inference Pipelines feature so you can pass raw input data directly to Amazon SageMaker. We’ll also reuse the preprocessing code between training and inference to reduce development overhead and errors.\n",
    "\n",
    "To run Scikit-learn on Sagemaker `SKLearn` Estimator with a script as an entry point. The training script is very similar to a training script you might run outside of SageMaker. Also, as this data set is pretty small in term of size, we use the 'local' mode for preprocessing and upload the transformer and transformed data into S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "sagemaker_session = sagemaker.Session()\n",
    "git_config = {'repo': 'https://github.com/aws-samples/quicksight-sagemaker-integration-blog.git','branch': 'master'}\n",
    "\n",
    "script_path = 'quicksight-sagemaker-integration/preprocessing.py'\n",
    "\n",
    "sklearn_preprocessor = SKLearn(\n",
    "    entry_point=script_path,\n",
    "    git_config=git_config,\n",
    "    role=role,\n",
    "    instance_type=\"local\",\n",
    "    framework_version=\"0.20.0\")\n",
    "sklearn_preprocessor.fit({'train': s3_input_train})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preparing the training and validation dataset <a class=\"anchor\" id=\"preprocess_train_data\"></a>\n",
    "Now that our proprocessor is properly fitted, let's go ahead and preprocess our training and validation data. Let's use batch transform to directly preprocess the raw data and store right back into s3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a SKLearn Transformer from the trained SKLearn Estimator\n",
    "transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-train-output')\n",
    "\n",
    "scikit_learn_inferencee_model = sklearn_preprocessor.create_model(env={'TRANSFORM_MODE': 'feature-transform'})\n",
    "transformer_train = scikit_learn_inferencee_model.transformer(\n",
    "    instance_count=1, \n",
    "    instance_type='local',\n",
    "    assemble_with = 'Line',\n",
    "    output_path = transform_train_output_path,\n",
    "    accept = 'text/csv')\n",
    "\n",
    "\n",
    "# Preprocess training input\n",
    "transformer_train.transform(s3_input_train.config['DataSource']['S3DataSource']['S3Uri'], content_type='text/csv')\n",
    "print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)\n",
    "transformer_train.wait()\n",
    "preprocessed_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name\n",
    "print(preprocessed_train_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a SKLearn Transformer from the trained SKLearn Estimator\n",
    "transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-validation-output')\n",
    "transformer_validation = scikit_learn_inferencee_model.transformer(\n",
    "    instance_count=1, \n",
    "    instance_type='local',\n",
    "    assemble_with = 'Line',\n",
    "    output_path = transform_validation_output_path,\n",
    "    accept = 'text/csv')\n",
    "# Preprocess validation input\n",
    "transformer_validation.transform(s3_input_validation.config['DataSource']['S3DataSource']['S3Uri'], content_type='text/csv')\n",
    "print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)\n",
    "transformer_validation.wait()\n",
    "preprocessed_validation_path = transformer_validation.output_path+transformer_validation.latest_transform_job.job_name\n",
    "print(preprocessed_validation_path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Train\n",
    "\n",
    "Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "container = sagemaker.image_uris.retrieve(\"xgboost\", boto3.Session().region_name, \"1.2-1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_input_train_processed = sagemaker.session.TrainingInput(\n",
    "    preprocessed_train_path, \n",
    "    distribution='FullyReplicated',\n",
    "    content_type='text/csv', \n",
    "    s3_data_type='S3Prefix')\n",
    "print(s3_input_train_processed.config)\n",
    "s3_input_validation_processed = sagemaker.session.TrainingInput(\n",
    "    preprocessed_validation_path, \n",
    "    distribution='FullyReplicated',\n",
    "    content_type='text/csv', \n",
    "    s3_data_type='S3Prefix')\n",
    "print(s3_input_validation_processed.config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:\n",
    "- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.\n",
    "- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.\n",
    "- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.\n",
    "- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.\n",
    "- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.\n",
    "\n",
    "More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = sagemaker.Session()\n",
    "\n",
    "xgb = sagemaker.estimator.Estimator(container,\n",
    "                                    role, \n",
    "                                    instance_count=1, \n",
    "                                    instance_type='ml.m4.xlarge',\n",
    "                                    output_path='s3://{}/{}/output'.format(bucket, prefix),\n",
    "                                    sagemaker_session=sess)\n",
    "xgb.set_hyperparameters(max_depth=5,\n",
    "                        eta=0.2,\n",
    "                        gamma=4,\n",
    "                        min_child_weight=6,\n",
    "                        subsample=0.8,\n",
    "                        objective='binary:logistic',\n",
    "                        num_round=100)\n",
    "\n",
    "xgb.fit({'train': s3_input_train_processed, 'validation': s3_input_validation_processed}) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Post-processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a SKLearn Transformer from the trained SKLearn Estimator\n",
    "transform_postprocessor_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-postprocessing-output')\n",
    "scikit_learn_post_process_model = sklearn_preprocessor.create_model(env={'TRANSFORM_MODE': 'inverse-label-transform'})\n",
    "transformer_post_processing = scikit_learn_post_process_model.transformer(\n",
    "    instance_count=1, \n",
    "    instance_type='local',\n",
    "    assemble_with = 'Line',\n",
    "    output_path = transform_postprocessor_path,\n",
    "    accept = 'text/csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference Pipeline <a class=\"anchor\" id=\"pipeline_setup\"></a>\n",
    "Setting up a Machine Learning pipeline can be done with the create_model(). In this example, we configure our pipeline model with the fitted Scikit-learn inference model, the fitted Xgboost model and the psotprocessing model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "model_name = 'QS-inference-pipeline-' + timestamp_prefix\n",
    "client = boto3.client('sagemaker')\n",
    "response = client.create_model(\n",
    "    ModelName=model_name,\n",
    "    Containers=[\n",
    "        {\n",
    "            'Image': sklearn_preprocessor.image_uri,\n",
    "            'ModelDataUrl': sklearn_preprocessor.model_data,\n",
    "            'Environment': {\n",
    "                    \"SAGEMAKER_SUBMIT_DIRECTORY\": sklearn_preprocessor.uploaded_code.s3_prefix,\n",
    "                    \"TRANSFORM_MODE\": \"feature-transform\",\n",
    "                    \"SAGEMAKER_CONTAINER_LOG_LEVEL\": str(sklearn_preprocessor.container_log_level),\n",
    "                    \"SAGEMAKER_REGION\": sklearn_preprocessor.sagemaker_session.boto_region_name,\n",
    "                    \"SAGEMAKER_PROGRAM\": sklearn_preprocessor.uploaded_code.script_name\n",
    "                }\n",
    "        },\n",
    "        {\n",
    "            'Image': xgb.image_uri,\n",
    "            'ModelDataUrl': xgb.model_data,\n",
    "            \"Environment\": {}\n",
    "        },\n",
    "        {\n",
    "            'Image': scikit_learn_post_process_model.image_uri,\n",
    "            'ModelDataUrl': scikit_learn_post_process_model.model_data,\n",
    "            'Environment': {\n",
    "                    \"SAGEMAKER_SUBMIT_DIRECTORY\": sklearn_preprocessor.uploaded_code.s3_prefix,\n",
    "                    \"TRANSFORM_MODE\": \"inverse-label-transform\",\n",
    "                    \"SAGEMAKER_CONTAINER_LOG_LEVEL\": str(sklearn_preprocessor.container_log_level),\n",
    "                    \"SAGEMAKER_REGION\": sklearn_preprocessor.sagemaker_session.boto_region_name,\n",
    "                    \"SAGEMAKER_PROGRAM\": sklearn_preprocessor.uploaded_code.script_name\n",
    "                }\n",
    "        },\n",
    "    ],\n",
    "    ExecutionRoleArn = role,\n",
    ")\n",
    "model_name"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 4
}