{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Customer Churn Prediction with Amazon SageMaker Autopilot\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "_**Using AutoPilot to Predict Mobile Customer Departure**_\n", "\n", "---\n", "\n", "---\n", "\n", "Kernel `Python 3 (Data Science)` works well with this notebook.\n", "\n", "## Contents\n", "\n", "1. [Introduction](#Introduction)\n", "1. [Setup](#Setup)\n", "1. [Data](#Data)\n", "1. [Train](#Settingup)\n", "1. [Autopilot Results](#Results)\n", "1. [Host](#Host)\n", "1. [Cleanup](#Cleanup)\n", "\n", "\n", "---\n", "\n", "## Introduction\n", "\n", "Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.\n", "\n", "Losing customers is costly for any business. Identifying unhappy customers early on gives you a chance to offer them incentives to stay. This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.\n", "\n", "We use an example of churn that is familiar to all of us–leaving a mobile phone operator. Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives–I can always use a phone upgrade or perhaps have a new feature activated–and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.\n", "\n", "---\n", "## Setup\n", "\n", "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", "\n", "Let's start by specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "isConfigCell": true, "tags": [ "parameters" ] }, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "from sagemaker import get_execution_role\n", "\n", "region = boto3.Session().region_name\n", "\n", "session = sagemaker.Session()\n", "\n", "# You can modify the following to use a bucket of your choosing\n", "bucket = session.default_bucket()\n", "prefix = \"sagemaker/DEMO-autopilot-churn\"\n", "\n", "role = get_execution_role()\n", "\n", "# This is the client we will use to interact with SageMaker AutoPilot\n", "sm = boto3.Session().client(service_name=\"sagemaker\", region_name=region)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll import the Python libraries we'll need for the remainder of the exercise." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import io\n", "import os\n", "import sys\n", "import time\n", "import json\n", "from IPython.display import display\n", "from time import strftime, gmtime\n", "import boto3\n", "import sagemaker\n", "from sagemaker.predictor import csv_serializer" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Data\n", "\n", "Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes–after all, predicting the future is tricky business! But I’ll also show how to deal with prediction errors.\n", "\n", "The dataset we will use is synthetically generated, but indictive of the types of features you'd see in this use case." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client(\"s3\")\n", "s3.download_file(\n", " f\"sagemaker-example-files-prod-{region}\", \"datasets/tabular/synthetic/churn.txt\", \"churn.txt\"\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Upload the dataset to S3\n", "\n", "Before you run Autopilot on the dataset, first perform a check of the dataset to make sure that it has no obvious errors. The Autopilot process can take long time, and it's generally a good practice to inspect the dataset before you start a job. This particular dataset is small, so you can inspect it in the notebook instance itself. If you have a larger dataset that will not fit in a notebook instance memory, inspect the dataset offline using a big data analytics tool like Apache Spark. [Deequ](https://github.com/awslabs/deequ) is a library built on top of Apache Spark that can be helpful for performing checks on large datasets. Autopilot is capable of handling datasets up to 5 GB.\n", "\n", "Read the data into a Pandas data frame and take a look." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | State | \n", "Account Length | \n", "Area Code | \n", "Phone | \n", "Int'l Plan | \n", "VMail Plan | \n", "VMail Message | \n", "Day Mins | \n", "Day Calls | \n", "Day Charge | \n", "Eve Mins | \n", "Eve Calls | \n", "Eve Charge | \n", "Night Mins | \n", "Night Calls | \n", "Night Charge | \n", "Intl Mins | \n", "Intl Calls | \n", "Intl Charge | \n", "CustServ Calls | \n", "Churn? | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "KS | \n", "128 | \n", "415 | \n", "382-4657 | \n", "no | \n", "yes | \n", "25 | \n", "265.1 | \n", "110 | \n", "45.07 | \n", "197.4 | \n", "99 | \n", "16.78 | \n", "244.7 | \n", "91 | \n", "11.01 | \n", "10.0 | \n", "3 | \n", "2.70 | \n", "1 | \n", "False. | \n", "
1 | \n", "OH | \n", "107 | \n", "415 | \n", "371-7191 | \n", "no | \n", "yes | \n", "26 | \n", "161.6 | \n", "123 | \n", "27.47 | \n", "195.5 | \n", "103 | \n", "16.62 | \n", "254.4 | \n", "103 | \n", "11.45 | \n", "13.7 | \n", "3 | \n", "3.70 | \n", "1 | \n", "False. | \n", "
2 | \n", "NJ | \n", "137 | \n", "415 | \n", "358-1921 | \n", "no | \n", "no | \n", "0 | \n", "243.4 | \n", "114 | \n", "41.38 | \n", "121.2 | \n", "110 | \n", "10.30 | \n", "162.6 | \n", "104 | \n", "7.32 | \n", "12.2 | \n", "5 | \n", "3.29 | \n", "0 | \n", "False. | \n", "
3 | \n", "OH | \n", "84 | \n", "408 | \n", "375-9999 | \n", "yes | \n", "no | \n", "0 | \n", "299.4 | \n", "71 | \n", "50.90 | \n", "61.9 | \n", "88 | \n", "5.26 | \n", "196.9 | \n", "89 | \n", "8.86 | \n", "6.6 | \n", "7 | \n", "1.78 | \n", "2 | \n", "False. | \n", "
4 | \n", "OK | \n", "75 | \n", "415 | \n", "330-6626 | \n", "yes | \n", "no | \n", "0 | \n", "166.7 | \n", "113 | \n", "28.34 | \n", "148.3 | \n", "122 | \n", "12.61 | \n", "186.9 | \n", "121 | \n", "8.41 | \n", "10.1 | \n", "3 | \n", "2.73 | \n", "3 | \n", "False. | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
3328 | \n", "AZ | \n", "192 | \n", "415 | \n", "414-4276 | \n", "no | \n", "yes | \n", "36 | \n", "156.2 | \n", "77 | \n", "26.55 | \n", "215.5 | \n", "126 | \n", "18.32 | \n", "279.1 | \n", "83 | \n", "12.56 | \n", "9.9 | \n", "6 | \n", "2.67 | \n", "2 | \n", "False. | \n", "
3329 | \n", "WV | \n", "68 | \n", "415 | \n", "370-3271 | \n", "no | \n", "no | \n", "0 | \n", "231.1 | \n", "57 | \n", "39.29 | \n", "153.4 | \n", "55 | \n", "13.04 | \n", "191.3 | \n", "123 | \n", "8.61 | \n", "9.6 | \n", "4 | \n", "2.59 | \n", "3 | \n", "False. | \n", "
3330 | \n", "RI | \n", "28 | \n", "510 | \n", "328-8230 | \n", "no | \n", "no | \n", "0 | \n", "180.8 | \n", "109 | \n", "30.74 | \n", "288.8 | \n", "58 | \n", "24.55 | \n", "191.9 | \n", "91 | \n", "8.64 | \n", "14.1 | \n", "6 | \n", "3.81 | \n", "2 | \n", "False. | \n", "
3331 | \n", "CT | \n", "184 | \n", "510 | \n", "364-6381 | \n", "yes | \n", "no | \n", "0 | \n", "213.8 | \n", "105 | \n", "36.35 | \n", "159.6 | \n", "84 | \n", "13.57 | \n", "139.2 | \n", "137 | \n", "6.26 | \n", "5.0 | \n", "10 | \n", "1.35 | \n", "2 | \n", "False. | \n", "
3332 | \n", "TN | \n", "74 | \n", "415 | \n", "400-4344 | \n", "no | \n", "yes | \n", "25 | \n", "234.4 | \n", "113 | \n", "39.85 | \n", "265.9 | \n", "82 | \n", "22.60 | \n", "241.4 | \n", "77 | \n", "10.86 | \n", "13.7 | \n", "4 | \n", "3.70 | \n", "0 | \n", "False. | \n", "
3333 rows × 21 columns
\n", "