{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Customer Churn Prediction with XGBoost\n",
    "_**Using Gradient Boosted Trees to Predict Mobile Customer Departure**_\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Setup](#Setup)\n",
    "1. [Data](#Data)\n",
    "1. [Train](#Train)\n",
    "1. [Compile](#Compile)\n",
    "1. [Host](#Host)\n",
    "  1. [Evaluate](#Evaluate)\n",
    "  1. [Relative cost of errors](#Relative-cost-of-errors)\n",
    "1. [Extensions](#Extensions)\n",
    "\n",
    "---\n",
    "\n",
    "## Background\n",
    "\n",
    "_이 노트북의 자세한 유즈 케이스 내용은 다음의 블로그에서 확인하실 수 있습니다. [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_\n",
    "\n",
    "고객을 잃는 것은 모든 기업에서 비용이 많이 듭니다. 행복하지 않은 고객을 조기에 발견하면 그들에게 머물 인센티브를 제공 할 수있는 기회를 제공합니다. 이 노트북은 고객 이탈 예측이라고도 하는 불만족스러운 고객을 자동으로 식별하기 위해 기계 학습 (ML) 을 사용하는 방법에 대해 설명합니다. ML 모델은 완벽한 예측을 거의 제공하지 않기 때문에 이 노트북은 ML 사용의 재무 결과를 결정할 때 예측 실수의 상대적 비용을 통합하는 방법에 대해서도 설명합니다.\n",
    "\n",
    "휴대폰 가입을 해지하는 우리 모두에게 익숙한 이탈의 예를 사용합니다. 고객이 떠날 생각이라고 알고 있다면 적시에 인센티브를 제공 할 수 있습니다. 항상 전화로 응대할 수 있도록 하거나 새로운 기능을 활성화 할 수 있습니다. 인센티브는 고객을 잃고 재확보하는 것보다 훨씬 더 비용 효율적입니다.\n",
    "\n",
    "---\n",
    "\n",
    "## Setup\n",
    "\n",
    "\n",
    "- 학습 및 모델 데이터에 사용할 S3 버킷과 prefix 입니다. 이 옵션은 노트북 인스턴스, 교육 및 호스팅과 동일한 지역 내에 있어야 합니다.\n",
    "- 데이터에 대한 교육 및 호스팅 액세스를 제공하는 데 사용되는 IAM 역할 arn입니다. 이러한 항목을 만드는 방법은 설명서를 참조하십시오. 노트북 인스턴스, 교육 및/또는 호스팅에 둘 이상의 역할이 필요한 경우 boto 정규 표현식을 적절한 전체 IAM 역할 arn 문자열로 교체하십시오."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true,
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "bucket = sess.default_bucket()\n",
    "prefix = \"sagemaker/DEMO-xgboost-churn\"\n",
    "\n",
    "# Define IAM role\n",
    "import boto3\n",
    "import re\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "다음으로, 나머지 연습에 필요한 Python 라이브러리를 가져옵니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import io\n",
    "import os\n",
    "import sys\n",
    "import time\n",
    "import json\n",
    "from IPython.display import display\n",
    "from time import strftime, gmtime\n",
    "from sagemaker.inputs import TrainingInput\n",
    "from sagemaker.serializers import CSVSerializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Data\n",
    "\n",
    "이동 통신 사업자는 고객이 궁극적으로 이탈을 끝내고 서비스를 계속 사용하는 기록 기록을 가지고 있습니다. 우리는 학습이라는 프로세스를 사용하여 하나의 이동 통신 사업자의 이탈의 ML 모델을 구성하는데 이러한 정보를 사용할 수 있습니다. 모델을 학습한 후 임의의 고객의 프로필 정보 (모델을 교육하는 데 사용한 것과 동일한 프로필 정보) 를 모델에 전달하고 모델이 고객이 이탈할지 여부를 예측하도록 할 수 있습니다. 물론, 우리는 모델이 실수를 할 것으로 기대합니다. 결국 미래를 예측하는 것은 까다로운 사업입니다! 그러나 예측 오류를 처리하는 방법도 보여 드리겠습니다.\n",
    "\n",
    "The dataset we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.  Let's download and read that dataset in now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt ./"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "churn = pd.read_csv(\"./churn.txt\")\n",
    "pd.set_option(\"display.max_columns\", 500)\n",
    "churn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "현대 표준에 따르면, 이 데이터셋은 5,000개의 레코드만 있는 비교적 작은 데이터셋으로, 각 레코드는 미국 이동 통신 사업자의 고객 프로필을 설명하기 위해 21개의 속성을 사용합니다. \n",
    "속성은 다음과 같습니다:\n",
    "\n",
    "- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ\n",
    "- `Account Length`: 이 계정이 활성화된 기간 (일)\n",
    "- `Area Code`: the three-digit area code of the corresponding customer’s phone number\n",
    "- `Phone`: the remaining seven-digit phone number\n",
    "- `Int’l Plan`: whether the customer has an international calling plan: yes/no\n",
    "- `VMail Plan`: whether the customer has a voice mail feature: yes/no\n",
    "- `VMail Message`: 월별 평균 음성 메일 메시지 수\n",
    "- `Day Mins`: 하루 동안 사용된 총 통화 시간 (분)\n",
    "- `Day Calls`: 하루 동안 걸려온 총 통화 수\n",
    "- `Day Charge`: 주간 통화의 청구 비용\n",
    "- `Eve Mins, Eve Calls, Eve Charge`: 저녁 시간에 걸린 통화에 대한 청구 비용\n",
    "- `Night Mins`, `Night Calls`, `Night Charge`: 야간 통화에 대한 청구 비용\n",
    "- `Intl Mins`, `Intl Calls`, `Intl Charge`: 국제 통화에 대한 청구 비용\n",
    "- `CustServ Calls`: 고객 서비스에 걸려온 통화 수\n",
    "- `Churn?`: whether the customer left the service: true/false\n",
    "\n",
    "마지막, `Churn?`이 모델이 예측하기를 원하는 속성.  대상 속성이 이진이기 때문에, 우리의 모델은 이진 분류라고 이진 예측을 수행합니다.\n",
    "\n",
    "데이터 분석을 시작해 보죠."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Frequency tables for each categorical feature\n",
    "for column in churn.select_dtypes(include=[\"object\"]).columns:\n",
    "    display(pd.crosstab(index=churn[column], columns=\"% observations\", normalize=\"columns\"))\n",
    "\n",
    "# Histograms for each numeric features\n",
    "display(churn.describe())\n",
    "%matplotlib inline\n",
    "hist = churn.hist(bins=30, sharey=True, figsize=(10, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "다음을 확인할 수 있습니다:\n",
    "- `State` 매우 균등하게 분포 된 것처럼 보입니다.\n",
    "- `Phone` 실제 용도로 너무 많은 고유 값을 사용합니다.접두사를 구문 분석하는 것이 약간의 가치를 가질 수 있지만 할당되는 방법에 대한 더 많은 컨텍스트가 없으면 이 접두사를 사용하지 않아야 합니다.\n",
    "- 숫자 속성들은 `VMail Message`를 제외하고는 매우 잘 분포 되어 있고 정규분포 형태를 보입니다. `Area Code` 는 숫자 보다는 오브젝트 타입으로 변경해야 할 것 같습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "churn = churn.drop(\"Phone\", axis=1)\n",
    "churn[\"Area Code\"] = churn[\"Area Code\"].astype(object)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "각각의 피쳐와 타겟 사이의 관계를 한번 살펴봅니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for column in churn.select_dtypes(include=[\"object\"]).columns:\n",
    "    if column != \"Churn?\":\n",
    "        display(pd.crosstab(index=churn[column], columns=churn[\"Churn?\"], normalize=\"columns\"))\n",
    "\n",
    "for column in churn.select_dtypes(exclude=[\"object\"]).columns:\n",
    "    print(column)\n",
    "    hist = churn[[column, \"Churn?\"]].hist(by=\"Churn?\", bins=30)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(churn.corr(numeric_only=True))\n",
    "pd.plotting.scatter_matrix(churn, figsize=(12, 12))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "다른 변수와 의존 관계가 높은 있는 몇몇 속성 들을 볼 수 있습니다. 이런 속성들을 모델 학습에 포함시키는 것은 단순히 불필요하거나, bias를 만들는 정도일 수 도 있지만 어떤 알고리즘에서는 치명적인 문제를 일으킬 수 있습니다. 이러한 속성을 제거 합시다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "churn = churn.drop([\"Day Charge\", \"Eve Charge\", \"Night Charge\", \"Intl Charge\"], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "이제 데이터 집합을 정리했으므로 사용할 알고리즘을 결정해 보겠습니다. 위에서 언급했듯이 높은 값과 낮은 (중간 값은 아님) 값이 이탈을 예측하는 일부 변수가 있는 것으로 보입니다. 선형 회귀 분석과 같은 알고리즘에서 이를 수용하려면 다항식 (또는 버킷) terms를 생성하게 됩니다. 대신 그라데이션 부스터 트리를 사용하여 이 문제를 모델링해 보겠습니다. Amazon SageMaker는 관리되는 분산 설정에서 학습한 다음 실시간 예측 엔드포인트로 호스팅하는 데 사용할 수 있는 XGBoost 컨테이너를 제공합니다. XGBoost는 피처와 대상 변수 간의 비선형 관계를 자연스럽게 고려하고 피처 간의 복잡한 상호 작용을 수용하는 그라디언트 부스트 트리를 사용합니다.\n",
    "\n",
    "Amazon SageMaker XGBoost를 위한 CSV 형식은:\n",
    "- 첫 번째 열에 예측 변수가 있어야 합니다.\n",
    "- 헤더 행이 없어야 합니다.\n",
    "\n",
    "하지만 먼저 범주 형 기능을 숫자 기능으로 변환해 보겠습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "churn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_data = pd.get_dummies(churn)\n",
    "model_data = pd.concat(\n",
    "    [model_data[\"Churn?_True.\"], model_data.drop([\"Churn?_False.\", \"Churn?_True.\"], axis=1)], axis=1\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "이제 데이터를 training, validation 및 test 세트로 분할해 보겠습니다. 이렇게하면 모델의 과적합을 방지하고 아직 보지 못한 데이터에 대한 모델 정확도를 테스트 할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data, validation_data, test_data = np.split(\n",
    "    model_data.sample(frac=1, random_state=1729),\n",
    "    [int(0.7 * len(model_data)), int(0.9 * len(model_data))],\n",
    ")\n",
    "train_data.to_csv(\"train.csv\", header=False, index=False)\n",
    "validation_data.to_csv(\"validation.csv\", header=False, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "train_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validation_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "이제 이 파일들을 S3에 업로드 합시다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n",
    "    os.path.join(prefix, \"train/train.csv\")\n",
    ").upload_file(\"train.csv\")\n",
    "boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n",
    "    os.path.join(prefix, \"validation/validation.csv\")\n",
    ").upload_file(\"validation.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource(\"s3\").Bucket(bucket)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [x] Check S3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Train\n",
    "\n",
    "먼저 우리는 XGBoost algorithm containers를 지정합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "container = sagemaker.image_uris.retrieve(\"xgboost\", boto3.Session().region_name, \"1\")\n",
    "display(container)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "그다음, CSV 형식을 사용하므로 S3 파일을 지정할 `TrainingInput` 를 생성합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_input_train = TrainingInput(\n",
    "    s3_data=\"s3://{}/{}/train\".format(bucket, prefix), content_type=\"csv\"\n",
    ")\n",
    "s3_input_validation = TrainingInput(\n",
    "    s3_data=\"s3://{}/{}/validation/\".format(bucket, prefix), content_type=\"csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "어떤 인스턴스 타입을 선택할지 등의 변수와 XGBoost 의 하이퍼 파라미터를 정의합니다.  A few key hyperparameters are:\n",
    "- `max_depth` 는 알고리즘 내의 각 트리를 얼마나 깊게 만들 수 있는지를 제어합니다. 나무가 깊어지면 더 잘 맞을 수 있지만 계산 비용이 많이 들고 과적합으로 이어질 수 있습니다. 일반적으로 많은 수의 얕은 나무와 더 적은 수의 더 깊은 나무 사이에서 탐색해야하는 모델 성능에는 약간의 절충점이 있습니다.\n",
    "- `subsample` 은 학습 데이터의 샘플링을 제어합니다. 이 기술은 과적합을 줄이는 데 도움이 될 수 있지만 너무 낮게 설정하면 데이터 모델이 부족할 수도 있습니다.\n",
    "- `num_round` 는 부스팅 라운드 수를 제어합니다. 이것은 본질적으로 이전 반복의 잔차를 사용하여 학습되는 후속 모형입니다. 다시 말하지만, 라운드가 많으면 트레이닝 데이터에 더 잘 맞을 수 있지만 계산 비용이 많이 들거나 과적합으로 이어질 수 있습니다.\n",
    "- `eta` 는 부스팅의 각 라운드가 얼마나 공격적인지를 제어합니다. 값이 클수록 보수적 인 Boosting이 발생합니다.\n",
    "- `gamma` 는 나무가 얼마나 공격적으로 성장하는지 제어합니다. 값이 클수록 더 보수적인 모델이 됩니다.\n",
    "\n",
    "더우 자세한 XGBoost's hyperparmeter 내용은 여기서 확인 할 수 있습니다. [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = sagemaker.Session()\n",
    "\n",
    "xgb = sagemaker.estimator.Estimator(\n",
    "    container,\n",
    "    role,\n",
    "    instance_count=1,\n",
    "    instance_type=\"ml.m4.xlarge\",\n",
    "    output_path=\"s3://{}/{}/output\".format(bucket, prefix),\n",
    "    sagemaker_session=sess,\n",
    ")\n",
    "xgb.set_hyperparameters(\n",
    "    max_depth=5,\n",
    "    eta=0.2,\n",
    "    gamma=4,\n",
    "    min_child_weight=6,\n",
    "    subsample=0.8,\n",
    "    silent=0,\n",
    "    objective=\"binary:logistic\",\n",
    "    num_round=100,\n",
    ")\n",
    "\n",
    "xgb.fit({\"train\": s3_input_train, \"validation\": s3_input_validation})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [x] check Sagemaker - Training jobs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Host\n",
    "\n",
    "이제 알고리즘을 학습했으므로 모델을 만들어 호스팅된 엔드포인트에 배포해 보겠습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "xgb_predictor = xgb.deploy(\n",
    "    initial_instance_count=1, instance_type=\"ml.m4.xlarge\", serializer=CSVSerializer()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "- [x] check Sagemaker - Inference - Endpoints"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Evaluate\n",
    "\n",
    "이제 호스팅 된 엔드 포인트가 실행되었으므로 http POST 요청만으로 모델에서 실시간 예측을 매우 쉽게 할 수 있습니다.하지만 먼저, 우리는 엔드 포인트 뒤에 모델에 우리의`test_data`NumPy 배열을 전달하기위한 설정 시리얼 라이저 및 디시리얼라이저 해야 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(data, rows=500):\n",
    "    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n",
    "    predictions = \"\"\n",
    "    for array in split_array:\n",
    "        predictions = \",\".join([predictions, xgb_predictor.predict(array).decode(\"utf-8\")])\n",
    "\n",
    "    return np.fromstring(predictions[1:], sep=\",\")\n",
    "\n",
    "\n",
    "predictions = predict(test_data.to_numpy()[:, 1:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "기계 학습 모델의 성능을 비교하는 방법에는 여러 가지가 있지만 실제 값과 예측 값을 비교하는 것부터 시작해 보겠습니다.이 경우, 우리는 단순히 고객이 이탈했는지 (`1`) 아닌지 (`0`) 예측하고 간단한 Confusion Matrix 를 생성합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "pd.crosstab(\n",
    "    index=test_data.iloc[:, 0],\n",
    "    columns=np.round(predictions),\n",
    "    rownames=[\"actual\"],\n",
    "    colnames=[\"predictions\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Clean up Endpoint\n",
    "\n",
    "이 노트북 사용이 끝났으면 아래 셀을 실행하세요. 이렇게 하면 생성한 호스팅된 엔드포인트가 제거되고 계속 켜져 있는 인스턴스로 인한 요금이 부과되는 것을 방지할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_predictor.delete_endpoint()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Appendices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_Note, due to randomized elements of the algorithm, you results may differ slightly._\n",
    "\n",
    "Of the 48 churners, we've correctly predicted 39 of them (true positives). And, we incorrectly predicted 4 customers would churn who then ended up not doing so (false positives).  There are also 9 customers who ended up churning, that we predicted would not (false negatives).\n",
    "\n",
    "An important point here is that because of the `np.round()` function above we are using a simple threshold (or cutoff) of 0.5.  Our predictions from `xgboost` come out as continuous values between 0 and 1 and we force them into the binary classes that we began with.  However, because a customer that churns is expected to cost the company more than proactively trying to retain a customer who we think might churn, we should consider adjusting this cutoff.  That will almost certainly increase the number of false positives, but it can also be expected to increase the number of true positives and reduce the number of false negatives.\n",
    "\n",
    "To get a rough intuition here, let's look at the continuous values of our predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(predictions)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The continuous valued predictions coming from our model tend to skew toward 0 or 1, but there is sufficient mass between 0.1 and 0.9 that adjusting the cutoff should indeed shift a number of customers' predictions.  For example..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions > 0.3, 1, 0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that changing the cutoff from 0.5 to 0.3 results in 1 more true positives, 3 more false positives, and 1 fewer false negatives.  The numbers are small overall here, but that's 6-10% of customers overall that are shifting because of a change to the cutoff.  Was this the right decision?  We may end up retaining 3 extra customers, but we also unnecessarily incentivized 5 more customers who would have stayed.  Determining optimal cutoffs is a key step in properly applying machine learning in a real-world setting.  Let's discuss this more broadly and then apply a specific, hypothetical solution for our current problem.\n",
    "\n",
    "### Relative cost of errors\n",
    "\n",
    "Any practical binary classification problem is likely to produce a similarly sensitive cutoff. That by itself isn’t a problem. After all, if the scores for two classes are really easy to separate, the problem probably isn’t very hard to begin with and might even be solvable with simple rules instead of ML.\n",
    "\n",
    "More important, if I put an ML model into production, there are costs associated with the model erroneously assigning false positives and false negatives. I also need to look at similar costs associated with correct predictions of true positives and true negatives.  Because the choice of the cutoff affects all four of these statistics, I need to consider the relative costs to the business for each of these four outcomes for each prediction.\n",
    "\n",
    "#### Assigning costs\n",
    "\n",
    "What are the costs for our problem of mobile operator churn? The costs, of course, depend on the specific actions that the business takes. Let's make some assumptions here.\n",
    "\n",
    "First, assign the true negatives the cost of 0 USD. Our model essentially correctly identified a happy customer in this case, and we don’t need to do anything.\n",
    "\n",
    "False negatives are the most problematic, because they incorrectly predict that a churning customer will stay. We lose the customer and will have to pay all the costs of acquiring a replacement customer, including foregone revenue, advertising costs, administrative costs, point of sale costs, and likely a phone hardware subsidy. A quick search on the Internet reveals that such costs typically run in the hundreds of dollars so, for the purposes of this example, let's assume 500 USD. This is the cost of false negatives.\n",
    "\n",
    "Finally, for customers that our model identifies as churning, let's assume a retention incentive in the amount of 100 USD. If my provider offered me such a concession, I’d certainly think twice before leaving. This is the cost of both true positive and false positive outcomes. In the case of false positives (the customer is happy, but the model mistakenly predicted churn), we will “waste” the 100 USD concession. We probably could have spent that 100 USD more effectively, but it's possible we increased the loyalty of an already loyal customer, so that’s not so bad."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Finding the optimal cutoff\n",
    "\n",
    "It’s clear that false negatives are substantially more costly than false positives. Instead of optimizing for error based on the number of customers, we should be minimizing a cost function that looks like this:\n",
    "\n",
    "```txt\n",
    "$500 * FN(C) + $0 * TN(C) + $100 * FP(C) + $100 * TP(C)\n",
    "```\n",
    "\n",
    "FN(C) means that the false negative percentage is a function of the cutoff, C, and similar for TN, FP, and TP.  We need to find the cutoff, C, where the result of the expression is smallest.\n",
    "\n",
    "A straightforward way to do this, is to simply run a simulation over a large number of possible cutoffs.  We test 100 possible values in the for loop below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cutoffs = np.arange(0.01, 1, 0.01)\n",
    "costs = []\n",
    "for c in cutoffs:\n",
    "    costs.append(\n",
    "        np.sum(\n",
    "            np.sum(\n",
    "                np.array([[0, 100], [500, 100]])\n",
    "                * pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions > c, 1, 0))\n",
    "            )\n",
    "        )\n",
    "    )\n",
    "\n",
    "costs = np.array(costs)\n",
    "plt.plot(cutoffs, costs)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Cost is minimized near a cutoff of:\",\n",
    "    cutoffs[np.argmin(costs)],\n",
    "    \"for a cost of:\",\n",
    "    np.min(costs),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above chart shows how picking a threshold too low results in costs skyrocketing as all customers are given a retention incentive.  Meanwhile, setting the threshold too high results in too many lost customers, which ultimately grows to be nearly as costly.  The overall cost can be minimized at 8,400 USD by setting the cutoff to 0.46, which is substantially better than the 20k+ USD I would expect to lose by not taking any action."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Extensions\n",
    "\n",
    "This notebook showcased how to build a model that predicts whether a customer is likely to churn, and then how to optimally set a threshold that accounts for the cost of true positives, false positives, and false negatives.  There are several means of extending it including:\n",
    "- Some customers who receive retention incentives will still churn.  Including a probability of churning despite receiving an incentive in our cost function would provide a better ROI on our retention programs.\n",
    "- Customers who switch to a lower-priced plan or who deactivate a paid feature represent different kinds of churn that could be modeled separately.\n",
    "- Modeling the evolution of customer behavior. If usage is dropping and the number of calls placed to Customer Service is increasing, you are more likely to experience churn then if the trend is the opposite. A customer profile should incorporate behavior trends.\n",
    "- Actual training data and monetary cost assignments could be more complex.\n",
    "- Multiple models for each type of churn could be needed.\n",
    "\n",
    "Regardless of additional complexity, similar principles described in this notebook are likely apply."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-northeast-2:806072073708:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 4
}