{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a5903c8c-314f-4500-9864-c152f0b631b6",
   "metadata": {},
   "source": [
    "# AWS re:Invent Machine Learning Builders' Session\n",
    "\n",
    "#### WPS302: Identify improper payments with analytics and ML (Lab 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "---\n",
    "\n",
    "## Introduction\n",
    "#### Since 2003, the US federal government has made approximately &#36;1.7 trillion in improper payments, with an estimated &#36;206 billion made in FY 2020 alone. Improper payments are now anticipated to increase proportionally to new levels of federal spending, from the &#36;1 trillion infrastructure bill, to the anticipated &#36;3.5 trillion budget reconciliation plan.\n",
    "\n",
    "#### *How can we go beyond basic heuristic rulesets to help agencies identify improper payments at scale?*\n",
    "\n",
    "#### **In this lab we'll demonstrate how to train a classification model on an imbalanced dataset to predict fraudulent Medicare providers using the RandomForest algorithm. Additionally, we'll demonstrate using Script Mode and a Batch Transform.**\n",
    "\n",
    "</p>\n",
    "\n",
    "#### **Let's get started!**"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "---\n",
    "\n",
    "## 1. Setup\n",
    "<a id=section_1_0></a>"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "id": "388e8566-9cb1-4850-b7bf-cd02ae8f5518",
   "metadata": {},
   "source": [
    "### 1.1 Prerequisites\n",
    "<a id=section_1_1></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c099c6f-50f9-4fc9-9b74-c2707f109d17",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install imblearn\n",
    "# !pip install sagemaker\n",
    "!pip install matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fcece01-79bb-4d8d-8673-bf23279a8af0",
   "metadata": {},
   "source": [
    "### 1.2 Import packages and modules\n",
    "<a id=section_1_2></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e6c79b9-ae69-473b-bcc5-2610b6ca75f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import os\n",
    "\n",
    "# AWS SDK for Python and Sagemaker packages\n",
    "import boto3\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker.inputs import TrainingInput\n",
    "from sagemaker.serializers import CSVSerializer\n",
    "\n",
    "from math import sqrt\n",
    "\n",
    "# Imbalanced-learn packages\n",
    "from imblearn.pipeline import Pipeline\n",
    "from imblearn.over_sampling import SMOTE\n",
    "from imblearn.under_sampling import RandomUnderSampler\n",
    "\n",
    "# sklearn packages\n",
    "from sklearn.datasets import dump_svmlight_file\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import (\n",
    "    balanced_accuracy_score, \n",
    "    classification_report,\n",
    "    confusion_matrix,\n",
    "    ConfusionMatrixDisplay,    \n",
    "    plot_confusion_matrix,\n",
    "    roc_auc_score\n",
    ")\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4b6bd0b-688e-452b-a9c3-436b04058975",
   "metadata": {},
   "source": [
    "### 1.3 Global config settings\n",
    "<a id=section_1_3></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba228db3-5bb4-4cff-995d-5d4265ddcd21",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set lab name\n",
    "lab_name = 'lab2'\n",
    "\n",
    "# Allow viewing of all columns and rows\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', None)\n",
    "\n",
    "# Create directory to store lab contents\n",
    "data_dir = './data/{}'.format(lab_name)\n",
    "if not os.path.exists(data_dir):\n",
    "    os.mkdir(data_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9400be1-e24d-4ed8-a72e-af718f8a3f3f",
   "metadata": {},
   "source": [
    "### 1.4 Global config variables\n",
    "<a id=section_1_4></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb847492-d1f4-44c7-88e9-581be7ca8512",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the IAM role and Sagemaker session\n",
    "try:\n",
    "    role = sagemaker.get_execution_role()\n",
    "except:\n",
    "    role = get_execution_role()\n",
    "\n",
    "# Get the SakeMaker session\n",
    "session = sagemaker.Session()\n",
    "\n",
    "print('Using IAM role arn: {}'.format(role))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7df71d9-c376-4e44-be3c-8c25cfec719d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup the S3 client\n",
    "s3_client = boto3.client('s3')\n",
    "\n",
    "# Set S3 settings\n",
    "bucket = session.default_bucket()\n",
    "prefix = 'fraud-detect-demo'\n",
    "\n",
    "print('Using S3 path: s3://{}/{}'.format(bucket, prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9c564c2-7f27-437e-b87b-db75fb37ae5c",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 2. Exploratory Data Analysis\n",
    "<a id=section_2_0></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ace98ec0-7df0-4573-93d6-dead4faeed56",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2.1 Read the preprocessed medicare data\n",
    "<a id=section_2_1></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7149526b-f5fc-4f2f-8166-13da4a4f7d01",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv('./data/processed_data_classification_v2.csv', delimiter=',')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d120236-9fbd-45a3-be66-08dab2758123",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2.2 View the dimensions of the dataset (#rows, #cols)\n",
    "<a id=section_2_2></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d54aeec4-fe71-40dd-a9e4-6c4c2f6eed57",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc0cf37e-661f-43ea-a64a-afb9c24cb584",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2.3 Visually inspect the first few rows in the dataset\n",
    "<a id=section_2_3></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e106477c-d6a3-47e9-a351-f33e220d3abb",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11893226-6fa3-4c9f-8cd6-ac32f1e43b19",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2.4 Check data for any nulls\n",
    "<a id=section_2_4></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f301d12-c286-473d-971d-db4384f82437",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.isnull().values.any()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d82e356-313c-40b0-96c0-9624c0f0f436",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2.5 Check for imbalance\n",
    "<a id=section_2_5></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86c06cb7-34c9-400d-b837-b0b30f3cfebd",
   "metadata": {},
   "source": [
    "Review the target (fraudulent_provider) value counts to check for imbalance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00af0530-3d4c-43ea-a1e6-66af0a4c3bd2",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "data['fraudulent_provider'].value_counts().plot(ax = ax, kind = 'bar', ylabel = 'frequency', xlabel = 'Transaction Type')\n",
    "plt.xticks(range(2), ['non-fraudulent', 'fraudulent'], rotation=0)\n",
    "plt.title(label=\"Number of distinct transaction types (non-fraud and fraud)\",\n",
    "          fontsize=12,\n",
    "          color=\"black\")\n",
    "plt.show()\n",
    "\n",
    "print('There are {} non-fradulent records and {} fradulent records'.format(\n",
    "    data['fraudulent_provider'].value_counts()[0], \n",
    "    data['fraudulent_provider'].value_counts()[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8c871f1-0787-4c5e-9d07-454f19620ad8",
   "metadata": {},
   "source": [
    "We see that the majority of data is **non-fraudulent** transactions, however, our goal is to train a model to identify **fradulent** transactions. Attempting to train a model on this dataset (as-is) may yield a high accuracy score because accuracy is calculated as:\n",
    "\n",
    "<pre>\n",
    "Total Number of Correct Predictions / Total Number of Predictions\n",
    "</pre>\n",
    "\n",
    "As a result, the model will perform well when predicting the majority class, but perform poorly at predicting the minority class due to lack of training examples.\n",
    "\n",
    "To address this challange, we will need to rebalance the dataset using sampling techniques that are designed to improve the performance of models that rely on imbalanced datasets. To help with this task, we'll use under and over sampling techniques from the [Imblearn package](https://imbalanced-learn.org/stable/user_guide.html#user-guide) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b859a624-4556-4671-85d5-e037f3a18799",
   "metadata": {
    "tags": [],
    "toc-hr-collapsed": true
   },
   "source": [
    "## 3. Preprocessing\n",
    "<a id=section_3_0></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c963071a-f926-4a91-9d5f-98d66472a814",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 3.1 Data Preparation\n",
    "<a id=section_3_1></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8f6be91-4e8e-409d-9138-559e0ed523fa",
   "metadata": {},
   "source": [
    "Remove the column headers from the dataset as SageMaker does not need headers for processing CSV files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52f03dde-c284-48e6-b137-587ea4fda7a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Removing column headers from CSV file\n",
    "feature_columns = data.columns[1:]\n",
    "label_column = data.columns[0]\n",
    "\n",
    "# Setting the datatype to float32\n",
    "features = data[feature_columns].values.astype('float32')\n",
    "labels = (data[label_column].values).astype('float32')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bb939be-52f5-4805-83d5-887ead5b8695",
   "metadata": {},
   "source": [
    "Split the dataset into train and test sets to evaluate the performance of our model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c61e4ac-ec3d-4a7c-bd5d-d85cbc134405",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train-test split\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    features, labels, test_size=0.5, stratify=labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "847d905a-28c5-47b7-9935-aec340c00507",
   "metadata": {},
   "source": [
    "Since the data is highly imbalanced, it is important to stratify across the data sets to ensure a near even distribution, so we set the test_size parameter to 0.5. The training dataset (X_train) will be used to fit the model and the testing dataset (X_test) will be used for predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84175027-6e1a-440a-a705-36645cf71f19",
   "metadata": {},
   "source": [
    "Display the size of the training and test datasets after the train-test split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "379dd46f-ba40-4d73-985e-61617cc77c0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The training dataset contains {} total rows ({} non-fraudulent transactions, {} fraudulent transactions)'.format(len(y_train), (y_train == 0).sum(), (y_train == 1).sum()))\n",
    "print('The test dataset contains {} total rows ({} non-fraudulent transactions, {} fraudulent transactions)'.format(len(y_test), (y_test == 0).sum(), (y_test == 1).sum()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2740242-f86e-4dcf-9198-6b6e3443b95e",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 3.2 Applying Synthetic Minority Over-sampling (SMOTE)\n",
    "<a id=section_3_2></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cedb293c-b8ed-431b-8aeb-2437b451bc33",
   "metadata": {},
   "source": [
    "The [sampling strategy](https://imbalanced-learn.org/stable/auto_examples/api/plot_sampling_strategy_usage.html#sphx-glr-auto-examples-api-plot-sampling-strategy-usage-py) for resampling an imbalanced dataset is very important for improving the performance of the model. \n",
    "\n",
    "For this lab, we'll set the [SMOTE sampling strategy](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to 0.95, which means SMOTE will create new samples until the minority class is equal to 95% of the majority class (2500 * 0.95 = 2375). Next, we'll set the [RandomUnderSampler sampling strategy](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html) to 1.0, which means we'll reduce the number of samples in the majority class (2500) to equal the new minority class size (2375).\n",
    "\n",
    "Feel free to expirement with different sampling strategy ratios to see the impact."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62c4ff8d-bbcb-4063-a8f9-a6a4bcc67e2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Oversample the minority class with SMOTE \n",
    "over = SMOTE(random_state=42, sampling_strategy=0.95)\n",
    "\n",
    "# Undersample the majority class to achieve about a 1:1 ratio.\n",
    "# The minority class will be the same amount (1 to 1) as the majority class\n",
    "under = RandomUnderSampler(random_state=42, sampling_strategy=1.0)\n",
    "\n",
    "# Add steps to parameter list\n",
    "steps = [('o', over), ('u', under)]\n",
    "\n",
    "# Create imblearn.pipeline and pass steps\n",
    "pipeline = Pipeline(steps=steps)\n",
    "\n",
    "# Fit and apply to the CMS dataset in a single transform\n",
    "X_smote, y_smote = pipeline.fit_resample(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a6709cb-ecd3-4831-8a92-d6e8640aa3c9",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 3.3 Check for imbalance\n",
    "<a id=section_3_3></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf26af7-fb0d-4825-a07e-1d3ad275ada2",
   "metadata": {},
   "source": [
    "Review the target (fraudulent_provider) value counts to check for imbalance *after* applying data augmentation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e8f6939-dc3d-4f93-969f-c9d9cca3c59e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert to DataFrame for plotting\n",
    "df_y_smote = pd.DataFrame(y_smote.astype(int))\n",
    "\n",
    "# Plot\n",
    "fig, ax = plt.subplots()\n",
    "df_y_smote.value_counts().plot(ax = ax, kind = 'bar', ylabel = 'frequency', xlabel = 'Transaction Type')\n",
    "plt.xticks(range(2), ['non-fraudulent', 'fraudulent'], rotation=0)\n",
    "plt.title(label=\"Number of distinct transaction types (non-fraud and fraud)\",\n",
    "          fontsize=12,\n",
    "          color=\"black\")\n",
    "plt.show()\n",
    "\n",
    "print('There are {} non-fradulent records and {} fradulent records'.format(\n",
    "    df_y_smote.value_counts()[0], \n",
    "    df_y_smote.value_counts()[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4235d23d-5a76-4062-bcd6-13496a920743",
   "metadata": {},
   "outputs": [],
   "source": [
    "pct_chg = abs((len(df_y_smote) - len(data)) / len(data) * 100)\n",
    "\n",
    "print('Observe that by applying SMOTE and RandomUnderSampling our dataset has decreased in size by {:.2f}% as a result of downsampling of the majority class and upsampling of the minority class'.format(pct_chg))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b880394-8fc3-4745-b31c-35477c51f822",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 3.4 Test-train split for augmented dataset\n",
    "<a id=section_3_4></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4177659-2c40-4c58-9ec4-390a7653ef67",
   "metadata": {},
   "source": [
    "Split the resampled dataset - 80% will be used for training and 20% will be used for validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef776983-bd27-4d44-8c7e-4d99d4295815",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_smote_train, X_smote_validation, y_smote_train, y_smote_validation = train_test_split(\n",
    "    X_smote, y_smote, test_size=0.2, stratify=y_smote)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b672de0-53da-4ecf-913f-5710a971df02",
   "metadata": {},
   "source": [
    "Display the size of the training and test datasets after the train-test split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a5dc489-1fda-4c81-b457-35635001f02c",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The training dataset contains {} total rows ({} non-fraudulent transactions, {} fraudulent transactions)'.format(len(X_smote_train), (y_smote_train == 0).sum(), (y_smote_train == 1).sum()))\n",
    "print('The validation dataset contains {} total rows ({} non-fraudulent transactions, {} fraudulent transactions)'.format(len(X_smote_validation), (y_smote_validation == 0).sum(), (y_smote_validation == 1).sum()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a2494a9-54b6-4963-8af6-3a192fd6c0a4",
   "metadata": {},
   "source": [
    "### 3.5 Prepare datasets for training and evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd09b82d-7fee-4631-a93c-6f0b248f8afa",
   "metadata": {},
   "source": [
    "Ensure the first column in the dataset are the labels, then convert to DataFrames. We'll use the training and validation datasets (which we have over and undersampled) for training the model. We'll use the testing dataset for model evaluation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70bde8ca-c4c6-4053-ac11-ac4a251a7c48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rearrange the first column as target column\n",
    "trainX_concate = np.concatenate((y_smote_train.reshape(len(y_smote_train),1), X_smote_train), axis=1)\n",
    "trainX = pd.DataFrame(trainX_concate, index=None, columns=None)\n",
    "\n",
    "validationX_concate = np.concatenate((y_smote_validation.reshape(len(y_smote_validation),1), X_smote_validation), axis=1)\n",
    "validationX = pd.DataFrame(validationX_concate, index=None, columns=None)\n",
    "\n",
    "testX = pd.DataFrame(X_test, index=None, columns=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "211804aa-8176-4a40-bf01-3f136d4c5c0b",
   "metadata": {},
   "source": [
    "Save the files locally in CSV format. After this step, there should be three new CSV files visible in the folder: **./data/lab2**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1520b8d4-e50c-4b02-8ce0-2dc73a94331c",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainX.to_csv(\"{}/train.csv\".format(data_dir), header=False, index=False)\n",
    "validationX.to_csv(\"{}/validation.csv\".format(data_dir), header=False, index=False)\n",
    "testX.to_csv(\"{}/test.csv\".format(data_dir), header=False, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89f7b929-4ad4-4a76-b4b9-a2c93d9957c3",
   "metadata": {},
   "source": [
    "### 3.6 Upload the datasets to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "391d1d6c-d641-432d-a3dc-a92f52c1237d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the directory path in S3\n",
    "subdir = '{}/smote'.format(lab_name)\n",
    "\n",
    "train_path = session.upload_data(\n",
    "    path=\"{}/train.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/training'.format(prefix, subdir)\n",
    ")\n",
    "\n",
    "validation_path = session.upload_data(\n",
    "    path=\"{}/validation.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/validation'.format(prefix, subdir)\n",
    ")\n",
    "\n",
    "test_path = session.upload_data(\n",
    "    path=\"{}/test.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/testing'.format(prefix, subdir)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "982837f5-9103-407b-9731-a4a19800a5e9",
   "metadata": {},
   "source": [
    "Display the location of our datasets in S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75646e42-5e71-4921-9940-c0bddedaa3a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The S3 URI of the training dataset is: {}'.format(train_path))\n",
    "print('The S3 URI of the validation dataset is: {}'.format(validation_path))\n",
    "print('The S3 URI of the testing dataset is: {}'.format(test_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c404c762-0e5b-4d84-a5f6-676a70f2a46e",
   "metadata": {},
   "source": [
    "### 3.7 Set the output location for model artifacts in S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "849d4a83-a3fb-47d2-ab28-f55666f5a040",
   "metadata": {},
   "outputs": [],
   "source": [
    "output_location = 's3://{}/{}/{}/output/'.format(bucket, prefix, subdir)\n",
    "print('The S3 URI for model artifacts is: {}'.format(output_location))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01df6d50-b0f8-49ba-9e6c-929f6d1ba80d",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 4. Model Training\n",
    "<a id=section_4_0></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0741bb0d-9f4a-4b21-bf0c-126ef65f5463",
   "metadata": {},
   "source": [
    "### 4. Writing a *Script Mode* script\n",
    "<a id=section_4_1></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a982f291-4b5d-4970-9b9b-76948777d8b4",
   "metadata": {},
   "source": [
    "The below script contains both training and inference functionality and can run in SageMaker Training hardware. Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b8dbda3-d4b8-42c0-a82e-6fd972f3dfa2",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile script.py\n",
    "\n",
    "import argparse\n",
    "import joblib\n",
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "\n",
    "# inference function for model loading\n",
    "def model_fn(model_dir):\n",
    "    clf = joblib.load(os.path.join(model_dir, \"model.joblib\"))\n",
    "    return clf\n",
    "\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "\n",
    "    print(\"extracting arguments\")\n",
    "    parser = argparse.ArgumentParser()\n",
    "\n",
    "    # hyperparameters sent by the client are passed as command-line arguments to the script.\n",
    "    # to simplify the demo we don't use all sklearn RandomForest hyperparameters\n",
    "    parser.add_argument(\"--n-estimators\", type=int, default=10)\n",
    "    parser.add_argument(\"--min-samples-leaf\", type=int, default=3)\n",
    "\n",
    "    # Data, model, and output directories\n",
    "    parser.add_argument(\"--model-dir\", type=str, default=os.environ.get(\"SM_MODEL_DIR\"))\n",
    "    parser.add_argument(\"--train\", type=str, default=os.environ.get(\"SM_CHANNEL_TRAIN\"))\n",
    "    parser.add_argument(\"--test\", type=str, default=os.environ.get(\"SM_CHANNEL_TEST\"))\n",
    "    parser.add_argument(\"--train-file\", type=str, default=\"train.csv\")\n",
    "    parser.add_argument(\"--test-file\", type=str, default=\"validation.csv\")\n",
    "\n",
    "    args, _ = parser.parse_known_args()\n",
    "    \n",
    "    # read data from csv\n",
    "    print(\"reading data\")\n",
    "    train_df = pd.read_csv(os.path.join(args.train, args.train_file), header=None)\n",
    "    test_df = pd.read_csv(os.path.join(args.test, args.test_file), header=None)\n",
    "\n",
    "    # build training and testing dataset\n",
    "    print(\"building training and testing datasets\")\n",
    "    X_train = train_df[train_df.columns[1:]]\n",
    "    X_test = test_df[test_df.columns[1:]]\n",
    "    y_train = train_df[train_df.columns[0]]\n",
    "    y_test = test_df[test_df.columns[0]]\n",
    "\n",
    "    # train\n",
    "    print(\"training model\")\n",
    "    model = RandomForestClassifier(\n",
    "        n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1\n",
    "    )\n",
    "\n",
    "    model.fit(X_train, y_train)\n",
    "\n",
    "    # print accuracy\n",
    "    print(\"validating model\")\n",
    "    y_pred = model.predict(X_test)\n",
    "    acc = accuracy_score(y_test, y_pred)\n",
    "    auc = roc_auc_score(y_test, y_pred)\n",
    "    print(f\"Accuracy is: {acc}\")\n",
    "    print(f\"Area under the curve is: {auc}\")\n",
    "\n",
    "    # persist model\n",
    "    path = os.path.join(args.model_dir, \"model.joblib\")\n",
    "    joblib.dump(model, path)\n",
    "    print(\"model persisted at \" + path)\n",
    "    print(args.min_samples_leaf)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf4df646-ece7-4ef5-bd90-87e48fd17df3",
   "metadata": {},
   "source": [
    "### 4.2 SageMaker Training\n",
    "<a id=section_4_2></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfc0d235-9928-4fb2-aeda-68ac4a6d389f",
   "metadata": {},
   "source": [
    "Launching a training job with SageMaker Python SDK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dd44d10-9544-434a-a466-5920ace224ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We use the Estimator from the SageMaker Python SDK\n",
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "\n",
    "FRAMEWORK_VERSION = \"0.23-1\"\n",
    "\n",
    "sklearn_estimator = SKLearn(\n",
    "    entry_point=\"script.py\",\n",
    "    role=role,\n",
    "    instance_count=1,\n",
    "    instance_type=\"ml.c5.xlarge\",\n",
    "    framework_version=FRAMEWORK_VERSION,\n",
    "    base_job_name=\"rf-scikit\",\n",
    "    metric_definitions=[{\"Name\": \"Accuracy\", \"Regex\": \"Accuracy is: ([0-9.]+).*$\"}],\n",
    "    output_path=output_location,\n",
    "    hyperparameters={\n",
    "        \"n-estimators\": 100,\n",
    "        \"min-samples-leaf\": 2\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ebbca96-0205-4b5c-914e-209632108bd2",
   "metadata": {},
   "source": [
    "**Note: This step will initiate a Sagemaker Training Job and will take approximately 2-3 minutes to complete. As the Sagemaker Training Job is running there will be a lot of logging data generated, this is normal. The job is successfully completed when you see output similar to the following:**\n",
    "<pre>\n",
    "...\n",
    "yyyy-mm-dd HH:mm:ss Completed - Training job completed\n",
    "...\n",
    "Training seconds: 77\n",
    "Billable seconds: 77\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e44b1ce-7a57-4c20-8e0a-d8a36fbc1479",
   "metadata": {},
   "outputs": [],
   "source": [
    "# kick off training job\n",
    "sklearn_estimator.fit({\"train\": train_path, \"test\": validation_path}, wait=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6f400dd-cfea-493a-8fbc-4bde42acdea3",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 5. Model Hosting\n",
    "<a id=section_5_0></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15e42149-49da-4f10-ac4d-ea00987008da",
   "metadata": {},
   "source": [
    "We can also use the trained model for asynchronous batch inference on S3 data using SageMaker Batch Transform."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a4d2fc8-2bbf-456d-b7ea-696d8cf22c74",
   "metadata": {},
   "source": [
    "### 5.1 Define an SKLearn Transformer from the trained SKLearn Estimator\n",
    "<a id=section_5_1></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76c58e95-1442-4a7d-b4be-d5476172ac9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "transformer = sklearn_estimator.transformer(instance_count=1, instance_type=\"ml.m5.xlarge\", strategy='MultiRecord', assemble_with=\"Line\", accept=\"text/csv\", \n",
    "                                            output_path=output_location)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc6118ff-996b-4252-8b8a-11320df75a0e",
   "metadata": {},
   "source": [
    "Now we deploy the estimator to an endpoint."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "211100a6-1d76-445f-9439-0e0f9ca91b96",
   "metadata": {},
   "source": [
    "### 5.2 Start a batch transform job and wait for it to finish\n",
    "<a id=section_5_2></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55dfb1a0-6fb6-4735-8d5b-dcc6c6b466ad",
   "metadata": {},
   "source": [
    "**Note: This step will create a batch transform job and will take approximately 3-4 minutes to complete. As the batch transform job is running there will be a lot of logging data generated, this is normal.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3a5b7fb-3bb8-403c-a6bc-d23f673e7b7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "transformer.transform(test_path, split_type=\"Line\", content_type=\"text/csv\")\n",
    "print(\"Waiting for transform job: \" + transformer.latest_transform_job.job_name)\n",
    "transformer.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "831e8235-098a-4894-9fd8-72fa4e1eaf47",
   "metadata": {},
   "source": [
    "After the transform job has completed, download the output data from S3. For each file \"f\" in the input data, we have a corresponding file \"f.out\" containing the predicted labels from each input row. We can compare the predicted labels to the true labels saved earlier."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e10e029-f62d-4d3d-b5da-fcfada520d8e",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 6. Model Evaluation\n",
    "<a id=section_6_0></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f7a14e2-3a6d-40d3-b730-d0e87c2738ac",
   "metadata": {},
   "source": [
    "### 6.1 Download the output data from S3 to local file system\n",
    "<a id=section_6_1></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7508d3e9-ecae-4c52-adf5-aeed4326bfd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_output = transformer.output_path\n",
    "output_file_name = \"test.csv.out\"\n",
    "\n",
    "!aws s3 cp {batch_output}{output_file_name} {data_dir}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d9d1b3b-73b8-40b2-a328-aef984d2de03",
   "metadata": {},
   "source": [
    "### 6.2 Get output predictions\n",
    "<a id=section_6_2></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43e71c4e-2937-49e6-b767-1f9a984fcbc9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import genfromtxt\n",
    "y_preds = genfromtxt(data_dir + '/' + output_file_name, delimiter=',') \n",
    "y_preds.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a85fc3a2-d805-45f8-ace1-05a7b6767090",
   "metadata": {},
   "source": [
    "### 6.3 Calculate balanced accuracy scores\n",
    "<a id=section_6_3></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4f1bc68-4658-4321-9214-2179df20de80",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate balanced accuracy score\n",
    "print(\"Balanced accuracy = {}\".format(balanced_accuracy_score(y_test, y_preds)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf5f5bac-779c-4689-beb3-10cfad617987",
   "metadata": {},
   "source": [
    "### 6.4 Plot results in a confusion matrix\n",
    "<a id=section_6_4></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b57d0a8-037d-4e3b-a368-91c629ab5702",
   "metadata": {},
   "source": [
    "Apart from single-value metrics, it's also useful to look at metrics that indicate performance per class. A confusion matrix, per-class precision, recall and f1-scores can also provide more information about the model's performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8362c80-d220-4a0f-907a-39332bb2d844",
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_confusion_matrix(y_true, y_predicted):\n",
    "\n",
    "    cm = confusion_matrix(y_true, y_predicted, labels=[0, 1])\n",
    "\n",
    "    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[\"non-fraud\", \"fraud\"])\n",
    "    disp.plot()\n",
    "\n",
    "    plt.title('Confusion Matrix')\n",
    "    plt.ylabel('Real Classes')\n",
    "    plt.xlabel('Predicted Classes')\n",
    "\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4adc4b6-d4c5-42ac-b7e4-970b4fbdf955",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the confusion matrix\n",
    "plot_confusion_matrix(y_test, y_preds)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "169ef179-97b1-4bde-bc01-d7f0a0ccbc4b",
   "metadata": {},
   "source": [
    "### 6.5 Display Classification Report\n",
    "<a id=section_6_5></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24b5fb71-6b95-472f-8562-107255c7a698",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(\n",
    "    y_test, y_preds, target_names=['non-fraud', 'fraud']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89424b62-41f1-4ab7-b110-550065854b49",
   "metadata": {},
   "source": [
    "### Clean up"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0777c760-fff0-4a90-87a4-004e24133cd6",
   "metadata": {},
   "source": [
    "### Data Acknowledgements\n",
    "\n",
    "The curated dataset used for this lab comes from the following Centers for Medicare & Medicaid Services dataset:\n",
    "https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4de55c1-943e-4ecc-9e26-5d19fdce7110",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}