{ "cells": [ { "cell_type": "markdown", "source": [ "# AWS re:Invent Machine Learning Builders' Session\n", "\n", "#### WPS302: Identify improper payments with analytics and ML (Lab 1)" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "---\n", "\n", "## Introduction\n", "#### Since 2003, the US federal government has made approximately $1.7 trillion in improper payments, with an estimated $206 billion made in FY 2020 alone. Improper payments are now anticipated to increase proportionally to new levels of federal spending, from the $1 trillion infrastructure bill, to the anticipated $3.5 trillion budget reconciliation plan.\n", "\n", "#### *How can we go beyond basic heuristic rulesets to help agencies identify improper payments at scale?*\n", "\n", "#### **In this lab we'll demonstrate how to train a classification model on an imbalanced dataset to predict fraudulent Medicare providers using the XGBoost algorithm. Additionally, we'll demonstrate using the Sagemaker Estimator and Real-time inference endpoints.**\n", "\n", "

\n", "\n", "#### **Let's get started!**" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "id": "1cfb339e-206d-4753-a7eb-3dd84e1f3339", "metadata": { "tags": [] }, "source": [ "---\n", "\n", "## 1. Setup\n", "" ] }, { "cell_type": "markdown", "id": "388e8566-9cb1-4850-b7bf-cd02ae8f5518", "metadata": {}, "source": [ "### 1.1 Prerequisites\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "7c099c6f-50f9-4fc9-9b74-c2707f109d17", "metadata": {}, "outputs": [], "source": [ "!pip install imblearn\n", "#!pip install sagemaker\n", "!pip install matplotlib" ] }, { "cell_type": "markdown", "id": "0fcece01-79bb-4d8d-8673-bf23279a8af0", "metadata": {}, "source": [ "### 1.2 Import packages and modules\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "2e6c79b9-ae69-473b-bcc5-2610b6ca75f6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import os\n", "\n", "# AWS SDK for Python and Sagemaker packages\n", "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.inputs import TrainingInput\n", "from sagemaker.serializers import CSVSerializer\n", "\n", "from math import sqrt\n", "\n", "# Imbalanced-learn packages\n", "from imblearn.pipeline import Pipeline\n", "from imblearn.over_sampling import SMOTE\n", "from imblearn.under_sampling import RandomUnderSampler\n", "\n", "# sklearn packages\n", "from sklearn.datasets import dump_svmlight_file\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import (\n", " balanced_accuracy_score, \n", " classification_report,\n", " confusion_matrix,\n", " ConfusionMatrixDisplay, \n", " plot_confusion_matrix,\n", " roc_auc_score\n", ")\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "b4b6bd0b-688e-452b-a9c3-436b04058975", "metadata": {}, "source": [ "### 1.3 Global config settings\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "ba228db3-5bb4-4cff-995d-5d4265ddcd21", "metadata": {}, "outputs": [], "source": [ "# Set lab name\n", "lab_name = 'lab1'\n", "\n", "# Allow viewing of all columns and rows\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', None)\n", "\n", "# Create directory to store lab contents\n", "data_dir = './data/{}'.format(lab_name)\n", "\n", "if not os.path.exists('./data'):\n", " os.mkdir('./data')\n", "\n", "if not os.path.exists(data_dir):\n", " os.mkdir(data_dir)" ] }, { "cell_type": "markdown", "id": "c9400be1-e24d-4ed8-a72e-af718f8a3f3f", "metadata": {}, "source": [ "### 1.4 Global config variables\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "cb847492-d1f4-44c7-88e9-581be7ca8512", "metadata": {}, "outputs": [], "source": [ "# Get the IAM role and Sagemaker session\n", "try:\n", " role = sagemaker.get_execution_role()\n", "except:\n", " role = get_execution_role()\n", "\n", "# Get the SakeMaker session\n", "session = sagemaker.Session()\n", "\n", "print('Using IAM role arn: {}'.format(role))" ] }, { "cell_type": "code", "execution_count": null, "id": "a7df71d9-c376-4e44-be3c-8c25cfec719d", "metadata": {}, "outputs": [], "source": [ "# Setup the S3 client\n", "s3_client = boto3.client('s3')\n", "\n", "# Set S3 settings\n", "bucket = session.default_bucket()\n", "prefix = 'fraud-detect-demo'\n", "\n", "print('Using S3 path: s3://{}/{}'.format(bucket, prefix))" ] }, { "cell_type": "markdown", "id": "a9c564c2-7f27-437e-b87b-db75fb37ae5c", "metadata": { "tags": [] }, "source": [ "## 2. Exploratory Data Analysis\n", "" ] }, { "cell_type": "markdown", "id": "ace98ec0-7df0-4573-93d6-dead4faeed56", "metadata": { "tags": [] }, "source": [ "### 2.1 Read the preprocessed medicare data\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "7149526b-f5fc-4f2f-8166-13da4a4f7d01", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('./data/processed_data_classification_v2.csv', delimiter=',')" ] }, { "cell_type": "markdown", "id": "8d120236-9fbd-45a3-be66-08dab2758123", "metadata": { "tags": [] }, "source": [ "### 2.2 View the dimensions of the dataset (#rows, #cols)\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "d54aeec4-fe71-40dd-a9e4-6c4c2f6eed57", "metadata": { "tags": [] }, "outputs": [], "source": [ "data.shape" ] }, { "cell_type": "markdown", "id": "cc0cf37e-661f-43ea-a64a-afb9c24cb584", "metadata": { "tags": [] }, "source": [ "### 2.3 Visually inspect the first few rows in the dataset\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "e106477c-d6a3-47e9-a351-f33e220d3abb", "metadata": {}, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "markdown", "id": "11893226-6fa3-4c9f-8cd6-ac32f1e43b19", "metadata": { "tags": [] }, "source": [ "### 2.4 Check data for any nulls\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "8f301d12-c286-473d-971d-db4384f82437", "metadata": {}, "outputs": [], "source": [ "data.isnull().values.any()" ] }, { "cell_type": "markdown", "id": "0d82e356-313c-40b0-96c0-9624c0f0f436", "metadata": { "tags": [] }, "source": [ "### 2.5 Check for imbalance\n", "" ] }, { "cell_type": "markdown", "id": "86c06cb7-34c9-400d-b837-b0b30f3cfebd", "metadata": {}, "source": [ "Review the target (fraudulent_provider) value counts to check for imbalance" ] }, { "cell_type": "code", "execution_count": null, "id": "00af0530-3d4c-43ea-a1e6-66af0a4c3bd2", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "data['fraudulent_provider'].value_counts().plot(ax = ax, kind = 'bar', ylabel = 'frequency', xlabel = 'Transaction Type')\n", "plt.xticks(range(2), ['non-fraudulent', 'fraudulent'], rotation=0)\n", "plt.title(label=\"Number of distinct transaction types (non-fraud and fraud)\",\n", " fontsize=12,\n", " color=\"black\")\n", "plt.show()\n", "\n", "print('There are {} non-fradulent records and {} fradulent records'.format(\n", " data['fraudulent_provider'].value_counts()[0], \n", " data['fraudulent_provider'].value_counts()[1]))" ] }, { "cell_type": "markdown", "id": "e8c871f1-0787-4c5e-9d07-454f19620ad8", "metadata": {}, "source": [ "We see that the majority of data is **non-fraudulent** transactions, however, our goal is to train a model to identify **fradulent** transactions. Attempting to train a model on this dataset (as-is) may yield a high accuracy score because accuracy is calculated as:\n", "\n", "
\n",
    "Total Number of Correct Predictions / Total Number of Predictions\n",
    "
\n", "\n", "As a result, the model will perform well when predicting the majority class, but perform poorly at predicting the minority class due to lack of training examples.\n", "\n", "To address this challange, we will need to rebalance the dataset using sampling techniques that are designed to improve the performance of models that rely on imbalanced datasets. To help with this task, we'll use under and over sampling techniques from the [Imblearn package](https://imbalanced-learn.org/stable/user_guide.html#user-guide) " ] }, { "cell_type": "markdown", "id": "b859a624-4556-4671-85d5-e037f3a18799", "metadata": { "tags": [], "toc-hr-collapsed": true }, "source": [ "## 3. Preprocessing\n", "" ] }, { "cell_type": "markdown", "id": "c963071a-f926-4a91-9d5f-98d66472a814", "metadata": { "tags": [] }, "source": [ "### 3.1 Data Preparation\n", "" ] }, { "cell_type": "markdown", "id": "f8f6be91-4e8e-409d-9138-559e0ed523fa", "metadata": {}, "source": [ "Remove the column headers from the dataset as SageMaker does not need headers for processing CSV files" ] }, { "cell_type": "code", "execution_count": null, "id": "52f03dde-c284-48e6-b137-587ea4fda7a4", "metadata": {}, "outputs": [], "source": [ "# Removing column headers from CSV file\n", "feature_columns = data.columns[1:]\n", "label_column = data.columns[0]\n", "\n", "# Setting the datatype to float32\n", "features = data[feature_columns].values.astype('float32')\n", "labels = (data[label_column].values).astype('float32')" ] }, { "cell_type": "markdown", "id": "6bb939be-52f5-4805-83d5-887ead5b8695", "metadata": {}, "source": [ "Split the dataset into train and test sets to evaluate the performance of our model" ] }, { "cell_type": "code", "execution_count": null, "id": "1c61e4ac-ec3d-4a7c-bd5d-d85cbc134405", "metadata": {}, "outputs": [], "source": [ "# Train-test split\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " features, labels, test_size=0.5, stratify=labels)" ] }, { "cell_type": "markdown", "id": "847d905a-28c5-47b7-9935-aec340c00507", "metadata": {}, "source": [ "Since the data is highly imbalanced, it is important to stratify across the data sets to ensure a near even distribution, so we set the test_size parameter to 0.5. The training dataset (X_train) will be used to fit the model and the testing dataset (X_test) will be used for predictions" ] }, { "cell_type": "markdown", "id": "84175027-6e1a-440a-a705-36645cf71f19", "metadata": {}, "source": [ "Display the size of the training and test datasets after the train-test split" ] }, { "cell_type": "code", "execution_count": null, "id": "379dd46f-ba40-4d73-985e-61617cc77c0b", "metadata": {}, "outputs": [], "source": [ "print('The training dataset contains {} total rows ({} non-fraudulent transactions, {} fraudulent transactions)'.format(len(y_train), (y_train == 0).sum(), (y_train == 1).sum()))\n", "print('The test dataset contains {} total rows ({} non-fraudulent transactions, {} fraudulent transactions)'.format(len(y_test), (y_test == 0).sum(), (y_test == 1).sum()))" ] }, { "cell_type": "markdown", "id": "c2740242-f86e-4dcf-9198-6b6e3443b95e", "metadata": { "tags": [] }, "source": [ "### 3.2 Applying Synthetic Minority Over-sampling (SMOTE)\n", "" ] }, { "cell_type": "markdown", "id": "cedb293c-b8ed-431b-8aeb-2437b451bc33", "metadata": {}, "source": [ "The [sampling strategy](https://imbalanced-learn.org/stable/auto_examples/api/plot_sampling_strategy_usage.html#sphx-glr-auto-examples-api-plot-sampling-strategy-usage-py) for resampling an imbalanced dataset is very important for improving the performance of the model. \n", "\n", "For this lab, we'll set the [SMOTE sampling strategy](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to 0.95, which means SMOTE will create new samples until the minority class is equal to 95% of the majority class (2500 * 0.95 = 2375). Next, we'll set the [RandomUnderSampler sampling strategy](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html) to 1.0, which means we'll reduce the number of samples in the majority class (2500) to equal the new minority class size (2375).\n", "\n", "Feel free to expirement with different sampling strategy ratios to see the impact." ] }, { "cell_type": "code", "execution_count": null, "id": "62c4ff8d-bbcb-4063-a8f9-a6a4bcc67e2b", "metadata": {}, "outputs": [], "source": [ "# Oversample the minority class with SMOTE \n", "over = SMOTE(random_state=42, sampling_strategy=0.95)\n", "\n", "# Undersample the majority class to achieve about a 1:1 ratio.\n", "# The minority class will be the same amount (1 to 1) as the majority class\n", "under = RandomUnderSampler(random_state=42, sampling_strategy=1.0)\n", "\n", "# Add steps to parameter list\n", "steps = [('o', over), ('u', under)]\n", "\n", "# Create imblearn.pipeline and pass steps\n", "pipeline = Pipeline(steps=steps)\n", "\n", "# Fit and apply to the CMS dataset in a single transform\n", "X_smote, y_smote = pipeline.fit_resample(X_train, y_train)" ] }, { "cell_type": "markdown", "id": "8a6709cb-ecd3-4831-8a92-d6e8640aa3c9", "metadata": { "tags": [] }, "source": [ "### 3.3 Check for imbalance\n", "" ] }, { "cell_type": "markdown", "id": "2bf26af7-fb0d-4825-a07e-1d3ad275ada2", "metadata": {}, "source": [ "Review the target (fraudulent_provider) value counts to check for imbalance *after* applying data augmentation" ] }, { "cell_type": "code", "execution_count": null, "id": "5e8f6939-dc3d-4f93-969f-c9d9cca3c59e", "metadata": {}, "outputs": [], "source": [ "# Convert to DataFrame for plotting\n", "df_y_smote = pd.DataFrame(y_smote.astype(int))\n", "\n", "# Plot\n", "fig, ax = plt.subplots()\n", "df_y_smote.value_counts().plot(ax = ax, kind = 'bar', ylabel = 'frequency', xlabel = 'Transaction Type')\n", "plt.xticks(range(2), ['non-fraudulent', 'fraudulent'], rotation=0)\n", "plt.title(label=\"Number of distinct transaction types (non-fraud and fraud)\",\n", " fontsize=12,\n", " color=\"black\")\n", "plt.show()\n", "\n", "print('There are {} non-fradulent records and {} fradulent records'.format(\n", " df_y_smote.value_counts()[0], \n", " df_y_smote.value_counts()[1]))" ] }, { "cell_type": "code", "execution_count": null, "id": "4235d23d-5a76-4062-bcd6-13496a920743", "metadata": {}, "outputs": [], "source": [ "pct_chg = abs((len(df_y_smote) - len(data)) / len(data) * 100)\n", "\n", "print('Observe that by applying SMOTE and RandomUnderSampling our dataset has decreased in size by {:.2f}% as a result of downsampling of the majority class and upsampling of the minority class'.format(pct_chg))" ] }, { "cell_type": "markdown", "id": "8b880394-8fc3-4745-b31c-35477c51f822", "metadata": { "tags": [] }, "source": [ "### 3.4 Test-train split for augmented dataset\n", "" ] }, { "cell_type": "markdown", "id": "f4177659-2c40-4c58-9ec4-390a7653ef67", "metadata": {}, "source": [ "Split the augmented dataset - 80% will be used for training and 20% will be used for validation" ] }, { "cell_type": "code", "execution_count": null, "id": "ef776983-bd27-4d44-8c7e-4d99d4295815", "metadata": {}, "outputs": [], "source": [ "X_smote_train, X_smote_validation, y_smote_train, y_smote_validation = train_test_split(\n", " X_smote, y_smote, test_size=0.2, stratify=y_smote)" ] }, { "cell_type": "markdown", "id": "2b672de0-53da-4ecf-913f-5710a971df02", "metadata": {}, "source": [ "Display the size of the training and test datasets after the train-test split" ] }, { "cell_type": "code", "execution_count": null, "id": "0a5dc489-1fda-4c81-b457-35635001f02c", "metadata": {}, "outputs": [], "source": [ "print('The training dataset contains {} total rows ({} non-fraudulent transactions, {} fraudulent transactions)'.format(len(X_smote_train), (y_smote_train == 0).sum(), (y_smote_train == 1).sum()))\n", "print('The validation dataset contains {} total rows ({} non-fraudulent transactions, {} fraudulent transactions)'.format(len(X_smote_validation), (y_smote_validation == 0).sum(), (y_smote_validation == 1).sum()))" ] }, { "cell_type": "markdown", "id": "5a2494a9-54b6-4963-8af6-3a192fd6c0a4", "metadata": {}, "source": [ "### 3.5 Prepare datasets for training and evaluation" ] }, { "cell_type": "markdown", "id": "bd09b82d-7fee-4631-a93c-6f0b248f8afa", "metadata": {}, "source": [ "Ensure the first column in the dataset are the labels, then convert to DataFrames. We'll use the training and validation datasets (which we have over and undersampled) for training the model. We'll use the testing dataset for model evaluation. " ] }, { "cell_type": "code", "execution_count": null, "id": "70bde8ca-c4c6-4053-ac11-ac4a251a7c48", "metadata": {}, "outputs": [], "source": [ "# Rearrange the first column as target column\n", "trainX_concate = np.concatenate((y_smote_train.reshape(len(y_smote_train),1), X_smote_train), axis=1)\n", "trainX = pd.DataFrame(trainX_concate, index=None, columns=None)\n", "\n", "validationX_concate = np.concatenate((y_smote_validation.reshape(len(y_smote_validation),1), X_smote_validation), axis=1)\n", "validationX = pd.DataFrame(validationX_concate, index=None, columns=None)\n", "\n", "testX = pd.DataFrame(X_test, index=None, columns=None)" ] }, { "cell_type": "markdown", "id": "211804aa-8176-4a40-bf01-3f136d4c5c0b", "metadata": {}, "source": [ "Save the files locally in CSV format. After this step, there should be three new CSV files visible in the folder: **./data/lab1**" ] }, { "cell_type": "code", "execution_count": null, "id": "1520b8d4-e50c-4b02-8ce0-2dc73a94331c", "metadata": {}, "outputs": [], "source": [ "trainX.to_csv(\"{}/train.csv\".format(data_dir), header=False, index=False)\n", "validationX.to_csv(\"{}/validation.csv\".format(data_dir), header=False, index=False)\n", "testX.to_csv(\"{}/test.csv\".format(data_dir), header=False, index=False)" ] }, { "cell_type": "markdown", "id": "89f7b929-4ad4-4a76-b4b9-a2c93d9957c3", "metadata": {}, "source": [ "### 3.6 Upload the datasets to S3" ] }, { "cell_type": "code", "execution_count": null, "id": "391d1d6c-d641-432d-a3dc-a92f52c1237d", "metadata": {}, "outputs": [], "source": [ "# Set the directory path in S3\n", "subdir = '{}/smote'.format(lab_name)\n", "\n", "train_path = session.upload_data(\n", " path=\"{}/train.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/training'.format(prefix, subdir)\n", ")\n", "\n", "validation_path = session.upload_data(\n", " path=\"{}/validation.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/validation'.format(prefix, subdir)\n", ")\n", "\n", "test_path = session.upload_data(\n", " path=\"{}/test.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/testing'.format(prefix, subdir)\n", ")" ] }, { "cell_type": "markdown", "id": "982837f5-9103-407b-9731-a4a19800a5e9", "metadata": {}, "source": [ "Display the location of our datasets in S3" ] }, { "cell_type": "code", "execution_count": null, "id": "75646e42-5e71-4921-9940-c0bddedaa3a3", "metadata": {}, "outputs": [], "source": [ "print('The S3 URI of the training dataset is: {}'.format(train_path))\n", "print('The S3 URI of the validation dataset is: {}'.format(validation_path))\n", "print('The S3 URI of the testing dataset is: {}'.format(test_path))" ] }, { "cell_type": "markdown", "id": "c404c762-0e5b-4d84-a5f6-676a70f2a46e", "metadata": {}, "source": [ "### 3.7 Set the output location for model artifacts in S3" ] }, { "cell_type": "code", "execution_count": null, "id": "849d4a83-a3fb-47d2-ab28-f55666f5a040", "metadata": {}, "outputs": [], "source": [ "output_location = 's3://{}/{}/{}/output/'.format(bucket, prefix, subdir)\n", "print('The S3 URI for model artifacts is: {}'.format(output_location))" ] }, { "cell_type": "markdown", "id": "01df6d50-b0f8-49ba-9e6c-929f6d1ba80d", "metadata": { "tags": [] }, "source": [ "## 4. Model Training\n", "" ] }, { "cell_type": "markdown", "id": "0741bb0d-9f4a-4b21-bf0c-126ef65f5463", "metadata": {}, "source": [ "### 4.1 Get the container URI for running XGBoost\n", "" ] }, { "cell_type": "markdown", "id": "a982f291-4b5d-4970-9b9b-76948777d8b4", "metadata": {}, "source": [ "We will use the Amazon XGBoost supervised learning algorithm for classification" ] }, { "cell_type": "code", "execution_count": null, "id": "9b8dbda3-d4b8-42c0-a82e-6fd972f3dfa2", "metadata": {}, "outputs": [], "source": [ "# Retrieves the ECR URI for the pre-built SageMaker XGBoost Docker image\n", "container = sagemaker.image_uris.retrieve(\"xgboost\", boto3.Session().region_name, \"latest\")\n", "\n", "print('Using container URI: {}'.format(container))" ] }, { "cell_type": "markdown", "id": "cf4df646-ece7-4ef5-bd90-87e48fd17df3", "metadata": {}, "source": [ "### 4.2 Train the model\n", "" ] }, { "cell_type": "markdown", "id": "dfc0d235-9928-4fb2-aeda-68ac4a6d389f", "metadata": {}, "source": [ "In this step, we train our XGBoost model via the SakeMaker estimator object. Estimators are a high level interface for SageMaker training for handling end-to-end Amazon SageMaker training and deployment tasks.\n", "\n", "**Note: This step will initiate a Sagemaker Training Job and will take approximately 3-4 minutes to complete. As the Sagemaker Training Job is running there will be a lot of logging data generated, this is normal. The job is successfully completed when you see output similar to the following:**\n", "
\n",
    "...\n",
    "yyyy-mm-dd HH:mm:ss Completed - Training job completed\n",
    "...\n",
    "Training seconds: 163\n",
    "Billable seconds: 163\n",
    "
" ] }, { "cell_type": "code", "execution_count": null, "id": "0dd44d10-9544-434a-a466-5920ace224ab", "metadata": {}, "outputs": [], "source": [ "# Set basic hyperparams\n", "hyperparams = {\n", " 'objective':'binary:logistic',\n", " 'eval_metric':'auc',\n", " 'num_round':200, \n", " 'silent':0,\n", " 'seed':42 \n", "}\n", "\n", "# Create an Estimator object\n", "clf = sagemaker.estimator.Estimator(container,\n", " role,\n", " hyperparameters=hyperparams,\n", " instance_count=1, \n", " instance_type='ml.c5.xlarge',\n", " output_path=output_location,\n", " sagemaker_session=session)\n", "\n", "\n", "# Define the data type and paths to the training and validation datasets\n", "content_type = \"text/csv\" \n", "train_input = TrainingInput(train_path, content_type=content_type)\n", "validation_input = TrainingInput(validation_path, content_type=content_type)\n", "\n", "# Fit the model\n", "clf.fit({'train': train_input, 'validation': validation_input})" ] }, { "cell_type": "markdown", "id": "b6f400dd-cfea-493a-8fbc-4bde42acdea3", "metadata": { "tags": [] }, "source": [ "## 5. Model Hosting\n", "" ] }, { "cell_type": "markdown", "id": "5a4d2fc8-2bbf-456d-b7ea-696d8cf22c74", "metadata": {}, "source": [ "### 5.1 Create a real-time inference endpoint\n", "" ] }, { "cell_type": "markdown", "id": "fc6118ff-996b-4252-8b8a-11320df75a0e", "metadata": {}, "source": [ "Now we deploy the estimator to an endpoint.\n", "\n", "**Note: This step will create an inference endpoint and will take approximately 3-4 minutes to complete. The task is successfully completed when you see output similar to the following:**\n", "\n", "--------!" ] }, { "cell_type": "code", "execution_count": null, "id": "c8c8dd01-f375-4fb8-a3fc-5f063dd3e5d9", "metadata": {}, "outputs": [], "source": [ "# Serialize data to a CSV-formatted string\n", "csv_serializer = CSVSerializer()\n", "\n", "# Create a real-time inference endpoint that hosts our trained model \n", "xgb_predictor = clf.deploy(initial_instance_count=1,\n", " instance_type='ml.m5.xlarge', \n", " serializer=csv_serializer)" ] }, { "cell_type": "markdown", "id": "2e10e029-f62d-4d3d-b5da-fcfada520d8e", "metadata": { "tags": [] }, "source": [ "## 6. Model Evaluation\n", "" ] }, { "cell_type": "markdown", "id": "831e8235-098a-4894-9fd8-72fa4e1eaf47", "metadata": {}, "source": [ "Once we have trained the model we can use it to make predictions for the test set." ] }, { "cell_type": "markdown", "id": "2f7a14e2-3a6d-40d3-b730-d0e87c2738ac", "metadata": {}, "source": [ "### 6.1 Create a wrapper function for model testing\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "7508d3e9-ecae-4c52-adf5-aeed4326bfd7", "metadata": {}, "outputs": [], "source": [ "# Because we have a large test set, we call predict on smaller batches\n", "def predict(current_predictor, df, rows=500):\n", " \"\"\"\n", " A wrapper function to invoke the Estimator's predict function using\n", " a for loop. \n", " \n", " Parameters:\n", " current_predictor: The sagemaker.estimator.Estimator object\n", " df: a DataFrame object containing observations without the target feature\n", " rows: number of observations passed to the predict function per batch\n", " \n", " Returns:\n", " predictions: An array of predictions (of dtype float64)\n", " \"\"\"\n", " \n", " # Split an array into multiple sub-arrays by dividing num of observations by rows parameter\n", " split_array = np.array_split(df, int(df.shape[0] / float(rows) + 1))\n", " \n", " # Initialize variable to store prediction results\n", " predictions = ''\n", " \n", " # Call the Estimator's predict function\n", " for array in split_array:\n", " predictions = ','.join([predictions, current_predictor.predict(array).decode('utf-8')])\n", "\n", " # Return\n", " return np.fromstring(predictions[1:], sep=',')" ] }, { "cell_type": "markdown", "id": "6d9d1b3b-73b8-40b2-a328-aef984d2de03", "metadata": {}, "source": [ "### 6.2 Test the model\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "43e71c4e-2937-49e6-b767-1f9a984fcbc9", "metadata": {}, "outputs": [], "source": [ "# Test the model by invoking the real-time inference endpoint with observations from the test dataset\n", "raw_preds = predict(xgb_predictor, X_test)" ] }, { "cell_type": "code", "execution_count": null, "id": "f026726c-d7bd-4d50-9e27-13b5c02b02c7", "metadata": {}, "outputs": [], "source": [ "# Let's have look a quick look at the first ten predictions (values between 0 and 1) returned by our trained model\n", "print('The first ten raw predictions: {}'.format(np.round(raw_preds[:10], 3)))" ] }, { "cell_type": "markdown", "id": "d1eddb41-a33a-4ce8-83b6-246142291690", "metadata": {}, "source": [ "Our next step is to determine the threshold for classifying each prediction as non-fradulent (0) or fradulent (1). For relatively balanaced datasets, the default binary classification threshold is typically **0.50**, meaning that a prediction below 0.50 is classified as non-fradulent (0), and a value above 0.50 is classified as fradulent (1). However, for highly imbalanced datasets, we need to adjust the classification threshold, a technique known as **threshold moving.**" ] }, { "cell_type": "markdown", "id": "a85fc3a2-d805-45f8-ace1-05a7b6767090", "metadata": {}, "source": [ "### 6.3 Calculate balanced accuracy scores\n", "" ] }, { "cell_type": "markdown", "id": "4489d683-0406-4d9c-b84f-673dd5267ea5", "metadata": {}, "source": [ "We will use a few measures from the scikit-learn package to evaluate the performance of our model. When dealing with an imbalanced dataset, we need to choose metrics that take into account the frequency of each class in the data.\n", "\n", "We will use [balanced accuracy score](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score)\n", "\n", "We can bring a balance between the metrics again by adjusting our classification threshold (threshold between labeling a point as fraud or not). We can try different thresholds to see if they affect the result of the classification." ] }, { "cell_type": "code", "execution_count": null, "id": "37245e63-617e-48ce-9469-8bb1ab905bcf", "metadata": {}, "outputs": [], "source": [ "# Calculate balanced accuracy scores for different threshold values\n", "proposed_treshold = 0.0\n", "proposed_score = 0.0\n", "\n", "# Iterate over 1000 values between 0.1 and 0.99 to find the maximum balanced accuracy score and classification threshold\n", "for thres in np.linspace(0.1, 0.99, num=1000):\n", " smote_thres_preds = np.where(raw_preds > thres, 1, 0)\n", " score = balanced_accuracy_score(y_test, smote_thres_preds)\n", " \n", " # Set the max score \n", " if proposed_score <= score:\n", " proposed_score = score\n", " proposed_treshold = thres" ] }, { "cell_type": "code", "execution_count": null, "id": "20d0b45a-5b8a-4f22-9c47-5e7865cd69ae", "metadata": {}, "outputs": [], "source": [ "print('We will use the binary classification threshold of {:.5f} as it yields the highest balanced accuracy score: {:.5f}'.format(proposed_treshold, proposed_score))" ] }, { "cell_type": "code", "execution_count": null, "id": "edc4692e-5545-49a0-9bb1-6554bfa75f31", "metadata": {}, "outputs": [], "source": [ "# Use the proposed threshold to convert the predictions returned by the model to 0 or 1.\n", "y_preds = np.where(raw_preds >= proposed_treshold, 1, 0)" ] }, { "cell_type": "code", "execution_count": null, "id": "f243b651-d7b1-4265-b0f2-19798a47eb13", "metadata": {}, "outputs": [], "source": [ "# Let's have a quick look at the first ten predictions after applying the proposed binary classification threshold\n", "print('The first ten binary class predictions: {}'.format(np.round(y_preds[:10], 3)))" ] }, { "cell_type": "markdown", "id": "bf5f5bac-779c-4689-beb3-10cfad617987", "metadata": {}, "source": [ "### 6.4 Plot results in a confusion matrix\n", "" ] }, { "cell_type": "markdown", "id": "5b57d0a8-037d-4e3b-a368-91c629ab5702", "metadata": {}, "source": [ "Apart from single-value metrics, it's also useful to look at metrics that indicate performance per class. A confusion matrix, per-class precision, recall and f1-scores can also provide more information about the model's performance." ] }, { "cell_type": "code", "execution_count": null, "id": "b8362c80-d220-4a0f-907a-39332bb2d844", "metadata": {}, "outputs": [], "source": [ "def plot_confusion_matrix(y_true, y_predicted):\n", " \n", " disp = ConfusionMatrixDisplay.from_predictions(\n", " y_true,\n", " y_predicted,\n", " display_labels=['Non Fraud', 'Fraud'],\n", " cmap='cividis')" ] }, { "cell_type": "code", "execution_count": null, "id": "e4adc4b6-d4c5-42ac-b7e4-970b4fbdf955", "metadata": {}, "outputs": [], "source": [ "# Display the confusion matrix\n", "plot_confusion_matrix(y_test, y_preds)" ] }, { "cell_type": "markdown", "id": "24e7a964-85fb-4b98-a600-6dbb32a5e1e0", "metadata": {}, "source": [ "Let's interpret the results from the confusion matrix:\n", "\n", "The *positive class* (1) is **fraud** and the *negative class* (0) is **non-fraud**.\n", "\n", "- True Negatives (top left): 2220 transactions were labeled as *non-fraud* and the model correctly classified them as *non-fraud*.\n", "- True Positives (bottom right): 164 transactions were labeled as *fraud* and the model correctly classified them as *fraud*\n", "- False Positives (top right): 280 transactions were labled as *non-fraud* and the model incorrectly classified them as *fraud*.\n", "- False Negatives (bottom left): 86 transactions were labled as *fraud* and the model incorrectly classified them as *non-fraud*.\n", "\n", "Note: due to randomization during model training, your results may vary slightly for these metrics. " ] }, { "cell_type": "markdown", "id": "169ef179-97b1-4bde-bc01-d7f0a0ccbc4b", "metadata": {}, "source": [ "### 6.5 Display Classification Report\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "24b5fb71-6b95-472f-8562-107255c7a698", "metadata": {}, "outputs": [], "source": [ "# Create a report showing the main classification metrics \n", "smote_cls_rpt = classification_report(\n", " y_test, y_preds, target_names=['non-fraud', 'fraud'], digits=4, output_dict=True)\n", "\n", "# Render as DataFrame\n", "df_smote_cls_rpt = pd.DataFrame(smote_cls_rpt)\n", "df_smote_cls_rpt.T" ] }, { "cell_type": "markdown", "id": "c9824e07-825a-4832-ab3b-931352f7bc90", "metadata": { "tags": [] }, "source": [ "### 6.6 Training without SMOTE (Optional)\n", "" ] }, { "cell_type": "markdown", "id": "43be3fdb-d3e6-40a1-aa94-5833f5f5bbdc", "metadata": {}, "source": [ "In this section we'll perform the same training, hosting and model evaluation steps on the original dataset, then compare the classification reports between the original dataset and the SMOTE augmented dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "43e8b004-8d5d-4bad-b145-2abd7e2a5d15", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Load dataset to dataframe\n", "data = pd.read_csv('./data/processed_data_classification_v2.csv', delimiter=',')\n", "\n", "# Removing column headers from CSV file\n", "feature_columns = data.columns[1:]\n", "label_column = data.columns[0]\n", "\n", "# Setting the datatype to float32\n", "features = data[feature_columns].values.astype('float32')\n", "labels = (data[label_column].values).astype('float32')\n", "\n", "# Initial split for training and test\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " features, labels, train_size=0.5, stratify=labels)\n", "\n", "# Second split for training and validation\n", "X_train, X_validation, y_train, y_validation = train_test_split(\n", " X_train, y_train, test_size=0.2)\n", "\n", "# Convert training dataset to CSV\n", "X_train_baseline_concate = np.concatenate((y_train.reshape(len(y_train),1), X_train), axis=1)\n", "X_train_baseline = pd.DataFrame(X_train_baseline_concate, index=None, columns=None)\n", "\n", "# Convert validation dataset to CSV\n", "X_validation_baseline_concate = np.concatenate((y_validation.reshape(len(y_validation),1), X_validation), axis=1)\n", "X_validation_baseline = pd.DataFrame(X_validation_baseline_concate, index=None, columns=None)\n", "\n", "# Convert test dataset to CSV\n", "X_test_baseline_concate = np.concatenate((y_test.reshape(len(y_test),1), X_test), axis=1)\n", "X_test_baseline = pd.DataFrame(X_test_baseline_concate, index=None, columns=None)\n", "\n", "# Write files to local data directory\n", "X_train_baseline.to_csv(\"{}/train_baseline.csv\".format(data_dir), header=False, index=False)\n", "X_validation_baseline.to_csv(\"{}/validation_baseline.csv\".format(data_dir), header=False, index=False)\n", "X_test_baseline.to_csv(\"{}/test_baseline.csv\".format(data_dir), header=False, index=False)\n", "\n", "# Set the directory path in S3\n", "subdir = '{}/nosmote'.format(lab_name)\n", "\n", "# Upload to S3\n", "X_train_baseline_path = session.upload_data(\n", " path=\"{}/train_baseline.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/training'.format(prefix, subdir)\n", ")\n", "\n", "X_validation_baseline_path = session.upload_data(\n", " path=\"{}/validation_baseline.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/validation'.format(prefix, subdir)\n", ")\n", "\n", "X_test_baseline_path = session.upload_data(\n", " path=\"{}/test_baseline.csv\".format(data_dir), bucket=bucket, key_prefix='{}/{}/testing'.format(prefix, subdir)\n", ")\n", "\n", "# Set the output location\n", "output_location = 's3://{}/{}/{}/output/'.format(bucket, prefix, subdir)\n", "\n", "# Set the scale_pos_weight parameter\n", "# scale_pos_weight = sqrt(np.count_nonzero(y_train==0)/np.count_nonzero(y_train))\n", "\n", "# Set basic hyperparams\n", "hyperparams = {\n", " 'objective':'binary:logistic',\n", " 'eval_metric':'auc',\n", " 'num_round':200, \n", " 'silent':0,\n", " 'seed':42\n", "}\n", "# ,'scale_pos_weight': scale_pos_weight\n", "\n", "# Create new estimator\n", "clf_baseline = sagemaker.estimator.Estimator(container,\n", " role,\n", " hyperparameters=hyperparams,\n", " instance_count=1, \n", " instance_type='ml.c5.xlarge',\n", " output_path=output_location,\n", " sagemaker_session=session)\n", "\n", "# Set the output path\n", "clf_baseline.output_path = output_location\n", "\n", "# Define the data type and paths to the training and validation datasets\n", "content_type = \"text/csv\" \n", "train_input = TrainingInput(X_train_baseline_path, content_type=content_type)\n", "validation_input = TrainingInput(X_validation_baseline_path, content_type=content_type)\n", "\n", "# Fit the model\n", "clf_baseline.fit({'train': train_input, 'validation': validation_input })" ] }, { "cell_type": "code", "execution_count": null, "id": "dead17e7-db5f-418c-a5ea-ebae87f95c19", "metadata": {}, "outputs": [], "source": [ "# Serialize data to a CSV-formatted string\n", "csv_serializer = CSVSerializer()\n", "\n", "# Create a real-time inference endpoint that hosts our trained model \n", "xgb_predictor_baseline = clf_baseline.deploy(initial_instance_count=1,\n", " instance_type='ml.m5.xlarge', \n", " serializer=csv_serializer)" ] }, { "cell_type": "code", "execution_count": null, "id": "3e2297dd-d4df-4ea2-b0ea-e424fec21c1b", "metadata": {}, "outputs": [], "source": [ "# Test the model by invoking the real-time inference endpoint with observations from the test dataset\n", "y_preds_orig = predict(xgb_predictor_baseline, X_test)" ] }, { "cell_type": "code", "execution_count": null, "id": "1cc18bdd-0236-4781-9b18-1850a4609a6d", "metadata": {}, "outputs": [], "source": [ "# Note here the .round() is using the 0.5 threshold\n", "plot_confusion_matrix(y_test, y_preds_orig.round())" ] }, { "cell_type": "code", "execution_count": null, "id": "9e6b7640-74d1-45e2-9cdb-7854c58df8df", "metadata": {}, "outputs": [], "source": [ "# Create a report showing the main classification metrics \n", "no_smote_cls_rpt = classification_report(\n", " y_test, y_preds_orig.round(), target_names=['non-fraud', 'fraud'], digits=4, output_dict=True)\n", "\n", "# Render as DataFrame\n", "df_no_smote_cls_rpt = pd.DataFrame(no_smote_cls_rpt)\n", "df_no_smote_cls_rpt.T" ] }, { "cell_type": "markdown", "id": "c5747a45-81cf-4263-a8c4-c6c6639fab11", "metadata": {}, "source": [ "### Wrap Up" ] }, { "cell_type": "code", "execution_count": null, "id": "141acbc2-1e5f-49f0-bae7-83a030700008", "metadata": {}, "outputs": [], "source": [ "# Compare the original (no SMOTE) and SMOTE classification reports\n", "df_no_smote_cls_rpt.T.compare(df_smote_cls_rpt.T, \n", " align_axis=0, \n", " keep_shape=True, \n", " keep_equal=True).rename(index={'self': 'No SMOTE', 'other': 'SMOTE'})" ] }, { "cell_type": "markdown", "id": "bd6d7e25-300e-4992-8ab4-dc7e9f290d91", "metadata": {}, "source": [ "Given the CMS dataset is highly imbalanced and our objective is to identify fradulent trasactions (minority class), we will focus on evaluating the **recall** and **f1-score** metrics. \n", "\n", "By comparing the results of the baseline and resampled classification reports, we observe that:\n", "1. the recall metric for the minority class increased by 83% (from .31 to .57) for the model trained on the resampled dataset\n", "2. the f1-score increased by 6.7% (from .418 to .446) for the model trained on the resampled dataset\n", "\n", "Since our goal is to prevent fradulent transactions, improving the recall metric places a high importance on reducing the number of false negatives (transactions that were labled as *fraud* and the model incorrectly classified them as *non-fraud*). Additionally, improving the f1-score helps because it is the harmonic mean of the precision and recall.\n", "\n", "Note: due to randomization during model training individual results may vary slightly for these metrics. " ] }, { "cell_type": "markdown", "id": "89424b62-41f1-4ab7-b110-550065854b49", "metadata": {}, "source": [ "### Clean up" ] }, { "cell_type": "code", "execution_count": null, "id": "01f121ab-ef0e-4372-83dc-9f26081c391f", "metadata": {}, "outputs": [], "source": [ "# Uncomment to clean up endpoints\n", "xgb_predictor.delete_endpoint()\n", "xgb_predictor_baseline.delete_endpoint()" ] }, { "cell_type": "markdown", "id": "0777c760-fff0-4a90-87a4-004e24133cd6", "metadata": {}, "source": [ "### Data Acknowledgements\n", "\n", "The curated dataset used for this lab comes from the following Centers for Medicare & Medicaid Services dataset:\n", "https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service" ] }, { "cell_type": "code", "execution_count": null, "id": "a4de55c1-943e-4ecc-9e26-5d19fdce7110", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.1" } }, "nbformat": 4, "nbformat_minor": 5 }