{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# In this notebook, we use Supervised Machine Learning (Random Forest in Scikit-Learn package) with classification to identify Fraudulent Medicare providers using data from CMS that has been preprocessed using Data Wrangler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import required libraries (install imblearn using pip if not present)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install imblearn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np \n", "import pandas as pd\n", "import boto3\n", "import os\n", "import sagemaker\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import io\n", "import sklearn\n", "from math import sqrt\n", "from sagemaker import get_execution_role\n", "from sagemaker.deserializers import JSONDeserializer\n", "from sagemaker.serializers import CSVSerializer\n", "from sagemaker.amazon.amazon_estimator import get_image_uri\n", "from sklearn.datasets import dump_svmlight_file \n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score\n", "from sklearn.metrics import classification_report\n", "from imblearn.over_sampling import SMOTE\n", "from imblearn.under_sampling import RandomUnderSampler\n", "from imblearn.pipeline import Pipeline\n", "from sklearn.datasets import dump_svmlight_file \n", "from collections import Counter\n", "from sagemaker.s3 import S3Downloader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enable the ability to see all columns and rows of data if the data size is big" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('max_columns', 200)\n", "pd.set_option('max_rows', 200)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = sagemaker.Session()\n", "bucket = session.default_bucket()\n", "prefix = 'fraud-detect-demo/sklearn'\n", "role = get_execution_role()\n", "s3_client = boto3.client(\"s3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by reading in the entire preprocessed medicare data set prepared for classification" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!gzip -dk ../Classification\\ with\\ XGBoost/processed_data_classification.csv.gz\n", "!mv ../Classification\\ with\\ XGBoost/processed_data_classification.csv ./" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "data = pd.read_csv('processed_data_classification.csv', delimiter=',')\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Investigate and process the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check data for any nulls" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.isnull().values.any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check for imbalance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['fraudulent_provider'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the majority of data is non-fraudulent. We will need to rebalance the data using sampling techniques that are designed specifically for imbalanced problems to improve the performance of the model.We use the Random Under Sampler and Over Sampling techniques from imblearn to do this (http://glemaitre.github.io/imbalanced-learn/api.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, remove column headers from data as SageMaker does not need headers for processing csv files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_columns = data.columns[1:]\n", "label_column = data.columns[0]\n", "\n", "features = data[feature_columns].values.astype('float32')\n", "labels = (data[label_column].values).astype('float32')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will split our dataset into a train and test to evaluate the performance of our models. Since the data is highly imbalanced, it is important to stratify across the data sets to ensure an even distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(\n", " features, labels, test_size=0.1, stratify=labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply SMOTE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ratio in oversampling and the sampling strategy for undersampling are very important in improving the performance of the models. We have selected ratios based ased on research from https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0225-0 for this dataset. However, try to expirement with different ratios to see the impact" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "over = SMOTE(sampling_strategy=0.25)\n", "under = RandomUnderSampler(sampling_strategy=1)\n", "steps = [('o', over), ('u', under)]\n", "pipeline = Pipeline(steps=steps)\n", "# transform the dataset\n", "X_smote, y_smote = pipeline.fit_resample(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(sorted(Counter(y_smote).items()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_smote_train, X_smote_validation, y_smote_train, y_smote_validation = train_test_split(\n", " X_smote, y_smote, test_size=0.1, stratify=y_smote)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training and Prediction - Supervised learning (Random Forest classification)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use a supervised learning algorithm for classifcation using Scikit-Learn Random Forest " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare Data and Upload to S3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first save the data in csv format and then upload the data to S3 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# first column is target column\n", "trainX_concate = np.concatenate((y_smote_train.reshape(len(y_smote_train),1), X_smote_train), axis=1)\n", "trainX = pd.DataFrame(trainX_concate, index=None, columns=None)\n", "\n", "validationX_concate = np.concatenate((y_smote_validation.reshape(len(y_smote_validation),1), X_smote_validation), axis=1)\n", "validationX = pd.DataFrame(validationX_concate, index=None, columns=None)\n", "\n", "testX = pd.DataFrame(X_test, index=None, columns=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainX.to_csv(\"cms_payment_train.csv\", header=False, index=False)\n", "validationX.to_csv(\"cms_payment_validation.csv\", header=False, index=False)\n", "testX.to_csv(\"cms_payment_test.csv\", header=False, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we upload the data to S3. SageMaker will take training data from s3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_path = session.upload_data(\n", " path=\"cms_payment_train.csv\", bucket=bucket, key_prefix=prefix\n", ")\n", "\n", "validation_path = session.upload_data(\n", " path=\"cms_payment_validation.csv\", bucket=bucket, key_prefix=prefix\n", ")\n", "\n", "test_path = session.upload_data(\n", " path=\"cms_payment_test.csv\", bucket=bucket, key_prefix=prefix\n", ")\n", "\n", "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n", "print('Training artifacts will be uploaded to: {}'.format(output_location))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing a *Script Mode* script\n", "The below script contains both training and inference functionality and can run in SageMaker Training hardware. Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile script.py\n", "\n", "import argparse\n", "import joblib\n", "import os\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.metrics import roc_auc_score\n", "\n", "\n", "# inference functions ---------------\n", "def model_fn(model_dir):\n", " clf = joblib.load(os.path.join(model_dir, \"model.joblib\"))\n", " return clf\n", "\n", "\n", "if __name__ == \"__main__\":\n", "\n", " print(\"extracting arguments\")\n", " parser = argparse.ArgumentParser()\n", "\n", " # hyperparameters sent by the client are passed as command-line arguments to the script.\n", " # to simplify the demo we don't use all sklearn RandomForest hyperparameters\n", " parser.add_argument(\"--n-estimators\", type=int, default=10)\n", " parser.add_argument(\"--min-samples-leaf\", type=int, default=3)\n", "\n", " # Data, model, and output directories\n", " parser.add_argument(\"--model-dir\", type=str, default=os.environ.get(\"SM_MODEL_DIR\"))\n", " parser.add_argument(\"--train\", type=str, default=os.environ.get(\"SM_CHANNEL_TRAIN\"))\n", " parser.add_argument(\"--test\", type=str, default=os.environ.get(\"SM_CHANNEL_TEST\"))\n", " parser.add_argument(\"--train-file\", type=str, default=\"cms_payment_train.csv\")\n", " parser.add_argument(\"--test-file\", type=str, default=\"cms_payment_validation.csv\")\n", "\n", " args, _ = parser.parse_known_args()\n", "\n", " print(\"reading data\")\n", " train_df = pd.read_csv(os.path.join(args.train, args.train_file), header=None)\n", " test_df = pd.read_csv(os.path.join(args.test, args.test_file), header=None)\n", "\n", " print(\"building training and testing datasets\")\n", " X_train = train_df[train_df.columns[1:]]\n", " X_test = test_df[test_df.columns[1:]]\n", " y_train = train_df[train_df.columns[0]]\n", " y_test = test_df[test_df.columns[0]]\n", "\n", " # train\n", " print(\"training model\")\n", " model = RandomForestClassifier(\n", " n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1\n", " )\n", "\n", " model.fit(X_train, y_train)\n", "\n", " # print accuracy\n", " print(\"validating model\")\n", " y_pred = model.predict(X_test)\n", " acc = accuracy_score(y_test, y_pred)\n", " auc = roc_auc_score(y_test, y_pred)\n", " print(f\"Accuracy is: {acc}\")\n", " print(f\"Area under the curve is: {auc}\")\n", "\n", " # persist model\n", " path = os.path.join(args.model_dir, \"model.joblib\")\n", " joblib.dump(model, path)\n", " print(\"model persisted at \" + path)\n", " print(args.min_samples_leaf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Training\n", "\n", "### Launching a training job with SageMaker Python SDK" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We use the Estimator from the SageMaker Python SDK\n", "from sagemaker.sklearn.estimator import SKLearn\n", "\n", "FRAMEWORK_VERSION = \"0.23-1\"\n", "\n", "sklearn_estimator = SKLearn(\n", " entry_point=\"script.py\",\n", " role=get_execution_role(),\n", " instance_count=1,\n", " instance_type=\"ml.c5.xlarge\",\n", " framework_version=FRAMEWORK_VERSION,\n", " base_job_name=\"rf-scikit\",\n", " metric_definitions=[{\"Name\": \"Accuracy\", \"Regex\": \"Accuracy is: ([0-9.]+).*$\"}],\n", " hyperparameters={\n", " \"n-estimators\": 100,\n", " \"min-samples-leaf\": 2\n", " },\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# launch training job, with asynchronous call\n", "sklearn_estimator.fit({\"train\": train_path, \"test\": validation_path}, wait=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batch Transform \n", "We can also use the trained model for asynchronous batch inference on S3 data using SageMaker Batch Transform." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define an SKLearn Transformer from the trained SKLearn Estimator\n", "transformer = sklearn_estimator.transformer(instance_count=1, instance_type=\"ml.m5.xlarge\", strategy='MultiRecord', assemble_with=\"Line\", accept=\"text/csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start a transform job and wait for it to finish\n", "transformer.transform(test_path, split_type=\"Line\", content_type=\"text/csv\")\n", "print(\"Waiting for transform job: \" + transformer.latest_transform_job.job_name)\n", "transformer.wait()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check Output Data \n", "After the transform job has completed, download the output data from S3. For each file \"f\" in the input data, we have a corresponding file \"f.out\" containing the predicted labels from each input row. We can compare the predicted labels to the true labels saved earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download the output data from S3 to local file system\n", "batch_output = transformer.output_path\n", "output_file_name = \"cms_payment_test.csv.out\"\n", "\n", "!aws s3 cp {batch_output}/{output_file_name} ./" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from numpy import genfromtxt\n", "y_preds = genfromtxt(output_file_name, delimiter=',') \n", "y_preds.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use a few measures from the scikit-learn package to evaluate the performance of our model. When dealing with an imbalanced dataset, we need to choose metrics that take into account the frequency of each class in the data.\n", "\n", "We will use [balanced accuracy score](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we can bring a balance between the metrics again by adjusting our classification threshold (threshold between labeling a point as fraud or not). We can try different thresholds to see if they affect the result of the classification. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate balanced accuracy score\n", "print(\"Balanced accuracy = {}\".format(balanced_accuracy_score(y_test, y_preds)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apart from single-value metrics, it's also useful to look at metrics that indicate performance per class. A confusion matrix, and per-class precision, recall and f1-score can also provide more information about the model's performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_confusion_matrix(y_true, y_predicted):\n", "\n", " cm = confusion_matrix(y_true, y_predicted)\n", " # Get the per-class normalized value for each cell\n", " cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", " \n", " # We color each cell according to its normalized value, annotate with exact counts.\n", " ax = sns.heatmap(cm_norm, annot=cm, fmt=\"d\")\n", " ax.set(xticklabels=[\"non-fraud\", \"fraud\"], yticklabels=[\"non-fraud\", \"fraud\"])\n", " ax.set_ylim([0,2])\n", " plt.title('Confusion Matrix')\n", " plt.ylabel('Real Classes')\n", " plt.xlabel('Predicted Classes')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_confusion_matrix(y_test, y_preds)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(classification_report(\n", " y_test, y_preds, target_names=['non-fraud', 'fraud']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Data Acknowledgements\n", "\n", "The dataset used to demonstrated the fraud detection solution has been collected and analysed from CMS \n", "\n", "https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service\n", "\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }