{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# In this notebook, we use unsupervised machine learning with anomaly detection to identify Fraudulent Medicare providers using data from CMS that has been preprocessed using Data Wrangler. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import required libraries (install imblearn using pip if not present)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install imblearn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np \n", "import pandas as pd\n", "import boto3\n", "import os\n", "import sagemaker\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import io\n", "import sklearn\n", "from math import sqrt\n", "from sagemaker import get_execution_role\n", "from sagemaker import RandomCutForest\n", "from sagemaker.deserializers import JSONDeserializer\n", "from sagemaker.serializers import CSVSerializer\n", "from sagemaker.amazon.amazon_estimator import get_image_uri\n", "from sklearn.datasets import dump_svmlight_file \n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score\n", "from sklearn.metrics import classification_report\n", "from imblearn.over_sampling import SMOTE\n", "from imblearn.under_sampling import RandomUnderSampler\n", "from imblearn.pipeline import Pipeline\n", "from sklearn.datasets import dump_svmlight_file \n", "from collections import Counter\n", "from sagemaker.s3 import S3Downloader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enable the ability to see all columns and rows of data if the data size is big" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('max_columns', None)\n", "pd.set_option('max_rows', None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = sagemaker.Session()\n", "bucket = session.default_bucket()\n", "prefix = 'fraud-detect-demo/randomforest'\n", "role = get_execution_role()\n", "s3_client = boto3.client(\"s3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by reading in the entire preprocessed medicare data set prepared for anomaly detection. This dataset has a lot more data elements than the dataset prepared for classification" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!gzip -d processed_data_anomaly_detection1.csv.gz\n", "!gzip -d processed_data_anomaly_detection2.csv.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data1 = pd.read_csv('processed_data_anomaly_detection1.csv', delimiter=',')\n", "data2 = pd.read_csv('processed_data_anomaly_detection2.csv', delimiter=',')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data1.append(data2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Investigate and process the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check data for any nulls" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.isnull().values.any()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['fraudulent_provider'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 80% for the training set and 20% for testing set\n", "RANDOM_SEED = 314 #used to help randomly select the data points\n", "TEST_PCT = 0.2 # 20% of the data\n", "\n", "train, test = train_test_split(data, test_size=TEST_PCT,stratify = data['fraudulent_provider'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training and Prediction - Unsupervised Learning (Anomaly Detection)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use Anomaly Detection, an unsupervised Learning, to determine fraud" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Anomaly Detection, ideally you need to run the training on the \"normal\" data set (in this case non fraud) and then test on a mix of fraud and non fraud data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_x = train.loc[train['fraudulent_provider'] == 0]\n", "train_x = train_x.drop(['fraudulent_provider'], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train = train_x.values.astype('float32')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# specify general training job information\n", "rcf = RandomCutForest(role=get_execution_role(),\n", " instance_count=2,\n", " instance_type='ml.c5.4xlarge',\n", " data_location='s3://{}/{}/'.format(bucket, prefix),\n", " output_path='s3://{}/{}/output'.format(bucket, prefix),\n", " num_samples_per_tree=2048,\n", " num_trees=1000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rcf.fit(rcf.record_set(X_train, channel='train'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Host Random Cut Forest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have a trained model we can deploy it and get some predictions for our test set. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rcf_predictor = rcf.deploy(\n", " initial_instance_count=1,\n", " instance_type='ml.c4.xlarge',\n", " serializer=CSVSerializer(),\n", " deserializer=JSONDeserializer()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the model deployed, let's see how it performs in terms of separating fraudulent from legitimate transactions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def predict_rcf(current_predictor, d, rows=500):\n", " split_array = np.array_split(d, int(d.shape[0] / float(rows) + 1))\n", " predictions = []\n", " for array in split_array:\n", " array_preds = [s['score'] for s in current_predictor.predict(array)['scores']]\n", " predictions.append(array_preds)\n", "\n", " return np.concatenate([np.array(batch) for batch in predictions])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frauds = test.loc[test['fraudulent_provider'] == 1]\n", "nonfrauds = test.loc[test['fraudulent_provider'] == 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "frauds = frauds.drop(['fraudulent_provider'], axis=1)\n", "nonfrauds = nonfrauds.drop(['fraudulent_provider'], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "positives= frauds.values.astype('float32')\n", "negatives= nonfrauds.values.astype('float32')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "positives_scores = predict_rcf(rcf_predictor, positives)\n", "negatives_scores = predict_rcf(rcf_predictor, negatives)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.set(color_codes=True)\n", "sns.set(rc={'figure.figsize':(11.7,8.27)})\n", "sns.distplot(positives_scores, label='fraud', bins=20)\n", "sns.distplot(negatives_scores, label='not-fraud', bins=20)\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check the data for Bias" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# converting the facet value that we wil check for bias ('female') from float to int for setting up Sagemaker clarify bias processing\n", "test['referring_provider_gender_f'] = test['referring_provider_gender_f'].astype(int)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.s3 import S3Uploader\n", "\n", "test.to_csv(\"train.csv\", index=False, header=False)\n", "\n", "train_uri = S3Uploader.upload(\"train.csv\", \"s3://{}/{}\".format(bucket, prefix))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "IMPORTANT: Since we are running RCF on a large number of columns, it is recommended to use an instance with high memory and increase the number of instances to the extent possible" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker import clarify\n", "\n", "clarify_processor = clarify.SageMakerClarifyProcessor(\n", " role=role, instance_count=3, instance_type=\"ml.r5.24xlarge\", sagemaker_session=session\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bias_report_output_path = \"s3://{}/{}/clarify-bias\".format(bucket, prefix)\n", "bias_data_config = clarify.DataConfig(\n", " s3_data_input_path=train_uri,\n", " s3_output_path=bias_report_output_path,\n", " label=\"fraudulent_provider\",\n", " headers=train.columns.to_list(),\n", " dataset_type=\"text/csv\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Update the `model_name` below with the `model_name` of the sagemaker endpoint you deployed. this should be available in the parameters returned from the \n", "`describe_endpoint_config` command in the sagemaker client smclient. To get the endpoint config you need to run the `describe_endpoint` command on the endpoint you deployed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smclient = boto3.client(service_name='sagemaker')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smclient.describe_endpoint_config(EndpointConfigName= 'get name from Console')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "IMPORTANT: Use the `ModelName` from the above in the `model_name` below; Also, since we are running RCF on a large number of columns, it is recommended to use an instance with high memory and increase the number of instances to the extent possible" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_config = clarify.ModelConfig(\n", " model_name=\"replace with model name\",\n", " instance_type=\"ml.r5.24xlarge\",\n", " instance_count=3,\n", " accept_type=\"text/csv\",\n", " content_type=\"text/csv\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions_config = clarify.ModelPredictedLabelConfig()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bias_config = clarify.BiasConfig(\n", " label_values_or_threshold=[1.25], facet_name=\"referring_provider_gender_f\", facet_values_or_threshold=[1]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clarify_processor.run_post_training_bias(\n", " data_config=bias_data_config,\n", " data_bias_config=bias_config,\n", " model_config=model_config,\n", " model_predicted_label_config=predictions_config,\n", " methods=[\"DI\",\"RD\"],\n", " logs=False\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bias_report_output_path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "S3Downloader.download(\"{}/report.pdf\".format(bias_report_output_path), \"../Fraud Detection/Bias/AnomalyDetection\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To view the bias metrics, open up the bias_report.pdf - alternatively you can view results in Studio under the expirements tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate which features contribute to the model predictions (Explainability)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of samples below are critical to determine explainability. Ideally, you need to have at least 5 times the number of columns in the dataset to allow enough permutations and combinations. However, it is important to keep in mind that this dramatically increases memory required. In this case, RCF already requires significant amount of memory. So we keep the number of samples to 200" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shap_config = clarify.SHAPConfig(\n", " num_samples=200,\n", " agg_method=\"mean_abs\",\n", " save_local_shap_values=True,\n", ")\n", "\n", "explainability_output_path = \"s3://{}/{}/clarify-explainability\".format(bucket, prefix)\n", "explainability_data_config = clarify.DataConfig(\n", " s3_data_input_path=train_uri,\n", " s3_output_path=explainability_output_path,\n", " label=\"fraudulent_provider\",\n", " headers=train.columns.to_list(),\n", " dataset_type=\"text/csv\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clarify_processor.run_explainability(\n", " data_config=explainability_data_config,\n", " model_config=model_config,\n", " explainability_config=shap_config,\n", " logs=False\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "explainability_output_path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "S3Downloader.download(\"{}/report.pdf\".format(explainability_output_path), \"../Fraud Detection/Exp/AnomalyDetection\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean Up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment to clean up endpoints\n", "# rcf_predictor.delete_endpoint()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Data Acknowledgements\n", "\n", "The dataset used to demonstrated the fraud detection solution has been collected and analysed from CMS \n", "\n", "https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.m5.large", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }