# In this notebook, we use Supervised Machine Learning (Random Forest in Scikit-Learn package) with classification to identify Fraudulent Medicare providers using data from CMS that has been preprocessed using Data Wrangler

## Setup

Import required libraries (install imblearn using pip if not present)

In [None]:
!pip install imblearn

In [None]:
import numpy as np 
import pandas as pd
import boto3
import os
import sagemaker
import seaborn as sns
import matplotlib.pyplot as plt
import io
import sklearn
from math import sqrt
from sagemaker import get_execution_role
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer
from sagemaker.amazon.amazon_estimator import get_image_uri
from sklearn.datasets import dump_svmlight_file 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.datasets import dump_svmlight_file 
from collections import Counter
from sagemaker.s3 import S3Downloader

Enable the ability to see all columns and rows of data if the data size is big

In [None]:
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 200)

In [None]:
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = 'fraud-detect-demo/sklearn'
role = get_execution_role()
s3_client = boto3.client("s3")

Let's start by reading in the entire preprocessed medicare data set prepared for classification

In [None]:
!gzip -dk ../Classification\ with\ XGBoost/processed_data_classification.csv.gz
!mv ../Classification\ with\ XGBoost/processed_data_classification.csv ./

In [None]:
data = pd.read_csv('processed_data_classification.csv', delimiter=',')
data.head()

## Investigate and process the data

Check data for any nulls

In [None]:
data.isnull().values.any()

Check for imbalance

In [None]:
data['fraudulent_provider'].value_counts()

We see that the majority of data is non-fraudulent. We will need to rebalance the data using sampling techniques that are designed specifically for imbalanced problems to improve the performance of the model.We use the Random Under Sampler and Over Sampling techniques from imblearn to do this (http://glemaitre.github.io/imbalanced-learn/api.html)

First, remove column headers from data as SageMaker does not need headers for processing csv files

In [None]:
feature_columns = data.columns[1:]
label_column = data.columns[0]

features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

We will split our dataset into a train and test to evaluate the performance of our models. Since the data is highly imbalanced, it is important to stratify across the data sets to ensure an even distribution.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
 features, labels, test_size=0.1, stratify=labels)

## Apply SMOTE

The ratio in oversampling and the sampling strategy for undersampling are very important in improving the performance of the models. We have selected ratios based ased on research from https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0225-0 for this dataset. However, try to expirement with different ratios to see the impact

In [None]:
over = SMOTE(sampling_strategy=0.25)
under = RandomUnderSampler(sampling_strategy=1)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
# transform the dataset
X_smote, y_smote = pipeline.fit_resample(X_train, y_train)

In [None]:
print(sorted(Counter(y_smote).items()))

In [None]:
X_smote_train, X_smote_validation, y_smote_train, y_smote_validation = train_test_split(
 X_smote, y_smote, test_size=0.1, stratify=y_smote)

## Training and Prediction - Supervised learning (Random Forest classification)

We use a supervised learning algorithm for classifcation using Scikit-Learn Random Forest 

### Prepare Data and Upload to S3

We first save the data in csv format and then upload the data to S3 

In [None]:
# first column is target column
trainX_concate = np.concatenate((y_smote_train.reshape(len(y_smote_train),1), X_smote_train), axis=1)
trainX = pd.DataFrame(trainX_concate, index=None, columns=None)

validationX_concate = np.concatenate((y_smote_validation.reshape(len(y_smote_validation),1), X_smote_validation), axis=1)
validationX = pd.DataFrame(validationX_concate, index=None, columns=None)

testX = pd.DataFrame(X_test, index=None, columns=None)

In [None]:
trainX.to_csv("cms_payment_train.csv", header=False, index=False)
validationX.to_csv("cms_payment_validation.csv", header=False, index=False)
testX.to_csv("cms_payment_test.csv", header=False, index=False)

Now we upload the data to S3. SageMaker will take training data from s3

In [None]:
train_path = session.upload_data(
 path="cms_payment_train.csv", bucket=bucket, key_prefix=prefix
)

validation_path = session.upload_data(
 path="cms_payment_validation.csv", bucket=bucket, key_prefix=prefix
)

test_path = session.upload_data(
 path="cms_payment_test.csv", bucket=bucket, key_prefix=prefix
)

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

## Writing a *Script Mode* script
The below script contains both training and inference functionality and can run in SageMaker Training hardware. Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [None]:
%%writefile script.py

import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score


# inference functions ---------------
def model_fn(model_dir):
 clf = joblib.load(os.path.join(model_dir, "model.joblib"))
 return clf


if __name__ == "__main__":

 print("extracting arguments")
 parser = argparse.ArgumentParser()

 # hyperparameters sent by the client are passed as command-line arguments to the script.
 # to simplify the demo we don't use all sklearn RandomForest hyperparameters
 parser.add_argument("--n-estimators", type=int, default=10)
 parser.add_argument("--min-samples-leaf", type=int, default=3)

 # Data, model, and output directories
 parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
 parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
 parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
 parser.add_argument("--train-file", type=str, default="cms_payment_train.csv")
 parser.add_argument("--test-file", type=str, default="cms_payment_validation.csv")

 args, _ = parser.parse_known_args()

 print("reading data")
 train_df = pd.read_csv(os.path.join(args.train, args.train_file), header=None)
 test_df = pd.read_csv(os.path.join(args.test, args.test_file), header=None)

 print("building training and testing datasets")
 X_train = train_df[train_df.columns[1:]]
 X_test = test_df[test_df.columns[1:]]
 y_train = train_df[train_df.columns[0]]
 y_test = test_df[test_df.columns[0]]

 # train
 print("training model")
 model = RandomForestClassifier(
 n_estimators=args.n_estimators, min_samples_leaf=args.min_samples_leaf, n_jobs=-1
 )

 model.fit(X_train, y_train)

 # print accuracy
 print("validating model")
 y_pred = model.predict(X_test)
 acc = accuracy_score(y_test, y_pred)
 auc = roc_auc_score(y_test, y_pred)
 print(f"Accuracy is: {acc}")
 print(f"Area under the curve is: {auc}")

 # persist model
 path = os.path.join(args.model_dir, "model.joblib")
 joblib.dump(model, path)
 print("model persisted at " + path)
 print(args.min_samples_leaf)

## SageMaker Training

### Launching a training job with SageMaker Python SDK

In [None]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
 entry_point="script.py",
 role=get_execution_role(),
 instance_count=1,
 instance_type="ml.c5.xlarge",
 framework_version=FRAMEWORK_VERSION,
 base_job_name="rf-scikit",
 metric_definitions=[{"Name": "Accuracy", "Regex": "Accuracy is: ([0-9.]+).*$"}],
 hyperparameters={
 "n-estimators": 100,
 "min-samples-leaf": 2
 },
)

In [None]:
# launch training job, with asynchronous call
sklearn_estimator.fit({"train": train_path, "test": validation_path}, wait=True)

## Batch Transform 
We can also use the trained model for asynchronous batch inference on S3 data using SageMaker Batch Transform.

In [None]:
# Define an SKLearn Transformer from the trained SKLearn Estimator
transformer = sklearn_estimator.transformer(instance_count=1, instance_type="ml.m5.xlarge", strategy='MultiRecord', assemble_with="Line", accept="text/csv")

In [None]:
# Start a transform job and wait for it to finish
transformer.transform(test_path, split_type="Line", content_type="text/csv")
print("Waiting for transform job: " + transformer.latest_transform_job.job_name)
transformer.wait()

### Check Output Data 
After the transform job has completed, download the output data from S3. For each file "f" in the input data, we have a corresponding file "f.out" containing the predicted labels from each input row. We can compare the predicted labels to the true labels saved earlier.

In [None]:
# Download the output data from S3 to local file system
batch_output = transformer.output_path
output_file_name = "cms_payment_test.csv.out"

!aws s3 cp {batch_output}/{output_file_name} ./

In [None]:
from numpy import genfromtxt
y_preds = genfromtxt(output_file_name, delimiter=',') 
y_preds.shape

### Evaluation

We will use a few measures from the scikit-learn package to evaluate the performance of our model. When dealing with an imbalanced dataset, we need to choose metrics that take into account the frequency of each class in the data.

We will use [balanced accuracy score](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score)


we can bring a balance between the metrics again by adjusting our classification threshold (threshold between labeling a point as fraud or not). We can try different thresholds to see if they affect the result of the classification. 

In [None]:
# Calculate balanced accuracy score
print("Balanced accuracy = {}".format(balanced_accuracy_score(y_test, y_preds)))

Apart from single-value metrics, it's also useful to look at metrics that indicate performance per class. A confusion matrix, and per-class precision, recall and f1-score can also provide more information about the model's performance.

In [None]:
def plot_confusion_matrix(y_true, y_predicted):

 cm = confusion_matrix(y_true, y_predicted)
 # Get the per-class normalized value for each cell
 cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 
 # We color each cell according to its normalized value, annotate with exact counts.
 ax = sns.heatmap(cm_norm, annot=cm, fmt="d")
 ax.set(xticklabels=["non-fraud", "fraud"], yticklabels=["non-fraud", "fraud"])
 ax.set_ylim([0,2])
 plt.title('Confusion Matrix')
 plt.ylabel('Real Classes')
 plt.xlabel('Predicted Classes')
 plt.show()

In [None]:
plot_confusion_matrix(y_test, y_preds)

In [None]:
print(classification_report(
 y_test, y_preds, target_names=['non-fraud', 'fraud']))


## Data Acknowledgements

The dataset used to demonstrated the fraud detection solution has been collected and analysed from CMS 

https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service

