# Credit card fraud detector using Amazon SageMaker

## Investigate and process the data

Let's start by downloading the dataset from: https://www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv and upload it into the notebook file system as creditcard.csv.

In [None]:
import numpy as np 
import pandas as pd

data = pd.read_csv('creditcard.csv', delimiter=',')

Let's take a peek at our data (we only show a subset of the columns in the table):

In [None]:
print(data.columns)
data[['Time', 'V1', 'V2', 'V27', 'V28', 'Amount', 'Class']].describe()

The class column corresponds to whether or not a transaction is fradulent. We see that the majority of data is non-fraudulent with only $492$ ($.173\%$) of the data corresponding to fraudulent examples.

In [None]:
nonfrauds, frauds = data.groupby('Class').size()
print('Number of frauds: ', frauds)
print('Number of non-frauds: ', nonfrauds)
print('Percentage of fradulent data:', 100.*frauds/(frauds + nonfrauds))

This dataset has 28 columns, $V_i$ for $i=1..28$ of anonymized features along with columns for time, amount, and class. We already know that the columns $V_i$ have been normalized to have $0$ mean and unit standard deviation as the result of a PCA.

In [None]:
feature_columns = data.columns[:-1]
label_column = data.columns[-1]

features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

Next, we will prepare our data for loading and training.

## Training

We will split our dataset into a train and test to evaluate the performance of our models. It's important to do so _before_ any techniques meant to alleviate the class imbalance are used. This ensures that we don't leak information from the test set into the train set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
 features, labels, test_size=0.1, random_state=42)

## Unsupervised Learning

In a fraud detection scenario, commonly we will have very few labeled examples, and it's possible that labeling fraud takes a very long time. We would like then to extract information from the unlabeled data we have at hand as well. _Anomaly detection_ is a form of unsupervised learning where we try to identify anomalous examples based solely on their feature characteristics. Random Cut Forest is a state-of-the-art anomaly detection algorithm that is both accurate and scalable. We will train such a model on our training data and evaluate its performance on our test set.

In [None]:
import boto3
import os
import sagemaker
from sagemaker import get_execution_role

session = sagemaker.Session()
bucket = "renbrave-fraud-detection-model"
prefix = 'fraud-classifier'

In [None]:
from sagemaker import RandomCutForest

# specify general training job information
rcf = RandomCutForest(role=get_execution_role(),
 train_instance_count=1,
 train_instance_type='ml.c4.xlarge',
 data_location='s3://{}/{}/'.format(bucket, prefix),
 output_path='s3://{}/{}/output'.format(bucket, prefix),
 num_samples_per_tree=512,
 num_trees=50)

In [None]:
rcf.fit(rcf.record_set(X_train))

### Host Random Cut Forest

Once we have a trained model we can deploy it and get some predictions for our test set.

In [None]:
rcf_predictor = rcf.deploy(
 endpoint_name='random-cut-forest-endpoint',
 initial_instance_count=1,
 instance_type='ml.c4.xlarge')

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

rcf_predictor.content_type = 'text/csv'
rcf_predictor.serializer = csv_serializer
rcf_predictor.accept = 'application/json'
rcf_predictor.deserializer = json_deserializer

### Test Random Cut Forest

With the model deployed, let's see how it performs in terms of separating fraudulent from legitimate transactions.

In [None]:
def predict_rcf(current_predictor, data, rows=500):
 split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
 predictions = []
 for array in split_array:
 array_preds = [s['score'] for s in current_predictor.predict(array)['scores']]
 predictions.append(array_preds)

 return np.concatenate([np.array(batch) for batch in predictions])

In [None]:
positives = X_test[y_test == 1]
positives_scores = predict_rcf(rcf_predictor, positives)

negatives = X_test[y_test == 0]
negatives_scores = predict_rcf(rcf_predictor, negatives)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)

In [None]:
sns.distplot(positives_scores, label='fraud', bins=20)
sns.distplot(negatives_scores, label='not-fraud', bins=20)
plt.legend()

The unsupervised model already can achieve some separation between the classes, with higher anomaly scores being correlated to fraud.

## Supervised Learning

Once we have gathered an adequate amount of labeled training data, we can use a supervised learning algorithm that discovers relationships between the features and the dependent class.

We will use Gradient Boosted Trees as our model, as they have a proven track record, are highly scalable and can deal with missing data, reducing the need to pre-process datasets.

### Prepare Data and Upload to S3

First we copy the data to an in-memory buffer.

In [1]:
import io
import sklearn
from sklearn.datasets import dump_svmlight_file 

buf = io.BytesIO()

sklearn.datasets.dump_svmlight_file(X_train, y_train, buf)
buf.seek(0);

NameError: name 'X_train' is not defined

Now we upload the data to S3 using boto3.

In [None]:
key = 'fraud-dataset'
subdir = 'base'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', subdir, key)).upload_fileobj(buf)

s3_train_data = 's3://{}/{}/train/{}/{}'.format(bucket, prefix, subdir, key)
print('Uploaded training data location: {}'.format(s3_train_data))

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

We can now train using SageMaker's built-in XGBoost algorithm. To specify the XGBoost algorithm, we use a utility function to obtain its URI. A complete list of built-in algorithms is found here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version='0.90-2')

SageMaker abstracts training via Estimators. We can pass the classifier and parameters along with hyperparameters to the estimator, and fit the estimator to the data in S3. An important parameter here is `scale_pos_weight` which scales the weights of the positive vs. negative class examples. This is crucial to do in an imbalanced dataset like the one we are using here, otherwise the majority class would dominate the learning.

In [None]:
from math import sqrt
from sagemaker import get_execution_role

# Because the data set is so highly skewed, we set the scale position weight conservatively,
# as sqrt(num_nonfraud/num_fraud).
# Other recommendations for the scale_pos_weight are setting it to (num_nonfraud/num_fraud).
scale_pos_weight = sqrt(np.count_nonzero(y_train==0)/np.count_nonzero(y_train))
hyperparams = {
 "max_depth":5,
 "subsample":0.8,
 "num_round":100,
 "eta":0.2,
 "gamma":4,
 "min_child_weight":6,
 "silent":0,
 "objective":'binary:logistic',
 "eval_metric":'auc',
 "scale_pos_weight": scale_pos_weight
}

clf = sagemaker.estimator.Estimator(container,
 get_execution_role(),
 hyperparameters=hyperparams,
 train_instance_count=1, 
 train_instance_type='ml.m4.xlarge',
 output_path=output_location,
 sagemaker_session=session)
clf.fit({'train': s3_train_data})

### Host Classifier

Now we deploy the estimator to and endpoint.

In [None]:
from sagemaker.predictor import csv_serializer

predictor = clf.deploy(initial_instance_count=1,
 endpoint_name="fraud-detection-endpoint",
 instance_type='ml.m4.xlarge', 
 serializer=csv_serializer,
 deserializer=None,
 content_type='text/csv')

## Evaluation

Once we have trained the model we can use it to make predictions for the test set.

In [None]:
# Because we have a large test set, we call predict on smaller batches
def predict(current_predictor, data, rows=500):
 split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
 predictions = ''
 for array in split_array:
 predictions = ','.join([predictions, current_predictor.predict(array).decode('utf-8')])

 return np.fromstring(predictions[1:], sep=',')

In [None]:
raw_preds = predict(predictor, X_test)

We will use a few measures from the scikit-learn package to evaluate the performance of our model. When dealing with an imbalanced dataset, we need to choose metrics that take into account the frequency of each class in the data.

Two such metrics are the [balanced accuracy score](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score), and [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-s-kappa).

In [None]:
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score

# scikit-learn expects 0/1 predictions, so we threshold our raw predictions
y_preds = np.where(raw_preds > 0.5, 1, 0)
print("Balanced accuracy = {}".format(balanced_accuracy_score(y_test, y_preds)))
print("Cohen's Kappa = {}".format(cohen_kappa_score(y_test, y_preds)))

We can already see that our model performs very well in terms of both metrics, Cohen's Kappa scores above 0.8 are generally very favorable.

Apart from single-value metrics, it's also useful to look at metrics that indicate performance per class. A confusion matrix, and per-class precision, recall and f1-score can also provide more information about the model's performance.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(y_true, y_predicted):

 cm = confusion_matrix(y_true, y_predicted)
 # Get the per-class normalized value for each cell
 cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 
 # We color each cell according to its normalized value, annotate with exact counts.
 ax = sns.heatmap(cm_norm, annot=cm, fmt="d")
 ax.set(xticklabels=["non-fraud", "fraud"], yticklabels=["non-fraud", "fraud"])
 ax.set_ylim([0,2])
 plt.title('Confusion Matrix')
 plt.ylabel('Real Classes')
 plt.xlabel('Predicted Classes')
 plt.show()

In [None]:
plot_confusion_matrix(y_test, y_preds)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(
 y_test, y_preds, target_names=['non-fraud', 'fraud']))

### Keep sending test traffic to the endpoint via lambda

We can now show how we could use both of these models in a production system, using HTTP requests to an AWS Lambda function that invokes both the unsupervised and the supervised SageMaker endpoints.

We create a background thread that will constantly create HTTP requests to invoke the Lambda, using our test data as input. See the included `generate_endpoint_traffic.py` file to see how that is done. The output will be logged to Kinesis, and you can also observe it in the Lambda function's CloudWatch logs.

In [None]:
from threading import Thread
from generate_endpoint_traffic import generate_traffic

thread = Thread(target = generate_traffic, args=[np.copy(X_test)])
thread.start()

### SMOTE

Now that we have a baseline model using XGBoost, we can try to see if sampling techniques that are designed specifically for imbalanced problems can improve the performance of the model.

For that purpose we will be using the [imbalanced-learn](https://imbalanced-learn.readthedocs.io/en/stable/index.html) package that works well with scikit-learn. We have pre-installed the package for this kernel, but if you need it for a different Jupyter kernel you can install it by running `pip install --upgrade imbalanced-learn` within the conda environment you need.

We will be using [Sythetic Minority Over-sampling](https://arxiv.org/abs/1106.1813) (SMOTE), which oversamples the minority class by interpolating new data points between existing ones.

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

We can see that SMOTE has now balanced the two classes:

In [None]:
from collections import Counter
print(sorted(Counter(y_smote).items()))

We note that this is a case of extreme oversampling of the the minority class, we went from ~0.17% to 50%. An alternative would be to use a smaller resampling ratio, such as having one minority cl
ass sample for every `sqrt(non_fraud/fraud)` majority samples, or using more advanced resampling techniques. See the [comparison](https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py) provided by imbalanced-learn for more over-sampling options.

In our case we'll use the SMOTE dataset we just created and upload it to S3 for training.

In [None]:
smote_buf = io.BytesIO()

# Dump the SMOTE data into a buffer
sklearn.datasets.dump_svmlight_file(X_smote, y_smote, smote_buf)
smote_buf.seek(0);

# Upload from the buffer to S3
key = 'fraud-dataset-smote'
subdir = 'smote'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', subdir, key)).upload_fileobj(smote_buf)

s3_smote_train_data = 's3://{}/{}/train/{}/{}'.format(bucket, prefix, subdir, key)
print('Uploaded training data location: {}'.format(s3_smote_train_data))

smote_output_location = 's3://{}/{}/smote-output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(smote_output_location))

In [None]:
try:
 hyperparams.pop("scale_pos_weight") # No need to scale weights after SMOTE resampling, so we remove that parameter
except KeyError:
 # In case the parameter was already removed, do nothing.
 pass
smote_xgb = sagemaker.estimator.Estimator(container,
 get_execution_role(),
 hyperparameters=hyperparams,
 train_instance_count=1, 
 train_instance_type='ml.m4.xlarge',
 output_path=smote_output_location,
 sagemaker_session=session)

In [None]:
smote_xgb.fit({'train': s3_smote_train_data})

After fitting the model we can check its performance to compare it against the base XGBoost model.

In [None]:
smote_predictor = smote_xgb.deploy(initial_instance_count=1,
 endpoint_name="fraud-detection-smote-endpoint",
 instance_type='ml.m4.xlarge')

# Specify input and output formats.
smote_predictor.content_type = 'text/csv'
smote_predictor.serializer = csv_serializer
smote_predictor.deserializer = None

In [None]:
smote_raw_preds = predict(smote_predictor, X_test)
smote_preds = np.where(smote_raw_preds > 0.5, 1, 0)

In [None]:
print("Balanced accuracy = {}".format(balanced_accuracy_score(y_test, smote_preds)))
print("Cohen's Kappa = {}".format(cohen_kappa_score(y_test, smote_preds)))

In [None]:
plot_confusion_matrix(y_test, smote_preds)

In [None]:
print(classification_report(
 y_test, smote_preds, target_names=['non-fraud', 'fraud']))

Due to the randomness of XGBoost your results may vary, but overall, you should see a large increase in non-fraud cases being classified as fraud (false positives). The reason this happens is because SMOTE has oversampled the fraud class so much that it's increased its overlap in feature space with the non-fraud cases.
Since Cohen's Kappa gives more weight to false positives than balanced accuracy does, the metric drops significantly, as does the precision and F1 score for fraud cases. However, we can bring a balance between the metrics again by adjusting our classification threshold.

So far we've been using 0.5 as the threshold between labeling a point as fraud or not. We can try different thresholds to see if they affect the result of the classification. To evaluate we'll use the balanced accuracy and Cohen's Kappa metrics.

In [None]:
for thres in np.linspace(0.1, 0.9, num=9):
 smote_thres_preds = np.where(smote_raw_preds > thres, 1, 0)
 print("Threshold: {:.1f}".format(thres))
 print("Balanced accuracy = {:.3f}".format(balanced_accuracy_score(y_test, smote_thres_preds)))
 print("Cohen's Kappa = {:.3f}\n".format(cohen_kappa_score(y_test, smote_thres_preds)))

We see that Cohen's Kappa keeps increasing along with the threshold, without a significant loss in balanced accuracy. This adds a useful knob to our model: We can keep a low threshold if we care more about not missing any fraudulent cases, or we can increase the threshold to try to minimize the number of false positives.

## Clean up

We will leave the unsupervised and base XGBoost endpoints running at the end of this notebook so we can handle incoming event streams. However, don't forget to delete the prediction endpoints when you're done. You can do that at the Amazon SageMaker console in the Endpoints page. Or you can run `predictor_name.delete_endpoint()`

In [None]:
# Uncomment to clean up endpoints
# rcf_predictor.delete_endpoint()
# predictor.delete_endpoint()
smote_predictor.delete_endpoint()


## Data Acknowledgements

The dataset used to demonstrated the fraud detection solution has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the [DefeatFraud](https://mlg.ulb.ac.be/wordpress/portfolio_page/defeatfraud-assessment-and-validation-of-deep-feature-engineering-and-learning-solutions-for-fraud-detection/) project
We cite the following works:
* Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
* Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
* Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
* Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
* Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
* Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing