Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: Apache-2.0


# Notebook for Financial Fraud data exploration 
***Please download the [banksim data](https://www.kaggle.com/datasets/ealaxi/banksim1) from Kaggle*** for financial fraud use case following the instructions in Readme (or *notebooks/download_data.ipynb*) first in order to run all the notebooks related to the financial fraud use case

The BankSim dataset is a simulated 6-month dataset with ~587K clean transactions and 7200 fraud transactions. The first CSV is the raw transaction data, and the second CSV is the transactions organized as a graph which the customer and merchants being the nodes, and the transaction as the edge.

## Table of Contents
1. Load raw transaction data
 * Make observations on (customer, merchant) transactions
2. Load raw graph network data
 * Compare with raw data 

In [None]:
import pandas as pd
import numpy as np

## Load in raw transaction data

In [None]:
raw_data_path = '../../data/01_raw/financial_fraud/bs140513_032310.csv'

raw_trans_data = pd.read_csv(raw_data_path)

raw_trans_data.shape

In [None]:
print(raw_trans_data.columns)

### Observation: raw trans data has more categorical variables(age, gender. zipcode) for customer and merchant(zip) than the network data

In [None]:
raw_trans_data.describe()

In [None]:
raw_trans_data_sorted = raw_trans_data.sort_values(by=['customer', 'step']).reset_index(drop=True)

In [None]:
raw_trans_data_sorted.head()

## Dive deeper into the transactions

### Observation: one customer can make multiple transactions at one merchant 

In [None]:
raw_trans_data_sorted.loc[
 (raw_trans_data_sorted.customer=="'C1093826151'")&(raw_trans_data_sorted.merchant=="'M348934600'")
].head()

In [None]:
known_fraud = raw_trans_data_sorted.loc[raw_trans_data_sorted.fraud==1]
known_fraud.head()

In [None]:
known_fraud.shape

### Observation: same (customer, merchant) pair can be flagged as fraud multiple times 

In [None]:
known_fraud[known_fraud.duplicated(subset=['customer', 'merchant'])].head()

In [None]:
known_fraud.loc[(known_fraud.customer=="'C1001065306'")&(known_fraud.merchant=="'M980657600'")]

### Observation: for same customer on same category purchase, the fraud flag can be different 

In [None]:
raw_trans_data_sorted.loc[(raw_trans_data_sorted.customer=="'C1000148617'")&(raw_trans_data_sorted.category=="'es_health'")]

## Load in the raw network data 

In [None]:
raw_net_data_path = '../../data/01_raw/financial_fraud/bsNET140513_032310.csv'

In [None]:
raw_net_data = pd.read_csv(raw_net_data_path)

In [None]:
raw_net_data.shape

In [None]:
raw_net_data.columns

### Observation: Source is the customer id, Target is the merchant id and Weight is the transaction amount

Most of the features are available in in the raw transaction data instead

In [None]:
raw_net_data.loc[(raw_net_data.Source=="'C1093826151'")&(raw_net_data.Target=="'M348934600'")].head()

In [None]:
raw_trans_data_sorted.loc[
 (raw_trans_data_sorted.customer=="'C1093826151'")
 &(raw_trans_data_sorted.merchant=="'M348934600'")
].head()

# References

Edgar Alonso Lopez-Rojas and Stefan Axelsson. 2014. BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH.