{ "cells": [ { "cell_type": "markdown", "id": "3538d800", "metadata": {}, "source": [ "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", "\n", "SPDX-License-Identifier: Apache-2.0\n" ] }, { "cell_type": "markdown", "id": "c38ab94f", "metadata": {}, "source": [ "# Notebook for Financial Fraud data exploration \n", "***Please download the [banksim data](https://www.kaggle.com/datasets/ealaxi/banksim1) from Kaggle*** for financial fraud use case following the instructions in Readme (or *notebooks/download_data.ipynb*) first in order to run all the notebooks related to the financial fraud use case\n", "\n", "The BankSim dataset is a simulated 6-month dataset with ~587K clean transactions and 7200 fraud transactions. The first CSV is the raw transaction data, and the second CSV is the transactions organized as a graph which the customer and merchants being the nodes, and the transaction as the edge.\n", "\n", "## Table of Contents\n", "1. Load raw transaction data\n", " * Make observations on (customer, merchant) transactions\n", "2. Load raw graph network data\n", " * Compare with raw data " ] }, { "cell_type": "code", "execution_count": null, "id": "6fe75714", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "a1ef6022", "metadata": {}, "source": [ "## Load in raw transaction data" ] }, { "cell_type": "code", "execution_count": null, "id": "442a4084", "metadata": {}, "outputs": [], "source": [ "raw_data_path = '../../data/01_raw/financial_fraud/bs140513_032310.csv'\n", "\n", "raw_trans_data = pd.read_csv(raw_data_path)\n", "\n", "raw_trans_data.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "69b0cfaa", "metadata": {}, "outputs": [], "source": [ "print(raw_trans_data.columns)" ] }, { "cell_type": "markdown", "id": "f1052dde", "metadata": {}, "source": [ "### Observation: raw trans data has more categorical variables(age, gender. zipcode) for customer and merchant(zip) than the network data" ] }, { "cell_type": "code", "execution_count": null, "id": "5e7b3438", "metadata": {}, "outputs": [], "source": [ "raw_trans_data.describe()" ] }, { "cell_type": "code", "execution_count": null, "id": "aba29654", "metadata": {}, "outputs": [], "source": [ "raw_trans_data_sorted = raw_trans_data.sort_values(by=['customer', 'step']).reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "eb4d9a7a", "metadata": {}, "outputs": [], "source": [ "raw_trans_data_sorted.head()" ] }, { "cell_type": "markdown", "id": "5060a1ac", "metadata": {}, "source": [ "## Dive deeper into the transactions" ] }, { "cell_type": "markdown", "id": "aa9c8da9", "metadata": {}, "source": [ "### Observation: one customer can make multiple transactions at one merchant " ] }, { "cell_type": "code", "execution_count": null, "id": "9dc3394e", "metadata": {}, "outputs": [], "source": [ "raw_trans_data_sorted.loc[\n", " (raw_trans_data_sorted.customer==\"'C1093826151'\")&(raw_trans_data_sorted.merchant==\"'M348934600'\")\n", "].head()" ] }, { "cell_type": "code", "execution_count": null, "id": "af2d812f", "metadata": {}, "outputs": [], "source": [ "known_fraud = raw_trans_data_sorted.loc[raw_trans_data_sorted.fraud==1]\n", "known_fraud.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "6c967e7b", "metadata": {}, "outputs": [], "source": [ "known_fraud.shape" ] }, { "cell_type": "markdown", "id": "4712d6f3", "metadata": {}, "source": [ "### Observation: same (customer, merchant) pair can be flagged as fraud multiple times " ] }, { "cell_type": "code", "execution_count": null, "id": "707ee956", "metadata": {}, "outputs": [], "source": [ "known_fraud[known_fraud.duplicated(subset=['customer', 'merchant'])].head()" ] }, { "cell_type": "code", "execution_count": null, "id": "35713a0b", "metadata": {}, "outputs": [], "source": [ "known_fraud.loc[(known_fraud.customer==\"'C1001065306'\")&(known_fraud.merchant==\"'M980657600'\")]" ] }, { "cell_type": "markdown", "id": "fa392449", "metadata": {}, "source": [ "### Observation: for same customer on same category purchase, the fraud flag can be different " ] }, { "cell_type": "code", "execution_count": null, "id": "3b4834db", "metadata": {}, "outputs": [], "source": [ "raw_trans_data_sorted.loc[(raw_trans_data_sorted.customer==\"'C1000148617'\")&(raw_trans_data_sorted.category==\"'es_health'\")]" ] }, { "cell_type": "markdown", "id": "fe78ad6a", "metadata": {}, "source": [ "## Load in the raw network data " ] }, { "cell_type": "code", "execution_count": null, "id": "e422df0b", "metadata": {}, "outputs": [], "source": [ "raw_net_data_path = '../../data/01_raw/financial_fraud/bsNET140513_032310.csv'" ] }, { "cell_type": "code", "execution_count": null, "id": "466c6cf3", "metadata": {}, "outputs": [], "source": [ "raw_net_data = pd.read_csv(raw_net_data_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "cbe331f4", "metadata": {}, "outputs": [], "source": [ "raw_net_data.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "c10d1de0", "metadata": {}, "outputs": [], "source": [ "raw_net_data.columns" ] }, { "cell_type": "markdown", "id": "71905f7f", "metadata": {}, "source": [ "### Observation: Source is the customer id, Target is the merchant id and Weight is the transaction amount\n", "\n", "Most of the features are available in in the raw transaction data instead" ] }, { "cell_type": "code", "execution_count": null, "id": "18460fe7", "metadata": {}, "outputs": [], "source": [ "raw_net_data.loc[(raw_net_data.Source==\"'C1093826151'\")&(raw_net_data.Target==\"'M348934600'\")].head()" ] }, { "cell_type": "code", "execution_count": null, "id": "b94bc8c4", "metadata": {}, "outputs": [], "source": [ "raw_trans_data_sorted.loc[\n", " (raw_trans_data_sorted.customer==\"'C1093826151'\")\n", " &(raw_trans_data_sorted.merchant==\"'M348934600'\")\n", "].head()" ] }, { "cell_type": "markdown", "id": "a92d5224", "metadata": {}, "source": [ "# References\n", "\n", "Edgar Alonso Lopez-Rojas and Stefan Axelsson. 2014. BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH." ] }, { "cell_type": "code", "execution_count": null, "id": "c04f5295", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kedro-taddy-venv", "language": "python", "name": "kedro-taddy-venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }