{ "cells": [ { "cell_type": "markdown", "id": "3ecbc89f", "metadata": {}, "source": [ "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", "\n", "SPDX-License-Identifier: Apache-2.0\n" ] }, { "cell_type": "markdown", "id": "e2ebcf6a", "metadata": {}, "source": [ "# Prepare Financial Fraud dataset for dynamic graph model (TADDY)\n", "\n", "The TADDY model an anomaly detection that detects anomalous edges in dynamic (changing over time) graphs. It learns edge embeddings that combine spatial (neighboring nodes and edges) of the graph as well as temporal information. A fully connected layer then classifies the embeddings as anomaly/not anomaly.\n", "\n", "The model expects graph snapshots with labeled edges, so this notebook prepares the BankSim dataset for TADDY modeling framework.\n", "\n", "## Table of Contents\n", "1. Process raw transaction data\n", " * Get edge and node list to build graph from the raw transaction data. Each transaction can be represented as an edge sourced from the customer node to the merchant node. \n", " * Deduping data. We only keep the most recent transaction for each (customer, merchant) pair. Hence we only conducted 1 classification for their most recent interaction of each (customer, merchant) pair. \n", " * Create and save raw node names/ids (str) to node indexes mapping. These indexes will be used to formulate graphs represented as sparse adjacency matrix during training. Namely, the indexes created here will determine their position in the adjacency matrix. Hence, we checked several times in the notebook to make sure the indexes are correctly aligned. \n", " * Save the labels for each edge with the correct order. \n", " * Train and test graph snapshots split. Earlier snapshots are used for training and later snapshots are used for testing. \n", " \n", "2. Save all the processed data \n", " * Source nodes of edges are stored as row indexes \n", " * Target nodes of edges are stored as col indexes\n", " * Node indexes of all edges are stored in a sparse matrix (list of list as headtail)" ] }, { "cell_type": "code", "execution_count": null, "id": "47a5a004", "metadata": {}, "outputs": [], "source": [ "import sys \n", "import os" ] }, { "cell_type": "code", "execution_count": null, "id": "6e2cc490", "metadata": {}, "outputs": [], "source": [ "sys.path.append('../../src/')" ] }, { "cell_type": "code", "execution_count": null, "id": "541fac43", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import pickle\n", "\n", "from anomaly_detection_spatial_temporal_data.utils import ensure_directory" ] }, { "cell_type": "markdown", "id": "0d4c6d5d", "metadata": {}, "source": [ "# Load raw data" ] }, { "cell_type": "code", "execution_count": null, "id": "a0a945c9", "metadata": {}, "outputs": [], "source": [ "raw_data_path = '../../data/01_raw/financial_fraud/bs140513_032310.csv'\n", "\n", "raw_trans_data = pd.read_csv(raw_data_path)\n", "\n", "raw_trans_data.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "c813ead7", "metadata": {}, "outputs": [], "source": [ "raw_net_data_path = '../../data/01_raw/financial_fraud/bsNET140513_032310.csv'\n", "\n", "raw_net_trans_data = pd.read_csv(raw_net_data_path)\n", "\n", "raw_net_trans_data.shape" ] }, { "cell_type": "markdown", "id": "12431415", "metadata": {}, "source": [ "# Process edge data for dynamic graph model \n", "## Customer can be treated as source node and merchant can be treated as target node " ] }, { "cell_type": "code", "execution_count": null, "id": "2b5ffe86", "metadata": {}, "outputs": [], "source": [ "edges = raw_trans_data[['step','customer','merchant','category','amount','fraud']]" ] }, { "cell_type": "code", "execution_count": null, "id": "198edfde", "metadata": {}, "outputs": [], "source": [ "# remove self loops where customer bought from self\n", "edges = edges.loc[edges.customer!=edges.merchant]\n", "\n", "edges.shape" ] }, { "cell_type": "markdown", "id": "29770eca", "metadata": {}, "source": [ "### check duplicated (customer, merchant) pairs " ] }, { "cell_type": "code", "execution_count": null, "id": "d46573d5", "metadata": {}, "outputs": [], "source": [ "customer_merchant_trans_count = edges.groupby(\n", " by=['customer','merchant']\n", ").agg({'step':'count'}) #there are 47132 unique pairs " ] }, { "cell_type": "code", "execution_count": null, "id": "2461c7be", "metadata": {}, "outputs": [], "source": [ "customer_merchant_trans_fraud = edges.groupby(by=['customer','merchant']).agg({'fraud':'sum'})" ] }, { "cell_type": "code", "execution_count": null, "id": "551e7eef", "metadata": {}, "outputs": [], "source": [ "customer_merchant_trans_fraud.columns" ] }, { "cell_type": "markdown", "id": "80f7e360", "metadata": {}, "source": [ "### Observation: 1065 (customer, merchant) pairs had been flagged as fraud for more than 1 time" ] }, { "cell_type": "code", "execution_count": null, "id": "7875839d", "metadata": {}, "outputs": [], "source": [ "customer_merchant_trans_fraud.loc[customer_merchant_trans_fraud.fraud>1]" ] }, { "cell_type": "markdown", "id": "92a69c5b", "metadata": {}, "source": [ "### Observation: 1108 (customer, merchant) pairs had changing labels" ] }, { "cell_type": "code", "execution_count": null, "id": "b4aabe66", "metadata": {}, "outputs": [], "source": [ "customer_merchant_trans_fraud_consistency = edges.groupby(by=['customer','merchant']).agg({'fraud':'mean'})\n", "customer_merchant_trans_fraud_consistency" ] }, { "cell_type": "code", "execution_count": null, "id": "aa5b04ec", "metadata": {}, "outputs": [], "source": [ "customer_merchant_trans_fraud_consistency.loc[\n", " (customer_merchant_trans_fraud_consistency.fraud!=1) & (customer_merchant_trans_fraud_consistency.fraud!=0) \n", "]" ] }, { "cell_type": "markdown", "id": "4a626f24", "metadata": {}, "source": [ "# Dedupe (customer, merchant) pair, only keep the last transaction (the latest)" ] }, { "cell_type": "code", "execution_count": null, "id": "6a78fc34", "metadata": {}, "outputs": [], "source": [ "edges_deduped = edges.drop_duplicates(subset=['customer','merchant'], keep='last', )" ] }, { "cell_type": "code", "execution_count": null, "id": "e7662a21", "metadata": {}, "outputs": [], "source": [ "edges_deduped.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "934feb84", "metadata": {}, "outputs": [], "source": [ "edges_array = np.array(edges_deduped[['customer','merchant']])" ] }, { "cell_type": "markdown", "id": "1d61d443", "metadata": {}, "source": [ "### convert str ids to int indexes " ] }, { "cell_type": "code", "execution_count": null, "id": "fd1567e8", "metadata": {}, "outputs": [], "source": [ "vertexs, edges_1d = np.unique(edges_array, return_inverse=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "2dee5cd7", "metadata": {}, "outputs": [], "source": [ "# vertexs, len(vertexs)" ] }, { "cell_type": "markdown", "id": "8ed25f4b", "metadata": {}, "source": [ "### save str ids to int indexes mapping" ] }, { "cell_type": "code", "execution_count": null, "id": "f862f87a", "metadata": {}, "outputs": [], "source": [ "vertex_to_id = {}\n", "for i,vertex in enumerate(vertexs):\n", " vertex_to_id.setdefault(vertex,i)" ] }, { "cell_type": "code", "execution_count": null, "id": "31507b0d", "metadata": {}, "outputs": [], "source": [ "vertex_to_id_df = pd.DataFrame.from_dict(\n", " vertex_to_id, \n", " orient='index', \n", " columns=['idx']\n", ").reset_index().rename(columns={\"index\": \"name\"})" ] }, { "cell_type": "markdown", "id": "5ef01e18", "metadata": {}, "source": [ "#### save id to index mapping\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7791eadf", "metadata": {}, "outputs": [], "source": [ "vertex_to_id_file_path = \"../../data/02_intermediate/financial_fraud/node_id.csv\"\n", "\n", "ensure_directory(vertex_to_id_file_path)\n", "\n", "vertex_to_id_df.to_csv(\"../../data/02_intermediate/financial_fraud/node_id.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "d8bddf9f", "metadata": {}, "outputs": [], "source": [ "edges_idx = np.reshape(edges_1d, [-1, 2])" ] }, { "cell_type": "code", "execution_count": null, "id": "20c28e50", "metadata": {}, "outputs": [], "source": [ "edges_idx, len(edges_idx)" ] }, { "cell_type": "markdown", "id": "58acb9d6", "metadata": {}, "source": [ "### Check whether the node indexes for the top 3 edge list records are correct \n", "It's critical that the indexes are correctly aligned with raw data, and the indexes in the graph (represented as sparse graph)" ] }, { "cell_type": "code", "execution_count": null, "id": "8f0416a4", "metadata": {}, "outputs": [], "source": [ "### manually checkingg the node id for the note indexes\n", "# (vertexs[3317], vertexs[4148]), (vertexs[2363], vertexs[4154]),(vertexs[3396], vertexs[4127]), (vertexs[3304], vertexs[4130])" ] }, { "cell_type": "code", "execution_count": null, "id": "75a5ee7b", "metadata": {}, "outputs": [], "source": [ "### consistent with the raw data \n", "# edges_deduped.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "d8439b71", "metadata": {}, "outputs": [], "source": [ "# print('vertex:', len(vertexs), 'edge:', len(edges_idx))" ] }, { "cell_type": "markdown", "id": "2d6a560b", "metadata": {}, "source": [ "# Find labels for the edge" ] }, { "cell_type": "code", "execution_count": null, "id": "ee98b54a", "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": null, "id": "4c506eb8", "metadata": {}, "outputs": [], "source": [ "edge_label_arr = np.zeros([edges_deduped.shape[0], 3], dtype=np.int32)\n", "for idx, row in tqdm(edges_deduped.reset_index().iterrows(), total=edges_deduped.shape[0]): #using deduped trans \n", " edge_label_arr[idx][0] = vertex_to_id[row['customer']]\n", " edge_label_arr[idx][1] = vertex_to_id[row['merchant']]\n", " edge_label_arr[idx][2] = row['fraud']" ] }, { "cell_type": "code", "execution_count": null, "id": "c9832d12", "metadata": {}, "outputs": [], "source": [ "edge_label_arr.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "f7d7fbb5", "metadata": {}, "outputs": [], "source": [ "edge_label_postprocessed_df = pd.DataFrame(edge_label_arr, columns=['source','target','label'])" ] }, { "cell_type": "code", "execution_count": null, "id": "e73bd80b", "metadata": {}, "outputs": [], "source": [ "edge_label_postprocessed_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "4ff75ddc", "metadata": {}, "outputs": [], "source": [ "edge_label_df_file_path = \"../../data/02_intermediate/financial_fraud/edge_label.csv\"\n", "edge_list_arr_file_path = \"../../data/02_intermediate/financial_fraud/edge_list.npz\"\n", "\n", "ensure_directory(edge_label_df_file_path)\n", "ensure_directory(edge_list_arr_file_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "1761a812", "metadata": {}, "outputs": [], "source": [ "with open(edge_list_arr_file_path, mode=\"wb\") as f:\n", " np.savez(f,data=edge_label_arr)" ] }, { "cell_type": "markdown", "id": "519adc4d", "metadata": {}, "source": [ "### check again the processed data are consistent with the raw data " ] }, { "cell_type": "code", "execution_count": null, "id": "dfe90243", "metadata": {}, "outputs": [], "source": [ "# (vertexs[edge_label_arr[0][0]], vertexs[edge_label_arr[0][1]])" ] }, { "cell_type": "code", "execution_count": null, "id": "b38c5624", "metadata": {}, "outputs": [], "source": [ "# edges_deduped.loc[(edges_deduped.customer ==vertexs[edge_label_arr[0][0]] )& (edges_deduped.merchant ==vertexs[edge_label_arr[0][1]])]" ] }, { "cell_type": "code", "execution_count": null, "id": "030cddfd", "metadata": {}, "outputs": [], "source": [ "#check fraud ratio\n", "edge_label_postprocessed_df['label'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "id": "744cbd3e", "metadata": {}, "source": [ "# Split train/test data and generate data for graph dataloader " ] }, { "cell_type": "code", "execution_count": null, "id": "232db566", "metadata": {}, "outputs": [], "source": [ "edges_deduped.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "07fd8665", "metadata": {}, "outputs": [], "source": [ "edges_deduped.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "aa6d44ca", "metadata": {}, "outputs": [], "source": [ "m = len(edge_label_arr) #edge number \n", "n = len(vertex_to_id_df) #node number \n", "\n", "print(f\"Number of edges: {m}, Number of nodes: {n}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b9add3e5", "metadata": {}, "outputs": [], "source": [ "train_per = 0.5 #split in half \n", "\n", "train_num = int(np.floor(train_per * m))\n", "\n", "train = edge_label_arr[0:train_num, :] #first half being training samples\n", "test = edge_label_arr[train_num:, :] #second half being test samples " ] }, { "cell_type": "code", "execution_count": null, "id": "ddcdb9cd", "metadata": {}, "outputs": [], "source": [ "train.shape, test.shape" ] }, { "cell_type": "markdown", "id": "e07c0ea6", "metadata": {}, "source": [ "# Build graph in the format of a sparse matrix with edge list \n", "Again, it's critical that the indexes are correctly aligned with raw data, and the indexes in the graph (represented as sparse graph)" ] }, { "cell_type": "code", "execution_count": null, "id": "515a1343", "metadata": {}, "outputs": [], "source": [ "from scipy.sparse import csr_matrix,coo_matrix,eye" ] }, { "cell_type": "code", "execution_count": null, "id": "78dcae10", "metadata": {}, "outputs": [], "source": [ "train_mat = csr_matrix(\n", " (np.ones([np.size(train, 0)], dtype=np.int32), \n", " (train[:, 0], train[:, 1])),\n", " shape=(n, n))" ] }, { "cell_type": "code", "execution_count": null, "id": "b8faa366", "metadata": {}, "outputs": [], "source": [ "train_mat.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "5a307687", "metadata": {}, "outputs": [], "source": [ "train_mat = train_mat + train_mat.transpose() #enforce symmetry " ] }, { "cell_type": "markdown", "id": "246fa3e7", "metadata": {}, "source": [ "#### check edgelist id with the sparse matrix idx" ] }, { "cell_type": "code", "execution_count": null, "id": "d03c6158", "metadata": {}, "outputs": [], "source": [ "# train_mat[3317,4148], train_mat[4148,3317]" ] }, { "cell_type": "code", "execution_count": null, "id": "f14dad4e", "metadata": {}, "outputs": [], "source": [ "# train_mat[86,4145], train_mat[4145,86] #being 0 because this edge is in the test set " ] }, { "cell_type": "code", "execution_count": null, "id": "61b59e6a", "metadata": {}, "outputs": [], "source": [ "train_mat = (train_mat + train_mat.transpose() + eye(n)).tolil() #Convert to List of Lists format" ] }, { "cell_type": "code", "execution_count": null, "id": "36e6c887", "metadata": {}, "outputs": [], "source": [ "headtail = train_mat.rows #store the indexes of edges" ] }, { "cell_type": "code", "execution_count": null, "id": "d1b9db90", "metadata": {}, "outputs": [], "source": [ "# headtail" ] }, { "cell_type": "code", "execution_count": null, "id": "8ce58a03", "metadata": {}, "outputs": [], "source": [ "#check degrees of each source node \n", "degrees = np.array([len(x) for x in headtail])" ] }, { "cell_type": "markdown", "id": "b20a0404", "metadata": {}, "source": [ "# Creating snapshots of graphs for the dataloader of TADDY model" ] }, { "cell_type": "code", "execution_count": null, "id": "e70312c2", "metadata": {}, "outputs": [], "source": [ "snap_size=5000" ] }, { "cell_type": "code", "execution_count": null, "id": "da3ead3e", "metadata": {}, "outputs": [], "source": [ "train_size = int(len(train) / snap_size + 0.5) #making slices of snapshots\n", "test_size = int(len(test) / snap_size + 0.5)" ] }, { "cell_type": "code", "execution_count": null, "id": "64043f58", "metadata": {}, "outputs": [], "source": [ "train_size, test_size" ] }, { "cell_type": "code", "execution_count": null, "id": "83b2fdf3", "metadata": {}, "outputs": [], "source": [ "rows = []\n", "cols = []\n", "weis = []\n", "labs = []\n", "for ii in range(train_size):\n", " start_loc = ii * snap_size\n", " end_loc = (ii + 1) * snap_size\n", "\n", " row = np.array(train[start_loc:end_loc, 0], dtype=np.int32) #source nodes of edges stored as row indexes \n", " col = np.array(train[start_loc:end_loc, 1], dtype=np.int32) #target nodes of edges stored as col indexes \n", " lab = np.array(train[start_loc:end_loc, 2], dtype=np.int32) #labels\n", " wei = np.ones_like(row, dtype=np.int32) #weights of edge (all set to be 1 in this experiment)\n", "\n", " rows.append(row)\n", " cols.append(col)\n", " weis.append(wei) #weights\n", " labs.append(lab) #label" ] }, { "cell_type": "code", "execution_count": null, "id": "fe5c7fd1", "metadata": {}, "outputs": [], "source": [ "for i in range(test_size):\n", " start_loc = i * snap_size\n", " end_loc = (i + 1) * snap_size\n", "\n", " row = np.array(test[start_loc:end_loc, 0], dtype=np.int32)\n", " col = np.array(test[start_loc:end_loc, 1], dtype=np.int32)\n", " lab = np.array(test[start_loc:end_loc, 2], dtype=np.int32)\n", " wei = np.ones_like(row, dtype=np.int32)\n", "\n", " rows.append(row)\n", " cols.append(col)\n", " weis.append(wei)\n", " labs.append(lab)" ] }, { "cell_type": "code", "execution_count": null, "id": "2e38880d", "metadata": {}, "outputs": [], "source": [ "# len(rows), rows[0].shape" ] }, { "cell_type": "code", "execution_count": null, "id": "ad307ef3", "metadata": {}, "outputs": [], "source": [ "# rows[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "1872977c", "metadata": {}, "outputs": [], "source": [ "# len(cols), cols[0].shape" ] }, { "cell_type": "code", "execution_count": null, "id": "20aa6f10", "metadata": {}, "outputs": [], "source": [ "# cols[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "28c6fd7e", "metadata": {}, "outputs": [], "source": [ "# len(labs), labs[0].shape" ] }, { "cell_type": "code", "execution_count": null, "id": "24da4ed8", "metadata": {}, "outputs": [], "source": [ "# labs[0]" ] }, { "cell_type": "markdown", "id": "1b298e5c", "metadata": {}, "source": [ "### save all intermediate graph data" ] }, { "cell_type": "code", "execution_count": null, "id": "c2e29dc3", "metadata": {}, "outputs": [], "source": [ "train_test_data_file_path = '../../data/03_primary/financial_fraud/training_data.pkl'\n", "ensure_directory(train_test_data_file_path)\n", "\n", "train_test_data = (rows,cols,labs,weis,headtail,train_size,test_size,n,m)\n", "\n", "with open(train_test_data_file_path, 'wb') as f:\n", " pickle.dump(train_test_data, f)" ] }, { "cell_type": "markdown", "id": "f09fab72", "metadata": {}, "source": [ "# References\n", "\n", "Edgar Alonso Lopez-Rojas and Stefan Axelsson. 2014. BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH.\n", "\n", "Yixin Liu, Shirui Pan, Yu Guang Wang, Fei Xiong, Liang Wang, Qingfeng Chen, and Vincent CS Lee. 2015. Anomaly Detection in Dynamic Graphs via Transformer." ] }, { "cell_type": "code", "execution_count": null, "id": "215bb74b", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kedro-taddy-venv", "language": "python", "name": "kedro-taddy-venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }