{ "cells": [ { "cell_type": "markdown", "id": "4347b118", "metadata": {}, "source": [ "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", "\n", "SPDX-License-Identifier: Apache-2.0" ] }, { "cell_type": "markdown", "id": "673e1413", "metadata": {}, "source": [ "# Set up Financial Fraud dataloader, model training and inference \n", "\n", "## Table of Contents\n", "1. Load processed graph data (in notebook 1.1) into a data dict for the data loader for model training \n", "* The data dictionary is defined in the referenced TADDY modeling framework to easily fetch relevant data during training \n", "2. Load the model training and data sampling configurations \n", "* Use Eigenvalue decomposition based on the adjacency matrix for substructure sampling. Nodes are sampled across multiple snapshots for the edge of interest based on a defined time window. \n", "3. Pass the data dict in step (1) to the model \n", "4. Train the model \n", "5. Apply model inference on the specific snapshot" ] }, { "cell_type": "code", "execution_count": null, "id": "50793cb0", "metadata": {}, "outputs": [], "source": [ "import sys \n", "import os" ] }, { "cell_type": "code", "execution_count": null, "id": "c0a839b2", "metadata": {}, "outputs": [], "source": [ "sys.path.append('../../')\n", "sys.path.append('../../src/')" ] }, { "cell_type": "code", "execution_count": null, "id": "7ceca9e9", "metadata": {}, "outputs": [], "source": [ "import pickle\n", "import numpy as np\n", "import torch\n", "import yaml\n", "from anomaly_detection_spatial_temporal_data.model.model_config import TaddyConfig\n", "from anomaly_detection_spatial_temporal_data.utils import ensure_directory\n", "from anomaly_detection_spatial_temporal_data.model.dynamic_graph import Taddy" ] }, { "cell_type": "markdown", "id": "25ce8f88", "metadata": {}, "source": [ "# Load processed graph data in notebook 1.1" ] }, { "cell_type": "code", "execution_count": null, "id": "13602e3c", "metadata": {}, "outputs": [], "source": [ "with open(\"../../data/03_primary/financial_fraud/training_data.pkl\", 'rb') as file:\n", " data = pickle.load(file)" ] }, { "cell_type": "code", "execution_count": null, "id": "3c50f4e1", "metadata": {}, "outputs": [], "source": [ "rows, cols, labels, weights, headtail, train_size, test_size, nb_nodes, nb_edges = data" ] }, { "cell_type": "code", "execution_count": null, "id": "df5a9b9b", "metadata": {}, "outputs": [], "source": [ "degrees = np.array([len(x) for x in headtail])\n", "num_snap = test_size + train_size\n", "labels = [torch.LongTensor(label) for label in labels]\n", "\n", "snap_train = list(range(num_snap))[:train_size]\n", "snap_test = list(range(num_snap))[train_size:]" ] }, { "cell_type": "code", "execution_count": null, "id": "e4c78b05", "metadata": {}, "outputs": [], "source": [ "idx = list(range(nb_nodes))\n", "index_id_map = {i:i for i in idx}\n", "idx = np.array(idx)" ] }, { "cell_type": "markdown", "id": "269092b2", "metadata": {}, "source": [ "# Set model training and data sampling configuration and create data dictionary" ] }, { "cell_type": "markdown", "id": "54b380bc", "metadata": {}, "source": [ "### load the model training hyperparameters" ] }, { "cell_type": "code", "execution_count": null, "id": "8f57fb6a", "metadata": {}, "outputs": [], "source": [ "train_config_file = '../../conf/base/parameters/taddy.yml'\n", "\n", "with open(train_config_file, \"r\") as stream:\n", " try:\n", " train_config=yaml.safe_load(stream)\n", " print(train_config)\n", " except yaml.YAMLError as exc:\n", " print(exc)" ] }, { "cell_type": "markdown", "id": "1d2469f4", "metadata": {}, "source": [ "### load the data sampling parameters\n", " * window size is the number of snapshots looked back during node sampling \n", " * neighbor number is the number of neighbors to sample close to the source and target nodes of the edge of interest " ] }, { "cell_type": "code", "execution_count": null, "id": "14036469", "metadata": {}, "outputs": [], "source": [ "eigen_file_name = \"../../data/05_model_input/financial_fraud/eigen_tmp.pkl\"\n", "data_loader_config = train_config['data_load_options']\n", "\n", "data_loader_config" ] }, { "cell_type": "markdown", "id": "7cea84d3", "metadata": {}, "source": [ "### Define relevant functions to node sampling, the purpose of eah function is explained in brief docstring. " ] }, { "cell_type": "code", "execution_count": null, "id": "7969f94d", "metadata": {}, "outputs": [], "source": [ "import scipy.sparse as sp\n", "from numpy.linalg import inv\n", "\n", "def normalize(mx):\n", " \"\"\"Row-normalize sparse matrix\"\"\"\n", " rowsum = np.array(mx.sum(1))\n", " r_inv = np.power(rowsum, -1).flatten()\n", " r_inv[np.isinf(r_inv)] = 0.\n", " r_mat_inv = sp.diags(r_inv)\n", " mx = r_mat_inv.dot(mx)\n", " return mx\n", "\n", "def normalize_adj(adj):\n", " \"\"\"Symmetrically normalize adjacency matrix. (0226)\"\"\"\n", " adj = sp.coo_matrix(adj)\n", " rowsum = np.array(adj.sum(1))\n", " d_inv_sqrt = np.power(rowsum, -0.5).flatten()\n", " d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.\n", " d_mat_inv_sqrt = sp.diags(d_inv_sqrt)\n", " return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).tocoo()\n", "\n", "def adj_normalize(mx):\n", " \"\"\"Row-normalize sparse matrix\"\"\"\n", " rowsum = np.array(mx.sum(1))\n", " r_inv = np.power(rowsum, -0.5).flatten()\n", " r_inv[np.isinf(r_inv)] = 0.\n", " r_mat_inv = sp.diags(r_inv)\n", " mx = r_mat_inv.dot(mx).dot(r_mat_inv)\n", " return mx\n", "\n", "def sparse_mx_to_torch_sparse_tensor(sparse_mx):\n", " \"\"\"Convert a scipy sparse matrix to a torch sparse tensor.\"\"\"\n", " sparse_mx = sparse_mx.tocoo().astype(np.float32)\n", " indices = torch.from_numpy(\n", " np.vstack((sparse_mx.row, sparse_mx.col)).astype(np.int64))\n", " values = torch.from_numpy(sparse_mx.data)\n", " shape = torch.Size(sparse_mx.shape)\n", " return torch.sparse.FloatTensor(indices, values, shape)\n", "\n", "def preprocess_adj(adj):\n", " \"\"\"Preprocessing of adjacency matrix for simple GCN model and conversion to tuple representation. (0226)\"\"\"\n", " adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T > adj)\n", " # adj_np = np.array(adj.todense())\n", " adj_normalized = normalize_adj(adj + sp.eye(adj.shape[0]))\n", " adj_normalized = sparse_mx_to_torch_sparse_tensor(adj_normalized)\n", " return adj_normalized\n", "\n", "def get_adjs(rows, cols, weights, nb_nodes, eigen_file_name, data_loader_config):\n", " \"\"\"Generate adjacency matrix and conduct eigenvalue decomposition for node sampling\"\"\"\n", " if not os.path.exists(eigen_file_name):\n", " generate_eigen = True\n", " print('Generating eigen as: ' + eigen_file_name)\n", " else:\n", " generate_eigen = False\n", " print('Loading eigen from: ' + eigen_file_name)\n", " with open(eigen_file_name, 'rb') as f:\n", " eigen_adjs_sparse = pickle.load(f)\n", " eigen_adjs = []\n", " for eigen_adj_sparse in eigen_adjs_sparse:\n", " eigen_adjs.append(np.array(eigen_adj_sparse.todense()))\n", "\n", " adjs = []\n", " if generate_eigen:\n", " eigen_adjs = []\n", " eigen_adjs_sparse = []\n", "\n", " for i in range(len(rows)):\n", " adj = sp.csr_matrix((weights[i], (rows[i], cols[i])), shape=(nb_nodes, nb_nodes), dtype=np.float32)\n", " adjs.append(preprocess_adj(adj))\n", " if data_loader_config['compute_s']:\n", " if generate_eigen:\n", " eigen_adj = data_loader_config['c'] * inv((sp.eye(adj.shape[0]) - (1 - data_loader_config['c']) * adj_normalize(adj)).toarray())\n", " for p in range(adj.shape[0]):\n", " eigen_adj[p,p] = 0.\n", " eigen_adj = normalize(eigen_adj)\n", " eigen_adjs.append(eigen_adj)\n", " eigen_adjs_sparse.append(sp.csr_matrix(eigen_adj))\n", "\n", " else:\n", " eigen_adjs.append(None)\n", "\n", " if generate_eigen:\n", " with open(eigen_file_name, 'wb') as f:\n", " pickle.dump(eigen_adjs_sparse, f, pickle.HIGHEST_PROTOCOL)\n", "\n", " return adjs, eigen_adjs" ] }, { "cell_type": "code", "execution_count": null, "id": "6eca68b4", "metadata": {}, "outputs": [], "source": [ "ensure_directory(eigen_file_name)\n", "edges = [np.vstack((rows[i], cols[i])).T for i in range(num_snap)]\n", "adjs, eigen_adjs = get_adjs(rows, cols, weights, nb_nodes, eigen_file_name, data_loader_config)" ] }, { "cell_type": "markdown", "id": "1f71e030", "metadata": {}, "source": [ "### The data dictionary defined in TADDY modeling framework \n", " * X is the node feature matrix (We did not generate node feature for this use case. Hence we are aiming to learn from the graph structural information and its evloving pattern) \n", " * A is the adjacency matrix (a popular way to represent a graph)\n", " * S is the eigen decomposition result of A \n", " * degrees stores all the node degrees\n", " * other keys are self-explanatory: edges store edge list, y is the edge label, snap_train are snapshots for training and snap_test are snapshots for testing. num_snap is the total number of snapshots. " ] }, { "cell_type": "code", "execution_count": null, "id": "3891b2b6", "metadata": {}, "outputs": [], "source": [ "data_dict = {\n", " 'X': None, \n", " 'A': adjs, \n", " 'S': eigen_adjs, \n", " 'index_id_map': index_id_map, \n", " 'edges': edges,\n", " 'y': labels, \n", " 'idx': idx, \n", " 'snap_train': snap_train, \n", " 'degrees': degrees,\n", " 'snap_test': snap_test, \n", " 'num_snap': num_snap}" ] }, { "cell_type": "code", "execution_count": null, "id": "8d44731b", "metadata": {}, "outputs": [], "source": [ "train_config" ] }, { "cell_type": "code", "execution_count": null, "id": "ac606e28", "metadata": {}, "outputs": [], "source": [ "#change save path for notebook\n", "train_config['model_options']['save_directory'] = '../../data/07_model_output/financial_fraud' \n", "\n", "if not os.path.exists(train_config['model_options']['save_directory']):\n", " os.makedirs(train_config['model_options']['save_directory'])" ] }, { "cell_type": "code", "execution_count": null, "id": "d08e4727", "metadata": {}, "outputs": [], "source": [ "model_config = TaddyConfig(config=train_config['model_options'])\n", "model_obj = Taddy(data_dict, model_config)\n", "\n", "model_config.save_directory" ] }, { "cell_type": "markdown", "id": "d2dd97e3", "metadata": {}, "source": [ "# Train model" ] }, { "cell_type": "code", "execution_count": null, "id": "ab926544", "metadata": {}, "outputs": [], "source": [ "learned_result,save_model_path = model_obj.run()" ] }, { "cell_type": "markdown", "id": "38f9513d", "metadata": {}, "source": [ "# Model training result " ] }, { "cell_type": "code", "execution_count": null, "id": "ccee993b", "metadata": {}, "outputs": [], "source": [ "learned_result" ] }, { "cell_type": "code", "execution_count": null, "id": "1c1b6e29", "metadata": {}, "outputs": [], "source": [ "save_model_path" ] }, { "cell_type": "markdown", "id": "a43b81c3", "metadata": {}, "source": [ "# Run inference on the specific snapshot " ] }, { "cell_type": "markdown", "id": "1c62a79e", "metadata": {}, "source": [ "### load trained model " ] }, { "cell_type": "code", "execution_count": null, "id": "bafa56f6", "metadata": {}, "outputs": [], "source": [ "import torch\n", "import transformers" ] }, { "cell_type": "code", "execution_count": null, "id": "9b9f22dd", "metadata": {}, "outputs": [], "source": [ "model = torch.load(save_model_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "0d641f4c", "metadata": {}, "outputs": [], "source": [ "snap_num = 9" ] }, { "cell_type": "code", "execution_count": null, "id": "2f7132b6", "metadata": {}, "outputs": [], "source": [ "pred = model.predict(snap_num)" ] }, { "cell_type": "code", "execution_count": null, "id": "5d58104e", "metadata": {}, "outputs": [], "source": [ "from sklearn import metrics\n", "\n", "auc = metrics.roc_auc_score(labels[snap_num],pred)\n", "\n", "auc" ] }, { "cell_type": "markdown", "id": "3647bdba", "metadata": {}, "source": [ "# References\n", "\n", "Edgar Alonso Lopez-Rojas and Stefan Axelsson. 2014. BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH.\n", "\n", "Yixin Liu, Shirui Pan, Yu Guang Wang, Fei Xiong, Liang Wang, Qingfeng Chen, and Vincent CS Lee. 2015. Anomaly Detection in Dynamic Graphs via Transformer." ] }, { "cell_type": "code", "execution_count": null, "id": "0acbaf09", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kedro-taddy-venv", "language": "python", "name": "kedro-taddy-venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }