{ "cells": [ { "cell_type": "markdown", "id": "fd46e112", "metadata": {}, "source": [ "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", "\n", "SPDX-License-Identifier: Apache-2.0\n" ] }, { "cell_type": "markdown", "id": "4738f3e2", "metadata": {}, "source": [ "# Prepare Financial Fraud dataset for Numenta Benchmark (NAB)\n", "\n", "The [Numenta Benchmark](https://github.com/numenta/NAB) consists of multiple anomaly detection algorithms for time-series. The algorithms identify anomalies contextually - based on previous values for that time series. Therefore the NAB repository expects each time series to be in its own CSV with timestamp and value as the columns.\n", "\n", "## This notebook consists of steps to \n", "1. Load raw data\n", "2. Process raw data into independent time series for NAB\n", "3. Save JSON specifying anomalies" ] }, { "cell_type": "code", "execution_count": null, "id": "cabcf7f0", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "id": "7ff96ba1", "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append('../../src/')\n", "\n", "from anomaly_detection_spatial_temporal_data.utils import ensure_directory" ] }, { "cell_type": "markdown", "id": "a6ef58a4", "metadata": {}, "source": [ "## Load raw data" ] }, { "cell_type": "code", "execution_count": null, "id": "6e050f45", "metadata": {}, "outputs": [], "source": [ "raw_data_path = '../../data/01_raw/financial_fraud/bs140513_032310.csv'\n", "\n", "raw_trans_data = pd.read_csv(raw_data_path)" ] }, { "cell_type": "markdown", "id": "05cb5a62", "metadata": {}, "source": [ "## Construct purchase time series for a single (customer, merchant) pair\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4c567fef", "metadata": {}, "outputs": [], "source": [ "# values in the CSV have quotations \n", "example_c = \"\"\"'C1001065306'\"\"\"\n", "example_m = \"\"\"'es_health'\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "c5522fc8", "metadata": {}, "outputs": [], "source": [ "c_m_p_data_example = raw_trans_data.loc[\n", " (raw_trans_data.customer==example_c)\n", " &(raw_trans_data.category==example_m)\n", "][['step','amount','fraud']]" ] }, { "cell_type": "code", "execution_count": null, "id": "988f9adf", "metadata": {}, "outputs": [], "source": [ "c_m_p_data_example" ] }, { "cell_type": "code", "execution_count": null, "id": "ccd5a913", "metadata": {}, "outputs": [], "source": [ "example_ts_file_path = f\"\"\"../../data/02_intermediate/financial_fraud/ts_data/{example_c}_{example_m}_transaction_data.csv\"\"\"\n", "print(example_ts_file_path)\n", "\n", "ensure_directory(example_ts_file_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "1d170cc6", "metadata": {}, "outputs": [], "source": [ "c_m_p_data_example.rename(columns={'step':'timestamp','amount':'value','fraud':'label'}, inplace=True)\n", "c_m_p_data_example[['timestamp','value']].to_csv(example_ts_file_path, index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "ead29d7f", "metadata": {}, "outputs": [], "source": [ "example_ts_label_file_path = f\"\"\"../../data/02_intermediate/financial_fraud/ts_label/{example_c}_{example_m}_transaction_label.csv\"\"\"\n", "print(example_ts_label_file_path)\n", "\n", "ensure_directory(example_ts_label_file_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "95650a3c", "metadata": {}, "outputs": [], "source": [ "c_m_p_data_example[['label']].to_csv(example_ts_label_file_path, index=False)" ] }, { "cell_type": "markdown", "id": "d4264831", "metadata": {}, "source": [ "## Generate a label dict needed for NAB model" ] }, { "cell_type": "code", "execution_count": null, "id": "df5ea147", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import json\n", "\n", "def generate_dummy_labels(data_dir: str, label_dir:str) -> str:\n", " \"\"\"Generate a dummy label JSON file and return its path\"\"\"\n", " data_dir_path = Path(data_dir)\n", " dummy_labels = dict()\n", " for file_path in data_dir_path.rglob(\"*.csv\"):\n", " file_path_relative = file_path.relative_to(data_dir_path)\n", " dummy_labels[str(file_path_relative)] = []\n", " dummy_label_path = Path(f\"{label_dir}/labels-combined.json\")\n", " with dummy_label_path.open(\"w\") as file:\n", " json.dump(dummy_labels, file, indent=4)\n", " return str(dummy_label_path.resolve())" ] }, { "cell_type": "code", "execution_count": null, "id": "9f7e7ada", "metadata": {}, "outputs": [], "source": [ "label_dict_filepath = generate_dummy_labels(\n", " \"../../data/02_intermediate/financial_fraud/ts_data/\",\n", " \"../../data/02_intermediate/financial_fraud/ts_label/\"\n", ")" ] }, { "cell_type": "markdown", "id": "df25941e", "metadata": {}, "source": [ "# References\n", "\n", "Edgar Alonso Lopez-Rojas and Stefan Axelsson. 2014. BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH.\n", "\n", "Alexander Lavin and Subutai Ahmad. 2015. Evaluating Real-Time Anomaly Detection Algorithms – The Numenta Anomaly Benchmark." ] }, { "cell_type": "code", "execution_count": null, "id": "ca9de484", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "kedro-nab-venv", "language": "python", "name": "kedro-nab-venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }