{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9371c102",
   "metadata": {},
   "source": [
    "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
    "\n",
    "SPDX-License-Identifier: Apache-2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5656a903",
   "metadata": {},
   "source": [
    "# Prepare BATADAL dataset for Numenta Benchmark (NAB)\n",
    "\n",
    "The [Numenta Benchmark](https://github.com/numenta/NAB) consists of multiple anomaly detection algorithms for time-series. The algorithms identify anomalies contextually - based on previous values for that time series. Therefore the NAB repository expects each time series to be in its own CSV with timestamp and value as the columns.\n",
    "\n",
    "This notebook converts the BATADAL CSV into multiple CSVs, one for each system variable. The three BATADAL datasets are not contiguous, and since NAB is contextual, we only use the test BATADAL dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c52c987d",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "\n",
    "1. Load BATADAL data \n",
    "2. Process Test with anomalies\n",
    "  * Append anomaly timestamp labels\n",
    "  * Split each sensor column into its own CSV\n",
    "  * Save each CSV\n",
    "  * Save JSON specifying anomalies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86a7bdfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from collections import defaultdict\n",
    "from typing import List, Tuple, Dict\n",
    "from pathlib import Path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7625f68",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e14e5023",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_no_anom = pd.read_csv(\"../../data/01_raw/iot/BATADAL_dataset03_train_no_anomaly.csv\")\n",
    "train_some_anom = pd.read_csv(\"../../data/01_raw/iot/BATADAL_dataset04_train_some_anomaly.csv\")\n",
    "test_with_anom = pd.read_csv(\"../../data/01_raw/iot/BATADAL_test_dataset_some_anomaly.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f172f187",
   "metadata": {},
   "outputs": [],
   "source": [
    "# has leading white space\n",
    "train_some_anom.columns = train_some_anom.columns.str.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8a1b199",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_no_anom.shape, train_some_anom.shape, test_with_anom.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0460d940",
   "metadata": {},
   "source": [
    "# NAB\n",
    "NAB expects a directly containing CSVs with `timestamp, value` as the columns. Therefore we will split each CSV into multiple CSVs as required"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f38d7f6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "sensor_cols = [c for c in test_with_anom.columns if c not in [\"DATETIME\", \"ATT_FLAG\", \"timestamp\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "022aec7a",
   "metadata": {},
   "source": [
    "## Split Test Set\n",
    "\n",
    "As NAB finds anomalies at individual time series level by using previous time steps to predict future time steps, we only need the BATADAL test set as it is 3-months of continuous data with anomalies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f10ac082",
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_str_datetime(anomalies: List[Tuple]) -> List[Tuple]:\n",
    "    fmt =\"%d/%m/%Y %H\"\n",
    "    anomalies_dt = [\n",
    "        (pd.to_datetime(s, format=fmt), pd.to_datetime(e, format=fmt)) for s, e in anomalies\n",
    "    ]\n",
    "    \n",
    "    return anomalies_dt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e32b27c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def append_anomaly_column(anomalies: List[Tuple], df: pd.DataFrame) -> pd.DataFrame:        \n",
    "    df[\"timestamp\"] = pd.to_datetime(df[\"DATETIME\"], format=\"%d/%m/%y %H\")\n",
    "    df = df.set_index([\"timestamp\"])\n",
    "\n",
    "    df[\"attack\"] = 0\n",
    "    for start, end in anomalies:\n",
    "        df.loc[start:end, \"attack\"] = 1\n",
    "        \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "280af11e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def split_csv_and_save(\n",
    "    test: pd.DataFrame,\n",
    "    test_anomalies: List[Tuple],\n",
    "    sensor_columns: List[str], \n",
    "    parameters: Dict\n",
    ") -> None:\n",
    "    csv_save_dir = Path(parameters[\"ts_data_dir\"])\n",
    "    label_save_dir = Path(parameters[\"ts_label_dir\"])\n",
    "    csv_save_dir.mkdir(parents=True, exist_ok=True)\n",
    "    label_save_dir.mkdir(parents=True, exist_ok=True)\n",
    "    \n",
    "    anomaly_dict = defaultdict(list)\n",
    "    for c in sensor_columns:\n",
    "        test.reset_index()[[\"timestamp\", c]].rename(\n",
    "            columns={c:\"value\"}\n",
    "        ).to_csv(f\"{csv_save_dir}/{c}.csv\", index=False)\n",
    "\n",
    "        for s_anom, e_anom in test_anomalies:\n",
    "            anomaly_dict[f\"{c}.csv\"].append([\n",
    "                s_anom.strftime('%Y-%m-%d %H:%M:%S.%f'), e_anom.strftime('%Y-%m-%d %H:%M:%S.%f')\n",
    "            ])\n",
    "            \n",
    "    with open(f\"{label_save_dir}/labels-test.json\", \"w\") as fp:\n",
    "        json.dump(anomaly_dict, fp, indent=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cc6419e",
   "metadata": {},
   "source": [
    "### Append anomaly column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ce30e7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# http://www.batadal.net/images/Attacks_TestDataset.png\n",
    "test_anomalies = [\n",
    "    (\"16/01/2017 09\", \"19/01/2017 06\"),\n",
    "    (\"30/01/2017 08\", \"02/02/2017 00\"),\n",
    "    (\"09/02/2017 03\", \"10/02/2017 09\"),\n",
    "    (\"12/02/2017 01\", \"13/02/2017 07\"),\n",
    "    (\"24/02/2017 05\", \"28/02/2017 08\"),\n",
    "    (\"10/03/2017 14\", \"13/03/2017 21\"),\n",
    "    (\"25/03/2017 20\", \"27/03/2017 01\")\n",
    "]\n",
    "\n",
    "test_anomalies_dt = convert_str_datetime(test_anomalies)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f09073b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df = append_anomaly_column(test_anomalies_dt, test_with_anom)\n",
    "test_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bec50550",
   "metadata": {},
   "source": [
    "### Save splitted CSVs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7937d239",
   "metadata": {},
   "outputs": [],
   "source": [
    "parameters = {\n",
    "    \"ts_data_dir\": \"../../data/02_intermediate/iot/ts_data\",\n",
    "    \"ts_label_dir\": \"../../data/02_intermediate/iot/ts_label\"\n",
    "    \n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d134abb",
   "metadata": {},
   "outputs": [],
   "source": [
    "split_csv_and_save(test_df, test_anomalies_dt, sensor_cols, parameters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31e50738",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls \"../../data/02_intermediate/iot/ts_data\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "894ebfc2",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls \"../../data/02_intermediate/iot/ts_label\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34208904",
   "metadata": {},
   "source": [
    "# References\n",
    "Riccardo Taormina and Stefano Galelli and Nils Ole Tippenhauer and Elad Salomons and Avi Ostfeld and Demetrios G. Eliades and Mohsen Aghashahi and Raanju Sundararajan and Mohsen Pourahmadi and M. Katherine Banks and B. M. Brentan and Enrique Campbell and G. Lima and D. Manzi and D. Ayala-Cabrera and M. Herrera and I. Montalvo and J. Izquierdo and E. Luvizotto and Sarin E. Chandy and Amin Rasekh and Zachary A. Barker and Bruce Campbell and M. Ehsan Shafiee and Marcio Giacomoni and Nikolaos Gatsis and Ahmad Taha and Ahmed A. Abokifa and Kelsey Haddad and Cynthia S. Lo and Pratim Biswas and M. Fayzul K. Pasha and Bijay Kc and Saravanakumar Lakshmanan Somasundaram and Mashor Housh and Ziv Ohar; \"The Battle Of The Attack Detection Algorithms: Disclosing Cyber Attacks On Water Distribution Networks.\" Journal of Water Resources Planning and Management, 144 (8), August 2018\n",
    "\n",
    "Alexander Lavin and Subutai Ahmad. 2015. Evaluating Real-Time Anomaly Detection Algorithms – The Numenta Anomaly Benchmark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9805b64e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}