{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e913ef0e",
   "metadata": {},
   "source": [
    "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
    "\n",
    "SPDX-License-Identifier: Apache-2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be15214f",
   "metadata": {},
   "source": [
    "# Prepare BATADAL dataset for GDN\n",
    "\n",
    "GDN is an unsupervised anomaly detection algorithm that identifies anomalies at a timestep level for an entire system. This system consists of nodes that generate time-series (such as sensors in a water treatment plant), and GDN learns the relationship between the nodes during non-anomalous operation. These learned relationships can then be used at inference time to identify if the system is operating with anomalies.\n",
    "\n",
    "To use GDN, we need a text file containing the list of nodes (will be referred to as 'sensors'), a train CSV containing timesteps and time series with for each sensor, and a test CSV containing timesteps and time series for each sensor, and labels for each time step.\n",
    "\n",
    "This notebook prepares the text file, train and test CSVs from the original data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "420e79ce",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "\n",
    "1. Load BATADAL data\n",
    "2. Process Train with no anomalies\n",
    "  * Scale each sensor column with MinMax scaling\n",
    "  * Save sensor list to text file\n",
    "  * Save as CSV\n",
    "3. Process Train with anomalies\n",
    "  * Append anomaly timestamp labels\n",
    "  * Scale each sensor column with MinMax scaling\n",
    "  * Save as CSV\n",
    "4. Process Test with anomalies\n",
    "  * Append anomaly timestamp labels\n",
    "  * Scale each sensor column with MinMax scaling\n",
    "  * Save as CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39ffff48",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import pathlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "262956ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "source_dir = \"../../data/01_raw/iot\"\n",
    "destination_dir = \"../../data/03_primary/iot\"\n",
    "\n",
    "pathlib.Path(destination_dir).mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f4f1fc9",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_no_anom = pd.read_csv(f\"{source_dir}/BATADAL_dataset03_train_no_anomaly.csv\")\n",
    "train_some_anom = pd.read_csv(f\"{source_dir}/BATADAL_dataset04_train_some_anomaly.csv\")\n",
    "test_with_anom = pd.read_csv(f\"{source_dir}/BATADAL_test_dataset_some_anomaly.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d7930f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# has leading white space\n",
    "train_some_anom.columns = train_some_anom.columns.str.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f708ffa8",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_no_anom.shape, train_some_anom.shape, test_with_anom.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef5b8732",
   "metadata": {},
   "source": [
    "# Train no anomaly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7993dec4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fc2d9b7",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "SENSOR_COLS = [c for c in train_no_anom.columns if c not in [\"DATETIME\", \"ATT_FLAG\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3bfca3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "normalizer = MinMaxScaler(feature_range=(0, 1)).fit(train_no_anom[SENSOR_COLS])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4019346d",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_no_anom[SENSOR_COLS] = normalizer.transform(train_no_anom[SENSOR_COLS])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c787f6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(f\"{destination_dir}/iot_sensor_list_batadal.txt\", \"w\") as f:\n",
    "    f.writelines(\"\\n\".join(SENSOR_COLS))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1cedaa82",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_no_anom.reset_index().rename(columns={\"index\": \"timestamp\"})[[\"timestamp\" ]+ SENSOR_COLS].to_csv(\n",
    "    f\"{destination_dir}/iot_gdn_train.csv\",\n",
    "    index=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09d07001",
   "metadata": {},
   "source": [
    "# Train with anomaly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "916df750",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Tuple, List\n",
    "def append_anomaly_column(anomalies: List[Tuple], df: pd.DataFrame) -> pd.DataFrame:    \n",
    "    fmt =\"%d/%m/%Y %H\"\n",
    "    anomalies_dt = [\n",
    "        (pd.to_datetime(s, format=fmt), pd.to_datetime(e, format=fmt)) for s, e in anomalies\n",
    "    ]\n",
    "    \n",
    "    df = df.reset_index().rename(columns={\"index\": \"timestamp\"})\n",
    "    df[\"pdDateTime\"] = pd.to_datetime(df[\"DATETIME\"], format=\"%d/%m/%y %H\")\n",
    "    df = df.set_index([\"pdDateTime\"])\n",
    "\n",
    "    df[\"attack\"] = 0\n",
    "    for start, end in anomalies_dt:\n",
    "        df.loc[start:end, \"attack\"] = 1\n",
    "        \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24804237",
   "metadata": {},
   "outputs": [],
   "source": [
    "# from http://www.batadal.net/images/Attacks_TrainingDataset2.png\n",
    "fmt =\"%d/%m/%Y %H\"\n",
    "train_anomalies = [\n",
    "    (\"13/09/2016 23\", \"16/09/2016 00\"),\n",
    "    (\"26/09/2016 11\", \"27/09/2016 10\"),\n",
    "    (\"09/10/2016 09\", \"11/10/2016 20\"),\n",
    "    (\"29/10/2016 19\", \"02/11/2016 16\"),\n",
    "    (\"26/11/2016 17\", \"29/11/2016 04\"),\n",
    "    (\"06/12/2016 07\", \"10/12/2016 04\"),\n",
    "    (\"14/12/2016 15\", \"19/12/2016 04\")\n",
    "]\n",
    "\n",
    "train_anomalies_dt = [\n",
    "    (pd.to_datetime(s, format=fmt), pd.to_datetime(e, format=fmt)) for s, e in train_anomalies\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "696d5489",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_some_anom = append_anomaly_column(train_anomalies_dt, train_some_anom)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4efbb77c",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_some_anom[\"attack\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f523a9d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_some_anom[SENSOR_COLS] = normalizer.transform(train_some_anom[SENSOR_COLS])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5420404a",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_some_anom[[\"timestamp\"] + SENSOR_COLS + [\"attack\"]].to_csv(\n",
    "    f\"{destination_dir}/iot_gdn_train_anom.csv\", \n",
    "    index=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9349995",
   "metadata": {},
   "source": [
    "# Test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c502fb77",
   "metadata": {},
   "outputs": [],
   "source": [
    "# http://www.batadal.net/images/Attacks_TestDataset.png\n",
    "test_anomalies = [\n",
    "    (\"16/01/2017 09\", \"19/01/2017 06\"),\n",
    "    (\"30/01/2017 08\", \"02/02/2017 00\"),\n",
    "    (\"09/02/2017 03\", \"10/02/2017 09\"),\n",
    "    (\"12/02/2017 01\", \"13/02/2017 07\"),\n",
    "    (\"24/02/2017 05\", \"28/02/2017 08\"),\n",
    "    (\"10/03/2017 14\", \"13/03/2017 21\"),\n",
    "    (\"25/03/2017 20\", \"27/03/2017 01\")\n",
    "]\n",
    "\n",
    "test_anomalies_dt = [\n",
    "    (pd.to_datetime(s, format=fmt), pd.to_datetime(e, format=fmt)) for s, e in test_anomalies\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41a72c5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_with_anom = append_anomaly_column(test_anomalies_dt, test_with_anom)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d500b964",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_with_anom[\"attack\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da28b24e",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_with_anom[SENSOR_COLS] = normalizer.transform(test_with_anom[SENSOR_COLS])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbcc7786",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_with_anom[[\"timestamp\"] + SENSOR_COLS + [\"attack\"]].to_csv(\n",
    "    f\"{destination_dir}/iot_gdn_test.csv\", \n",
    "    index=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f38a18ed",
   "metadata": {},
   "source": [
    "# References\n",
    "Riccardo Taormina and Stefano Galelli and Nils Ole Tippenhauer and Elad Salomons and Avi Ostfeld and Demetrios G. Eliades and Mohsen Aghashahi and Raanju Sundararajan and Mohsen Pourahmadi and M. Katherine Banks and B. M. Brentan and Enrique Campbell and G. Lima and D. Manzi and D. Ayala-Cabrera and M. Herrera and I. Montalvo and J. Izquierdo and E. Luvizotto and Sarin E. Chandy and Amin Rasekh and Zachary A. Barker and Bruce Campbell and M. Ehsan Shafiee and Marcio Giacomoni and Nikolaos Gatsis and Ahmad Taha and Ahmed A. Abokifa and Kelsey Haddad and Cynthia S. Lo and Pratim Biswas and M. Fayzul K. Pasha and Bijay Kc and Saravanakumar Lakshmanan Somasundaram and Mashor Housh and Ziv Ohar; \"The Battle Of The Attack Detection Algorithms: Disclosing Cyber Attacks On Water Distribution Networks.\" Journal of Water Resources Planning and Management, 144 (8), August 2018\n",
    "\n",
    "Ailin Deng and Bryan Hooi. 2021. Graph Neural Network-Based Anomaly Detection in Multivariate Time Series. CoRR abs/2106.06947, (2021). Retrieved from https://arxiv.org/abs/2106.06947 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "daff4d82",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}