Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: Apache-2.0

# Prepare BATADAL dataset for Numenta Benchmark (NAB)

The [Numenta Benchmark](https://github.com/numenta/NAB) consists of multiple anomaly detection algorithms for time-series. The algorithms identify anomalies contextually - based on previous values for that time series. Therefore the NAB repository expects each time series to be in its own CSV with timestamp and value as the columns.

This notebook converts the BATADAL CSV into multiple CSVs, one for each system variable. The three BATADAL datasets are not contiguous, and since NAB is contextual, we only use the test BATADAL dataset.

## Table of Contents

1. Load BATADAL data 
2. Process Test with anomalies
 * Append anomaly timestamp labels
 * Split each sensor column into its own CSV
 * Save each CSV
 * Save JSON specifying anomalies

In [None]:
import pandas as pd
from collections import defaultdict
from typing import List, Tuple, Dict
from pathlib import Path

In [None]:
import json

In [None]:
train_no_anom = pd.read_csv("../../data/01_raw/iot/BATADAL_dataset03_train_no_anomaly.csv")
train_some_anom = pd.read_csv("../../data/01_raw/iot/BATADAL_dataset04_train_some_anomaly.csv")
test_with_anom = pd.read_csv("../../data/01_raw/iot/BATADAL_test_dataset_some_anomaly.csv")

In [None]:
# has leading white space
train_some_anom.columns = train_some_anom.columns.str.strip()

In [None]:
train_no_anom.shape, train_some_anom.shape, test_with_anom.shape

# NAB
NAB expects a directly containing CSVs with `timestamp, value` as the columns. Therefore we will split each CSV into multiple CSVs as required

In [None]:
sensor_cols = [c for c in test_with_anom.columns if c not in ["DATETIME", "ATT_FLAG", "timestamp"]]

## Split Test Set

As NAB finds anomalies at individual time series level by using previous time steps to predict future time steps, we only need the BATADAL test set as it is 3-months of continuous data with anomalies.

In [None]:
def convert_str_datetime(anomalies: List[Tuple]) -> List[Tuple]:
 fmt ="%d/%m/%Y %H"
 anomalies_dt = [
 (pd.to_datetime(s, format=fmt), pd.to_datetime(e, format=fmt)) for s, e in anomalies
 ]
 
 return anomalies_dt

In [None]:
def append_anomaly_column(anomalies: List[Tuple], df: pd.DataFrame) -> pd.DataFrame: 
 df["timestamp"] = pd.to_datetime(df["DATETIME"], format="%d/%m/%y %H")
 df = df.set_index(["timestamp"])

 df["attack"] = 0
 for start, end in anomalies:
 df.loc[start:end, "attack"] = 1
 
 return df

In [None]:
def split_csv_and_save(
 test: pd.DataFrame,
 test_anomalies: List[Tuple],
 sensor_columns: List[str], 
 parameters: Dict
) -> None:
 csv_save_dir = Path(parameters["ts_data_dir"])
 label_save_dir = Path(parameters["ts_label_dir"])
 csv_save_dir.mkdir(parents=True, exist_ok=True)
 label_save_dir.mkdir(parents=True, exist_ok=True)
 
 anomaly_dict = defaultdict(list)
 for c in sensor_columns:
 test.reset_index()[["timestamp", c]].rename(
 columns={c:"value"}
 ).to_csv(f"{csv_save_dir}/{c}.csv", index=False)

 for s_anom, e_anom in test_anomalies:
 anomaly_dict[f"{c}.csv"].append([
 s_anom.strftime('%Y-%m-%d %H:%M:%S.%f'), e_anom.strftime('%Y-%m-%d %H:%M:%S.%f')
 ])
 
 with open(f"{label_save_dir}/labels-test.json", "w") as fp:
 json.dump(anomaly_dict, fp, indent=4)

### Append anomaly column

In [None]:
# http://www.batadal.net/images/Attacks_TestDataset.png
test_anomalies = [
 ("16/01/2017 09", "19/01/2017 06"),
 ("30/01/2017 08", "02/02/2017 00"),
 ("09/02/2017 03", "10/02/2017 09"),
 ("12/02/2017 01", "13/02/2017 07"),
 ("24/02/2017 05", "28/02/2017 08"),
 ("10/03/2017 14", "13/03/2017 21"),
 ("25/03/2017 20", "27/03/2017 01")
]

test_anomalies_dt = convert_str_datetime(test_anomalies)

In [None]:
test_df = append_anomaly_column(test_anomalies_dt, test_with_anom)
test_df.shape

### Save splitted CSVs

In [None]:
parameters = {
 "ts_data_dir": "../../data/02_intermediate/iot/ts_data",
 "ts_label_dir": "../../data/02_intermediate/iot/ts_label"
 
}

In [None]:
split_csv_and_save(test_df, test_anomalies_dt, sensor_cols, parameters)

In [None]:
!ls "../../data/02_intermediate/iot/ts_data"

In [None]:
!ls "../../data/02_intermediate/iot/ts_label"

# References
Riccardo Taormina and Stefano Galelli and Nils Ole Tippenhauer and Elad Salomons and Avi Ostfeld and Demetrios G. Eliades and Mohsen Aghashahi and Raanju Sundararajan and Mohsen Pourahmadi and M. Katherine Banks and B. M. Brentan and Enrique Campbell and G. Lima and D. Manzi and D. Ayala-Cabrera and M. Herrera and I. Montalvo and J. Izquierdo and E. Luvizotto and Sarin E. Chandy and Amin Rasekh and Zachary A. Barker and Bruce Campbell and M. Ehsan Shafiee and Marcio Giacomoni and Nikolaos Gatsis and Ahmad Taha and Ahmed A. Abokifa and Kelsey Haddad and Cynthia S. Lo and Pratim Biswas and M. Fayzul K. Pasha and Bijay Kc and Saravanakumar Lakshmanan Somasundaram and Mashor Housh and Ziv Ohar; "The Battle Of The Attack Detection Algorithms: Disclosing Cyber Attacks On Water Distribution Networks." Journal of Water Resources Planning and Management, 144 (8), August 2018

Alexander Lavin and Subutai Ahmad. 2015. Evaluating Real-Time Anomaly Detection Algorithms – The Numenta Anomaly Benchmark.