Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: Apache-2.0

# Prepare BATADAL dataset for NCAD

Neural Contextual Anomaly Detection (NCAD) for Time Series is a time series approach to anomaly detection. It splits time series into overlapping windows, which are then split into context and suspect windows. It learns embeddings for these windows using a temporal convolutional network. At inference time, if the distance of the embeddings for the context and context+suspect windows are far, then the algorithm predicts that there is an anomaly in the suspect window.

To use NCAD, we need to prepare train and test CSVs containing the timestamp, time series for each sensor, and label.

This notebook constructs a train with no anomaly, train with anomaly, and test CSVs from the original data.

## Table of Contents

1. Load BATADAL data
2. Process Train with no anomalies
 * Save sensor list to text file
 * Save as CSV
3. Process Train with anomalies
 * Append anomaly timestamp labels
 * Save as CSV
4. Process Test with anomalies
 * Append anomaly timestamp labels
 * Save as CSV


In [None]:
import pandas as pd
import pathlib

In [None]:
source_dir = "../../data/01_raw/iot"
destination_dir = "../../data/03_primary/iot"

pathlib.Path(destination_dir).mkdir(parents=True, exist_ok=True)

In [None]:
train_no_anom = pd.read_csv(f"{source_dir}/BATADAL_dataset03_train_no_anomaly.csv")
train_some_anom = pd.read_csv(f"{source_dir}/BATADAL_dataset04_train_some_anomaly.csv")
test_with_anom = pd.read_csv(f"{source_dir}/BATADAL_test_dataset_some_anomaly.csv")

In [None]:
# has leading white space
train_some_anom.columns = train_some_anom.columns.str.strip()

In [None]:
train_no_anom.shape, train_some_anom.shape, test_with_anom.shape

# Train no anomaly

In [None]:
from typing import Tuple, List

def append_anomaly_column(anomalies: List[Tuple], df: pd.DataFrame) -> pd.DataFrame: 
 fmt ="%d/%m/%Y %H"
 anomalies_dt = [
 (pd.to_datetime(s, format=fmt), pd.to_datetime(e, format=fmt)) for s, e in anomalies
 ]
 
 df = df.reset_index().rename(columns={"index": "timestamp"})
 df["pdDateTime"] = pd.to_datetime(df["DATETIME"], format="%d/%m/%y %H")
 df = df.set_index(["pdDateTime"])

 df["label"] = 0
 for start, end in anomalies_dt:
 df.loc[start:end, "label"] = 1
 
 return df

In [None]:
SENSOR_COLS = [c for c in train_no_anom.columns if c not in ["DATETIME", "ATT_FLAG"]]

In [None]:
with open(f"{destination_dir}/iot_sensor_list_batadal.txt", "w") as f:
 f.writelines("\n".join(SENSOR_COLS))

In [None]:
train_no_anom = append_anomaly_column([], train_no_anom)

In [None]:
train_no_anom[SENSOR_COLS + ["label"]].to_csv(
 f"{destination_dir}/iot_ncad_train.csv",
 index=False
)

# Train some anom

In [None]:
# from http://www.batadal.net/images/Attacks_TrainingDataset2.png
fmt ="%d/%m/%Y %H"
train_anomalies = [
 ("13/09/2016 23", "16/09/2016 00"),
 ("26/09/2016 11", "27/09/2016 10"),
 ("09/10/2016 09", "11/10/2016 20"),
 ("29/10/2016 19", "02/11/2016 16"),
 ("26/11/2016 17", "29/11/2016 04"),
 ("06/12/2016 07", "10/12/2016 04"),
 ("14/12/2016 15", "19/12/2016 04")
]

train_anomalies_dt = [
 (pd.to_datetime(s, format=fmt), pd.to_datetime(e, format=fmt)) for s, e in train_anomalies
]

In [None]:
train_some_anom = append_anomaly_column(train_anomalies_dt, train_some_anom)

In [None]:
train_some_anom["label"].value_counts()

In [None]:
train_some_anom[SENSOR_COLS + ["label"]].to_csv(
 f"{destination_dir}/iot_ncad_train_anom.csv", 
 index=False
)

# Test

In [None]:
# http://www.batadal.net/images/Attacks_TestDataset.png
test_anomalies = [
 ("16/01/2017 09", "19/01/2017 06"),
 ("30/01/2017 08", "02/02/2017 00"),
 ("09/02/2017 03", "10/02/2017 09"),
 ("12/02/2017 01", "13/02/2017 07"),
 ("24/02/2017 05", "28/02/2017 08"),
 ("10/03/2017 14", "13/03/2017 21"),
 ("25/03/2017 20", "27/03/2017 01")
]

test_anomalies_dt = [
 (pd.to_datetime(s, format=fmt), pd.to_datetime(e, format=fmt)) for s, e in test_anomalies
]

In [None]:
test_with_anom = append_anomaly_column(test_anomalies_dt, test_with_anom)

In [None]:
test_with_anom["label"].value_counts()

In [None]:
test_with_anom[SENSOR_COLS + ["label"]].to_csv(
 f"{destination_dir}/iot_ncad_test.csv", 
 index=False
)

# References
Riccardo Taormina and Stefano Galelli and Nils Ole Tippenhauer and Elad Salomons and Avi Ostfeld and Demetrios G. Eliades and Mohsen Aghashahi and Raanju Sundararajan and Mohsen Pourahmadi and M. Katherine Banks and B. M. Brentan and Enrique Campbell and G. Lima and D. Manzi and D. Ayala-Cabrera and M. Herrera and I. Montalvo and J. Izquierdo and E. Luvizotto and Sarin E. Chandy and Amin Rasekh and Zachary A. Barker and Bruce Campbell and M. Ehsan Shafiee and Marcio Giacomoni and Nikolaos Gatsis and Ahmad Taha and Ahmed A. Abokifa and Kelsey Haddad and Cynthia S. Lo and Pratim Biswas and M. Fayzul K. Pasha and Bijay Kc and Saravanakumar Lakshmanan Somasundaram and Mashor Housh and Ziv Ohar; "The Battle Of The Attack Detection Algorithms: Disclosing Cyber Attacks On Water Distribution Networks." Journal of Water Resources Planning and Management, 144 (8), August 2018

Chris U. Carmona, François-Xavier Aubet, Valentin Flunkert, and Jan Gasthaus. 2021. Neural Contextual Anomaly Detection for Time Series.