{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "778b2194",
   "metadata": {},
   "source": [
    "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
    "\n",
    "SPDX-License-Identifier: Apache-2.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa1c8cd4",
   "metadata": {},
   "source": [
    "# Prepare WiFi Data for GDN\n",
    "GDN is an unsupervised anomaly detection algorithm that identifies anomalies at a timestep level for an entire system. This system consists of nodes that generate time-series (such as sensors in a water treatment plant), and GDN learns the relationship between the nodes during non-anomalous operation. These learned relationships can then be used at inference time to identify if the system is operating with anomalies.\n",
    "\n",
    "To use GDN, we need a text file containing the list of nodes (will be referred to as 'sensors'), a train CSV containing timesteps and time series with for each sensor, and a test CSV containing timesteps and time series for each sensor, and labels for each time step.\n",
    "\n",
    "\n",
    "This notebook prepares the wifi data and constructs the necessary train, test, and list.txt files for GDN"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa220fb3",
   "metadata": {},
   "source": [
    "## Table of Contents  \n",
    "\n",
    "[Imports](#imports)   \n",
    "[Understanding the Data](#understand-data)   \n",
    "[Setup](#setup)   \n",
    "[Data Cleaning](#data-cleaning)   \n",
    "[Exploratory Data Analysis of Cleaned Data](#EDA)   \n",
    "[Assigning \"Attack\" Labels](#labels)   \n",
    "[Normalizing Data](#normalize-data)\n",
    "[Saving Cleaned Data for GDN](#saving-data)   \n",
    "[References](#references)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9138cf7",
   "metadata": {},
   "source": [
    "## Imports <a name=\"imports\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d8262b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import re\n",
    "import matplotlib.pyplot as plt\n",
    "import random\n",
    "import pathlib\n",
    "from sklearn.preprocessing import MinMaxScaler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc46394f",
   "metadata": {},
   "source": [
    "## Understanding the Data <a name=\"understand-data\"></a>\n",
    " \n",
    "Despite the growing popularity of 802.11 wireless networks, users often suffer from connectivity problems and performance issues due to unstable radio conditions and dynamic user behavior among other reasons. Anomaly detection and distinction are in the thick of major challenges that network managers encounter. This dataset exploits simulation as an effective tool to setup a computationally tractable network. \n",
    "\n",
    "The data features are categorized in two main classes: Density Attributes and Usage Attributes.   \n",
    "1. Density Attributes demonstrate how crowded is the place in terms of active attendant users, characterizing the association population and durability,\n",
    "2. Usage Attributes disclose the volume of the sent and received traffics by the present users, revealing the total bandwidth throughput regardless of how populous is the place and it is more relevant to the applications utilized by the current mobile users.\n",
    "\n",
    "We take a look at the usage attributes for subpopulation Net 1 and Net2. Net1 represents a subpopulation with normal behavior, and Net2 contains anomalies. We try to use graph based anomaly detection approaches to accurately identify the anomalies in Net2. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe7f1c2a",
   "metadata": {},
   "source": [
    "## Setup  <a name=\"setup\"></a>\n",
    "\n",
    "GDN requires the following files:     \n",
    "            1. **list.txt:** the feature names, one feature per line  \n",
    "            2. **train.csv:** training data modeling normal behavior, no anomalies were present according to the paper  \n",
    "            3. **test.csv:** test data.test.csv should have a column named \"attack\" which contains ground truth label (0/1) of being attacked or not (0: normal, 1:\n",
    "            attacked)  \n",
    "\n",
    "This notebook creates the 3 files mentioned above and will save them in `../../data/03_primary/wifi`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80e78a7c",
   "metadata": {},
   "source": [
    "## Data Cleaning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ef6b814",
   "metadata": {},
   "source": [
    "Currently, in each Net, each subpopulation contains a UDP directory, containing a sent and received csv for each of the 100 users. Currently, the raw data has a time column and a column formatted `Net2.cliHostx[y].udp.sentPk:vector(packetBytes)` where `y` represents the host. We want to clean the column names to only include the host name and round the time values to the nearest second. This is an example of one of the tables: \n",
    "\n",
    "\n",
    "| time      | Net2.cliHostx[y].udp.sentPk:vector(packetBytes) |\n",
    "| ----------- | ----------- |\n",
    "| 1.15      | 100       |\n",
    "| 63.2   | 231        |\n",
    "\n",
    "\n",
    "In order to run any graph based anomaly detection technique, we need a complete time series dataset. the UDP data is incomplete. Each file represents a user, who is either sending or receiving packets. Per subpopulation, there are 100 users, both sending and receiving packets -- 200 csv files. Within each of these files, we have a time column and a value columm, representing number of packets (in bytes) sent/received.\n",
    "\n",
    "We to fill in the missing time values with 0's since no bytes were either sent or received at that time. This will give us a more complete time series dataset. We want to sum the values for each host. For example, if host 8 received 10 bytes from user 3 and 12 bytes from user 4 at time t, then the value for time t should be 22. Finally, we want to concatenate the datasets to have one large dataset, with the time as index, and the hosts as column names or sensors. We need both a test and a train dataset for GDN. The train data will be from Net1 (normal behavior) and the test data will come from Net 2 (containing anomalies)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72016643",
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_host_data(udp_subpopulation_data_path, sent=False):\n",
    "    all_nums = []\n",
    "    host_user_map = {}\n",
    "    maximum_time = get_maximum_time(udp_subpopulation_data_path)\n",
    "    all_data = []\n",
    "    filename_prefix_check = 'r'\n",
    "    if sent:\n",
    "        filename_prefix_check = 's'\n",
    "        \n",
    "    for csv in os.listdir(net2_sub_0_path):\n",
    "        if csv[0]==filename_prefix_check:\n",
    "            try: \n",
    "                df = pd.read_csv(udp_subpopulation_data_path + csv)\n",
    "                host_column = df.columns[1]\n",
    "                host = int(host_column.split(\"[\")[1].split(\"]\")[0])\n",
    "                df.set_axis(['time', str(host)], axis=1, inplace=True)\n",
    "\n",
    "                if host not in host_user_map:\n",
    "                    host_user_map[host] = []\n",
    "                host_user_map[host].append(csv)\n",
    "                time_value_map = contruct_time_value_map(df)\n",
    "                column_values = fill_time_gaps(df, time_value_map, maximum_time)\n",
    "                df = impute_data(df, maximum_time, column_values)\n",
    "                all_data.append(df)\n",
    "            except:\n",
    "                print(\"cannot load file: \", csv)\n",
    "                pass\n",
    "    return all_data\n",
    "        \n",
    "def get_maximum_time(udp_subpopulation_data_path):\n",
    "    maximum_time = 0\n",
    "    for csv in os.listdir(udp_subpopulation_data_path):\n",
    "        try:\n",
    "            df = pd.read_csv(udp_subpopulation_data_path + csv)\n",
    "            df = df.round({\"time\":0})\n",
    "            maximum_time =max(max(df[\"time\"]), maximum_time)\n",
    "        except:\n",
    "            print(\"cannot load file:   \", csv)    \n",
    "    return maximum_time\n",
    "\n",
    "def contruct_time_value_map(df):\n",
    "    host = df.columns[1]\n",
    "    time_value_map = {}\n",
    "    for index, row in df.iterrows():\n",
    "        time_value_map[int(row[\"time\"])] = row[host]\n",
    "    return time_value_map\n",
    "\n",
    "def fill_time_gaps(df, time_value_map, maximum_time):\n",
    "    column_values = []\n",
    "    for t in range(int(maximum_time)):\n",
    "        if t+1 in time_value_map:\n",
    "            column_values.append(time_value_map[t+1])\n",
    "        else:\n",
    "            column_values.append(0)\n",
    "    return column_values\n",
    "\n",
    "def impute_data(df, maximum_time, column_values):\n",
    "    time = [i+1for i in range(int(maximum_time))]\n",
    "    host = df.columns[1]\n",
    "    df = pd.DataFrame(data={\"time\": [i+1for i in range(int(maximum_time))], host: column_values})\n",
    "    return df\n",
    "\n",
    "def reconstruct_dataset(maximum_time, udp_subpopulation_data_path, output_path=\"\"):\n",
    "    print(\"here\")\n",
    "    for csv in os.listdir(udp_subpopulation_data_path):\n",
    "        try:\n",
    "            df = pd.read_csv(udp_subpopulation_data_path + csv)\n",
    "            time_value_map = contruct_time_value_map(df)\n",
    "            column_values = fill_time_gaps(df, time_value_map, maximum_time)\n",
    "            df = impute_data(df, maximum_time, column_values)\n",
    "            df.to_csv(output_path + csv, index=False)\n",
    "        except:\n",
    "            print(\"cannot load file:   \", csv)\n",
    "            \n",
    "\n",
    "def construct_dataset_aggregate_hosts(all_host_data):\n",
    "    data = {}\n",
    "    for df in all_host_data:\n",
    "        host = df.columns[1]\n",
    "        if host in data:\n",
    "            aggregate_packets = data[host] + df[host]\n",
    "            data[host] = aggregate_packets\n",
    "        else:\n",
    "            data[host] = df[host]\n",
    "\n",
    "    return pd.DataFrame(data=data)\n",
    "            "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66a0192b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Net2 sent and received dataframes\n",
    "net2_sub_0_path = \"../../data/01_raw/wifi/wifi_data/Net2/0/UDP/\"\n",
    "net2_cleaned_sent_dfs = clean_host_data(net2_sub_0_path, sent=True)\n",
    "net2_cleaned_rcvd_dfs = clean_host_data(net2_sub_0_path, sent=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "710785a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_cleaned_sent_dfs[0].columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20d875ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Net1 sent and received dataframes\n",
    "net1_sub_0_path =\"../../data/01_raw/wifi/wifi_data/Net1/0/UDP/\"\n",
    "net1_cleaned_sent_dfs = clean_host_data(net1_sub_0_path, sent=True)\n",
    "net1_cleaned_rcvd_dfs = clean_host_data(net1_sub_0_path, sent=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e6eace5",
   "metadata": {},
   "outputs": [],
   "source": [
    "net1_cleaned_sent_dfs[0].columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28099c01",
   "metadata": {},
   "source": [
    "We summed packets from the same hosts. We must do this for both sent and received files. First we aggregate the received (rcvd) files then the sent files. We will then get the sent - received values for each host."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16f718db",
   "metadata": {},
   "outputs": [],
   "source": [
    "net1_sent_aggregated_df = construct_dataset_aggregate_hosts(net1_cleaned_sent_dfs)\n",
    "net1_sent_aggregated_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43635ed7",
   "metadata": {},
   "outputs": [],
   "source": [
    "net1_rcvd_aggregated_df = construct_dataset_aggregate_hosts(net1_cleaned_rcvd_dfs)\n",
    "net1_rcvd_aggregated_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d64bed98",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_sent_aggregated_df = construct_dataset_aggregate_hosts(net2_cleaned_sent_dfs)\n",
    "net2_rcvd_aggregated_df = construct_dataset_aggregate_hosts(net2_cleaned_rcvd_dfs)\n",
    "\n",
    "net2_sent_aggregated_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fc1f1dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_rcvd_aggregated_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9442d89d",
   "metadata": {},
   "source": [
    "## Exploratory Data Analysis of Cleaned Data <a name=\"EDA\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8179331e",
   "metadata": {},
   "outputs": [],
   "source": [
    "net1_sent_df_hosts = set(net1_sent_aggregated_df.columns)\n",
    "net1_rcvd_df_hosts  = set(net1_rcvd_aggregated_df.columns)\n",
    "net1_rcvd_df_hosts == net1_sent_df_hosts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd5e780c",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_sent_df_hosts = set(net2_sent_aggregated_df.columns)\n",
    "net2_rcvd_df_hosts  = set(net2_rcvd_aggregated_df.columns)\n",
    "net2_rcvd_df_hosts == net2_sent_df_hosts\n",
    "\n",
    "#verifying the hosts are the same\n",
    "net2_rcvd_df_hosts == net2_sent_df_hosts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82635ac9",
   "metadata": {},
   "source": [
    "We subtract the received bytes from the sent bytes. Ideally, the same number of packets sent, should also be received. Understanding the number of leftover packets will help us understand the data and aggregate it into one dataset in a meaningful way. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be75ab04",
   "metadata": {},
   "outputs": [],
   "source": [
    "def subtract_received_from_sent(sent_df, rcvd_df):\n",
    "    sent_minus_received = {}\n",
    "    for host in sent_df:\n",
    "        sent_minus_received[host]=sent_df[host] - rcvd_df[host]\n",
    "    df = pd.DataFrame(data=sent_minus_received)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56611aca",
   "metadata": {},
   "outputs": [],
   "source": [
    "net1_subtracted_data = subtract_received_from_sent(net1_sent_aggregated_df, net1_rcvd_aggregated_df)\n",
    "net2_subtracted_data = subtract_received_from_sent(net2_sent_aggregated_df, net2_rcvd_aggregated_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b4197b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_subtracted_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af5f4d5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "net1_subtracted_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c8e4ac8",
   "metadata": {},
   "outputs": [],
   "source": [
    "column_names = net1_subtracted_data.columns\n",
    "column_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1385cb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Keeping only the first 1200 rows, 1200 seconds so the normal behavior and anomaly data are the same length\n",
    "net1_subtracted_data=net1_subtracted_data.head(1200)\n",
    "len(net1_subtracted_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed2575e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "len(net2_subtracted_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8438d920",
   "metadata": {},
   "source": [
    "Plotting number of packets per host, overlaying the anomaly data and the normal data to see if we can visualize anomalies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ae29e40",
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_num_packets(net1, net2, host):\n",
    "    fig = plt.figure(figsize=(15, 10))\n",
    "    time = [i for i in range(1200)]\n",
    "    plt.scatter(time, net2[host], alpha = 0.3, label=\"Anomaly Net2\")\n",
    "    plt.scatter(time, net1[host], alpha=0.3, color = \"orange\", label= \"Normal Behavior Net1\")\n",
    "    title = \"Number of Packets Host: \" + host\n",
    "    plt.title(title)\n",
    "    plt.xlabel(\"Time\")\n",
    "    plt.ylabel(\"Number of Packets Sent- Received \")\n",
    "    plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8909080a",
   "metadata": {},
   "outputs": [],
   "source": [
    "for host in column_names[:4]:\n",
    "    plot_num_packets(net1_subtracted_data, net2_subtracted_data, host)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62a5c1a4",
   "metadata": {},
   "source": [
    "## Assigning \"Attack\" Labels <a name=\"labels\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46eaae21",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_subtracted_data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "279f16d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_subtracted_data.head(n=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1fd96f3",
   "metadata": {},
   "source": [
    "To assign attack labels, we will subtract the packets for the normal and anomaly data.  \n",
    "0: [2*Lower Quartile, 2*Upper Quartile]  \n",
    "1: otherwise  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66289bf5",
   "metadata": {},
   "outputs": [],
   "source": [
    "def assign_attack_labels(normal_data, \n",
    "                         anomaly_data):\n",
    "    all_data = {}\n",
    "    for host in normal_data.columns:\n",
    "        all_data[host] =  abs(normal_data[host]-anomaly_data[host])\n",
    "    all_data = pd.DataFrame(all_data)\n",
    "    \n",
    "    \n",
    "    all_host_stats = {}\n",
    "    for host in all_data.columns:\n",
    "        host_data = dict(all_data[host].describe())\n",
    "        all_host_stats[host]=host_data\n",
    "        \n",
    "    all_host_stats\n",
    "    anomaly_times = set()\n",
    "    for host in all_data.columns:\n",
    "        min_accepted_value = -2*all_host_stats[host][\"25%\"]\n",
    "        max_accepted_value =  2*all_host_stats[host][\"75%\"]\n",
    "        for index, row in all_data.iterrows():\n",
    "            value = row[host]\n",
    "            if (value < min_accepted_value or value > max_accepted_value) and (int(value) != 0):\n",
    "                anomaly_times.add(index)\n",
    "                \n",
    "    anomaly_data[\"attack\"] = [0 for i in range(len(anomaly_data))]\n",
    "    for index, row in anomaly_data.iterrows():\n",
    "        if index in anomaly_times:\n",
    "            anomaly_data[\"attack\"][index]= 1\n",
    "    return anomaly_data, anomaly_times\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "134d8fe3",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_with_labels, anomaly_times = assign_attack_labels(net1_subtracted_data, net2_subtracted_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6899c427",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_with_labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "489431fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Notice how there are no zeros in the anomaly times\n",
    "print(sorted(list(anomaly_times)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80afc50a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fraction of data classified as anomalies\n",
    "len(anomaly_times)/1200"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe2b11df",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verifying the attack column has been populated properly. Recall, there were 400 anomalies identified from our method above\n",
    "sum(net2_with_labels[\"attack\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b503a04f",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_with_labels.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67045f7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_with_labels.head(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b24a866",
   "metadata": {},
   "source": [
    "Plotting the number of packets sent - received. This way we can see where the anomalies actually lie. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "460246a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_num_packets_overlay_attack(normal, anomaly, host):\n",
    "    fig = plt.figure(figsize=(15, 10))\n",
    "    plt.scatter(anomaly.index, anomaly[host], alpha = 0.3, label=\"Anomaly\")\n",
    "    plt.scatter(normal.index, normal[host], alpha=0.3, color = \"orange\", label= \"Normal Behavior\")\n",
    "    plt.scatter(normal.index, anomaly[\"attack\"], alpha=0.3, color = \"red\", label= \"Classified Anomaly\")\n",
    "    title = \"Number of Packets with Highlighted Anomalies Host:  \" + host\n",
    "    plt.title(title)\n",
    "    plt.xlabel(\"Time\")\n",
    "    plt.ylabel(\"Number of Packets Sent-Received\")\n",
    "    plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cf993ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "column_names = net2_with_labels.columns[:-1]\n",
    "column_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9591ae5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "for host in column_names[:4]:\n",
    "    plot_num_packets_overlay_attack(net1_subtracted_data, net2_with_labels, host)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2412d0a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "net1_subtracted_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6cb267ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "net2_with_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78484d5b",
   "metadata": {},
   "source": [
    "## Normalizing Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11143bae",
   "metadata": {},
   "outputs": [],
   "source": [
    "# max min(0-1)\n",
    "def norm(train, test):\n",
    "    normalizer = MinMaxScaler(feature_range=(0, 1)).fit(train) # scale training data to [0,1] range\n",
    "    train_ret = normalizer.transform(train)\n",
    "    test_ret = normalizer.transform(test)\n",
    "\n",
    "    return train_ret, test_ret"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee2f5276",
   "metadata": {},
   "outputs": [],
   "source": [
    "test = net2_with_labels\n",
    "train = net1_subtracted_data\n",
    "attack_column = test[\"attack\"]\n",
    "\n",
    "test = test.iloc[:, 1:]\n",
    "train = train.iloc[:, 1:]\n",
    "\n",
    "train = train.fillna(train.mean())\n",
    "test = test.fillna(test.mean())\n",
    "train = train.fillna(0)\n",
    "test = test.fillna(0)\n",
    "\n",
    "train_columns = train.columns\n",
    "test_columns = test.columns\n",
    "\n",
    "# trim column names\n",
    "train = train.rename(columns=lambda x: x.strip())\n",
    "test = test.rename(columns=lambda x: x.strip())\n",
    "\n",
    "print(len(test.columns),test.columns)\n",
    "print(len(train.columns),train.columns)\n",
    "\n",
    "\n",
    "# train_labels = train.attack\n",
    "test_labels = test.attack\n",
    "\n",
    "# train = train.drop(columns=['attack'])\n",
    "test = test.drop(columns=['attack'])\n",
    "\n",
    "\n",
    "x_train, x_test = norm(train.values, test.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "182b018f",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = pd.DataFrame(x_train, columns = train_columns)\n",
    "test_df =pd.DataFrame(x_test, columns = test_columns[:-1])\n",
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb67e4f0",
   "metadata": {},
   "source": [
    "## Saving Cleaned Data for GDN   <a name=\"saving-data\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04878d57",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df[\"attack\"] = attack_column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1bcd01ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "def descriptive_column_names(df):\n",
    "    new_column_names = {}\n",
    "    for column in df.columns:\n",
    "        if column != \"attack\":\n",
    "            new_column_names[column] = \"host_\"+column\n",
    "    return new_column_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23939d3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.rename(columns=descriptive_column_names(train_df), inplace=True)\n",
    "test_df.rename(columns=descriptive_column_names(test_df), inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "311eacd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3717123a",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf2c878b",
   "metadata": {},
   "outputs": [],
   "source": [
    "pathlib.Path(\"../../data/03_primary/wifi/gdn\").mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7df3f921",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.to_csv(\"../../data/03_primary/wifi/gdn/wifi_gdn_train.csv\", index=False)\n",
    "test_df.to_csv(\"../../data/03_primary/wifi/gdn/wifi_gdn_test.csv\", index=False)\n",
    "\n",
    "with open(\"../../data/03_primary/wifi/gdn/wifi_sensor_list.txt\", \"w\") as f:\n",
    "    f.writelines(\"\\n\".join(train_df.columns))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "041baf16",
   "metadata": {},
   "source": [
    "# References <a name=\"references\"></a>\n",
    "\n",
    "Anisa Allahdadi and Ricardo Morla. 2017. 802.11 Wireless Access Point Usage Simulation and Anomaly Detection. CoRR abs/1707.02933, (2017). Retrieved from http://arxiv.org/abs/1707.02933 \n",
    "\n",
    "Ailin Deng and Bryan Hooi. 2021. Graph Neural Network-Based Anomaly Detection in Multivariate Time Series. CoRR abs/2106.06947, (2021). Retrieved from https://arxiv.org/abs/2106.06947 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16e0ab2d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "kedro-gdn-venv",
   "language": "python",
   "name": "kedro-gdn-venv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}