{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Amazon Lookout for Equipment** - Demonstration on an anonymized compressor dataset\n",
    "*Part 2: Dataset creation*\n",
    "\n",
    "**Note:** If you haven't created an IAM role for Amazon Lookout for Equipment, first please follow these [**set of instructions to create an IAM role**](https://github.com/dast1/l4e_iam_role_configuration/blob/main/configure_IAM_role.md)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialization\n",
    "---\n",
    "Following the data preparation notebook, this repository should now be structured as follow:\n",
    "```\n",
    "/lookout-equipment-demo/getting_started/\n",
    "|\n",
    "├── data/\n",
    "|   |\n",
    "|   ├── labelled-data/\n",
    "|   |   └── labels.csv\n",
    "|   |\n",
    "|   └── training-data/\n",
    "|       └── expander/\n",
    "|           ├── subsystem-01\n",
    "|           |   └── subsystem-01.csv\n",
    "|           |\n",
    "|           ├── subsystem-02\n",
    "|           |   └── subsystem-02.csv\n",
    "|           |\n",
    "|           ├── ...\n",
    "|           |\n",
    "|           └── subsystem-24\n",
    "|               └── subsystem-24.csv\n",
    "|\n",
    "├── dataset/                                <<< Original dataset <<<\n",
    "|   ├── labels.csv\n",
    "|   ├── tags_description.csv\n",
    "|   ├── timeranges.txt\n",
    "|   └── timeseries.zip\n",
    "|\n",
    "├── notebooks/\n",
    "|   ├── 1_data_preparation.ipynb\n",
    "|   ├── 2_dataset_creation.ipynb            <<< This notebook <<<\n",
    "|   ├── 3_model_training.ipynb\n",
    "|   ├── 4_model_evaluation.ipynb\n",
    "|   ├── 5_inference_scheduling.ipynb\n",
    "|   └── config.py\n",
    "|\n",
    "└── utils/\n",
    "    ├── aws_matplotlib_light.py\n",
    "    └── lookout_equipment_utils.py\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Notebook configuration update"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --quiet --upgrade tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports\n",
    "<span style=\"color: white; background-color: OrangeRed; padding: 0px 15px 0px 15px; border-radius: 20px;\">**Note:** Update the content of the **config.py** file **before** running the following cell</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import config\n",
    "import os\n",
    "import pandas as pd\n",
    "import pprint\n",
    "import sagemaker\n",
    "import sys\n",
    "import time\n",
    "\n",
    "from datetime import datetime\n",
    "\n",
    "# Helper functions for managing Lookout for Equipment API calls:\n",
    "sys.path.append('../utils')\n",
    "import lookout_equipment_utils as lookout"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# warnings.filterwarnings('ignore')\n",
    "DATA       = os.path.join('..', 'data')\n",
    "LABEL_DATA = os.path.join(DATA, 'labelled-data')\n",
    "TRAIN_DATA = os.path.join(DATA, 'training-data', 'expander')\n",
    "\n",
    "ROLE_ARN        = sagemaker.get_execution_role()\n",
    "REGION_NAME     = boto3.session.Session().region_name\n",
    "DATASET_NAME    = config.DATASET_NAME\n",
    "BUCKET          = config.BUCKET\n",
    "PREFIX_TRAINING = config.PREFIX_TRAINING\n",
    "PREFIX_LABEL    = config.PREFIX_LABEL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List of the directories from the training data \n",
    "# directory: each directory corresponds to a subsystem:\n",
    "components = []\n",
    "for root, dirs, files in os.walk(f'{TRAIN_DATA}'):\n",
    "    for subsystem in dirs:\n",
    "        components.append(subsystem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a dataset\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create data schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we need to setup the schema of your dataset. In the below cell, please define `DATASET_COMPONENT_FIELDS_MAP`. `DATASET_COMPONENT_FIELDS_MAP` is a Python dictonary (hashmap). The key of each entry in the dictionary is the `Component` name, and the value of each entry is a list of column names. The column names must exactly match the header in your csv files. The order of the column names also need to exactly match. As an example, if we want to create the data schema for the example we are using here, the dictionary will look like this:\n",
    "\n",
    "```json\n",
    "DATASET_COMPONENT_FIELDS_MAP = {\n",
    "    \"Component1\": ['Timestamp', 'Tag1', 'Tag2',...],\n",
    "    \"Component2\": ['Timestamp', 'Tag1', 'Tag2',...]\n",
    "    ...\n",
    "    \"ComponentN\": ['Timestamp', 'Tag1', 'Tag2',...]\n",
    "}\n",
    "```\n",
    "\n",
    "Make sure the component name **matches exactly** the name of the folder in S3 (everything is **case sensitive**):\n",
    "```json\n",
    "DATASET_COMPONENT_FIELDS_MAP = {\n",
    "    \"subsystem-01\": ['Timestamp', 'signal-026', 'signal-027',... , 'signal-092'],\n",
    "    \"subsystem-02\": ['Timestamp', 'signal-022', 'signal-023',... , 'signal-096'],\n",
    "    ...\n",
    "    \"subsystem-24\": ['Timestamp', 'signal-083'],\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATASET_COMPONENT_FIELDS_MAP = dict()\n",
    "for subsystem in components:\n",
    "    subsystem_tags = ['Timestamp']\n",
    "    for root, _, files in os.walk(f'{TRAIN_DATA}/{subsystem}'):\n",
    "        for file in files:\n",
    "            fname = os.path.join(root, file)\n",
    "            current_subsystem_df = pd.read_csv(fname, nrows=1)\n",
    "            subsystem_tags = subsystem_tags + current_subsystem_df.columns.tolist()[1:]\n",
    "\n",
    "        DATASET_COMPONENT_FIELDS_MAP.update({subsystem: subsystem_tags})\n",
    "        \n",
    "        \n",
    "lookout_dataset = lookout.LookoutEquipmentDataset(\n",
    "    dataset_name=DATASET_NAME,\n",
    "    component_fields_map=DATASET_COMPONENT_FIELDS_MAP,\n",
    "    region_name=REGION_NAME,\n",
    "    access_role_arn=ROLE_ARN\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to use the console, the following string can be used to configure the **dataset schema**:\n",
    "\n",
    "![dataset_schema](../assets/dataset-schema.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(lookout_dataset.dataset_schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the following cell to print a pretty version of this string in this notebook: use the previous string if you want to paste this schema in the console though, as JSON format requires double quotes for strings (and not simple quotes as Python dictionnaries are displayed from a Jupyter Notebook):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pprint\n",
    "pp = pprint.PrettyPrinter(depth=5)\n",
    "pp.pprint(eval(lookout_dataset.dataset_schema))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lookout_dataset.create()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is now created, but it is empty and ready to receive some timeseries data that we will ingest from the S3 location prepared in the previous notebook:\n",
    "\n",
    "![dataset_schema](../assets/dataset-created.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ingest data into a dataset\n",
    "---\n",
    "Let's double check the values of all the parameters that will be used to ingest some data into an existing Lookout for Equipment dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ROLE_ARN, BUCKET, PREFIX_TRAINING, DATASET_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Launch the ingestion job in the Lookout for Equipment dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = lookout_dataset.ingest_data(BUCKET, PREFIX_TRAINING)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ingestion is launched. With this amount of data (around 1.5 GB), it should take between 5-10 minutes:\n",
    "\n",
    "![dataset_schema](../assets/dataset-ingestion-in-progress.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "describe_data_ingestion_job_response = lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)\n",
    "describe_data_ingestion_job_response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the ingestion job ID and status:\n",
    "data_ingestion_job_id = response['JobId']\n",
    "data_ingestion_status = response['Status']\n",
    "\n",
    "# Wait until ingestion completes:\n",
    "print(\"=====Polling Data Ingestion Status=====\\n\")\n",
    "lookout_client = lookout.get_client(region_name=REGION_NAME)\n",
    "print(str(pd.to_datetime(datetime.now()))[:19], \"| \", data_ingestion_status)\n",
    "\n",
    "while data_ingestion_status == 'IN_PROGRESS':\n",
    "    time.sleep(60)\n",
    "    describe_data_ingestion_job_response = lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)\n",
    "    data_ingestion_status = describe_data_ingestion_job_response['Status']\n",
    "    print(str(pd.to_datetime(datetime.now()))[:19], \"| \", data_ingestion_status)\n",
    "    \n",
    "print(\"\\n=====End of Polling Data Ingestion Status=====\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ingestion should now be complete as can be seen in the console:\n",
    "\n",
    "![dataset_schema](../assets/dataset-ingestion-done.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we created a **Lookout for Equipment dataset** and ingested the S3 data previously uploaded into this dataset. **Move now to the next notebook to train a model based on these data.**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}