# **Amazon Lookout for Equipment** - Demonstration on an anonymized compressor dataset
*Part 2: Dataset creation*

**Note:** If you haven't created an IAM role for Amazon Lookout for Equipment, first please follow these [**set of instructions to create an IAM role**](https://github.com/dast1/l4e_iam_role_configuration/blob/main/configure_IAM_role.md).

## Initialization
---
Following the data preparation notebook, this repository should now be structured as follow:
```
/lookout-equipment-demo/getting_started/
|
├── data/
| |
| ├── labelled-data/
| | └── labels.csv
| |
| └── training-data/
| └── expander/
| ├── subsystem-01
| | └── subsystem-01.csv
| |
| ├── subsystem-02
| | └── subsystem-02.csv
| |
| ├── ...
| |
| └── subsystem-24
| └── subsystem-24.csv
|
├── dataset/ <<< Original dataset <<<
| ├── labels.csv
| ├── tags_description.csv
| ├── timeranges.txt
| └── timeseries.zip
|
├── notebooks/
| ├── 1_data_preparation.ipynb
| ├── 2_dataset_creation.ipynb <<< This notebook <<<
| ├── 3_model_training.ipynb
| ├── 4_model_evaluation.ipynb
| ├── 5_inference_scheduling.ipynb
| └── config.py
|
└── utils/
 ├── aws_matplotlib_light.py
 └── lookout_equipment_utils.py
```

### Notebook configuration update

In [None]:
!pip install --quiet --upgrade tqdm

### Imports
**Note:** Update the content of the **config.py** file **before** running the following cell

In [None]:
import boto3
import config
import os
import pandas as pd
import pprint
import sagemaker
import sys
import time

from datetime import datetime

# Helper functions for managing Lookout for Equipment API calls:
sys.path.append('../utils')
import lookout_equipment_utils as lookout

### Parameters

In [None]:
# warnings.filterwarnings('ignore')
DATA = os.path.join('..', 'data')
LABEL_DATA = os.path.join(DATA, 'labelled-data')
TRAIN_DATA = os.path.join(DATA, 'training-data', 'expander')

ROLE_ARN = sagemaker.get_execution_role()
REGION_NAME = boto3.session.Session().region_name
DATASET_NAME = config.DATASET_NAME
BUCKET = config.BUCKET
PREFIX_TRAINING = config.PREFIX_TRAINING
PREFIX_LABEL = config.PREFIX_LABEL

In [None]:
# List of the directories from the training data 
# directory: each directory corresponds to a subsystem:
components = []
for root, dirs, files in os.walk(f'{TRAIN_DATA}'):
 for subsystem in dirs:
 components.append(subsystem)

## Create a dataset
---

### Create data schema

First we need to setup the schema of your dataset. In the below cell, please define `DATASET_COMPONENT_FIELDS_MAP`. `DATASET_COMPONENT_FIELDS_MAP` is a Python dictonary (hashmap). The key of each entry in the dictionary is the `Component` name, and the value of each entry is a list of column names. The column names must exactly match the header in your csv files. The order of the column names also need to exactly match. As an example, if we want to create the data schema for the example we are using here, the dictionary will look like this:

```json
DATASET_COMPONENT_FIELDS_MAP = {
 "Component1": ['Timestamp', 'Tag1', 'Tag2',...],
 "Component2": ['Timestamp', 'Tag1', 'Tag2',...]
 ...
 "ComponentN": ['Timestamp', 'Tag1', 'Tag2',...]
}
```

Make sure the component name **matches exactly** the name of the folder in S3 (everything is **case sensitive**):
```json
DATASET_COMPONENT_FIELDS_MAP = {
 "subsystem-01": ['Timestamp', 'signal-026', 'signal-027',... , 'signal-092'],
 "subsystem-02": ['Timestamp', 'signal-022', 'signal-023',... , 'signal-096'],
 ...
 "subsystem-24": ['Timestamp', 'signal-083'],
}
```

In [None]:
DATASET_COMPONENT_FIELDS_MAP = dict()
for subsystem in components:
 subsystem_tags = ['Timestamp']
 for root, _, files in os.walk(f'{TRAIN_DATA}/{subsystem}'):
 for file in files:
 fname = os.path.join(root, file)
 current_subsystem_df = pd.read_csv(fname, nrows=1)
 subsystem_tags = subsystem_tags + current_subsystem_df.columns.tolist()[1:]

 DATASET_COMPONENT_FIELDS_MAP.update({subsystem: subsystem_tags})
 
 
lookout_dataset = lookout.LookoutEquipmentDataset(
 dataset_name=DATASET_NAME,
 component_fields_map=DATASET_COMPONENT_FIELDS_MAP,
 region_name=REGION_NAME,
 access_role_arn=ROLE_ARN
)

If you want to use the console, the following string can be used to configure the **dataset schema**:

![dataset_schema](../assets/dataset-schema.png)

In [None]:
print(lookout_dataset.dataset_schema)

Use the following cell to print a pretty version of this string in this notebook: use the previous string if you want to paste this schema in the console though, as JSON format requires double quotes for strings (and not simple quotes as Python dictionnaries are displayed from a Jupyter Notebook):

In [None]:
import pprint
pp = pprint.PrettyPrinter(depth=5)
pp.pprint(eval(lookout_dataset.dataset_schema))

### Create the dataset

In [None]:
lookout_dataset.create()

The dataset is now created, but it is empty and ready to receive some timeseries data that we will ingest from the S3 location prepared in the previous notebook:

![dataset_schema](../assets/dataset-created.png)

## Ingest data into a dataset
---
Let's double check the values of all the parameters that will be used to ingest some data into an existing Lookout for Equipment dataset:

In [None]:
ROLE_ARN, BUCKET, PREFIX_TRAINING, DATASET_NAME

Launch the ingestion job in the Lookout for Equipment dataset:

In [None]:
response = lookout_dataset.ingest_data(BUCKET, PREFIX_TRAINING)

The ingestion is launched. With this amount of data (around 1.5 GB), it should take between 5-10 minutes:

![dataset_schema](../assets/dataset-ingestion-in-progress.png)

In [None]:
describe_data_ingestion_job_response = lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)
describe_data_ingestion_job_response

In [None]:
# Get the ingestion job ID and status:
data_ingestion_job_id = response['JobId']
data_ingestion_status = response['Status']

# Wait until ingestion completes:
print("=====Polling Data Ingestion Status=====\n")
lookout_client = lookout.get_client(region_name=REGION_NAME)
print(str(pd.to_datetime(datetime.now()))[:19], "| ", data_ingestion_status)

while data_ingestion_status == 'IN_PROGRESS':
 time.sleep(60)
 describe_data_ingestion_job_response = lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)
 data_ingestion_status = describe_data_ingestion_job_response['Status']
 print(str(pd.to_datetime(datetime.now()))[:19], "| ", data_ingestion_status)
 
print("\n=====End of Polling Data Ingestion Status=====")

The ingestion should now be complete as can be seen in the console:

![dataset_schema](../assets/dataset-ingestion-done.png)

## Conclusion
---

In this notebook, we created a **Lookout for Equipment dataset** and ingested the S3 data previously uploaded into this dataset. **Move now to the next notebook to train a model based on these data.**