# **Amazon Lookout for Equipment** - Demonstration on an anonymized expander dataset
*Part 2: Dataset creation*

**Change bucket name here:** this Notebook should have a role that allows access to your S3 bucket to the Lookout for Equipment service. At this stage, it needs to read data from this S3 location to ingest the data. It will need to write data in this bucket at the inference scheduling phase.

**Note:** If you haven't created an IAM role for Amazon Lookout for Equipment, first please follow these [**set of instructions to create an IAM role**](https://github.com/dast1/l4e_iam_role_configuration/blob/main/configure_IAM_role.md).

In [1]:
BUCKET = ''
PREFIX = 'data/training-data/expander/'

## Initialization
---
Following the data preparation notebook, this repository should now be structured as follow:
```
/lookout-equipment-demo
|
+-- data/
| |
| +-- labelled-data/
| | \-- labels.csv
| |
| \-- training-data/
| \-- expander/
| |-- subsystem-01
| | \-- subsystem-01.csv
| |
| |-- subsystem-02
| | \-- subsystem-02.csv
| |
| |-- ...
| |
| \-- subsystem-24
| \-- subsystem-24.csv
|
+-- dataset/
| |-- labels.csv
| |-- tags_description.csv
| |-- timeranges.txt
| \-- timeseries.zip
|
+-- notebooks/
| |-- 1_data_preparation.ipynb
| |-- 2_dataset_creation.ipynb <<< This notebook <<<
| |-- 3_model_training.ipynb
| |-- 4_model_evaluation.ipynb
| \-- 5_inference_scheduling.ipynb
|
+-- utils/
 |-- lookout_equipment_utils.py
 \-- lookoutequipment.json
```

### Notebook configuration update
Amazon Lookout for Equipment being a very recent service, we need to make sure that we have access to the latest version of the AWS Python packages. If you see a `pip` dependency error, check that the `boto3` version is ok: if it's greater than 1.17.48 (the first version that includes the `lookoutequipment` API), you can discard this error and move forward with the next cell:

In [2]:
!pip install --quiet --upgrade boto3 tqdm sagemaker

import boto3
print(f'boto3 version: {boto3.__version__} (should be >= 1.17.48 to include Lookout for Equipment API)')

# Restart the current notebook to ensure we take into account the previous updates:
from IPython.core.display import HTML
HTML("")

boto3 version: 1.17.53 (should be >= 1.17.48 to include Lookout for Equipment API)


### Imports

In [3]:
import boto3
import os
import pandas as pd
import pprint
import sagemaker
import sys
import time
import warnings

from datetime import datetime

# Helper functions for managing Lookout for Equipment API calls:
sys.path.append('../utils')
import lookout_equipment_utils as lookout

### Parameters

In [4]:
warnings.filterwarnings('ignore')

DATA = os.path.join('..', 'data')
LABEL_DATA = os.path.join(DATA, 'labelled-data')
TRAIN_DATA = os.path.join(DATA, 'training-data', 'expander')

ROLE_ARN = sagemaker.get_execution_role()
REGION_NAME = boto3.session.Session().region_name
DATASET_NAME = 'lookout-demo-training-dataset'

In [5]:
# List of the directories from the training data 
# directory: each directory corresponds to a subsystem:
components = []
for root, dirs, files in os.walk(f'{TRAIN_DATA}'):
 for subsystem in dirs:
 components.append(subsystem)

## Create a dataset
---

### Create data schema

First we need to setup the schema of your dataset. In the below cell, please define `DATASET_COMPONENT_FIELDS_MAP`. `DATASET_COMPONENT_FIELDS_MAP` is a Python dictonary (hashmap). The key of each entry in the dictionary is the `Component` name, and the value of each entry is a list of column names. The column names must exactly match the header in your csv files. The order of the column names also need to exactly match. As an example, if we want to create the data schema for the example we are using here, the dictionary will look like this:

```json
DATASET_COMPONENT_FIELDS_MAP = {
 "Component1": ['Timestamp', 'Tag1', 'Tag2',...],
 "Component2": ['Timestamp', 'Tag1', 'Tag2',...]
 ...
 "ComponentN": ['Timestamp', 'Tag1', 'Tag2',...]
}
```

Make sure the component name **matches exactly** the name of the folder in S3 (everything is **case sensitive**):
```json
DATASET_COMPONENT_FIELDS_MAP = {
 "subsystem-01": ['Timestamp', 'signal-026', 'signal-027',... , 'signal-092'],
 "subsystem-02": ['Timestamp', 'signal-022', 'signal-023',... , 'signal-096'],
 ...
 "subsystem-24": ['Timestamp', 'signal-083'],
}
```

In [6]:
DATASET_COMPONENT_FIELDS_MAP = dict()
for subsystem in components:
 subsystem_tags = ['Timestamp']
 for root, _, files in os.walk(f'{TRAIN_DATA}/{subsystem}'):
 for file in files:
 fname = os.path.join(root, file)
 current_subsystem_df = pd.read_csv(fname, nrows=1)
 subsystem_tags = subsystem_tags + current_subsystem_df.columns.tolist()[1:]

 DATASET_COMPONENT_FIELDS_MAP.update({subsystem: subsystem_tags})
 
 
lookout_dataset = lookout.LookoutEquipmentDataset(
 dataset_name=DATASET_NAME,
 component_fields_map=DATASET_COMPONENT_FIELDS_MAP,
 region_name=REGION_NAME,
 access_role_arn=ROLE_ARN
)

If you want to use the console, the following string can be used to configure the **dataset schema**:

![dataset_schema](../assets/dataset-schema.png)

In [7]:
import pprint
pp = pprint.PrettyPrinter(depth=5)
pp.pprint(eval(lookout_dataset.dataset_schema))

{'Components': [{'Columns': [{'Name': 'Timestamp', 'Type': 'DATETIME'},
 {'Name': 'signal-067', 'Type': 'DOUBLE'}],
 'ComponentName': 'subsystem-19'},
 {'Columns': [{'Name': 'Timestamp', 'Type': 'DATETIME'},
 {'Name': 'signal-099', 'Type': 'DOUBLE'}],
 'ComponentName': 'subsystem-18'},
 {'Columns': [{'Name': 'Timestamp', 'Type': 'DATETIME'},
 {'Name': 'signal-016', 'Type': 'DOUBLE'},
 {'Name': 'signal-031', 'Type': 'DOUBLE'},
 {'Name': 'signal-032', 'Type': 'DOUBLE'},
 {'Name': 'signal-033', 'Type': 'DOUBLE'},
 {'Name': 'signal-044', 'Type': 'DOUBLE'},
 {'Name': 'signal-045', 'Type': 'DOUBLE'},
 {'Name': 'signal-103', 'Type': 'DOUBLE'},
 {'Name': 'signal-104', 'Type': 'DOUBLE'},
 {'Name': 'signal-105', 'Type': 'DOUBLE'}],
 'ComponentName': 'subsystem-09'},
 {'Columns': [{'Name': 'Timestamp', 'Type': 'DATETIME'},
 {'Name': 'signal-094', 'Type': 'DOUBLE'}],
 'ComponentName': 'subsystem-14'},
 {'Columns': [{'Name': 'Timestamp', 'Type': 'DATETIME'},
 {'Name': 'signal-083', 'Type': 'DOUBLE'

### Create the dataset

In [8]:
lookout_dataset.create()

Dataset "lookout-demo-training-dataset-v4" does not exist, creating it...



{'DatasetName': 'lookout-demo-training-dataset-v4',
 'DatasetArn': 'arn:aws:lookoutequipment:eu-west-1:123031033346:dataset/lookout-demo-training-dataset-v4/29dbcdb2-2d9f-4dcc-8922-a06fda7e4b63',
 'Status': 'CREATED',
 'ResponseMetadata': {'RequestId': '10458a57-87c7-400a-873e-02fefe116b12',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'x-amzn-requestid': '10458a57-87c7-400a-873e-02fefe116b12',
 'content-type': 'application/x-amz-json-1.0',
 'content-length': '210',
 'date': 'Fri, 16 Apr 2021 20:03:23 GMT'},
 'RetryAttempts': 0}}

The dataset is now created, but it is empty and ready to receive some timeseries data that we will ingest from the S3 location prepared in the previous notebook:

![dataset_schema](../assets/dataset-created.png)

## Ingest data into a dataset
---
Let's double check the values of all the parameters that will be used to ingest some data into an existing Lookout for Equipment dataset:

In [9]:
ROLE_ARN, BUCKET, PREFIX, DATASET_NAME

('arn:aws:iam::123031033346:role/service-role/AmazonSageMaker-ExecutionRole-20210128T070865',
 'sagemaker-lookout-equipment-demo',
 'data4/training-data/expander/',
 'lookout-demo-training-dataset-v4')

Launch the ingestion job in the Lookout for Equipment dataset:

In [10]:
response = lookout_dataset.ingest_data(BUCKET, PREFIX)

The ingestion is launched. With this amount of data (around 1.5 GB), it should take between 5-10 minutes:

![dataset_schema](../assets/dataset-ingestion-in-progress.png)

In [11]:
# Get the ingestion job ID and status:
data_ingestion_job_id = response['JobId']
data_ingestion_status = response['Status']

# Wait until ingestion completes:
print("=====Polling Data Ingestion Status=====\n")
lookout_client = lookout.get_client(region_name=REGION_NAME)
print(str(pd.to_datetime(datetime.now()))[:19], "| ", data_ingestion_status)

while data_ingestion_status == 'IN_PROGRESS':
 time.sleep(60)
 describe_data_ingestion_job_response = lookout_client.describe_data_ingestion_job(JobId=data_ingestion_job_id)
 data_ingestion_status = describe_data_ingestion_job_response['Status']
 print(str(pd.to_datetime(datetime.now()))[:19], "| ", data_ingestion_status)
 
print("\n=====End of Polling Data Ingestion Status=====")

=====Polling Data Ingestion Status=====

2021-04-16 20:03:50 | IN_PROGRESS
2021-04-16 20:04:50 | IN_PROGRESS
2021-04-16 20:05:50 | IN_PROGRESS
2021-04-16 20:06:50 | IN_PROGRESS
2021-04-16 20:07:50 | IN_PROGRESS
2021-04-16 20:08:50 | IN_PROGRESS
2021-04-16 20:09:50 | IN_PROGRESS
2021-04-16 20:10:50 | IN_PROGRESS
2021-04-16 20:11:50 | SUCCESS

=====End of Polling Data Ingestion Status=====


The ingestion should now be complete as can be seen in the console:

![dataset_schema](../assets/dataset-ingestion-done.png)

## Conclusion
---

In this notebook, we created a **Lookout for Equipment dataset** and ingested the S3 data previously uploaded into this dataset. **Move now to the next notebook to train a model based on these data.**

In [12]:
# We'll just persist this dataset name to collect it from the next notebook in this series:
dataset_fname = os.path.join(DATA, 'dataset_name.txt')
with open(dataset_fname, 'w') as f:
 f.write(DATASET_NAME)