# **Amazon Lookout for Equipment** - Getting started
*Part 2 - Dataset creation*

## Initialization
---
This repository is structured as follow:

```sh
. lookout-equipment-demo
|
├── data/
|   ├── interim                          # Temporary intermediate data
|   ├── processed                        # Finalized datasets
|   └── raw                              # Immutable original data
|
├── getting_started/
|   ├── 1_data_preparation.ipynb
|   ├── 2_dataset_creation.ipynb               <<< THIS NOTEBOOK <<<
|   ├── 3_model_training.ipynb
|   ├── 4_model_evaluation.ipynb
|   ├── 5_inference_scheduling.ipynb
|   ├── 6_visualization_with_quicksight.ipynb
|   └── 7_cleanup.ipynb
|
└── utils/
    └── lookout_equipment_utils.py
```

### Notebook configuration update

In [1]:
!pip install --quiet --upgrade pip
!pip install --quiet --upgrade sagemaker lookoutequipment

[0m

### Imports

In [2]:
import boto3
import config
import os
import pandas as pd
import pprint
import sagemaker
import sys
import time

from datetime import datetime

# SDK / toolbox for managing Lookout for Equipment API calls:
import lookoutequipment as lookout

In [3]:
PROCESSED_DATA = os.path.join('..', 'data', 'processed', 'getting-started')
TRAIN_DATA     = os.path.join(PROCESSED_DATA, 'training-data')

ROLE_ARN       = sagemaker.get_execution_role()
REGION_NAME    = boto3.session.Session().region_name
DATASET_NAME   = config.DATASET_NAME
BUCKET         = config.BUCKET
PREFIX         = config.PREFIX_TRAINING

## Create a dataset
---

### Create data schema

In [4]:
lookout_dataset = lookout.LookoutEquipmentDataset(
    dataset_name=DATASET_NAME,
    component_root_dir=f's3://{BUCKET}/{PREFIX}',
    access_role_arn=ROLE_ARN
)

The following method encapsulate the [**CreateDataset**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_CreateDataset.html) API:

```python
lookout_client.create_dataset(
    DatasetName=self.dataset_name,
    
    # Optional
    DatasetSchema={
        'InlineDataSchema': "schema"
    }
)
```

In [5]:
lookout_dataset.create()

Dataset "getting-started-pump" does not exist, creating it...



{'DatasetName': 'getting-started-pump',
 'DatasetArn': 'arn:aws:lookoutequipment:eu-west-1:038552646228:dataset/getting-started-pump/9f3b8a45-fa09-4e23-971d-29e0b9e30498',
 'Status': 'CREATED',
 'ResponseMetadata': {'RequestId': 'b8c933f1-1e0d-43f6-94d8-855c8645a350',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b8c933f1-1e0d-43f6-94d8-855c8645a350',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '186',
   'date': 'Fri, 13 May 2022 09:01:10 GMT'},
  'RetryAttempts': 0}}

The dataset is now created, but it is empty and ready to receive some timeseries data that we will ingest from the S3 location prepared in the previous notebook:

![Dataset created](assets/dataset-created.png)

## Ingest data into a dataset
---
Let's double check the values of all the parameters that will be used to ingest some data into an existing Lookout for Equipment dataset:

In [6]:
ROLE_ARN, BUCKET, PREFIX, DATASET_NAME

('arn:aws:iam::038552646228:role/AmazonSageMaker-LookoutEquipmentEnv',
 'lookout-equipment-poc',
 'getting_started/training-data/',
 'getting-started-pump')

Launch the ingestion job in the Lookout for Equipment dataset: the following method encapsulates the [**StartDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_StartDataIngestionJob.html) API:

```python
lookout_client.start_data_ingestion_job(
    DatasetName=DATASET_NAME,
    RoleArn=ROLE_ARN, 
    IngestionInputConfiguration={ 
        'S3InputConfiguration': { 
            'Bucket': BUCKET,
            'Prefix': PREFIX,
            'KeyPattern': "string"
        }
    }
)
```

In [7]:
response = lookout_dataset.ingest_data(BUCKET, PREFIX)

The ingestion is launched. With this amount of data (around 50 MB), it should take between less than 5 minutes:

![dataset_schema](assets/dataset-ingestion-in-progress.png)

We use the following cell to monitor the ingestion process by calling the following method, which encapsulates the [**DescribeDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_DescribeDataIngestionJob.html) API and runs it every 60 seconds:

In [8]:
lookout_dataset.poll_data_ingestion(sleep_time=60)

2022-05-13 09:04:11 | Data ingestion: IN_PROGRESS
2022-05-13 09:05:11 | Data ingestion: IN_PROGRESS
2022-05-13 09:06:11 | Data ingestion: IN_PROGRESS
2022-05-13 09:07:11 | Data ingestion: SUCCESS


In case any issue arise, you can inspect the API response available as a JSON document:

In [9]:
lookout_dataset.ingestion_job_response

{'JobId': 'af28c7e6ea53ad88e43457d8ced8ded4',
 'DatasetArn': 'arn:aws:lookoutequipment:eu-west-1:038552646228:dataset/getting-started-pump/9f3b8a45-fa09-4e23-971d-29e0b9e30498',
 'IngestionInputConfiguration': {'S3InputConfiguration': {'Bucket': 'lookout-equipment-poc',
   'Prefix': 'getting_started/training-data/'}},
 'RoleArn': 'arn:aws:iam::038552646228:role/AmazonSageMaker-LookoutEquipmentEnv',
 'CreatedAt': datetime.datetime(2022, 5, 13, 9, 3, 8, 123000, tzinfo=tzlocal()),
 'Status': 'SUCCESS',
 'DataQualitySummary': {'InsufficientSensorData': {'MissingCompleteSensorData': {'AffectedSensorCount': 0},
   'SensorsWithShortDateRange': {'AffectedSensorCount': 0}},
  'MissingSensorData': {'AffectedSensorCount': 0,
   'TotalNumberOfMissingValues': 0},
  'InvalidSensorData': {'AffectedSensorCount': 0,
   'TotalNumberOfInvalidValues': 0},
  'UnsupportedTimestamps': {'TotalNumberOfUnsupportedTimestamps': 0},
  'DuplicateTimestamps': {'TotalNumberOfDuplicateTimestamps': 0}},
 'IngestedFiles

The ingestion should now be complete as can be seen in the console:

![Ingestion done](assets/dataset-ingestion-done.png)

## Inspecting sensor data quality
---

You can now inspect the data quality of your dataset by clicking on `View dataset`. In this new screen, you will be able to visualize:
* Your dataset details with a summary of their grade. In our case, 22 sensors are marked as **High quality* while 8 sensors are marked as **Medium quality**
* The total number of sensors ingested
* The overall date range
* The location of the data source on S3

You then have a table with a row for each sensor where you can see the overall date range, the number of days of available data and the sensor grade. Hovering your mouse over a given sensor grade will give you the explanations linked to this grading. In the example below, you can see that Sensor0 was graded as Medium because multiple operating modes are detected. You will be able to use every sensors ingested, but the Lookout for Equipment console gives you some pieces of advice and warns about situations where bad performance may arise further down the road. To read about all the sensor grades the service checks out, [follow this link](https://docs.aws.amazon.com//lookout-for-equipment/latest/ug/reading-details-by-sensor.html):

![Ingestion done](assets/dataset-inspection.png)

You can obtain these detailed information by querying the [ListSensorStatistics](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/lookoutequipment.html#LookoutEquipment.Client.list_sensor_statistics) API:

In [10]:
job_id = lookout_dataset.ingestion_job_response['JobId']

response = lookout_dataset.client.list_sensor_statistics(DatasetName=DATASET_NAME, IngestionJobId=job_id)
results = response['SensorStatisticsSummaries']
while 'NextToken' in response:
    response = l4e_client.list_sensor_statistics(DatasetName=DATASET_NAME, IngestionJobId=job_id, NextToken=response['NextToken'])
    results.extend(response['SensorStatisticsSummaries'])
    
stats_df = pd.json_normalize(results, max_level=1)
print(stats_df.shape)
stats_df.head()

(30, 17)


Unnamed: 0,ComponentName,SensorName,DataExists,DataStartTime,DataEndTime,MissingValues.Count,MissingValues.Percentage,InvalidValues.Count,InvalidValues.Percentage,InvalidDateEntries.Count,InvalidDateEntries.Percentage,DuplicateTimestamps.Count,DuplicateTimestamps.Percentage,CategoricalValues.Status,MultipleOperatingModes.Status,LargeTimestampGaps.Status,MonotonicValues.Status
0,centrifugal-pump,Sensor0,True,2019-01-01 00:00:00+00:00,2019-10-27 23:55:00+00:00,0,0.0,0,0.0,0,0.0,0,0.0,NO_ISSUE_DETECTED,POTENTIAL_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED
1,centrifugal-pump,Sensor1,True,2019-01-01 00:00:00+00:00,2019-10-27 23:55:00+00:00,0,0.0,0,0.0,0,0.0,0,0.0,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED
2,centrifugal-pump,Sensor10,True,2019-01-01 00:00:00+00:00,2019-10-27 23:55:00+00:00,0,0.0,0,0.0,0,0.0,0,0.0,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED
3,centrifugal-pump,Sensor11,True,2019-01-01 00:00:00+00:00,2019-10-27 23:55:00+00:00,0,0.0,0,0.0,0,0.0,0,0.0,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED
4,centrifugal-pump,Sensor12,True,2019-01-01 00:00:00+00:00,2019-10-27 23:55:00+00:00,0,0.0,0,0.0,0,0.0,0,0.0,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED,NO_ISSUE_DETECTED


Here are all the characteristics you can get for each sensor:

In [11]:
stats_df.iloc[0]

ComponentName                              centrifugal-pump
SensorName                                          Sensor0
DataExists                                             True
DataStartTime                     2019-01-01 00:00:00+00:00
DataEndTime                       2019-10-27 23:55:00+00:00
MissingValues.Count                                       0
MissingValues.Percentage                                0.0
InvalidValues.Count                                       0
InvalidValues.Percentage                                0.0
InvalidDateEntries.Count                                  0
InvalidDateEntries.Percentage                           0.0
DuplicateTimestamps.Count                                 0
DuplicateTimestamps.Percentage                          0.0
CategoricalValues.Status                  NO_ISSUE_DETECTED
MultipleOperatingModes.Status      POTENTIAL_ISSUE_DETECTED
LargeTimestampGaps.Status                 NO_ISSUE_DETECTED
MonotonicValues.Status                  

## Conclusion
---

In this notebook, we created a **Lookout for Equipment dataset** and ingested the S3 data previously uploaded into this dataset. **Move now to the next notebook to train a model based on these data.**