{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Amazon Lookout for Equipment** - Getting started\n", "*Part 2 - Dataset creation*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialization\n", "---\n", "This repository is structured as follow:\n", "\n", "```sh\n", ". lookout-equipment-demo\n", "|\n", "├── data/\n", "| ├── interim # Temporary intermediate data\n", "| ├── processed # Finalized datasets\n", "| └── raw # Immutable original data\n", "|\n", "├── getting_started/\n", "| ├── 1_data_preparation.ipynb\n", "| ├── 2_dataset_creation.ipynb <<< THIS NOTEBOOK <<<\n", "| ├── 3_model_training.ipynb\n", "| ├── 4_model_evaluation.ipynb\n", "| ├── 5_inference_scheduling.ipynb\n", "| ├── 6_visualization_with_quicksight.ipynb\n", "| └── 7_cleanup.ipynb\n", "|\n", "└── utils/\n", " └── lookout_equipment_utils.py\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notebook configuration update" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m" ] } ], "source": [ "!pip install --quiet --upgrade pip\n", "!pip install --quiet --upgrade sagemaker lookoutequipment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import config\n", "import os\n", "import pandas as pd\n", "import pprint\n", "import sagemaker\n", "import sys\n", "import time\n", "\n", "from datetime import datetime\n", "\n", "# SDK / toolbox for managing Lookout for Equipment API calls:\n", "import lookoutequipment as lookout" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "PROCESSED_DATA = os.path.join('..', 'data', 'processed', 'getting-started')\n", "TRAIN_DATA = os.path.join(PROCESSED_DATA, 'training-data')\n", "\n", "ROLE_ARN = sagemaker.get_execution_role()\n", "REGION_NAME = boto3.session.Session().region_name\n", "DATASET_NAME = config.DATASET_NAME\n", "BUCKET = config.BUCKET\n", "PREFIX = config.PREFIX_TRAINING" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a dataset\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create data schema" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "lookout_dataset = lookout.LookoutEquipmentDataset(\n", " dataset_name=DATASET_NAME,\n", " component_root_dir=f's3://{BUCKET}/{PREFIX}',\n", " access_role_arn=ROLE_ARN\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following method encapsulate the [**CreateDataset**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_CreateDataset.html) API:\n", "\n", "```python\n", "lookout_client.create_dataset(\n", " DatasetName=self.dataset_name,\n", " \n", " # Optional\n", " DatasetSchema={\n", " 'InlineDataSchema': \"schema\"\n", " }\n", ")\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset \"getting-started-pump\" does not exist, creating it...\n", "\n" ] }, { "data": { "text/plain": [ "{'DatasetName': 'getting-started-pump',\n", " 'DatasetArn': 'arn:aws:lookoutequipment:eu-west-1:038552646228:dataset/getting-started-pump/9f3b8a45-fa09-4e23-971d-29e0b9e30498',\n", " 'Status': 'CREATED',\n", " 'ResponseMetadata': {'RequestId': 'b8c933f1-1e0d-43f6-94d8-855c8645a350',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'x-amzn-requestid': 'b8c933f1-1e0d-43f6-94d8-855c8645a350',\n", " 'content-type': 'application/x-amz-json-1.0',\n", " 'content-length': '186',\n", " 'date': 'Fri, 13 May 2022 09:01:10 GMT'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lookout_dataset.create()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is now created, but it is empty and ready to receive some timeseries data that we will ingest from the S3 location prepared in the previous notebook:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ingest data into a dataset\n", "---\n", "Let's double check the values of all the parameters that will be used to ingest some data into an existing Lookout for Equipment dataset:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('arn:aws:iam::038552646228:role/AmazonSageMaker-LookoutEquipmentEnv',\n", " 'lookout-equipment-poc',\n", " 'getting_started/training-data/',\n", " 'getting-started-pump')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ROLE_ARN, BUCKET, PREFIX, DATASET_NAME" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Launch the ingestion job in the Lookout for Equipment dataset: the following method encapsulates the [**StartDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_StartDataIngestionJob.html) API:\n", "\n", "```python\n", "lookout_client.start_data_ingestion_job(\n", " DatasetName=DATASET_NAME,\n", " RoleArn=ROLE_ARN, \n", " IngestionInputConfiguration={ \n", " 'S3InputConfiguration': { \n", " 'Bucket': BUCKET,\n", " 'Prefix': PREFIX,\n", " 'KeyPattern': \"string\"\n", " }\n", " }\n", ")\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "response = lookout_dataset.ingest_data(BUCKET, PREFIX)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ingestion is launched. With this amount of data (around 50 MB), it should take between less than 5 minutes:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the following cell to monitor the ingestion process by calling the following method, which encapsulates the [**DescribeDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_DescribeDataIngestionJob.html) API and runs it every 60 seconds:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-05-13 09:04:11 | Data ingestion: IN_PROGRESS\n", "2022-05-13 09:05:11 | Data ingestion: IN_PROGRESS\n", "2022-05-13 09:06:11 | Data ingestion: IN_PROGRESS\n", "2022-05-13 09:07:11 | Data ingestion: SUCCESS\n" ] } ], "source": [ "lookout_dataset.poll_data_ingestion(sleep_time=60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case any issue arise, you can inspect the API response available as a JSON document:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'JobId': 'af28c7e6ea53ad88e43457d8ced8ded4',\n", " 'DatasetArn': 'arn:aws:lookoutequipment:eu-west-1:038552646228:dataset/getting-started-pump/9f3b8a45-fa09-4e23-971d-29e0b9e30498',\n", " 'IngestionInputConfiguration': {'S3InputConfiguration': {'Bucket': 'lookout-equipment-poc',\n", " 'Prefix': 'getting_started/training-data/'}},\n", " 'RoleArn': 'arn:aws:iam::038552646228:role/AmazonSageMaker-LookoutEquipmentEnv',\n", " 'CreatedAt': datetime.datetime(2022, 5, 13, 9, 3, 8, 123000, tzinfo=tzlocal()),\n", " 'Status': 'SUCCESS',\n", " 'DataQualitySummary': {'InsufficientSensorData': {'MissingCompleteSensorData': {'AffectedSensorCount': 0},\n", " 'SensorsWithShortDateRange': {'AffectedSensorCount': 0}},\n", " 'MissingSensorData': {'AffectedSensorCount': 0,\n", " 'TotalNumberOfMissingValues': 0},\n", " 'InvalidSensorData': {'AffectedSensorCount': 0,\n", " 'TotalNumberOfInvalidValues': 0},\n", " 'UnsupportedTimestamps': {'TotalNumberOfUnsupportedTimestamps': 0},\n", " 'DuplicateTimestamps': {'TotalNumberOfDuplicateTimestamps': 0}},\n", " 'IngestedFilesSummary': {'TotalNumberOfFiles': 1,\n", " 'IngestedNumberOfFiles': 1,\n", " 'DiscardedFiles': []},\n", " 'IngestedDataSize': 51535331,\n", " 'DataStartTime': datetime.datetime(2019, 1, 1, 0, 0, tzinfo=tzlocal()),\n", " 'DataEndTime': datetime.datetime(2019, 10, 27, 23, 55, tzinfo=tzlocal()),\n", " 'ResponseMetadata': {'RequestId': 'a292c2c7-ddbf-48b2-9e7f-3e331fa6fd77',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'x-amzn-requestid': 'a292c2c7-ddbf-48b2-9e7f-3e331fa6fd77',\n", " 'content-type': 'application/x-amz-json-1.0',\n", " 'content-length': '1046',\n", " 'date': 'Fri, 13 May 2022 09:07:11 GMT'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lookout_dataset.ingestion_job_response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ingestion should now be complete as can be seen in the console:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting sensor data quality\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now inspect the data quality of your dataset by clicking on `View dataset`. In this new screen, you will be able to visualize:\n", "* Your dataset details with a summary of their grade. In our case, 22 sensors are marked as **High quality* while 8 sensors are marked as **Medium quality**\n", "* The total number of sensors ingested\n", "* The overall date range\n", "* The location of the data source on S3\n", "\n", "You then have a table with a row for each sensor where you can see the overall date range, the number of days of available data and the sensor grade. Hovering your mouse over a given sensor grade will give you the explanations linked to this grading. In the example below, you can see that Sensor0 was graded as Medium because multiple operating modes are detected. You will be able to use every sensors ingested, but the Lookout for Equipment console gives you some pieces of advice and warns about situations where bad performance may arise further down the road. To read about all the sensor grades the service checks out, [follow this link](https://docs.aws.amazon.com//lookout-for-equipment/latest/ug/reading-details-by-sensor.html):\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can obtain these detailed information by querying the [ListSensorStatistics](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/lookoutequipment.html#LookoutEquipment.Client.list_sensor_statistics) API:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(30, 17)\n" ] }, { "data": { "text/html": [ "
\n", " | ComponentName | \n", "SensorName | \n", "DataExists | \n", "DataStartTime | \n", "DataEndTime | \n", "MissingValues.Count | \n", "MissingValues.Percentage | \n", "InvalidValues.Count | \n", "InvalidValues.Percentage | \n", "InvalidDateEntries.Count | \n", "InvalidDateEntries.Percentage | \n", "DuplicateTimestamps.Count | \n", "DuplicateTimestamps.Percentage | \n", "CategoricalValues.Status | \n", "MultipleOperatingModes.Status | \n", "LargeTimestampGaps.Status | \n", "MonotonicValues.Status | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "centrifugal-pump | \n", "Sensor0 | \n", "True | \n", "2019-01-01 00:00:00+00:00 | \n", "2019-10-27 23:55:00+00:00 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "NO_ISSUE_DETECTED | \n", "POTENTIAL_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "
1 | \n", "centrifugal-pump | \n", "Sensor1 | \n", "True | \n", "2019-01-01 00:00:00+00:00 | \n", "2019-10-27 23:55:00+00:00 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "
2 | \n", "centrifugal-pump | \n", "Sensor10 | \n", "True | \n", "2019-01-01 00:00:00+00:00 | \n", "2019-10-27 23:55:00+00:00 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "
3 | \n", "centrifugal-pump | \n", "Sensor11 | \n", "True | \n", "2019-01-01 00:00:00+00:00 | \n", "2019-10-27 23:55:00+00:00 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "
4 | \n", "centrifugal-pump | \n", "Sensor12 | \n", "True | \n", "2019-01-01 00:00:00+00:00 | \n", "2019-10-27 23:55:00+00:00 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "0 | \n", "0.0 | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "NO_ISSUE_DETECTED | \n", "