{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Amazon Lookout for Equipment** - Getting started\n", "*Part 2 - Dataset creation*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialization\n", "---\n", "This repository is structured as follow:\n", "\n", "```sh\n", ". lookout-equipment-demo\n", "|\n", "├── data/\n", "| ├── interim # Temporary intermediate data\n", "| ├── processed # Finalized datasets\n", "| └── raw # Immutable original data\n", "|\n", "├── getting_started/\n", "| ├── 1_data_preparation.ipynb\n", "| ├── 2_dataset_creation.ipynb <<< THIS NOTEBOOK <<<\n", "| ├── 3_model_training.ipynb\n", "| ├── 4_model_evaluation.ipynb\n", "| ├── 5_inference_scheduling.ipynb\n", "| ├── 6_visualization_with_quicksight.ipynb\n", "| └── 7_cleanup.ipynb\n", "|\n", "└── utils/\n", " └── lookout_equipment_utils.py\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notebook configuration update" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m" ] } ], "source": [ "!pip install --quiet --upgrade pip\n", "!pip install --quiet --upgrade sagemaker lookoutequipment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import config\n", "import os\n", "import pandas as pd\n", "import pprint\n", "import sagemaker\n", "import sys\n", "import time\n", "\n", "from datetime import datetime\n", "\n", "# SDK / toolbox for managing Lookout for Equipment API calls:\n", "import lookoutequipment as lookout" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "PROCESSED_DATA = os.path.join('..', 'data', 'processed', 'getting-started')\n", "TRAIN_DATA = os.path.join(PROCESSED_DATA, 'training-data')\n", "\n", "ROLE_ARN = sagemaker.get_execution_role()\n", "REGION_NAME = boto3.session.Session().region_name\n", "DATASET_NAME = config.DATASET_NAME\n", "BUCKET = config.BUCKET\n", "PREFIX = config.PREFIX_TRAINING" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a dataset\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create data schema" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "lookout_dataset = lookout.LookoutEquipmentDataset(\n", " dataset_name=DATASET_NAME,\n", " component_root_dir=f's3://{BUCKET}/{PREFIX}',\n", " access_role_arn=ROLE_ARN\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following method encapsulate the [**CreateDataset**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_CreateDataset.html) API:\n", "\n", "```python\n", "lookout_client.create_dataset(\n", " DatasetName=self.dataset_name,\n", " \n", " # Optional\n", " DatasetSchema={\n", " 'InlineDataSchema': \"schema\"\n", " }\n", ")\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset \"getting-started-pump\" does not exist, creating it...\n", "\n" ] }, { "data": { "text/plain": [ "{'DatasetName': 'getting-started-pump',\n", " 'DatasetArn': 'arn:aws:lookoutequipment:eu-west-1:038552646228:dataset/getting-started-pump/9f3b8a45-fa09-4e23-971d-29e0b9e30498',\n", " 'Status': 'CREATED',\n", " 'ResponseMetadata': {'RequestId': 'b8c933f1-1e0d-43f6-94d8-855c8645a350',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'x-amzn-requestid': 'b8c933f1-1e0d-43f6-94d8-855c8645a350',\n", " 'content-type': 'application/x-amz-json-1.0',\n", " 'content-length': '186',\n", " 'date': 'Fri, 13 May 2022 09:01:10 GMT'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lookout_dataset.create()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is now created, but it is empty and ready to receive some timeseries data that we will ingest from the S3 location prepared in the previous notebook:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Dataset created](assets/dataset-created.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ingest data into a dataset\n", "---\n", "Let's double check the values of all the parameters that will be used to ingest some data into an existing Lookout for Equipment dataset:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('arn:aws:iam::038552646228:role/AmazonSageMaker-LookoutEquipmentEnv',\n", " 'lookout-equipment-poc',\n", " 'getting_started/training-data/',\n", " 'getting-started-pump')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ROLE_ARN, BUCKET, PREFIX, DATASET_NAME" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Launch the ingestion job in the Lookout for Equipment dataset: the following method encapsulates the [**StartDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_StartDataIngestionJob.html) API:\n", "\n", "```python\n", "lookout_client.start_data_ingestion_job(\n", " DatasetName=DATASET_NAME,\n", " RoleArn=ROLE_ARN, \n", " IngestionInputConfiguration={ \n", " 'S3InputConfiguration': { \n", " 'Bucket': BUCKET,\n", " 'Prefix': PREFIX,\n", " 'KeyPattern': \"string\"\n", " }\n", " }\n", ")\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "response = lookout_dataset.ingest_data(BUCKET, PREFIX)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ingestion is launched. With this amount of data (around 50 MB), it should take between less than 5 minutes:\n", "\n", "![dataset_schema](assets/dataset-ingestion-in-progress.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the following cell to monitor the ingestion process by calling the following method, which encapsulates the [**DescribeDataIngestionJob**](https://docs.aws.amazon.com/lookout-for-equipment/latest/ug/API_DescribeDataIngestionJob.html) API and runs it every 60 seconds:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-05-13 09:04:11 | Data ingestion: IN_PROGRESS\n", "2022-05-13 09:05:11 | Data ingestion: IN_PROGRESS\n", "2022-05-13 09:06:11 | Data ingestion: IN_PROGRESS\n", "2022-05-13 09:07:11 | Data ingestion: SUCCESS\n" ] } ], "source": [ "lookout_dataset.poll_data_ingestion(sleep_time=60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case any issue arise, you can inspect the API response available as a JSON document:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'JobId': 'af28c7e6ea53ad88e43457d8ced8ded4',\n", " 'DatasetArn': 'arn:aws:lookoutequipment:eu-west-1:038552646228:dataset/getting-started-pump/9f3b8a45-fa09-4e23-971d-29e0b9e30498',\n", " 'IngestionInputConfiguration': {'S3InputConfiguration': {'Bucket': 'lookout-equipment-poc',\n", " 'Prefix': 'getting_started/training-data/'}},\n", " 'RoleArn': 'arn:aws:iam::038552646228:role/AmazonSageMaker-LookoutEquipmentEnv',\n", " 'CreatedAt': datetime.datetime(2022, 5, 13, 9, 3, 8, 123000, tzinfo=tzlocal()),\n", " 'Status': 'SUCCESS',\n", " 'DataQualitySummary': {'InsufficientSensorData': {'MissingCompleteSensorData': {'AffectedSensorCount': 0},\n", " 'SensorsWithShortDateRange': {'AffectedSensorCount': 0}},\n", " 'MissingSensorData': {'AffectedSensorCount': 0,\n", " 'TotalNumberOfMissingValues': 0},\n", " 'InvalidSensorData': {'AffectedSensorCount': 0,\n", " 'TotalNumberOfInvalidValues': 0},\n", " 'UnsupportedTimestamps': {'TotalNumberOfUnsupportedTimestamps': 0},\n", " 'DuplicateTimestamps': {'TotalNumberOfDuplicateTimestamps': 0}},\n", " 'IngestedFilesSummary': {'TotalNumberOfFiles': 1,\n", " 'IngestedNumberOfFiles': 1,\n", " 'DiscardedFiles': []},\n", " 'IngestedDataSize': 51535331,\n", " 'DataStartTime': datetime.datetime(2019, 1, 1, 0, 0, tzinfo=tzlocal()),\n", " 'DataEndTime': datetime.datetime(2019, 10, 27, 23, 55, tzinfo=tzlocal()),\n", " 'ResponseMetadata': {'RequestId': 'a292c2c7-ddbf-48b2-9e7f-3e331fa6fd77',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'x-amzn-requestid': 'a292c2c7-ddbf-48b2-9e7f-3e331fa6fd77',\n", " 'content-type': 'application/x-amz-json-1.0',\n", " 'content-length': '1046',\n", " 'date': 'Fri, 13 May 2022 09:07:11 GMT'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lookout_dataset.ingestion_job_response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ingestion should now be complete as can be seen in the console:\n", "\n", "![Ingestion done](assets/dataset-ingestion-done.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting sensor data quality\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now inspect the data quality of your dataset by clicking on `View dataset`. In this new screen, you will be able to visualize:\n", "* Your dataset details with a summary of their grade. In our case, 22 sensors are marked as **High quality* while 8 sensors are marked as **Medium quality**\n", "* The total number of sensors ingested\n", "* The overall date range\n", "* The location of the data source on S3\n", "\n", "You then have a table with a row for each sensor where you can see the overall date range, the number of days of available data and the sensor grade. Hovering your mouse over a given sensor grade will give you the explanations linked to this grading. In the example below, you can see that Sensor0 was graded as Medium because multiple operating modes are detected. You will be able to use every sensors ingested, but the Lookout for Equipment console gives you some pieces of advice and warns about situations where bad performance may arise further down the road. To read about all the sensor grades the service checks out, [follow this link](https://docs.aws.amazon.com//lookout-for-equipment/latest/ug/reading-details-by-sensor.html):\n", "\n", "![Ingestion done](assets/dataset-inspection.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can obtain these detailed information by querying the [ListSensorStatistics](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/lookoutequipment.html#LookoutEquipment.Client.list_sensor_statistics) API:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(30, 17)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ComponentNameSensorNameDataExistsDataStartTimeDataEndTimeMissingValues.CountMissingValues.PercentageInvalidValues.CountInvalidValues.PercentageInvalidDateEntries.CountInvalidDateEntries.PercentageDuplicateTimestamps.CountDuplicateTimestamps.PercentageCategoricalValues.StatusMultipleOperatingModes.StatusLargeTimestampGaps.StatusMonotonicValues.Status
0centrifugal-pumpSensor0True2019-01-01 00:00:00+00:002019-10-27 23:55:00+00:0000.000.000.000.0NO_ISSUE_DETECTEDPOTENTIAL_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTED
1centrifugal-pumpSensor1True2019-01-01 00:00:00+00:002019-10-27 23:55:00+00:0000.000.000.000.0NO_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTED
2centrifugal-pumpSensor10True2019-01-01 00:00:00+00:002019-10-27 23:55:00+00:0000.000.000.000.0NO_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTED
3centrifugal-pumpSensor11True2019-01-01 00:00:00+00:002019-10-27 23:55:00+00:0000.000.000.000.0NO_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTED
4centrifugal-pumpSensor12True2019-01-01 00:00:00+00:002019-10-27 23:55:00+00:0000.000.000.000.0NO_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTEDNO_ISSUE_DETECTED
\n", "
" ], "text/plain": [ " ComponentName SensorName DataExists DataStartTime \\\n", "0 centrifugal-pump Sensor0 True 2019-01-01 00:00:00+00:00 \n", "1 centrifugal-pump Sensor1 True 2019-01-01 00:00:00+00:00 \n", "2 centrifugal-pump Sensor10 True 2019-01-01 00:00:00+00:00 \n", "3 centrifugal-pump Sensor11 True 2019-01-01 00:00:00+00:00 \n", "4 centrifugal-pump Sensor12 True 2019-01-01 00:00:00+00:00 \n", "\n", " DataEndTime MissingValues.Count MissingValues.Percentage \\\n", "0 2019-10-27 23:55:00+00:00 0 0.0 \n", "1 2019-10-27 23:55:00+00:00 0 0.0 \n", "2 2019-10-27 23:55:00+00:00 0 0.0 \n", "3 2019-10-27 23:55:00+00:00 0 0.0 \n", "4 2019-10-27 23:55:00+00:00 0 0.0 \n", "\n", " InvalidValues.Count InvalidValues.Percentage InvalidDateEntries.Count \\\n", "0 0 0.0 0 \n", "1 0 0.0 0 \n", "2 0 0.0 0 \n", "3 0 0.0 0 \n", "4 0 0.0 0 \n", "\n", " InvalidDateEntries.Percentage DuplicateTimestamps.Count \\\n", "0 0.0 0 \n", "1 0.0 0 \n", "2 0.0 0 \n", "3 0.0 0 \n", "4 0.0 0 \n", "\n", " DuplicateTimestamps.Percentage CategoricalValues.Status \\\n", "0 0.0 NO_ISSUE_DETECTED \n", "1 0.0 NO_ISSUE_DETECTED \n", "2 0.0 NO_ISSUE_DETECTED \n", "3 0.0 NO_ISSUE_DETECTED \n", "4 0.0 NO_ISSUE_DETECTED \n", "\n", " MultipleOperatingModes.Status LargeTimestampGaps.Status \\\n", "0 POTENTIAL_ISSUE_DETECTED NO_ISSUE_DETECTED \n", "1 NO_ISSUE_DETECTED NO_ISSUE_DETECTED \n", "2 NO_ISSUE_DETECTED NO_ISSUE_DETECTED \n", "3 NO_ISSUE_DETECTED NO_ISSUE_DETECTED \n", "4 NO_ISSUE_DETECTED NO_ISSUE_DETECTED \n", "\n", " MonotonicValues.Status \n", "0 NO_ISSUE_DETECTED \n", "1 NO_ISSUE_DETECTED \n", "2 NO_ISSUE_DETECTED \n", "3 NO_ISSUE_DETECTED \n", "4 NO_ISSUE_DETECTED " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "job_id = lookout_dataset.ingestion_job_response['JobId']\n", "\n", "response = lookout_dataset.client.list_sensor_statistics(DatasetName=DATASET_NAME, IngestionJobId=job_id)\n", "results = response['SensorStatisticsSummaries']\n", "while 'NextToken' in response:\n", " response = l4e_client.list_sensor_statistics(DatasetName=DATASET_NAME, IngestionJobId=job_id, NextToken=response['NextToken'])\n", " results.extend(response['SensorStatisticsSummaries'])\n", " \n", "stats_df = pd.json_normalize(results, max_level=1)\n", "print(stats_df.shape)\n", "stats_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are all the characteristics you can get for each sensor:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ComponentName centrifugal-pump\n", "SensorName Sensor0\n", "DataExists True\n", "DataStartTime 2019-01-01 00:00:00+00:00\n", "DataEndTime 2019-10-27 23:55:00+00:00\n", "MissingValues.Count 0\n", "MissingValues.Percentage 0.0\n", "InvalidValues.Count 0\n", "InvalidValues.Percentage 0.0\n", "InvalidDateEntries.Count 0\n", "InvalidDateEntries.Percentage 0.0\n", "DuplicateTimestamps.Count 0\n", "DuplicateTimestamps.Percentage 0.0\n", "CategoricalValues.Status NO_ISSUE_DETECTED\n", "MultipleOperatingModes.Status POTENTIAL_ISSUE_DETECTED\n", "LargeTimestampGaps.Status NO_ISSUE_DETECTED\n", "MonotonicValues.Status NO_ISSUE_DETECTED\n", "Name: 0, dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats_df.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we created a **Lookout for Equipment dataset** and ingested the S3 data previously uploaded into this dataset. **Move now to the next notebook to train a model based on these data.**" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }