{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Amazon Lookout for Equipment** - Demonstration on an anonymized compressor dataset\n", "*Part 1: Data preparation*\n", "\n", "**Change your bucket name here:**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# BUCKET = ''\n", "# PREFIX = 'data'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialization\n", "---\n", "This repository is initially structured as follow:\n", "```\n", "/lookout-equipment-demo\n", "|\n", "+-- dataset/ <<< Original dataset <<<\n", "| |-- labels.csv\n", "| |-- tags_description.csv\n", "| |-- timeranges.txt\n", "| \\-- timeseries.zip\n", "|\n", "+-- notebooks/\n", "| |-- 1_data_preparation.ipynb <<< This notebook <<<\n", "| |-- 2_dataset_creation.ipynb\n", "| |-- 3_model_training.ipynb\n", "| |-- 4_model_evaluation.ipynb\n", "| \\-- 5_inference_scheduling.ipynb\n", "|\n", "+-- utils/\n", " \\-- lookout_equipment_utils.py\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Notebook configuration update\n", "Amazon Lookout for Equipment being a very recent service, we need to make sure that we have access to the latest version of the AWS Python packages. If you see a `pip` dependency error, check that the `boto3` version is ok: if it's greater than 1.17.48 (the first version that includes the `lookoutequipment` API), you can discard this error and move forward with the next cell:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "boto3 version: 1.17.53 (should be >= 1.17.48 to include Lookout for Equipment API)\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "!pip install --quiet --upgrade boto3 tqdm tsia\n", "\n", "import boto3\n", "print(f'boto3 version: {boto3.__version__} (should be >= 1.17.48 to include Lookout for Equipment API)')\n", "\n", "# Restart the current notebook to ensure we take into account the previous updates:\n", "from IPython.core.display import HTML\n", "HTML(\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "import pyarrow as pa\n", "import pyarrow.parquet as pq\n", "import shutil\n", "import sys\n", "import tsia\n", "import warnings\n", "\n", "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sys.path.append('../utils')\n", "import lookout_equipment_utils as lookout" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "RAW_DATA = os.path.join('..', 'dataset')\n", "DATA = os.path.join('..', 'data')\n", "LABEL_DATA = os.path.join(DATA, 'labelled-data')\n", "TRAIN_DATA = os.path.join(DATA, 'training-data', 'expander')\n", "\n", "os.makedirs(DATA, exist_ok=True)\n", "os.makedirs(LABEL_DATA, exist_ok=True)\n", "os.makedirs(TRAIN_DATA, exist_ok=True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "plt.style.use('Solarize_Light2')\n", "plt.rcParams['lines.linewidth'] = 0.5\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading datasets of interest\n", "---\n", "### Analysis time ranges\n", "The dataset provided with this repository is one year long with some known anomaly periods appearing both at the beginning and at the end of the year. Using the following training / evaluation split, will allow Lookout for Equipment to have labelled periods on both side of the split date:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training period: from 2015-01-01 00:00:00 to 2015-08-31 23:59:00\n", "Evaluation period: from 2015-09-01 00:00:00 to 2015-11-30 23:59:00\n" ] } ], "source": [ "timeranges_fname = os.path.join(DATA, 'timeranges.txt')\n", "shutil.copyfile(os.path.join(RAW_DATA, 'timeranges.txt'), timeranges_fname)\n", "with open(timeranges_fname, 'r') as f:\n", " timeranges = f.readlines()\n", " \n", "training_start = pd.to_datetime(timeranges[0][:-1])\n", "training_end = pd.to_datetime(timeranges[1][:-1])\n", "evaluation_start = pd.to_datetime(timeranges[2][:-1])\n", "evaluation_end = pd.to_datetime(timeranges[3][:-1])\n", "\n", "print(f'Training period: from {training_start} to {training_end}')\n", "print(f'Evaluation period: from {evaluation_start} to {evaluation_end}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Labels\n", "Historical maintenance record time ranges are recorded in a CSV files with two columns containing *start time* and *end time* of each range:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
startend
02015-01-10 10:00:002015-01-10 20:00:00
12015-01-15 13:00:002015-01-16 15:00:00
22015-03-05 10:00:002015-03-27 00:00:00
32015-03-29 10:00:002015-03-29 19:00:00
42015-11-11 18:00:002015-11-13 12:00:00
\n", "
" ], "text/plain": [ " start end\n", "0 2015-01-10 10:00:00 2015-01-10 20:00:00\n", "1 2015-01-15 13:00:00 2015-01-16 15:00:00\n", "2 2015-03-05 10:00:00 2015-03-27 00:00:00\n", "3 2015-03-29 10:00:00 2015-03-29 19:00:00\n", "4 2015-11-11 18:00:00 2015-11-13 12:00:00" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels_fname = os.path.join(LABEL_DATA, 'labels.csv')\n", "shutil.copyfile(os.path.join(RAW_DATA, 'labels.csv'), labels_fname)\n", "labels_df = pd.read_csv(os.path.join(LABEL_DATA, 'labels.csv'), header=None)\n", "labels_df[0] = pd.to_datetime(labels_df[0])\n", "labels_df[1] = pd.to_datetime(labels_df[1])\n", "labels_df.columns = ['start', 'end']\n", "labels_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Time series\n", "The raw timeseries is a zipped parquet file, let's deflate it:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Archive: ../dataset/timeseries.zip\n", " inflating: ../data/training-data/expander.parquet \n" ] } ], "source": [ "timeseries_fname = os.path.join(RAW_DATA, 'timeseries.zip')\n", "!unzip -o $timeseries_fname -d $DATA/training-data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataframe stored there has 122 tags and 480,886 rows, ranging from *January 1st* to *November 30, 2015*:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(480886, 122)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
signal-001signal-002signal-003signal-004signal-005signal-006signal-007signal-008signal-009signal-010...signal-113signal-114signal-115signal-116signal-117signal-118signal-119signal-120signal-121signal-122
Timestamp
2015-01-01 00:00:000.3923710.5450050.2967740.4132890.1707440.4829800.2220630.2686910.7498600.475116...0.9390240.8307690.8113210.6534650.7894740.8103450.8035710.7878790.7647060.810345
2015-01-01 00:01:000.3894150.5691550.2906450.4156460.1423680.5322970.2220630.2908040.7767810.486884...0.9390240.8307690.8113210.6534650.7894740.8103450.8035710.7878790.7794120.810345
2015-01-01 00:02:000.3781790.5477500.2906450.4064560.1609590.4701150.2356730.2771150.7823890.472665...0.9390240.8307690.8113210.6534650.7894740.8103450.8035710.7878790.7647060.810345
2015-01-01 00:03:000.3811350.5477500.2845160.4017440.1707440.4987940.2492840.2704460.7717330.484432...0.9390240.8307690.8113210.6534650.7894740.8103450.8035710.7878790.7647060.810345
2015-01-01 00:04:000.3811350.5532380.2845160.4064560.1423680.4934330.1948420.2720250.7498600.484432...0.9390240.8307690.8113210.6534650.7894740.8103450.8035710.7878790.7647060.810345
\n", "

5 rows × 122 columns

\n", "
" ], "text/plain": [ " signal-001 signal-002 signal-003 signal-004 \\\n", "Timestamp \n", "2015-01-01 00:00:00 0.392371 0.545005 0.296774 0.413289 \n", "2015-01-01 00:01:00 0.389415 0.569155 0.290645 0.415646 \n", "2015-01-01 00:02:00 0.378179 0.547750 0.290645 0.406456 \n", "2015-01-01 00:03:00 0.381135 0.547750 0.284516 0.401744 \n", "2015-01-01 00:04:00 0.381135 0.553238 0.284516 0.406456 \n", "\n", " signal-005 signal-006 signal-007 signal-008 \\\n", "Timestamp \n", "2015-01-01 00:00:00 0.170744 0.482980 0.222063 0.268691 \n", "2015-01-01 00:01:00 0.142368 0.532297 0.222063 0.290804 \n", "2015-01-01 00:02:00 0.160959 0.470115 0.235673 0.277115 \n", "2015-01-01 00:03:00 0.170744 0.498794 0.249284 0.270446 \n", "2015-01-01 00:04:00 0.142368 0.493433 0.194842 0.272025 \n", "\n", " signal-009 signal-010 ... signal-113 signal-114 \\\n", "Timestamp ... \n", "2015-01-01 00:00:00 0.749860 0.475116 ... 0.939024 0.830769 \n", "2015-01-01 00:01:00 0.776781 0.486884 ... 0.939024 0.830769 \n", "2015-01-01 00:02:00 0.782389 0.472665 ... 0.939024 0.830769 \n", "2015-01-01 00:03:00 0.771733 0.484432 ... 0.939024 0.830769 \n", "2015-01-01 00:04:00 0.749860 0.484432 ... 0.939024 0.830769 \n", "\n", " signal-115 signal-116 signal-117 signal-118 \\\n", "Timestamp \n", "2015-01-01 00:00:00 0.811321 0.653465 0.789474 0.810345 \n", "2015-01-01 00:01:00 0.811321 0.653465 0.789474 0.810345 \n", "2015-01-01 00:02:00 0.811321 0.653465 0.789474 0.810345 \n", "2015-01-01 00:03:00 0.811321 0.653465 0.789474 0.810345 \n", "2015-01-01 00:04:00 0.811321 0.653465 0.789474 0.810345 \n", "\n", " signal-119 signal-120 signal-121 signal-122 \n", "Timestamp \n", "2015-01-01 00:00:00 0.803571 0.787879 0.764706 0.810345 \n", "2015-01-01 00:01:00 0.803571 0.787879 0.779412 0.810345 \n", "2015-01-01 00:02:00 0.803571 0.787879 0.764706 0.810345 \n", "2015-01-01 00:03:00 0.803571 0.787879 0.764706 0.810345 \n", "2015-01-01 00:04:00 0.803571 0.787879 0.764706 0.810345 \n", "\n", "[5 rows x 122 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_tags_fname = os.path.join(DATA, 'training-data', 'expander.parquet')\n", "table = pq.read_table(all_tags_fname)\n", "all_tags_df = table.to_pandas()\n", "del table\n", "\n", "print(all_tags_df.shape)\n", "all_tags_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tags description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset comes with a tag description file including:\n", "\n", "* `Tag`: the tag name as it is recorded by the customer in his historian system (for instance the [Honeywell process history database](https://www.honeywellprocess.com/en-US/explore/products/advanced-applications/uniformance/Pages/uniformance-phd.aspx))\n", "* `UOM`: the unit of measure for the recorded signal\n", "* `Subsystem`: an ID linked to the part of the asset this sensor is attached to" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TagUOMSubsystem
0signal-001micra ppsubsystem-05
1signal-002micra ppsubsystem-05
2signal-003micra ppsubsystem-05
3signal-004micra ppsubsystem-05
4signal-005micra ppsubsystem-08
\n", "
" ], "text/plain": [ " Tag UOM Subsystem\n", "0 signal-001 micra pp subsystem-05\n", "1 signal-002 micra pp subsystem-05\n", "2 signal-003 micra pp subsystem-05\n", "3 signal-004 micra pp subsystem-05\n", "4 signal-005 micra pp subsystem-08" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tags_description_fname = os.path.join(RAW_DATA, 'tags_description.csv')\n", "tags_description_df = pd.read_csv(tags_description_fname)\n", "tags_description_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's extract a list of features from this table: we group them by unit of measure for more convenience:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "features = list(tags_description_df.sort_values(by='UOM')['Tag'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset overview\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build a list of dataframes, one per feature (this will be useful for visualizations purpose). An early event in the year skews the data: we remove that part for visualization purpose only (hence the `start` and `end` range definition below), but will keep the period starting January 1st as a training period later on." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "start = pd.to_datetime('2015-04-05 00:00:00')\n", "end = evaluation_end\n", "\n", "df_list = []\n", "feature_groups = dict()\n", "for f in features:\n", " # Get the unit of measure for the current feature:\n", " uom = str(list(tags_description_df.loc[tags_description_df['Tag'] == f, 'UOM'])[0])\n", " \n", " # We have already some features in this group, add it:\n", " if uom in feature_groups.keys():\n", " feature_groups.update({uom: feature_groups[uom] + [f]})\n", " \n", " # Otherwise, create this group:\n", " else:\n", " feature_groups.update({uom: [f]})\n", " \n", " # Add the dataframe to the list:\n", " current_df = all_tags_df.loc[start:end, [f]]\n", " current_df = current_df.replace(np.nan, 0.0)\n", " df_list.append(current_df)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tag = 'signal-028'\n", "tag_df = all_tags_df.loc[start:end, [tag]]\n", "tag_df.columns = ['Value']\n", "\n", "fig, axes = lookout.plot_timeseries(\n", " tag_df, \n", " tag, \n", " fig_width=20, \n", " tag_split=evaluation_start, \n", " labels_df=labels_df\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the following two cells **only on instances with high memory** (at least `ml.m5.xlarge`): on smaller instances, the image generated by `matplotlib` are too large to be displayed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = tsia.plot.plot_multivariate_timeseries(\n", " timeseries_list=df_list,\n", " tags_list=features,\n", " split_date=evaluation_start,\n", " tags_description_df=tags_description_df,\n", " tags_grouping_key='UOM',\n", " num_cols=4,\n", " col_size=5\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# Discretize each signal in 3 bins:\n", "array = tsia.markov.discretize_multivariate(df_list)\n", "\n", "# Grouping the signals based on their unit of measure (UOM):\n", "num_timesteps = array.shape[1]\n", "separator = np.zeros(shape=(1, num_timesteps))\n", "separator = np.where(separator==0, np.nan, separator)\n", "grouped_array = []\n", "signal_list = []\n", "current_row = 0\n", "for uom in feature_groups.keys():\n", " num_features = len(feature_groups[uom])\n", " signal_list = signal_list + features[current_row:current_row + num_features + 1]\n", " signal_list.append(uom)\n", " grouped_array.append(array[current_row:current_row + num_features + 1])\n", " grouped_array.append(separator)\n", " current_row += num_features\n", "grouped_array = np.concatenate(grouped_array)\n", "\n", "# Plot the strip chart:\n", "tsia.plot.plot_timeseries_strip_chart(\n", " grouped_array, \n", " signal_list=signal_list,\n", " fig_width=20,\n", " dates=df_list[0].index.to_pydatetime(),\n", " day_interval=2\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building and uploading the dataset\n", "---\n", "We will structure our S3 bucket like this:\n", "```\n", "s3://sagemaker-lookout-equipment-demo/\n", "|\n", "+-- training-data/\n", "| |\n", "| +-- expander/\n", "| |\n", "| +-- subsystem-01\n", "| | \\-- subsystem-01.csv\n", "| |\n", "| +-- subsystem-02\n", "| | \\-- subsystem-02.csv\n", "| |\n", "| +-- ...\n", "| |\n", "| \\-- subsystem-24\n", "| \\-- subsystem-24.csv\n", "|\n", "+-- labelled-data/\n", " |\n", " \\-- labels.csv\n", "```" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Component subsystem-17: 100%|██████████| 24/24 [00:00<00:00, 398.16it/s]\n" ] } ], "source": [ "# Process each subsystem one by one:\n", "components = list(tags_description_df['Subsystem'].unique())\n", "progress_bar = tqdm(components)\n", "for component in progress_bar:\n", " progress_bar.set_description(f'Component {component}')\n", " progress_bar.refresh()\n", " \n", " # Check if CSV file already exist and do not overwrite it:\n", " component_tags_fname = os.path.join(TRAIN_DATA, f'{component}', f'{component}.csv')\n", " if not os.path.exists(component_tags_fname):\n", " # Build the dataframe with all the signal timeseries for the current subsystem:\n", " component_tags_list = list(tags_description_df[tags_description_df['Subsystem'] == component]['Tag'])\n", " component_tags_df = all_tags_df[component_tags_list]\n", " component_tags_df = component_tags_df.reset_index()\n", " component_tags_df['Timestamp'] = component_tags_df['Timestamp'].dt.strftime('%Y-%m-%dT%H:%M:%S.%f')\n", " \n", " # Save to disk:\n", " os.makedirs(os.path.join(TRAIN_DATA, f'{component}'), exist_ok=True)\n", " component_tags_df.to_csv(component_tags_fname, index=None)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "upload: ../data/training-data/expander/subsystem-03/subsystem-03.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-03/subsystem-03.csv\n", "upload: ../data/training-data/expander/subsystem-06/subsystem-06.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-06/subsystem-06.csv\n", "upload: ../data/training-data/expander/subsystem-01/subsystem-01.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-01/subsystem-01.csv\n", "upload: ../data/training-data/expander/subsystem-02/subsystem-02.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-02/subsystem-02.csv\n", "upload: ../data/training-data/expander/subsystem-04/subsystem-04.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-04/subsystem-04.csv\n", "upload: ../data/training-data/expander/subsystem-10/subsystem-10.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-10/subsystem-10.csv\n", "upload: ../data/training-data/expander/subsystem-11/subsystem-11.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-11/subsystem-11.csv\n", "upload: ../data/training-data/expander/subsystem-12/subsystem-12.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-12/subsystem-12.csv\n", "upload: ../data/training-data/expander/subsystem-13/subsystem-13.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-13/subsystem-13.csv\n", "upload: ../data/training-data/expander/subsystem-14/subsystem-14.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-14/subsystem-14.csv\n", "upload: ../data/training-data/expander/subsystem-15/subsystem-15.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-15/subsystem-15.csv\n", "upload: ../data/training-data/expander/subsystem-16/subsystem-16.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-16/subsystem-16.csv\n", "upload: ../data/training-data/expander/subsystem-17/subsystem-17.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-17/subsystem-17.csv\n", "upload: ../data/training-data/expander/subsystem-05/subsystem-05.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-05/subsystem-05.csv\n", "upload: ../data/training-data/expander/subsystem-08/subsystem-08.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-08/subsystem-08.csv\n", "upload: ../data/training-data/expander/subsystem-09/subsystem-09.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-09/subsystem-09.csv\n", "upload: ../data/training-data/expander/subsystem-07/subsystem-07.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-07/subsystem-07.csv\n", "upload: ../data/training-data/expander/subsystem-18/subsystem-18.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-18/subsystem-18.csv\n", "upload: ../data/training-data/expander/subsystem-19/subsystem-19.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-19/subsystem-19.csv\n", "upload: ../data/training-data/expander/subsystem-20/subsystem-20.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-20/subsystem-20.csv\n", "upload: ../data/training-data/expander/subsystem-21/subsystem-21.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-21/subsystem-21.csv\n", "upload: ../data/training-data/expander/subsystem-22/subsystem-22.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-22/subsystem-22.csv\n", "upload: ../data/training-data/expander/subsystem-23/subsystem-23.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-23/subsystem-23.csv\n", "upload: ../data/training-data/expander/subsystem-24/subsystem-24.csv to s3://sagemaker-lookout-equipment-demo/data4/training-data/expander/subsystem-24/subsystem-24.csv\n" ] } ], "source": [ "# Uploading training dataset to S3:\n", "training_src_dir = TRAIN_DATA\n", "training_s3_dest_path = f's3://{BUCKET}/{PREFIX}/training-data/expander'\n", "!aws s3 cp --recursive $training_src_dir $training_s3_dest_path" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "upload: ../data/labelled-data/labels.csv to s3://sagemaker-lookout-equipment-demo/data4/labelled-data/labels.csv\n" ] } ], "source": [ "# Uploading label dataset to S3:\n", "label_src_fname = os.path.join(LABEL_DATA, 'labels.csv')\n", "label_s3_dest_path = f's3://{BUCKET}/{PREFIX}/labelled-data/labels.csv'\n", "!aws s3 cp $label_src_fname $label_s3_dest_path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "---\n", "At this stage, we have built:\n", "* A single Parquet dataset that contains all the historical data for all tags provided by the customer: this is **58,668,092** at a **1 minute** sampling rate for **122 tags**.\n", "* **24 individual CSV files** (1 for each subsystem, each subsystem can contain several timeseries) filed in their respective subsystem directories\n", "\n", "Looking at the plot for **signal-028** above, we are going to try and predict the event that happens on **November 2015**: to achieve this, we will use a training set spanning from **January 2015** to **August 2015** and we will test on **September 2015** to **November 2015**." ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }