{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "A multivariate LSTM neural network for prediction future pollution levels to efficiently operate air filters
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# About" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predicting failure, remaining useful life or operating conditions is a classic request of IoT systems. By predicting which devices will fail, proactive maintenance can be scheduled to increase device uptime, optimize asset utilization, avoid costly catastrophic device failure and optimize field service efficiency. In this Notebook template, we will show how to implement a multivariate LSTM algorithm to predict pollution levels using the [Beijing PM2.5 data set](https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This Jupyter notebook is part of a larger solution intended to be deployed to your AWS account. Run through this notebook after air pollution data has been published into your account and stored in IoT Analytics. After finishing all steps in this notebook, you will move on to the next stage of the solution which is deploying your LSTM model to an edge solution with IoT Greengrass." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Architecture diagram
\n", "This notebook requires a few basic Python libraries including `pandas`, `numpy`, `keras`, `tensorflow`, `scikit-learn` and `matplotlib`.
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "from pandas import read_csv\n", "from datetime import datetime\n", "from math import sqrt\n", "from numpy import concatenate\n", "from matplotlib import pyplot\n", "from pandas import to_datetime\n", "from pandas import DataFrame\n", "from pandas import concat\n", "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.metrics import mean_squared_error\n", "from keras.models import Sequential\n", "from keras.layers import Dense\n", "from keras.layers import LSTM\n", "from joblib import dump, load\n", "import numpy\n", "import sagemaker\n", "import pickle\n", "import warnings\n", "import tensorflow\n", "from sagemaker.tensorflow import TensorFlow\n", "import json\n", "from sagemaker import get_execution_role\n", "from urllib.parse import urlparse\n", "import os\n", "import boto3\n", "import tarfile\n", "from keras.models import model_from_json\n", "from IPython.display import display, Markdown\n", "import traceback\n", " \n", "###Turn off warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "if type(tensorflow.contrib) != type(tensorflow): tensorflow.contrib._warning = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Background: problem description and approach" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to support business decision making, we must be able to take action on predictive analytics. For predictive operational analysis, we want to know what would be the operating condition of equipment to optimize the parameters for increasing the life of that equipment. To solve this for real-world equipment that has multiple operation modes and reports time-series measurements from multiple sensors, our primary approach is an Long Short-Term Memory neural network in TensorFlow. LSTM provides multivariate time series forecasting to predict the future pollution levels which could be used to automatically alter the parameters of the air filter to improve efficiency and increase remaining useful life (RUL) of air filters.\n", "\n", "In the [CRISP-DM lifecycle](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining), we have already completed the Business Understanding phase in knowing our objective is to increase RUL of our air filtration equipment and that we are going to try using pollution forecasting to drive improvements. The next phase, which starts in this notebook, is Data Understanding. We need to evaluate the available data, look for trends, and assess whether this data will be useful to our objective.\n", "\n", "BEST PRACTICES NOTE Different modeling approaches provide different business trade-offs, and different teams may opt for different approaches. For example, although the field service team is interested in the most precise prediction for any given filter in order to eliminate false positives and unnecessary truck rolls, a supervisor is interested in all possible devices that aren't operating at full efficiency to increase operational equipment effectiveness (OEE)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data Set Attribution:
Song Xi Chen, Guanghua School of Management, Center for Statistical Science, Peking University\n", "\n", "The data can be downloaded from this page: https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data has been split into training, validation and test data sets. The first 2 years of data is used for training data, followed by the next 2 year of data as validation to check the training accuracy. The last 1 year of data will later be used for device simulation testing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To load data already ingested from your device to the cloud, we need to set up an IoT Analytics SDK client to access your data set. This code initializes the client, fetches content from your data set, sorts by time, and previews the first five rows." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | pollution | \n", "dew | \n", "temp | \n", "press | \n", "wnd_dir | \n", "wnd_spd | \n", "snow | \n", "rain | \n", "
---|---|---|---|---|---|---|---|---|
date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2010-01-02 00:00:00 | \n", "129.0 | \n", "-16 | \n", "-4.0 | \n", "1020.0 | \n", "2 | \n", "1.79 | \n", "0 | \n", "0 | \n", "
2010-01-02 01:00:00 | \n", "148.0 | \n", "-15 | \n", "-4.0 | \n", "1020.0 | \n", "2 | \n", "2.68 | \n", "0 | \n", "0 | \n", "
2010-01-02 02:00:00 | \n", "159.0 | \n", "-11 | \n", "-5.0 | \n", "1021.0 | \n", "2 | \n", "3.57 | \n", "0 | \n", "0 | \n", "
2010-01-02 03:00:00 | \n", "181.0 | \n", "-7 | \n", "-5.0 | \n", "1022.0 | \n", "2 | \n", "5.36 | \n", "1 | \n", "0 | \n", "
2010-01-02 04:00:00 | \n", "138.0 | \n", "-7 | \n", "-5.0 | \n", "1022.0 | \n", "2 | \n", "6.25 | \n", "2 | \n", "0 | \n", "
\n", " | pollution(t-1) | \n", "dew(t-1) | \n", "temp(t-1) | \n", "press(t-1) | \n", "wnd_dir(t-1) | \n", "wnd_spd(t-1) | \n", "snow(t-1) | \n", "rain(t-1) | \n", "pollution(t) | \n", "
---|---|---|---|---|---|---|---|---|---|
1 | \n", "0.129779 | \n", "0.352941 | \n", "0.245902 | \n", "0.527273 | \n", "0.666667 | \n", "0.002290 | \n", "0.000000 | \n", "0.0 | \n", "0.148893 | \n", "
2 | \n", "0.148893 | \n", "0.367647 | \n", "0.245902 | \n", "0.527273 | \n", "0.666667 | \n", "0.003811 | \n", "0.000000 | \n", "0.0 | \n", "0.159960 | \n", "
3 | \n", "0.159960 | \n", "0.426471 | \n", "0.229508 | \n", "0.545454 | \n", "0.666667 | \n", "0.005332 | \n", "0.000000 | \n", "0.0 | \n", "0.182093 | \n", "
4 | \n", "0.182093 | \n", "0.485294 | \n", "0.229508 | \n", "0.563637 | \n", "0.666667 | \n", "0.008391 | \n", "0.037037 | \n", "0.0 | \n", "0.138833 | \n", "
5 | \n", "0.138833 | \n", "0.485294 | \n", "0.229508 | \n", "0.563637 | \n", "0.666667 | \n", "0.009912 | \n", "0.074074 | \n", "0.0 | \n", "0.109658 | \n", "
\n", " | pollution(t-1) | \n", "predicted_pollution(t) | \n", "
---|---|---|
date | \n", "\n", " | \n", " |
2010-01-02 00:00:00 | \n", "129.0 | \n", "129.0 | \n", "
2010-01-02 01:00:00 | \n", "148.0 | \n", "148.0 | \n", "
2010-01-02 02:00:00 | \n", "159.0 | \n", "159.0 | \n", "
2010-01-02 03:00:00 | \n", "181.0 | \n", "181.0 | \n", "
2010-01-02 04:00:00 | \n", "138.0 | \n", "138.0 | \n", "