# Data exploration/visualization

**SageMaker Studio Kernel**: Data Science

The challenge we're trying to address here is to detect anomalies in the components of a Wind Turbine. Each wind turbine has many sensors that reads data like:
 - Internal & external temperature
 - Wind speed
 - Rotor speed
 - Air pressure
 - Voltage (or current) in the generator
 - Vibration in the GearBox (using an IMU -> Accelerometer + Gyroscope)

So, depending on the types of the anomalies we want to detect, we need to select one or more features and then prepare a dataset that 'explains' the anomalies. We are interested in three types of anomalies:
 - Rotor speed (when the rotor is not in an expected speed)
 - Produced voltage (when the generator is not producing the expected voltage)
 - Gearbox vibration (when the vibration of the gearbox is far from the expected)
 
All these three anomalies (or violations) depend on many variables while the turbine is working. Thus, in order to address that, let's use a ML model called [Autoencoder](https://en.wikipedia.org/wiki/Autoencoder), with correlated features. This model is unsupervised. It learns the latent representation of the dataset and tries to predict (regression) the same tensor given as input. The strategy then is to use a dataset collected from a normal turbine (without anomalies). The model will then learn **'what is a normal turbine'**. When the sensors readings of a malfunctioning turbine is used as input, the model will not be able to rebuild the input, predicting something with a high error and detected as an anomaly.

The sequence of the sensors readings can be seen as a time-series dataset and therefore we observe a high correlation between neighbour samples. We can explore this by reformatting the data as a multidimensional tensor. We'll create a temporal encoding of six features in 10x10 steps of 250ms each. 250ms is the interval computed using 5 samples (the time interval between each sample is ~50ms). It means that we will create a tensor with a shape of 6x10x10.

![Tensor](imgs/tensor.png)

In the tensor above, each color is a different feature, encoded in 100 (10x10) timesteps (from the current reading to the past in a sliding window).

Let's start preparing our dataset, then.

### Install this lib to improve data visualization

In [None]:
!pip install -U matplotlib==3.4.1 seaborn==0.11.1

### And download the sensors (raw) data
This dataset was created by mini wind turbines, 3D printed and assembled for experimenting with ML@Edge. If you're interested on building your own 3D printed mini wind turbines, please check the link to this project in the home of this workshop.

In [None]:
!mkdir -p data
!curl https://aws-ml-blog.s3.amazonaws.com/artifacts/monitor-manage-anomaly-detection-model-wind-turbine-fleet-sagemaker-neo/dataset_wind_turbine.csv.gz -o data/dataset_wind.csv.gz

## Let's take a look on the data
Loading the dataset using Pandas...

In [None]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
## preprocessing is the data preparation script we'll use in our automated ML Pipeline
## here, it will be just a loaded library
import preprocessing as dataprep

In [None]:
parser = lambda date: datetime.strptime(date, '%Y-%m-%dT%H:%M:%S.%f+00:00')
df = pd.read_csv('data/dataset_wind.csv.gz', compression="gzip", sep=',', low_memory=False, parse_dates=[ 'eventTime'], date_parser=parser)

df.head()

Features:
 - **nanoId**: id of the edge device that collected the data
 - **turbineId**: id of the turbine that produced this data
 - **arduino_timestamp**: timestamp of the arduino that was operating this turbine
 - **nanoFreemem**: amount of free memory in bytes
 - **eventTime**: timestamp of the row
 - **rps**: rotation of the rotor in Rotations Per Second
 - **voltage**: voltage produced by the generator in milivolts
 - **qw, qx, qy, qz**: quaternion angular acceleration
 - **gx, gy, gz**: gravity acceleration
 - **ax, ay, az**: linear acceleration
 - **gearboxtemp**: internal temperature
 - **ambtemp**: external temperature
 - **humidity**: air humidity
 - **pressure**: air pressure
 - **gas**: air quality
 - **wind_speed_rps**: wind speed in Rotations Per Second

In [None]:
## we will select the following features to prepare our dataset
## with these features we have parameters for vibration, rotation and voltage
quat=['qx', 'qy', 'qz', 'qw']
rot=['wind_speed_rps', 'rps']
volt=['voltage']
features = quat + rot + volt

### Ploting the vibration data, just to have an idea

In [None]:
df[quat[:3]].iloc[1910:2000].plot(figsize=(20,10))

### Now, plot the rotation of the turbine and the wind speed in RPS

In [None]:
df[rot].iloc[1910:2000].plot(figsize=(20,10))

### Finally, plot the voltage readings

In [None]:
df[volt].iloc[1910:2000].plot(figsize=(20,10))

## Data preparation
The raw data for rotation is formated as angular acceleration using a Quaternion representation. We can convert it to Euler angles to make it easier to understand.

In [None]:
print('now converting quat to euler...')
roll,pitch,yaw = [], [], []
for idx, row in df.iterrows():
 r,p,y = dataprep.euler_from_quaternion(row['qx'], row['qy'], row['qz'], row['qw'])
 roll.append(r)
 pitch.append(p)
 yaw.append(y)
df['roll'] = roll
df['pitch'] = pitch
df['yaw'] = yaw

## Then, we can denoise and normalize the data to complete the process

In [None]:
df_train = df.copy()

# select the features
features = ['roll', 'pitch', 'yaw', 'wind_speed_rps', 'rps', 'voltage']

# get the std for denoising
raw_std = df_train[features].std()
for f in features:
 df_train[f] = dataprep.wavelet_denoise(df_train[f].values, 'db6', raw_std[f])#[:-1]

# normalize
training_std = df_train[features].std()
training_mean = df_train[features].mean()
df_train = (df_train[features] - training_mean) / training_std

df_train.head()

### Alright, this is our dataset. Let's just plot the original vs the prepared data
**Original Data**

In [None]:
df[features][:2000].plot(figsize=(20,10))

**Denoised & Normalized Data**

In [None]:
df_train[:2000].plot(figsize=(20,10))

 > There is too much noise in the raw data, specially in the accelerometer + gyroscope readings 
 > This process is important to remove the impurity and make the model more efficient

In [None]:
import seaborn as sns
corr = df[features].corr()

fig, ax = plt.subplots(figsize=(15, 8))

sns.heatmap(corr, annot=True, fmt="f",
 xticklabels=corr.columns.values,
 yticklabels=corr.columns.values,
 ax=ax)

As expected, the linear correlation between **rps (rotation speed)** and **voltage** is high. We need to keep both, given the model needs to understand what is a normal relationship between these two + other features.

Alright! Now you can start exercise #2: create a ML pipeline to train your model and then deploy it to the edge devices.

 > [Exercise 02](02%20-%20Training%20with%20Pytorch.ipynb)