# Pre-Process AI4I 2020 Predictive Maintenance dataset

#### In this notebook we download, explore and pre-process the dataset from UCI Data Repository, so that it can later used for training a ML model. 

## Dataset Description:

The dataset we use here for predictive maintenance comes from the UCI Data Repository and consists of a synthetic dataset containing machine failures due to features such as air temperature, process temperature, rotation speed,torque and tool wear. Additional details about this dataset can be found here: 

https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset

The machine failure consists of five independent failure modes

Tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 â€“ 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned).<br>
Heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the toolâ€™s rotational speed is below 1380 rpm. This is the case for 115 data points..<br>
Power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset..<br>
Overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints..<br>
Random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset..<br>

If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1

The 'machine failure' label that indicates, whether the machine has failed for a particular datapoint for any of the following failure modes are true.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import os
import seaborn as sb
import matplotlib.pyplot as mplt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
%matplotlib inline

### Set up Paths and Directories

In [None]:
#Preprocess the input dataset for Machine learning training
try:
    os.makedirs('training_data')
except Exception as e:
    print("'training_data' directory already exists")
        
try:
    os.makedirs('test_data')
except Exception as e:
    print("'test_data' directory already exists")
    


### Download the dataset

In [None]:
! curl --insecure https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv  --output ai4i2020.csv

In [None]:
cols = ["UDI", "Product_ID", "type", "air_temperature", "process_temperature", "rotational_speed", "torque", "tool_wear", "machine_failure", "TWF", "HDF", "PWF", "OSF", "RNF"]


In [None]:
origdf = pd.read_csv('ai4i2020.csv', sep=',', encoding = 'utf-8', skiprows=1, names = cols)

In [None]:
origdf.head(10)

Note the following: 

1. Our goal is to infer\predict the label "Machine failure" based on features such as sensor readings (temperature, speed etc) and other contextual information (for example Type)
2. Machine failure is indicated by '1' and '0' indicates no failure

We have a feature 'type' which depicts product type as L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number. 
However since this is categorical we need to encode it for use in our model. We will use Sklearn's Label Encoder to achieve this.

In [None]:
type_encoder = LabelEncoder()
type_encoder.fit(origdf['type'])
type_values = type_encoder.transform(origdf['type'])

Insert the encoded feature into the dataframe and drop the original 'type' feature. We also drop the UDI & Product_ID since they are just identifiers and do not provide much value from a feature perspective.

In [None]:
origdf.insert(2, "type_enc", type_values, True)
origdf = origdf.drop(columns = ['UDI', 'type', 'Product_ID'])

In [None]:
origdf.head(10)

### Data Exploration

#### Class Imbalance

Lets start by taking a look at how each class for our label "Machine Failure" appears in the dataset. This is important to understand if there is any Class Imbalance in the dataset.

In [None]:
sb.countplot(origdf.machine_failure)
print("Percent of Failure samples  = {:.1f}".format(len(origdf[origdf['machine_failure']==1])/len(origdf)*100))

CLASS IMBALANCE occurs when one class is much less prevalent than the others within the the label. Even though this is a synthetic dataset, within Predictive Maintaince use cases, this is a common occurence as machine failures are less common compared to non-failures.

Usually this is a challenge for ML models as the model may not have enough samples from the less common class to fully understand the patterns and accurately generate predictions. 

Depending on a few considerations and the use case, there are a number of techniques available such as up-sampling, down-sampling, SMOTE etc to deal with Class Imbalance issue. The finer details and impementation of these are beyond the scope of this post, but in real world use cases Class Imbalance should be addressed for building useful models.

#### Feature Correlations

Now let's explore correlations between our features and the label within the dataset. This helps in identifying features that are important for generating predictions and therefore should be retained in the training dataset.

We start by plotting a heatmap.

In [None]:
corr = origdf.corr()

mplt.figure(figsize= (15, 15))
sb.heatmap(np.abs(corr))
mplt.show()

There does seem to be certain degree of correlation between the failure modes TWF, HDF, PWF, OSF  and system readings such as temperature, torque, rotational speed etc. This is expected as the modes are derived based on the underlying readings.

Next let's try to identify what features correlate strongly with the failure label.

In [None]:
index_list = list(corr['machine_failure'].dropna().index)
val_list = np.argsort(np.abs(corr['machine_failure'].dropna().values))[::-1]

We get a ordered list of most corelated fesatures to the failure label and remove the label itself.

In [None]:
top_corrs = [index_list[x] for x in val_list[:15]]
reduceddat = origdf[top_corrs]
reduceddat.head()

The failure modes features TWF, HDF, PWF, OSF have stronger correlation to the failure label. Howerver for our use case, since the system readings will be streamed, we will assume features TWF, HDF, PWF, OSF are not available during inference. Additionally, we want the model to be able predict failures based on the foundational systematic features. Therefore we will be remove these from the dataset. We will also remove RNF for similiar reasons.

In [None]:
print("Dropping columns...")
processdf = origdf.drop(columns = ['TWF', 'HDF', 'PWF', 'OSF', 'RNF'])
print("replace na with 0")
processdf=processdf.replace('na',0)
print("Changing data types to numeric...")
processdf = processdf.apply(pd.to_numeric)  
print("Shape of the processed dataset ={}".format(processdf.shape))
    


In [None]:
processdf.head(5)

Rearranging the columns so that the label is at position 1.

In [None]:
finaldf = processdf[['machine_failure','air_temperature', 'process_temperature', 'rotational_speed', 'torque', 'tool_wear','type_enc']]

In [None]:
finaldf.head(5)

### Data Split & Export

Splitting the dataset into train and test sets and exporting locally to training and test folders.

In [None]:
X_train, X_test = train_test_split(finaldf, test_size=0.2, random_state = 1234)
X_train.to_csv('training_data/train.csv', index = False, header = None)
X_test.to_csv('test_data/test.csv', index=False, header=None)
finaldf.to_csv('fulldataset.csv', index=False, header=None)
print("Shape of Training data = {}".format(X_train.shape))
print("Shape of Test data = {}".format(X_test.shape))