# Sound anomaly detection
*Context*

## Introduction
---
Industrial companies have been collecting a massive amount of time series data about their operating processes, manufacturing production lines and industrial equipment. They sometime store years of data in historian systems or in their factory information system at large. Whereas they are looking to prevent equipment breakdown that would stop a production line, avoid catastrophic failures in a power generation facility or improving their end product quality by adjusting their process parameters, having the ability to process time series data is a challenge that modern cloud technologies are up to. However, everything is not about cloud itself: your factory edge capability must allow you to stream the appropriate data to the cloud (bandwidth, connectivity, protocol compatibility, putting data in context...).

What if had a frugal way to qualify your equipment health with few data? This would definitely help leveraging robust and easier to maintain edge-to-cloud blueprints. In this post, we are going to focus on a tactical approach industrial companies can use to help them reduce the impact of machine breakdowns by reducing how unpredictable they are.

Most times, machine failures are tackled by either reactive action (stop the line and repair...) or costly preventive maintenance where you have to build the proper replacement parts inventory and schedule regular maintenance activities. Skilled machine operators are the most valuable assets in such settings: years of experience allow them to develop a fine knowledge of how the machinery should operate, they become expert listeners and can to detect unusual behavior and sounds in rotating and moving machines. However, production lines are becoming more and more automated, and augmenting these machine operators with AI-generated insights is a way to maintain and develop the fine expertise needed to prevent reactive-only postures when dealing with machine breakdowns.

In this post we are going to compare and contrast two different approaches to identify a malfunctioning machine, providing we have sound recordings from its operation: we will start by building a neural network based on an autoencoder architecture and we will then use an image-based approach where we will feed images of sound (namely spectrograms) to an image based automated ML classification feature.

## Solution overview
---
In this example, we are going to use sounds recorded in an industrial environment to perform anomaly detection on industrial equipment.

To achieve this, we are going to explore and leverage the MIMII dataset for anomaly detection purpose: this is a sound dataset for **M**alfunctioning **I**ndustrial **M**achine **I**nvestigation and **I**nspection (MIMII). You can download it from **https://zenodo.org/record/3384388**: it contains sounds from several types of industrial machines (valves, pumps, fans and slide rails). In this example, we are going to focus on the **fans**. **[This paper](https://arxiv.org/abs/1909.09347)** describes the sound capture procedure.

We walk you through the following steps using Jupyter notebooks provided with this blog post:

1. The first one will focus on *data exploration* to get familiar with sound data: these data are particular time series data and exploring them requires specific approaches.
2. We will then use Amazon SageMaker to *build an* *autoencoder* that will be used as a classifier able to discriminate between normal and abnormal sounds.
3. Last, we are going to take on a more novel approach in the last part of this work: we are going to *transform the sound files into spectrogram images* and feed them directly to an *image classifier*. We will use Amazon Rekognition Custom Labels to perform this classification task and leverage Amazon SageMaker for the data preprocessing and to drive the Custom Labels training and evaluation process.

Both approaches requires an equal amount of effort to complete: although the models obtained in the end are not comparable, this will give you an idea of how much of a kick start you may get when using an applied AI service.

## Introducting the machine sound dataset
---
You can follow this data exploration work with the first companion notebook from **[this repository](https://github.com/michaelhoarau/sound-anomaly-detection)**. Each recording contains 8 channels, one for each microphone that was used to record a given machine sound. In this experiment, we will only focus on the recordings of the first microphone. The first thing we can do is to plot the waveforms of a normal and abnormal signals next to each other:

![Waveforms](pictures/waveforms.png)

Each signal is 10 seconds long and apart from the larger amplitude of the abnormal signal and some pattern that are more irregular, it’s difficult to distinguish between these two signals. In the companion notebook, you will also be able to listen to some of the sounds: most of the time, the differences are small, especially if you put them in a context of a very noisy environment.

A first approach could be to leverage the **[Fourier transform](https://en.wikipedia.org/wiki/Fourier_transform)**, which is a mathematical operator that decompose a function of time (or a signal) into its underlying frequencies. The Fourier transform is a function of frequency and its amplitude represents how much of a given frequency is present in the original signal. However, a sound signal is highly non-stationary (i.e. their statistics change over time). For a given time period, the frequency decomposition will be different from another time period. As a consequence, it will be rather meaningless to compute a single Fourier transform over the entire signal (however short they are in our case). We will need to call the short-time Fourier transform (STFT) for help: the STFT is obtained by computing the Fourier transform for successive frames in a signal.

If we plot the amplitude of each frequency present in the first 64 ms of the first signal of both the normal and abnormal dataset, we obtain the following plot:

![Short Fourier Transform](pictures/stft.png)

We now have a tool to discretize our time signals into the frequency domain which brings us one step closer to be able to visualize them in this domain. For each signal we will now:

1. Slice the signal in successive time frames
2. Compute an STFT for each time frame
3. Extract the amplitude of each frequency as a function of time
4. Most sounds we can hear as humans, are concentrated in a very small range (**both** in frequency and amplitude range). The next step is then to take a log scale for both the frequency and the amplitude: for the amplitude, we obtain this by converting the color axis to Decibels (which is the equivalent of applying a log scale to the sound amplitudes)
5. Plot the result on a spectrogram: a spectrogram has three dimensions: we keep time on the horizontal axis, put frequency on the vertical axis and use the amplitude to a color axis (in dB).

The picture below shows the frequency representation of the signals plotted earlier:

![Spectrograms](pictures/spectrograms.png)

We can now see that these images have interesting features that we can easily uncover with our naked eyes: this is exactly the kind of features that a neural network can try to uncover and structure. We will now build two types of feature extractor based on this analysis and feed them to different type of architectures.

## Building a custom autoencoder architecture
---
The **[autoencoder architecture](https://en.wikipedia.org/wiki/Autoencoder)** is a neural network with the same number of neurons in the input and the output layers. This kind of architecture learns to generate the “identity” transformation between inputs and outputs. The second notebook of our series will go through these different steps:

1. Build the dataset: to feed the spectrogram to an autoencoder, we will build a tabular dataset and upload it to Amazon S3.
2. Create a TensorFlow autoencoder model, train it in script mode by using the TensorFlow / Keras existing container
3. Evaluate the model to obtain a confusion matrix highlighting the classification performance between normal and abnormal sounds.

### Build a dataset
We are using the **[librosa library](https://librosa.org/doc/latest/index.html)** which is a python package for audio analysis. A features extraction function based on steps to generate the spectrogram described earlier is central to the dataset generation process.

```python
def extract_signal_features(signal, sr, n_mels=64, frames=5, n_fft=1024, hop_length=512):
 # Compute a spectrogram (using Mel scale):
 mel_spectrogram = librosa.feature.melspectrogram(
 y=signal,
 sr=sr,
 n_fft=n_fft,
 hop_length=hop_length,
 n_mels=n_mels
 )
 
 # Convert to decibel (log scale for amplitude):
 log_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
 
 # Generate an array of vectors as features for the current signal:
 features_vector_size = log_mel_spectrogram.shape[1] - frames + 1
 
 # Build N sliding windows (=frames) and concatenate
 # them to build a feature vector:
 features = np.zeros((features_vector_size, dims), np.float32)
 for t in range(frames):
 features[:, n_mels*t:n_mels*(t+1)] = log_mel_spectrogram[:, t:t+features_vector_size].T
 
 return features
```

Note that we will train our autoencoder only on the normal signals: our model will learn how to reconstruct these signals (“learning the identity transformation”). The main idea is to leverage this for classification later; when we feed this trained model with abnormal sounds, the reconstruction error will be a lot higher than when trying to reconstruct normal sounds. Using an error threshold, we will then be able to discriminate abnormal and normal sounds.

### Create the autoencoder
To build our autoencoder, we use Keras and assemble a simple autoencoder architecture with 3 hidden layers:

```python
from tensorflow.keras import Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense

def autoencoder_model(input_dims):
 inputLayer = Input(shape=(input_dims,))
 h = Dense(64, activation="relu")(inputLayer)
 h = Dense(64, activation="relu")(h)
 h = Dense(8, activation="relu")(h)
 h = Dense(64, activation="relu")(h)
 h = Dense(64, activation="relu")(h)
 h = Dense(input_dims, activation=None)(h)

 return Model(inputs=inputLayer, outputs=h)
```

We put this in a training script (model.py) and use the SageMaker TensorFlow estimator to configure our training job and launch the training:

```python
tf_estimator = TensorFlow(
 base_job_name='sound-anomaly',
 entry_point='model.py',
 source_dir='./autoencoder/',
 role=role,
 instance_count=1, 
 instance_type='ml.p3.2xlarge',
 framework_version='2.2',
 py_version='py37',
 hyperparameters={
 'epochs': 30,
 'batch-size': 512,
 'learning-rate': 1e-3,
 'n_mels': n_mels,
 'frame': frames
 },
 debugger_hook_config=False
)

tf_estimator.fit({'training': training_input_path})
```

Training over 30 epochs will take few minutes on a p3.2xlarge instance: at this stage, this will cost you a few cents. If you plan to use a similar approach on the whole MIMII dataset or use hyperparameter tuning, you can even further reduce this training cost by using Spot Training (check out **[this sample](https://github.com/aws-samples/amazon-sagemaker-managed-spot-training)** on how you can leverage Managed Training Spot and get a 70% discount in the process).

### Evaluate the model
Let’s now deploy the autoencoder behind a SageMaker endpoint: this operation will create a SageMaker endpoint and will continue to cost you as long as you let it leave. Do not forger to shut it down at the end of this experiment!

```python
tf_endpoint_name = 'sound-anomaly-'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
tf_predictor = tf_estimator.deploy(
 initial_instance_count=1,
 instance_type='ml.c5.large',
 endpoint_name=tf_endpoint_name
)
print(f'Endpoint name: {tf_predictor.endpoint_name}')
```

Our test dataset has an equal share of normal and abnormal sounds. We will loop through this dataset and send each test file to this endpoint. As our model is an autoencoder, we will evaluate how good the model is at reconstructing the input. The higher the reconstruction error, the greater the chance that we have identified an anomaly:

```python
y_true = test_labels
reconstruction_errors = []

for index, eval_filename in tqdm(enumerate(test_files), total=len(test_files)):
 # Load signal
 signal, sr = sound_tools.load_sound_file(eval_filename)

 # Extract features from this signal:
 eval_features = sound_tools.extract_signal_features(
 signal, 
 sr, 
 n_mels=n_mels, 
 frames=frames, 
 n_fft=n_fft, 
 hop_length=hop_length
 )
 
 # Get predictions from our autoencoder:
 prediction = tf_predictor.predict(eval_features)['predictions']
 
 # Estimate the reconstruction error:
 mse = np.mean(np.mean(np.square(eval_features - prediction), axis=1))
 reconstruction_errors.append(mse)
```

In the plot below, we can see that the distribution of reconstruction error for normal and abnormal signals differs significantly. The overlap between these histograms means we have to compromise:

![Reconstruction Error Histograms](pictures/reconstruction_error_histograms.png)

Let's explore the recall-precision tradeoff for a reconstruction error threshold varying between 5.0 and 10.0 (this encompasses most of the overlap we can see above). First, let's visualize how this threshold range separates our signals on a scatter plot of all the testing samples:

![threshold_range_exploration](pictures/threshold_range_exploration.png)

If we plot the number of samples flagged as false positives and false negatives we can see that the best compromise is to use a threshold set around 6.3 for the reconstruction error (assuming we are not looking at minimizing either the false positive or false negatives occurrences):

![reconstruction_error_threshold](pictures/reconstruction_error_threshold.png)

For this threshold (6.3), we obtain the confusion matrix below:

![confusion_matrix](pictures/confusion_matrix_autoencoder.png)

The metrics associated to this matrix are the following:

* Precision: 92.1%
* Recall: 92.1%
* Accuracy: 88.5%
* F1 Score: 92.1%

### Cleanup
Let’s not forget to delete our Endpoint to prevent any cost to continue incurring by using the **delete_endpoint()** API.

### Autoencoder improvement and further exploration

The spectrogram approach requires defining the spectrogram square dimensions (e.g. the number of Mel cell defined in the data exploration notebook) which is a heuristic. In contrast, deep learning networks with a CNN encoder can learn the best representation to perform the task at hands (anomaly detection). Further steps to investigate to improve on this first result could be:

* Experimenting with several more or less complex autoencoder architectures, training for a longer time, performing hyperparameter tuning with different optimizer, tuning the data preparation sequence (e.g. sound discretization parameters), etc.
* Leveraging high resolution spectrograms and feeding them to a CNN encoder to uncover the most appropriate representation of the sound.
* Using end-to-end model architecture with encoder-decoder that have been known to give good results on waveform datasets.
* Using deep learning models with multi-context temporal and channel (8 microphones) attention weights .
* Experimenting with time distributed 2D convolution layers can be used to encode features across the 8 channels: these encoded features could then be fed as sequences across time steps to an LSTM or GRU layer. From there, multiplicative sequence attention weights can then be learnt on the output sequence from the RNN layer.
* Exploring the appropriate image representation for multi-variate time series signal that are not waveform: replacing spectrograms with Markov transition fields, recurrence plots or network graphs could then be used to achieve the same goals for non-sound time-based signals.

## Using Amazon Rekognition Custom Labels
---
### Build a dataset
Previously, we had to train our autoencoder on only normal signals. In this case, we will build a more traditional split of training, and testing dataset. Based on the fans sound database this will yield:

* **4440 signals** for the training dataset, including:
 * 3260 normal signals
 * 1180 abnormal signals

* **1110 signals** for the testing dataset including:
 * 815 normal signals
 * 295 abnormal signals

We will generate and store the spectrogram of each signal and upload them in either a train or test bucket.

### Create a Rekognition Custom Labels

The first step is to create a Custom Labels project:

```python
# Initialization, get a Rekognition client:
PROJECT_NAME = 'sound-anomaly-detection'
reko = boto3.client("rekognition")

# Let's try to create a Rekognition project:
try:
 project_arn = reko.create_project(ProjectName=PROJECT_NAME)['ProjectArn']
 
# If the project already exists, we get its ARN:
except reko.exceptions.ResourceInUseException:
 # List all the existing project:
 print('Project already exists, collecting the ARN.')
 reko_project_list = reko.describe_projects()
 
 # Loop through all the Rekognition projects:
 for project in reko_project_list['ProjectDescriptions']:
 # Get the project name (the string after the first delimiter in the ARN)
 project_name = project['ProjectArn'].split('/')[1]
 
 # Once we find it, we store the ARN and break out of the loop:
 if (project_name == PROJECT_NAME):
 project_arn = project['ProjectArn']
 break
 
print(project_arn)
```

We need to tell Amazon Rekognition where to find the training data, testing data and where to output its results:

```python
TrainingData = {
 'Assets': [{ 
 'GroundTruthManifest': {
 'S3Object': { 
 'Bucket': ,
 'Name': f'{}/manifests/train.manifest'
 }
 }
 }]
}

TestingData = {
 'AutoCreate': True
}

OutputConfig = { 
 'S3Bucket': ,
 'S3KeyPrefix': f'{}/output'
}
```

Now we can create a project version: creating a project version will build and train a model within this Rekognition project for the data previously configured. Project creation can fail, if the bucket you selected cannot be accessed by Rekognition. Make sure the right Bucket Policy is applied to your bucket (check the notebooks to see the recommended policy).

Let’s now create a project version: this will launch a new model training and you will then have to wait for the model to be trained. This should take around 1 hour (less than $1 from a cost perspective):

```python
version = 'experiment-1'
VERSION_NAME = f'{PROJECT_NAME}.{version}'

# Let's try to create a new project version in the current project:
try:
 project_version_arn = reko.create_project_version(
 ProjectArn=project_arn, # Project ARN
 VersionName=VERSION_NAME, # Name of this version
 OutputConfig=OutputConfig, # S3 location for the output artefact
 TrainingData=TrainingData, # S3 location of the manifest describing the training data
 TestingData=TestingData # S3 location of the manifest describing the validation data
 )['ProjectVersionArn']
 
# If a project version with this name already exists, we get its ARN:
except reko.exceptions.ResourceInUseException:
 # List all the project versions (=models) for this project:
 print('Project version already exists, collecting the ARN:', end=' ')
 reko_project_versions_list = reko.describe_project_versions(ProjectArn=project_arn)
 
 # Loops through them:
 for project_version in reko_project_versions_list['ProjectVersionDescriptions']:
 # Get the project version name (the string after the third delimiter in the ARN)
 project_version_name = project_version['ProjectVersionArn'].split('/')[3]

 # Once we find it, we store the ARN and break out of the loop:
 if (project_version_name == VERSION_NAME):
 project_version_arn = project_version['ProjectVersionArn']
 break
 
print(project_version_arn)
status = reko.describe_project_versions(
 ProjectArn=project_arn,
 VersionNames=[project_version_arn.split('/')[3]]
)['ProjectVersionDescriptions'][0]['Status']
```

### Evaluate the model

First, we will deploy our model by using the ARN collected before: again, this will deploy an endpoint that will cost you around $4 per hour. Don’t forget to decommission it once you’re done!

```python
# Start the model
print('Starting model: ' + model_arn)
response = client.start_project_version(ProjectVersionArn=model_arn, MinInferenceUnits=min_inference_units)

# Wait for the model to be in the running state:
project_version_running_waiter = client.get_waiter('project_version_running')
project_version_running_waiter.wait(ProjectArn=project_arn, VersionNames=[version_name])

# Get the running status
describe_response=client.describe_project_versions(ProjectArn=project_arn, VersionNames=[version_name])
for model in describe_response['ProjectVersionDescriptions']:
 print("Status: " + model['Status'])
 print("Message: " + model['StatusMessage'])
```

Once the model is running you can start querying it for predictions: in the notebook, you will find a function *get_results()* that will query a given model with a list of pictures sitting in a given path. This will take a few minutes to run all the test samples and will cost less than $1 (for the ~3,000 test samples):

```python
predictions_ok = rt.get_results(project_version_arn, BUCKET, s3_path=f'{BUCKET}/{PREFIX}/test/normal', label='normal', verbose=True)
predictions_ko = rt.get_results(project_version_arn, BUCKET, s3_path=f'{BUCKET}/{PREFIX}/test/abnormal', label='abnormal', verbose=True)

def get_results(project_version_arn, bucket, s3_path, label=None, verbose=True):
 """
 Sends a list of pictures located in an S3 path to
 the endpoint to get the associated predictions.
 """

 fs = s3fs.S3FileSystem()
 data = {}
 predictions = pd.DataFrame(columns=['image', 'normal', 'abnormal'])
 
 for file in fs.ls(path=s3_path, detail=True, refresh=True):
 if file['Size'] > 0:
 image = '/'.join(file['Key'].split('/')[1:])
 if verbose == True:
 print('.', end='')

 labels = show_custom_labels(project_version_arn, bucket, image, 0.0)
 for L in labels:
 data[L['Name']] = L['Confidence']
 
 predictions = predictions.append(pd.Series({
 'image': file['Key'].split('/')[-1],
 'abnormal': data['abnormal'],
 'normal': data['normal'],
 'ground truth': label
 }), ignore_index=True)
 
 return predictions
 
def show_custom_labels(model, bucket, image, min_confidence):
 # Call DetectCustomLabels from the Rekognition API: this will give us the list 
 # of labels detected for this picture and their associated confidence level:
 reko = boto3.client('rekognition')
 try:
 response = reko.detect_custom_labels(
 Image={'S3Object': {'Bucket': bucket, 'Name': image}},
 MinConfidence=min_confidence,
 ProjectVersionArn=model
 )
 
 except Exception as e:
 print(f'Exception encountered when processing {image}')
 print(e)
 
 # Returns the list of custom labels for the image passed as an argument:
 return response['CustomLabels']
```

Let’s plot the confusion matrix associated to this test set:

![confusion_matrix_rekognition](pictures/confusion_matrix_rekognition.png)

The metrics associated to this matrix are the following:

* Precision: 100.0%
* Recall: 99.8%
* Accuracy: 99.8%
* F1 Score: 99.9%

Without any effort (and no ML knowledge!), we get impressive results. With so low false positives and false negatives, we can leverage such a model in even the most challenging industrial context.

### Cleanup

We need to stop the running model as we will continue to incur costs while the endpoint is live:

```python
print('Stopping model:' + model_arn)

# Stop the model:
try:
 reko = boto3.client('rekognition')
 response = reko.stop_project_version(ProjectVersionArn=model_arn)
 status = response['Status']
 print('Status: ' + status)

except Exception as e: 
 print(e) 

print('Done.')
```