# Overview of audio event detection STM32 model zoo

The STM32 model zoo includes several Tensorflow models for the audio event detection use case pre-trained on custom and public datasets.
Under each model directory, you can find the `ST_pretrainedmodel_public_dataset` directory, which contains different audio event detection models trained on various public datasets following the [training section](../scripts/training/README.md) in STM32 model zoo. 


<a name="ic_models"></a>
## Audio event detection (AED) Models

The table below summarizes the performance of the models, as well as their memory footprints generated using STM32Cube.AI (v7.3.0) for deployment purposes.


A note on clip-level accuracy : In a traditional AED data processing pipeline, audio is converted to a spectral representation (in this model zoo, log-mel-spectrograms), which is then cut into patches. Each patch is fed to the inference network, and a label vector is output for each patch. The labels on these patches are then aggregated based on which clip the patch belongs to, to form a single aggregate label vector for each clip. Accuracy is then computed on these aggregate label vectors.

The reason this metric is used instead of patch-level accuracy is because patch-level accuracy varies immensely depending on the specific manner used to cut spectrogram into patches, and also because clip-level accuracy is the metric most often reported in research papers.

By default, the results are provided for quantized Int8 models.


| Models                     | Input shape | Implementation | Dataset    | Clip-level Accuracy (%)   | MACCs    (M) | Activation RAM (KiB) | Weights Flash (KiB) | Source
|---------------------------|--------------|-----------------|------------|----------------------|-------------|-----------------------|----------------------|--------
| Miniresnet  1 stack | 64x50 | TensorFlow     | ESC-10    | 91.1%                |   14.5        |   59.6            |   127.8        |    [link](miniresnet/ST_pretrainedmodel_public_dataset/esc10/miniresnet_1stacks_64x50/miniresnet_1stacks_64x50_int8.tflite)
| Miniresnet  2 stacks 64x50 | 64x50x1 | TensorFlow     | ESC-10    | 93.6%                |   26        |   59.6            |   451.8        |    [link](miniresnet/ST_pretrainedmodel_public_dataset/esc10/miniresnet_2stacks_64x50/miniresnet_2stacks_64x50_int8.tflite)
| Miniresnetv2 1 stack 64x50 | 64x50x1 |  TensorFlow     | ESC-10    | 91.1%                |   15      |   59.6            |   124.0        |    [link](miniresnetv2/ST_pretrainedmodel_public_dataset/esc10/miniresnetv2_1stacks_64x50/miniresnetv2_1stacks_64x50_int8.tflite)
| Miniresnetv2 2 stacks 64x50 | 64x50x1 | TensorFlow     | ESC-10    | 93.6%                |   27        |   59.6            |   431.9        |    [link](miniresnetv2/ST_pretrainedmodel_public_dataset/esc10/miniresnetv2_2stacks_64x50/miniresnetv2_2stacks_64x50_int8.tflite)
| Yamnet 256 | 64x96x1 | TensorFlow     | ESC-10    | 94.6%                |   24        |   109.6            |   135.9      |    [link](yamnet/ST_pretrainedmodel_public_dataset/esc_10/yamnet_256_64x96/yamnet_256_64x96_int8.tflite)