Multiple Class Sound Event Detection
Reynaldo Vazquez
June, 2019
GitHub Repo
This notebook builds an algorithm that automatically detects the occurrence of two classes of sound events and their onset. A deep learning network with mixed Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), that uses transformations of the mel spectrogram, achieves this with an event-based error rate of 0.1 and a F1-score of 0.94.
Some sound events:
Thus, monitoring for some sound events may be a good practice from a safety or security perspective, and automated monitoring may the be most efficient.
This project is inspired by Task 2: Detection of rare sound events of DCASE17 challenge. What differentiates this project from DCASE17's challenge is the detection of more than one class of events within one model. The classes of sound events to be detected are the occurrences of glass breaks and gun shots.
Artificial sound mixes were created using background and event audio recordings from Tampere University's Detection and Classification of Acoustic Scenes and Events (DCASE) Community.
The created artificial audio dataset and meta data can be found here.
The source data can be found here.
Created artificial sound mixes are 10 seconds long and contain:
With this formula, two datasets were created separately: a training, and an evaluation dataset. These were created separately so that the backgrounds and event recordings used for the evaluation dataset were not used in any of the training audio mixes. Thus, keeping them unseen by the system during development. Meta-data was maintained documenting the source background, the overlaid source events, and the time onset and offset of each sound event.
Required libraries
import os, sys, io, h5py, IPython, dcase_util, sed_eval
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import librosa, librosa.display
from scipy.io import wavfile
from tabulate import tabulate
from io import BytesIO
from zipfile import ZipFile
from keras.models import Model
from keras.layers import Dense, Activation, Dropout, Input, TimeDistributed
from keras.layers import LSTM, Conv1D, Bidirectional, BatchNormalization, GRU
from keras.optimizers import Adam
from keras.models import load_model
%matplotlib inline
Parameter definition
track_dur = 10.0
track_dur_ms = track_dur*1000
sample_rate = 44100
mel_power = 0.5 # used for the transformation of the mel spectrogram features
lag = 0
Ty = 212
print('Ty =', str(Ty)+',', 'is the size of the output of the last LSTM model layer')
scene_label = 'background' # required by the event based metrics function
threshold = 0.3 # chosen for the binary classification
Required utilities
from sed2_utils import *
from local_paths import *
zip_train = ZipFile(path_to_mixes_train, 'r')
prefix_train = zip_train.namelist()[0]
meta_train_track = pd.read_csv(path_to_meta_track_info_train)
meta_train_segments = pd.read_csv(path_to_meta_clip_insertions_train)
zip_eval = ZipFile(path_to_mixes_eval, 'r')
prefix_eval = zip_eval.namelist()[0]
meta_eval_track = pd.read_csv(path_to_meta_track_info_eval)
meta_eval_segments = pd.read_csv(path_to_meta_clip_insertions_eval)
meta_train_segments.head(3)
meta_[dataset]_segments
contain the information of where in time an event occurs within a track. Event start and end times within the track are in miliseconds
show_audio_info(zip_train, 10039, meta_train_segments, prefix_train)
show_audio_info(zip_train, 10000, meta_train_segments, prefix_train)
show_audio_info(zip_train, 10004, meta_train_segments, prefix_train)
show_audio_info(zip_train, 10022, meta_train_segments, prefix_train)
%%capture
contents = {}
contents['train'] = tracks_content_info(meta_train_track, meta_train_segments, 'training', 10, 'glassbreak')
contents['eval'] = tracks_content_info(meta_eval_track, meta_eval_segments, 'evaluation', 10, 'glassbreak')
x_names = ('background only', 'glassbreak', 'gunshot')
plt.figure(figsize=(16, 5))
plt.suptitle('\nAudio Content by Type')
distribution_plot(contents, x_names, 130, .16, 3, 'total time (minutes)')
These plots show the totals of audio content in the datasets in seconds. Audio content is unbalanced towards background only.
To detect the onset of an event, the system will have to make predictions on segments of each audio track. The length of the segments chosen for the network is 47 milliseconds (i.e. 212 segments per track). With this in mind, and because the unbalanced nature of the dataset, event-based error rate, as defined in Mesaros and Heittola (2016) is used to evaluate the performance of the system. The formula to calculate it is:
where:
Substitutions are events in system output with correct temporal position but incorrect class label
Insertions are events in the system output but not in ground truth and not accounted as substitutions
Deletions are events in ground truth but not in system output and not accounted as substitutions
As can be seen, with this metric true negative predictions have no use or meaning, i.e. the model is not 'rewarded' for correctly predicting the abscense of an event.
Transform audio files into the logs of the square root of the mel spectrogram
path_in_zip = zip_train.namelist()[1]
rate, data = get_wav_info_from_zip(path_in_zip, zip_train, sample_rate)
S = log_mel_features(data, rate, mel_power)
Tx = S.shape[1]
n_freq = S.shape[0]
print('Tx =', Tx, 'number of timesteps input to the model', '\nn_freq =', n_freq)
X_train_mel, X_eval_mel = create_or_load_mel_features(zip_train, meta_train_track,
zip_eval, meta_eval_track, Tx, n_freq, sample_rate,
mel_power, prefix_train, prefix_eval)
print('shape of training features set:', X_train_mel.shape)
print('shape of evaluation features set:', X_eval_mel.shape)
Will create the labels array. Each track is assigned a pair vectors of lenght Ty = 212. Each element of each pair of vectors corresponds to a label $\in$ {0,1} for a ~47 milliseconds window. A 1 in the first vector corresponds to the occurrence of a glass break, whereas a 1 in the second vector corresponds to an occurrence of a gunshot.
train_labels_glassbreak = create_labels(meta_train_track, meta_train_segments, Ty, track_dur_ms,
1, lag, 'glassbreak')
train_labels_gunshot = create_labels(meta_train_track, meta_train_segments, Ty, track_dur_ms,
1, lag, 'gunshot')
eval_labels_glassbreak = create_labels(meta_eval_track, meta_eval_segments, Ty, track_dur_ms, 1,
lag, 'glassbreak')
eval_labels_gunshot = create_labels(meta_eval_track, meta_eval_segments, Ty, track_dur_ms, 1,
lag, 'gunshot')
Y_train = np.concatenate((train_labels_glassbreak, train_labels_gunshot), axis=2)
Y_train.shape
Y_eval = np.concatenate((eval_labels_glassbreak, eval_labels_gunshot), axis=2)
Y_eval.shape
LSTM-based model consists of 4 layers as follows
with batch normalization and dropout with rate of 0.5 in all but the output layer.
drop_rate = 0.5
def model(input_shape):
"""
Creates the model's graph in Keras.
Argument:
input_shape -- shape of the model's input data
Returns:
model -- Keras model instance
"""
X_input = Input(shape = input_shape)
X = Conv1D(256, kernel_size = 16, strides=4)(X_input)
X = BatchNormalization()(X)
X = Activation('relu')(X)
X = Dropout(drop_rate)(X)
X = Bidirectional(LSTM(units = 256, return_sequences = True))(X)
X = Dropout(drop_rate)(X)
X = BatchNormalization()(X)
X = Bidirectional(GRU(units = 256, return_sequences = True))(X)
X = Dropout(drop_rate)(X)
X = BatchNormalization()(X)
X = TimeDistributed(Dense(2, activation = "sigmoid"))(X)
model = Model(inputs = X_input, outputs = X)
return model
model = model(input_shape = (Tx, n_freq))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()
model.fit(X_train_mel, Y_train, batch_size = 64, epochs=50, verbose = 0)
raw_preds_eval_mel = model.predict(X_eval_mel)
preds_eval_mel = np.copy(raw_preds_eval_mel)
preds_eval_mel = np.where(preds_eval_mel >= threshold, 1, 0)
gb_pred = preds_eval_mel[:,:,0].reshape(500, 212, 1)
gs_pred = preds_eval_mel[:,:,1].reshape(500, 212, 1)
segments_dicts_ref_eval_glassbreak = get_segments_ref(meta_eval_track, meta_eval_segments, 'glassbreak')
segments_dicts_ref_eval_gunshot = get_segments_ref(meta_eval_track, meta_eval_segments, 'gunshot')
segments_dicts_est_eval_glassbreak = segments_dicts_est(gb_pred, meta_eval_track, event_label = 'glassbreak')
segments_dicts_est_eval_gunshot = segments_dicts_est(gs_pred, meta_eval_track, event_label = 'gunshot')
segments_dicts_ref_eval = segments_dicts_ref_eval_glassbreak + segments_dicts_ref_eval_gunshot
segments_dicts_est_eval = segments_dicts_est_eval_glassbreak + segments_dicts_est_eval_gunshot
ebm_eval, f1_eval, er_eval = ebm_tables(segments_dicts_ref_eval,
segments_dicts_est_eval, t_col = 0.5, pct_len = 0.5, eval_offset = False)
1 - Glassbreak
ebm_eval, f1_eval, er_eval = ebm_tables(segments_dicts_ref_eval_glassbreak,
segments_dicts_est_eval_glassbreak, t_col = 0.5, pct_len = 0.5, eval_offset = False)
2 - Gunshot
ebm_eval, f1_eval, er_eval = ebm_tables(segments_dicts_ref_eval_gunshot,
segments_dicts_est_eval_gunshot, t_col = 0.5, pct_len = 0.5, eval_offset = False)
The two preceding reports indicate that the model is better able to detect the occurrences of glassbreaks than gunshots. I belive this can be due, at least in part, on the dificulty of correctly labeling the onset and the offset of gunshots in the dataset. Hence, I suspect the model is actually performing better at detecting gunshots than those metrics suggest.
This can be seen below in the prediction demonstrations, in which the predictions for gunshots match the actual audio more closely than the imputed 'ground truth', which at times can be off by as much as a second.
plot_pred_vs_true(meta_eval_track, Y_eval, raw_preds_eval_mel, 'evaluation',
zip_eval, prefix_eval, track_index = 2)
plot_pred_vs_true(meta_eval_track, Y_eval, raw_preds_eval_mel, 'evaluation',
zip_eval, prefix_eval, track_index = 3)
plot_pred_vs_true(meta_eval_track, Y_eval, raw_preds_eval_mel, 'evaluation',
zip_eval, prefix_eval, track_index = 6)
This project built and tested a model for the detection of two classes of sound events that may alert about the presence of an emergency. For this purpose, an artificial dataset with audio mixes, containing occurrences of glassbreaks and gunshots, was created using source recordings from Tampere's DCASE community. The dataset consists of audio clips that contain 0 to 4 mixed event occurrences.
The model presented is a deep learning network with one temporal convolution, an LSTM layer, a GRU layer, and one time distributed dense layer as output. All but the output layers perform batch normalization, and dropout with rate of 0.5.
The network correctly detects 93 percent of event occurrences and their onset, missing 6 percent, miscategorizing 1 percent, and falsely detecting an event at a 3 percent rate.
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, “Metrics for polyphonic sound event detection”, Applied Sciences, 6(6):162, 2016