Hierarchical Detection of Sound Events and their Localization Using Convolutional Neural Networks with Adaptive Thresholds

Chytas, Sotirios Panagiotis; Potamianos, Gerasimos

doi:https://doi.org/10.33682/c6q0-wv87

Title:	Hierarchical Detection of Sound Events and their Localization Using Convolutional Neural Networks with Adaptive Thresholds
Authors:	Chytas, Sotirios Panagiotis Potamianos, Gerasimos
Date Issued:	Oct-2019
Citation:	S. Chytas & G. Potamianos, "Hierarchical Detection of Sound Events and their Localization Using Convolutional Neural Networks with Adaptive Thresholds", Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 50–54, New York University, NY, USA, Oct. 2019
Abstract:	This paper details our approach to Task 3 of the DCASE’19 Challenge, namely sound event localization and detection (SELD). Our system is based on multi-channel convolutional neural networks (CNNs), combined with data augmentation and ensembling. Specifically, it follows a hierarchical approach that first determines adaptive thresholds for the multi-label sound event detection (SED) problem, based on a CNN operating on spectrograms over long duration windows. It then exploits the derived thresholds in an ensemble of CNNs operating on raw waveforms over shorter-duration sliding windows to provide event segmentation and labeling. Finally, it employs event localization CNNs to yield direction-of-arrival (DOA) source estimates of the detected sound events. The system is developed and evaluated on the microphone-array set of Task 3. Compared to the baseline of the Challenge organizers, on the development set it achieves relative improvements of 12% in SED error, 2% in F-score, 36% in DOA error, and 3% in the combined SELD metric, but trails significantly in frame-recall, whereas on the evaluation set it achieves relative improvements of 3% in SED, 51% in DOA, and 4% in SELD errors. Overall though, the system lags significantly behind the best Task 3 submission, achieving a combined SELD error of 0.2033 against 0.044 of the latter
First Page:	50
Last Page:	54
DOI:	https://doi.org/10.33682/c6q0-wv87
Type:	Article
Appears in Collections:	Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)

Files in This Item:

File	Size	Format
DCASE2019Workshop_Chytas_24.pdf	779.62 kB	Adobe PDF	View/Open

Show full item record