Classifying Non-speech Vocals: Deep vs Signal Processing Representations

Pishdadian, Fatemeh; Seetharaman, Prem; Kim, Bongjun; Pardo, Bryan

doi:https://doi.org/10.33682/bczf-wv12

Title:	Classifying Non-speech Vocals: Deep vs Signal Processing Representations
Authors:	Pishdadian, Fatemeh Seetharaman, Prem Kim, Bongjun Pardo, Bryan
Date Issued:	Oct-2019
Citation:	F. Pishdadian, P. Seetharaman, B. Kim & B. Pardo, "Classifying Non-speech Vocals: Deep vs Signal Processing Representations", Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 194–198, New York University, NY, USA, Oct. 2019
Abstract:	Deep-learning-based audio processing algorithms have become very popular over the past decade. Due to promising results reported for deep-learning-based methods on many tasks, some now argue that signal processing audio representations (e.g. magnitude spectrograms) should be entirely discarded, in favor of learning representations from data using deep networks. In this paper, we compare the effectiveness of representations output by state-of-the-art deep nets trained for a task-specific problem, to off-the-shelf signal processing encoding. We address two tasks: query by vocal imitation and singing technique classification. For query by vocal imitation, experimental results showed deep representations were dominated by signal-processing representations. For singing technique classification, neither approach was clearly dominant. These results indicate it would be premature to abandon traditional signal processing in favor of exclusively using deep networks.
First Page:	194
Last Page:	198
DOI:	https://doi.org/10.33682/bczf-wv12
Type:	Article
Appears in Collections:	Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)

Files in This Item:

File	Size	Format
DCASE2019Workshop_Pishdadian_51.pdf	573.25 kB	Adobe PDF	View/Open

Show full item record