Skip navigation

Classifying Non-speech Vocals: Deep vs Signal Processing Representations

Authors: Pishdadian, Fatemeh
Seetharaman, Prem
Kim, Bongjun
Pardo, Bryan
Date Issued: Oct-2019
Citation: F. Pishdadian, P. Seetharaman, B. Kim & B. Pardo, "Classifying Non-speech Vocals: Deep vs Signal Processing Representations", Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 194–198, New York University, NY, USA, Oct. 2019
Abstract: Deep-learning-based audio processing algorithms have become very popular over the past decade. Due to promising results reported for deep-learning-based methods on many tasks, some now argue that signal processing audio representations (e.g. magnitude spectrograms) should be entirely discarded, in favor of learning representations from data using deep networks. In this paper, we compare the effectiveness of representations output by state-of-the-art deep nets trained for a task-specific problem, to off-the-shelf signal processing encoding. We address two tasks: query by vocal imitation and singing technique classification. For query by vocal imitation, experimental results showed deep representations were dominated by signal-processing representations. For singing technique classification, neither approach was clearly dominant. These results indicate it would be premature to abandon traditional signal processing in favor of exclusively using deep networks.
First Page: 194
Last Page: 198
Type: Article
Appears in Collections:Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)

Files in This Item:
File SizeFormat 
DCASE2019Workshop_Pishdadian_51.pdf573.25 kBAdobe PDFView/Open

Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.