Classifying Non-speech Vocals: Deep vs Signal Processing Representations
|Citation:||F. Pishdadian, P. Seetharaman, B. Kim & B. Pardo, "Classifying Non-speech Vocals: Deep vs Signal Processing Representations", Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 194–198, New York University, NY, USA, Oct. 2019|
|Abstract:||Deep-learning-based audio processing algorithms have become very popular over the past decade. Due to promising results reported for deep-learning-based methods on many tasks, some now argue that signal processing audio representations (e.g. magnitude spectrograms) should be entirely discarded, in favor of learning representations from data using deep networks. In this paper, we compare the effectiveness of representations output by state-of-the-art deep nets trained for a task-specific problem, to off-the-shelf signal processing encoding. We address two tasks: query by vocal imitation and singing technique classification. For query by vocal imitation, experimental results showed deep representations were dominated by signal-processing representations. For singing technique classification, neither approach was clearly dominant. These results indicate it would be premature to abandon traditional signal processing in favor of exclusively using deep networks.|
|Appears in Collections:||Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)|
Files in This Item:
|DCASE2019Workshop_Pishdadian_51.pdf||573.25 kB||Adobe PDF||View/Open|
Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.