Title: | Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model |
Authors: | Ikawa, Shota Kashino, Kunio |
Date Issued: | Oct-2019 |
Citation: | S. Ikawa & K. Kashino, "Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model", Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 99–103, New York University, NY, USA, Oct. 2019 |
Abstract: | We propose an audio captioning system that describes non-speech audio signals in the form of natural language. Unlike existing systems, this system can generate a sentence describing sounds, rather than an object label or onomatopoeia. This allows the description to include more information, such as how the sound is heard and how the tone or volume changes over time, and can accommodate unknown sounds. A major problem in realizing this capability is that the validity of the description depends not only on the sound itself but also on the situation or context. To address this problem, a conditional sequence-to-sequence model is proposed. In this model, a parameter called ``specificity'' is introduced as a condition to control the amount of information contained in the output text and generate an appropriate description. Experiments show that the proposed model works effectively. |
First Page: | 99 |
Last Page: | 103 |
DOI: | https://doi.org/10.33682/7bay-bj41 |
Type: | Article |
Appears in Collections: | Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019) |
Files in This Item:
File | Size | Format | |
---|---|---|---|
DCASE2019Workshop_Ikawa_82.pdf | 769.12 kB | Adobe PDF | View/Open |
Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.