Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model

Ikawa, Shota; Kashino, Kunio

doi:https://doi.org/10.33682/7bay-bj41

Title:	Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model
Authors:	Ikawa, Shota Kashino, Kunio
Date Issued:	Oct-2019
Citation:	S. Ikawa & K. Kashino, "Neural Audio Captioning Based on Conditional Sequence-to-Sequence Model", Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 99–103, New York University, NY, USA, Oct. 2019
Abstract:	We propose an audio captioning system that describes non-speech audio signals in the form of natural language. Unlike existing systems, this system can generate a sentence describing sounds, rather than an object label or onomatopoeia. This allows the description to include more information, such as how the sound is heard and how the tone or volume changes over time, and can accommodate unknown sounds. A major problem in realizing this capability is that the validity of the description depends not only on the sound itself but also on the situation or context. To address this problem, a conditional sequence-to-sequence model is proposed. In this model, a parameter called ``specificity'' is introduced as a condition to control the amount of information contained in the output text and generate an appropriate description. Experiments show that the proposed model works effectively.
First Page:	99
Last Page:	103
DOI:	https://doi.org/10.33682/7bay-bj41
Type:	Article
Appears in Collections:	Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)

Files in This Item:

File	Size	Format
DCASE2019Workshop_Ikawa_82.pdf	769.12 kB	Adobe PDF	View/Open

Show full item record