Crowdsourcing a Dataset of Audio Captions

Lipping, Samuel; Drossos, Konstantinos; Virtanen, Tuomas

doi:https://doi.org/10.33682/sezz-vd31

Full metadata record

DC Field	Value	Language
dc.contributor.author	Lipping, Samuel
dc.contributor.author	Drossos, Konstantinos
dc.contributor.author	Virtanen, Tuomas
dc.date.accessioned	2019-10-24T01:50:18Z	-
dc.date.available	2019-10-24T01:50:18Z	-
dc.date.issued	2019-10
dc.identifier.citation	S. Lipping, K. Drossos & T. Virtanen, "Crowdsourcing a Dataset of Audio Captions", Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 139–143, New York University, NY, USA, Oct. 2019	en
dc.identifier.uri	http://hdl.handle.net/2451/60745	-
dc.description.abstract	Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"'). The creation of a dataset for this task requires a considerable amount of work, rendering the crowdsourcing a very attractive option. In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets. During the first step initial captions are gathered. A grammatically corrected and/or rephrased version of each initial caption is obtained in second step. Finally, the initial and edited captions are rated, keeping the top ones for the produced dataset. We objectively evaluate the impact of our framework during the process of creating an audio captioning dataset, in terms of diversity and amount of typographical errors in the obtained captions. The obtained results show that the resulting dataset has less typographical errors than the initial captions, and on average each sound in the produced dataset has captions with a Jaccard similarity of 0.24, roughly equivalent to two ten-word captions having in common four words with the same root, indicating that the captions are dissimilar while they still contain some of the same information.	en
dc.rights	Distributed under the terms of the Creative Commons Attribution 4.0 International (CC-BY) license.	en
dc.title	Crowdsourcing a Dataset of Audio Captions	en
dc.type	Article	en
dc.identifier.DOI	https://doi.org/10.33682/sezz-vd31
dc.description.firstPage	139
dc.description.lastPage	143
Appears in Collections:	Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)

Files in This Item:

File	Size	Format
DCASE2019Workshop_Lipping_31.pdf	583.67 kB	Adobe PDF	View/Open

Show simple item record