Repeated Labeling Using Multiple Noisy Labelers

Ipeirotis, Panagiotis G.; Provost, Foster; Sheng, Victor; Wang, Jing

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ipeirotis, Panagiotis G.	-
dc.contributor.author	Provost, Foster	-
dc.contributor.author	Sheng, Victor	-
dc.contributor.author	Wang, Jing	-
dc.date.accessioned	2010-09-10T00:18:00Z	-
dc.date.available	2010-09-10T00:18:00Z	-
dc.date.issued	2010-09-10T00:18:00Z	-
dc.identifier.uri	http://hdl.handle.net/2451/29799	-
dc.description.abstract	This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire. For certain label-quality/cost regimes, the benefit is substantial.	en
dc.description.sponsorship	This work was supported by the National Science Foundation under Grant No. IIS-0643846, by an NSERC Postdoctoral Fellowship, and by an NEC Faculty Fellowship.	en
dc.language.iso	en_US	en
dc.relation.ispartofseries	CeDER-10-03	-
dc.subject	active learning	en
dc.subject	data selection	en
dc.subject	data preprocessing	en
dc.subject	classification	en
dc.subject	crowdsourcing	en
dc.subject	mechanical turk	en
dc.subject	noisy data	en
dc.title	Repeated Labeling Using Multiple Noisy Labelers	en
dc.type	Working Paper	en
dc.authorid-ssrn	586795	-
dc.authorid-ssrn	691208	-
dc.authorid-ssrn	1131964	-
Appears in Collections:	CeDER Working Papers

Files in This Item:

File	Description	Size	Format
CeDER-10-03.pdf		633.44 kB	Adobe PDF	View/Open

Show simple item record