Faculty Digital Archive

Archive@NYU >
Stern School of Business >
CeDER Published Papers >

Please use this identifier to cite or link to this item: http://hdl.handle.net/2451/27818

Title: Get Another Label? Improving Data Quality and Data Mining
Authors: Sheng, Victor
Provost, Foster
Ipeirotis, Panagiotis
Keywords: algorithms
design
experimentation
management
measurement
performance
Issue Date: 2008
Citation: Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008)
Series/Report no.: CeDER-PP-2008-02
Abstract: This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity and show several main results: (i) Repeated-labeling can improve label and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
URI: http://hdl.handle.net/2451/27818
Appears in Collections:CeDER Published Papers

Files in This Item:

File Description SizeFormat
CPP-02-08.pdf541.16 kBAdobe PDFView/Open

Items in Faculty Digital Archive are protected by copyright, with all rights reserved, unless otherwise indicated.

 

The contents of the FDA may be subject to copyright, be offered under a Creative Commons license, or be in the public domain.
Please check items for rights statements. For information about NYU’s copyright policy, see http://www.nyu.edu/footer/copyright-and-fair-use.html 
Valid XHTML 1.0 | CSS