|
Archive@NYU >
Stern School of Business >
CeDER Working Papers >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2451/25882
|
| Title: | Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers |
| Authors: | Sheng, Victor Provost, Foster Ipeirotis, Panagiotis G. |
| Issue Date: | 6-Mar-2008 |
| Series/Report no.: | CeDER-08-01 |
| Abstract: | This paper addresses the repeated acquisition of labels for
data items when the labeling is imperfect. We
examine the improvement (or lack thereof) in data quality via repeated
labeling, and focus especially on the improvement of training labels
for supervised induction.
With the outsourcing of small tasks becoming
easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it
often is possible to obtain less-than-expert labeling at low cost.
With low-cost labeling, preparing the unlabeled part of the data can
become considerably more expensive than labeling. We present
repeated-labeling strategies of increasing complexity, and show
several main results. (i) Repeated-labeling can improve label quality
and model quality, but not always. (ii) When labels are noisy,
repeated labeling can be preferable to single labeling even in the
traditional setting where labels are not particularly cheap. (iii) As
soon as the cost of processing the unlabeled data is not free, even
the simple strategy of labeling everything multiple times can give
considerable advantage. (iv) Repeatedly labeling a carefully chosen
set of points is generally preferable, and we present a robust
technique that combines different notions of uncertainty to select
data points for which quality should be improved. The bottom line: the
results show clearly that when labeling is not perfect, selective
acquisition of multiple labels is a strategy that data miners should
have in their repertoire; for certain label-quality/cost regimes, the
benefit is substantial. |
| URI: | http://hdl.handle.net/2451/25882 |
| Appears in Collections: | IOMS: Information Systems Working Papers CeDER Working Papers
|
All items in Faculty Digital Archive are protected by copyright, with all rights reserved.
|