Skip navigation

Handling Missing Values when Applying Classification Models

Authors: Saar-Tsechansky, Maytal
Provost, Foster
Keywords: missing data;classification;classification trees;imputation
Issue Date: Jul-2007
Publisher: Journal of Machine Learning Research
Citation: Journal of Machine Learning Research 8(July):1625-1657
Series/Report no.: CeDER-PP-2007-06
Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments.
Appears in Collections:CeDER Published Papers

Files in This Item:
File Description SizeFormat 
CPP-06-07.pdf380.51 kBAdobe PDFView/Open

Items in FDA are protected by copyright, with all rights reserved, unless otherwise indicated.