Handling Missing Values when Applying Classification Models

Saar-Tsechansky, Maytal; Provost, Foster

Full metadata record

DC Field	Value	Language
dc.contributor.author	Saar-Tsechansky, Maytal	-
dc.contributor.author	Provost, Foster	-
dc.date.accessioned	2008-12-03T17:56:26Z	-
dc.date.available	2008-12-03T17:56:26Z	-
dc.date.issued	2007-07	-
dc.identifier.citation	Journal of Machine Learning Research 8(July):1625-1657	en
dc.identifier.uri	http://hdl.handle.net/2451/27813	-
dc.description.abstract	Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments.	en
dc.description.sponsorship	NYU, Stern School of Business, IOMS Department, Center for Digital Economy Research	en
dc.format.extent	389647 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	en_US	en
dc.publisher	Journal of Machine Learning Research	en
dc.relation.ispartofseries	CeDER-PP-2007-06	en
dc.subject	missing data	en
dc.subject	classification	en
dc.subject	classification trees	en
dc.subject	imputation	en
dc.title	Handling Missing Values when Applying Classification Models	en
dc.type	Article	en
Appears in Collections:	CeDER Published Papers

Files in This Item:

File	Description	Size	Format
CPP-06-07.pdf		380.51 kB	Adobe PDF	View/Open

Show simple item record