Classification in Networked Data:  A Toolkit and a Univariate Case Study

Macskassy, Sofus; Provost, Foster

Full metadata record

DC Field	Value	Language
dc.contributor.author	Macskassy, Sofus	-
dc.contributor.author	Provost, Foster	-
dc.date.accessioned	2005-11-03T14:34:55Z	-
dc.date.available	2005-11-03T14:34:55Z	-
dc.date.issued	2004	-
dc.identifier.uri	http://hdl.handle.net/2451/14122	-
dc.description.abstract	This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research. Networked data are relational data where entities are interconnected, and this paper considers the common case where entities whose labels are to be estimated are linked to entities for which the label is known. NetKit is based on a three-component framework, comprising a local classifier, a relational classifier, and a collective inference procedure. Various existing relational learning algorithms can be instantiated with appropriate choices for these three components and new relational learning algorithms can be composed by new combinations of components. The case study demonstrates how the toolkit facilitates comparison of different learning methods (which so far has been lacking in machine learning research). It also shows how the modular framework allows analysis of subcomponents, to assess which, whether, and when particular components contribute to superior performance. The case study focuses on the simple but important special case of univariate network classification, for which the only information available is the structure of class linkage in the network (i.e., only links and some class labels are available). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate clearly that simple network-classification models perform remarkably wellâwell enough that they should be used regularly as baseline classifiers for studies of relational learning for networked data. The results also show that there are a small number of component combinations that excel, and that different components are preferable in different situations, for example when few versus many labels are known.	en
dc.format.extent	503812 bytes	-
dc.format.mimetype	application/pdf	-
dc.language	English	EN
dc.language.iso	en_US	-
dc.publisher	Stern School of Business, New York University	en
dc.relation.ispartofseries	CeDER-04-08	-
dc.subject	relational learning	en
dc.subject	network learning	en
dc.subject	collective inference	en
dc.subject	collective classification	en
dc.subject	networked data	en
dc.title	Classification in Networked Data: A Toolkit and a Univariate Case Study	en
dc.type	Working Paper	en
dc.description.series	Information Systems Working Papers Series	EN
Appears in Collections:	CeDER Working Papers IOMS: Information Systems Working Papers

Files in This Item:

File	Description	Size	Format
CeDER-04-08-2.pdf		492 kB	Adobe PDF	View/Open

Show simple item record