|
Archive@NYU >
Stern School of Business >
IOMS: Information Systems Working Papers >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2451/14122
|
| Title: | Classification in Networked Data: A Toolkit and a Univariate Case Study |
| Authors: | Macskassy, Sofus Provost, Foster |
| Keywords: | relational learning network learning collective inference collective classification networked data |
| Issue Date: | 2004 |
| Publisher: | Stern School of Business, New York University |
| Series/Report no.: | CeDER-04-08 |
| Abstract: | This paper presents NetKit, a modular toolkit for classification in
networked data, and a case-study of its application to a collection of
networked data sets used in prior machine learning research. Networked
data are relational data where entities are interconnected, and this
paper considers the common case where entities whose labels are to be
estimated are linked to entities for which the label is known. NetKit is
based on a three-component framework, comprising a local classifier, a
relational classifier, and a collective inference procedure. Various
existing relational learning algorithms can be instantiated with
appropriate choices for these three components and new relational
learning algorithms can be composed by new combinations of components.
The case study demonstrates how the toolkit facilitates comparison of
different learning methods (which so far has been lacking in machine
learning research). It also shows how the modular framework allows
analysis of subcomponents, to assess which, whether, and when particular
components contribute to superior performance. The case study focuses on
the simple but important special case of univariate network
classification, for which the only information available is the
structure of class linkage in the network (i.e., only links and some
class labels are available). To our knowledge, no work previously has
evaluated systematically the power of class-linkage alone for
classification in machine learning benchmark data sets. The results
demonstrate clearly that simple network-classification models perform
remarkably wellâwell enough that they should be used regularly
as baseline classifiers for studies of relational learning for networked
data. The results also show that there are a small number of component
combinations that excel, and that different components are preferable in
different situations, for example when few versus many labels are known. |
| URI: | http://hdl.handle.net/2451/14122 |
| Appears in Collections: | CeDER Working Papers IOMS: Information Systems Working Papers
|
All items in Faculty Digital Archive are protected by copyright, with all rights reserved.
|