|
Archive@NYU >
Stern School of Business >
CeDER Published Papers >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2451/27810
|
| Title: | Distribution-based aggregation for relational learning with identifier attributes |
| Authors: | Perlich, Claudia Provost, Foster |
| Keywords: | identifiers relational learning aggregation networks |
| Issue Date: | 27-Jan-2006 |
| Publisher: | Machine Learning |
| Citation: | Machine Learning 62 (1/2) 65-105, 2006 |
| Series/Report no.: | CeDER-PP-2006-08 |
| Abstract: | Identifier attributes—very high-dimensional categorical attributes
such as particular product ids or people’s names—rarely are
incorporated in statistical modeling. However, they can play an
important role in relational modeling: it may be informative to have
communicated with a particular set of people or to have purchased a
particular set of products. A key limitation of existing relational
modeling techniques is how they aggregate bags (multisets) of values
from related entities. The aggregations used by existing methods are
simple summaries of the distributions of features of related entities:
e.g., MEAN, MODE, SUM, or COUNT. This paper’s main contribution is
the introduction of aggregation operators that capture more information
about the value distributions, by storing meta-data about value
distributions and referencing this meta-data when aggregating—for
example by computing class-conditional distributional distances. Such
aggregations are particularly important for aggregating values from
high-dimensional categorical attributes, for which the simple aggregates
provide little information. In the first half of the paper we provide
general guidelines for designing aggregation operators, introduce the
new aggregators in the context of the relational learning system ACORA
(Automated Construction of Relational Attributes), and provide
theoretical justification.We also conjecture special properties of
identifier attributes, e.g., they proxy for unobserved attributes and
for information deeper in the relationship network. In the second half
of the paper we provide extensive empirical evidence that the
distribution-based aggregators indeed do facilitate modeling with
high-dimensional categorical attributes, and in support of the
aforementioned conjectures. |
| URI: | http://hdl.handle.net/2451/27810 |
| Appears in Collections: | CeDER Published Papers
|
All items in Faculty Digital Archive are protected by copyright, with all rights reserved.
|