|
Archive@NYU >
Stern School of Business >
CeDER Working Papers >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2451/27629
|
| Title: | Understanding, Estimating, and Incorporating Output Quality Into Join
Algorithms For Information Extraction |
| Authors: | Jain, Alpa Ipeirotis, Panagiotis G. Gravano, Luis Doan, Anhai |
| Issue Date: | 27-Jun-2008 |
| Series/Report no.: | CeDER-08-04 |
| Abstract: | Information extraction (IE) systems are trained to extract specific
relations from text databases. Real-world applications often require
that the output of multiple IE systems be joined to produce the data of
interest. To optimize the execution of a join of multiple extracted
relations, it is not sufficient to consider only execution time. In
fact, the quality of the join output is of critical importance: unlike
in the relational world, different join execution plans can produce join
results of widely different quality whenever IE systems are involved. In
this paper, we develop a principled approach to understand, estimate,
and incorporate output quality into the join optimization process over
extracted relations. We argue that the output quality is affected by (a)
the configuration of the IE systems used to process the documents, (b)
the document retrieval strategies used to retrieve documents, and (c)
the actual join algorithm used. Our analysis considers a variety of join
algorithms from relational query optimization, and predicts the output
quality –and, of course, the execution time– of the
alternate execution plans. We establish the accuracy of our analytical
models, as well as study the effectiveness of a quality-aware join
optimizer, with a large-scale experimental evaluation over real-world
text collections and state-of-the-art IE systems. |
| URI: | http://hdl.handle.net/2451/27629 |
| Appears in Collections: | CeDER Working Papers
|
All items in Faculty Digital Archive are protected by copyright, with all rights reserved.
|