Faculty Digital Archive

Archive@NYU >
Stern School of Business >
CeDER Working Papers >

Please use this identifier to cite or link to this item: http://hdl.handle.net/2451/27629

Title: Understanding, Estimating, and Incorporating Output Quality Into Join Algorithms For Information Extraction
Authors: Jain, Alpa
Ipeirotis, Panagiotis
Gravano, Luis
Doan, Anhai
Issue Date: 27-Jun-2008
Series/Report no.: CeDER-08-04
Abstract: Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, and predicts the output quality –and, of course, the execution time– of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.
URI: http://hdl.handle.net/2451/27629
Appears in Collections:CeDER Working Papers

Files in This Item:

File Description SizeFormat
CeDER-08-04.pdf968.19 kBAdobe PDFView/Open

All items in Faculty Digital Archive are protected by copyright, with all rights reserved.

 

The contents of this archive are either in the public domain or subject to copyright. Please consult NYU's "Handbook for Use of Copyrighted Materials" (http://library.nyu.edu/copyright/copyright.html) for information on using material within the Faculty Digital Archive.
Valid XHTML 1.0 | CSS