Faculty Digital Archive

Archive@NYU >
Stern School of Business >
CeDER Working Papers >

Please use this identifier to cite or link to this item: http://hdl.handle.net/2451/25886

Title: A Quality-Aware Optimizer for Information Extraction
Authors: Jain, Alpa
Ipeirotis, Panagiotis G.
Issue Date: 8-Mar-2008
Series/Report no.: CeDER-08-02
Abstract: Large amounts of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as ``knobs'' and tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task was an ad-hoc procedure, based mainly on heuristics. In this paper, we show how to use receiver operating characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating---on the fly---the parameters required by our analytic models to predict the run time and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.
URI: http://hdl.handle.net/2451/25886
Appears in Collections:CeDER Working Papers
IOMS: Information Systems Working Papers

Files in This Item:

File Description SizeFormat
CeDER-08-02.pdf586.76 kBAdobe PDFView/Open

Items in Faculty Digital Archive are protected by copyright, with all rights reserved, unless otherwise indicated.


The contents of the FDA may be subject to copyright, be offered under a Creative Commons license, or be in the public domain.
Please check items for rights statements. For information about NYU’s copyright policy, see http://www.nyu.edu/footer/copyright-and-fair-use.html 
Valid XHTML 1.0 | CSS