|
Archive@NYU >
Stern School of Business >
CeDER Working Papers >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2451/25886
|
| Title: | A Quality-Aware Optimizer for Information Extraction |
| Authors: | Jain, Alpa Ipeirotis, Panagiotis G. |
| Issue Date: | 8-Mar-2008 |
| Series/Report no.: | CeDER-08-02 |
| Abstract: | Large amounts of structured information is buried in unstructured text.
Information extraction systems can extract structured relations from the
documents and enable sophisticated, SQL-like queries over unstructured
text. Information extraction systems are not perfect and their output
has imperfect precision and recall (i.e., contains spurious tuples and
misses good tuples). Typically, an extraction system has a set of
parameters that can be used as ``knobs'' and tune the system to be
either precision- or recall-oriented. Furthermore, the choice of
documents processed by the extraction system also affects the quality of
the extracted relation. So far, estimating the output quality of an
information extraction task was an ad-hoc procedure, based mainly on
heuristics. In this paper, we show how to use receiver operating
characteristic (ROC) curves to estimate the extraction quality in a
statistically robust way and show how to use ROC analysis to select the
extraction parameters in a principled manner. Furthermore, we present
analytic models that reveal how different document retrieval strategies
affect the quality of the extracted relation. Finally, we present our
maximum likelihood approach for estimating---on the fly---the parameters
required by our analytic models to predict the run time and the output
quality of each execution plan. Our experimental evaluation demonstrates
that our optimization approach predicts accurately the output quality
and selects the fastest execution plan that satisfies the output quality restrictions. |
| URI: | http://hdl.handle.net/2451/25886 |
| Appears in Collections: | CeDER Working Papers IOMS: Information Systems Working Papers
|
All items in Faculty Digital Archive are protected by copyright, with all rights reserved.
|