|
Archive@NYU >
Stern School of Business >
CeDER Published Papers >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/2451/27821
|
| Title: | Towards a Query Optimizer for Text-Centric Tasks |
| Authors: | Ipeirotis, Panagiotis Agichtein, Eugene Jain, Pranay Gravano, Luis |
| Keywords: | metasearching text database selection distributed information retrieval information extraction focused crawling |
| Issue Date: | Nov-2007 |
| Publisher: | ACM Transactions on Database Systems |
| Citation: | ACM Transactions on Database Systems (TODS), vol. 32, no. 4, article 21,
November 2007 |
| Series/Report no.: | CeDER-PP-2007-13 |
| Abstract: | Text is ubiquitous and, not surprisingly, many important applications
rely on textual data for a variety of tasks. As a notable example,
information extraction applications derive structured relations from
unstructured text; as another example, focused crawlers explore the Web
to locate pages about specific topics. Execution plans for text-centric
tasks follow two general paradigms for processing a text database:
either we can scan, or “crawl,” the text database or,
alternatively, we can exploit search engine indexes and retrieve the
documents of interest via carefully crafted queries constructed in
task-specific ways. The choice between crawl- and query-based execution
plans can have a substantial impact on both execution time and output
“completeness” (e.g., in terms of recall). Nevertheless,
this choice is typically ad hoc and based on heuristics or plain
intuition. In this article, we present fundamental building blocks to
make the choice of execution plans for text-centric tasks in an
informed, cost-based way. Towards this goal, we show how to analyze
query- and crawl-based plans in terms of both execution time and output
completeness. We adapt results from random-graph theory and statistics
to develop a rigorous cost model for the execution plans. Our cost model
reflects the fact that the performance of the plans depends on
fundamental task-specific properties of the underlying text databases.
We identify these properties and present efficient techniques for
estimating the associated parameters of the cost model.We also present
two optimization approaches for text-centric tasks that rely on the
cost-model parameters and select efficient execution plans. Overall, our
optimization approaches help build efficient execution plans for a task,
resulting in significant efficiency and output completeness benefits. We
complement our results with a large-scale experimental evaluation for
three important text-centric tasks and over multiple real-life data sets. |
| URI: | http://hdl.handle.net/2451/27821 |
| Appears in Collections: | CeDER Published Papers
|
All items in Faculty Digital Archive are protected by copyright, with all rights reserved.
|