Towards a Query Optimizer for Text-Centric Tasks

Ipeirotis, Panagiotis; Agichtein, Eugene; Jain, Pranay; Gravano, Luis

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ipeirotis, Panagiotis	-
dc.contributor.author	Agichtein, Eugene	-
dc.contributor.author	Jain, Pranay	-
dc.contributor.author	Gravano, Luis	-
dc.date.accessioned	2008-12-09T15:53:37Z	-
dc.date.available	2008-12-09T15:53:37Z	-
dc.date.issued	2007-11	-
dc.identifier.citation	ACM Transactions on Database Systems (TODS), vol. 32, no. 4, article 21, November 2007	en
dc.identifier.uri	http://hdl.handle.net/2451/27821	-
dc.description.abstract	Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.	en
dc.description.sponsorship	NYU, Stern School of Business, IOMS Department, Center for Digital Economy Research	en
dc.format.extent	1095804 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	en_US	en
dc.publisher	ACM Transactions on Database Systems	en
dc.relation.ispartofseries	CeDER-PP-2007-13	en
dc.subject	metasearching	en
dc.subject	text database selection	en
dc.subject	distributed information retrieval	en
dc.subject	information extraction	en
dc.subject	focused crawling	en
dc.title	Towards a Query Optimizer for Text-Centric Tasks	en
dc.type	Article	en
Appears in Collections:	CeDER Published Papers

Files in This Item:

File	Description	Size	Format
CeDER-PP-2007-13.pdf		1.07 MB	Adobe PDF	View/Open

Show simple item record