6 Classification-Aware Hidden-Web Text Database Selection PANAGIOTIS G. IPEIROTIS New York University and LUIS GRAVANO Columbia University Many valuable text databases on the web have noncrawlable contents that are ?hidden? behind search interfaces. Metasearchers are helpful tools for searching over multiple such ?hidden-web? text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining which databases are the most relevant for a given user query. The state-of-the-art database selection techniques rely on statistical summaries of the database contents, generally including the database vocabulary and associated word frequencies. Unfortunately, hidden-web text databases typically do not export such summaries, so previous re- search has developed algorithms for constructing approximate content summaries from document samples extracted from the databases via querying. We present a novel ?focused-probing? sampling algorithm that detects the topics covered in a database and adaptively extracts documents that are representative of the topic coverage of the database. Our algorithm is the first to construct content summaries that include the frequencies of the words in the database. Unfortunately, Zipf?s law practically guarantees that for any relatively large database, content summaries built from moderately sized document samples will fail to cover many low-frequency words; in turn, incom- plete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To enhance the sparse document samples and improve the data- base selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with a similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects the best categories for a query, and then sends the query to the appropriate databases in the chosen categories. The second algorithm uses This material is based upon work supported by the National Science Foundation under Grants No. IIS-97-33880, IIS-98-17434, and IIS-0643846. The work of P. G. Ipeirotis is also supported by a Microsoft Live Labs Search Award and a Microsoft Virtual Earth Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or of the Microsoft Corporation. Authors? addresses: P. G. Ipeirotis, Department of Information, Operations, and Management Sci- ences, New York University, 44 West Fourth Street, Suite 8-84, New York, NY 10012-1126; email: panos@stern.nyu.edu; L. Gravano, Computer Science Department, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027-7003; email: gravano@cs.columbia.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, USA, fax +1 (212) 869-0481, or permissions@acm.org. C 2008 ACM 1046-8188/2008/03-ART6 $5.00 DOI 10.1145/1344411.1344412 http://doi.acm.org/ 10.1145/1344411.1344412 ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:2 ? P. G. Ipeirotis and L. Gravano ?shrinkage,? a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words. We describe how to mod- ify existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web da- tabases as well as TREC data, suggests that the proposed sampling methods generate high-quality content summaries and that the database selection algorithms produce significantly more relevant database selection decisions and overall search results than existing algorithms. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Anal- ysis and Indexing?Abstracting methods, indexing methods; H.3.3 [Information Storage and Re- trieval]: Information Search and Retrieval?Search process, selection process; H.3.4 [Information Storage and Retrieval]: Systems and Software?Information networks, performance evaluation (efficiency and effectiveness); H.3.5 [Information Storage and Retrieval]: Online Information Services?Web-based services; H.3.6 [Information Storage and Retrieval]: Library Automa- tion?Large text archives; H.3.7 [Information Storage and Retrieval]: Digital Libraries; H.2.4 [Database Management]: Systems?Textual databases, distributed databases; H.2.5 [Database Management]: Heterogeneous Databases General Terms: Algorithms, Experimentation, Measurement, Performance Additional Key Words and Phrases: Distributed information retrieval, web search, database selec- tion ACM Reference Format: Ipeirotis, P. G. and Gravano, L. 2008. Classification-Aware hidden-web text database selection. ACM Trans. Inform. Syst. 26, 2, Article 6 (March 2008), 66 pages. DOI = 10.1145/1344411.1344412 http://doi.acm.org/10.1145/1344411.1344412 1. INTRODUCTION The World-Wide Web continues to grow rapidly, which makes exploiting all useful information that is available a standing challenge. Although general web search engines crawl and index a large amount of information, typically they ignore valuable data in text databases that is ?hidden? behind search interfaces and whose contents are not directly available for crawling through hyperlinks. Example 1.1. Consider the U.S. Patent and Trademark (USPTO) database, which contains1 the full text of all patents awarded in the US since 1976.2 If we query3 USPTO for patents with the keywords ?wireless? and ?network?, USPTO returns 62,231 matches as of June 6th, 2007, corresponding to distinct patents that contain these keywords. In contrast, a query4 on Google?s main index that finds those pages in the USPTO database with the keywords ?wire- less? and ?network? returns two matches as of June 6th, 2007. This illustrates that valuable content available through the USPTO database is ignored by this search engine.5 One way to provide one-stop access to the information in text databases is through metasearchers, which can be used to query multiple databases 1The full text of the patents is stored at the USPTO site. 2The query interface is available at http://patft.uspto.gov/netahtml/PTO/search-adv.htm 3The query is [wireless AND network]. 4The query is [wireless network site:patft.uspto.gov]. 5Google has a dedicated patent-search service that specifically hosts and enables searches over the USPTO contents; see http://www.google.com/patents ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:3 simultaneously. A metasearcher performs three main tasks. After receiving a query, it finds the best databases to evaluate it (database selection), translates the query in a suitable form for each database (query translation), and finally retrieves and merges the results from different databases (result merging) and returns them to the user. The database selection component of a metasearcher is of crucial importance in terms of both query processing efficiency and effec- tiveness. Database selection algorithms are often based on statistics that character- ize each database?s contents [Yuwono and Lee 1997; Xu and Callan 1998; Meng et al. 1998; Gravano et al. 1999]. These statistics, to which we will refer as content summaries, usually include the document frequencies of the words that appear in the database, plus perhaps other simple statistics.6 These summaries provide sufficient information to the database selection component of a meta- searcher to decide which databases are the most promising to evaluate a given query. Constructing the content summary of a text database is a simple task if the fullcontentsofthedatabaseareavailable(e.g.,viacrawling).However,thistask is challenging for so-called hidden-web text databases, whose contents are only available via querying. In this case, a metasearcher could rely on the databases to supply the summaries (e.g., by following a protocol like STARTS [Gravano et al. 1997], or possibly by using semantic web [Berners-Lee et al. 2001] tags in the future). Unfortunately, many web-accessible text databases are com- pletely autonomous and do not report any detailed metadata about their con- tents to facilitate metasearching. To handle such databases, a metasearcher could rely on manually generated descriptions of the database contents. Such an approach would not scale to the thousands of text databases available on the web [Bergman 2001], and would likely not produce the good-quality, fine- grained content summaries required by database selection algorithms. In this article, we first present a technique to automate the extraction of high-quality content summaries from hidden-web text databases. Our tech- nique constructs these summaries from a biased sample of the documents in a database, extracted by adaptively probing the database using the topically focused queries sent to the database during a topic classification step. Our al- gorithm selects what queries to issue based in part on the results of earlier queries, thus focusing on those topics that are most representative of the da- tabase in question. Our technique resembles biased sampling over numeric databases, which focuses the sampling effort on the ?densest? areas. We show that this principle is also beneficial for the text-database world. Interestingly, our technique moves beyond the document sample and attempts to include in the content summary of a database accurate estimates of the actual document frequency of words in the database. For this, our technique exploits well-studied statistical properties of text collections. 6Other database selection algorithms (e.g., Si and Callan [2005, 2004a, 2003], Hawking and Thomas [2005], Shokouhi [2007]) also use document samples from the databases to make selection decisions. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:4 ? P. G. Ipeirotis and L. Gravano Unfortunately, all efficient techniques for building content summaries via document sampling suffer from a sparse-data problem: Many words in any text database tend to occur in relatively few documents, so any document sample of reasonably small size will necessarily miss many words that occur in the associated database only a small number of times. To alleviate this sparse-data problem, we exploit the observation (which we validate experimentally) that incomplete content summaries of topically related databases can be used to complement each other. Based on this observation, we explore two alternative algorithms that make database selection more resilient to incomplete content summaries. Our first algorithm selects databases hierarchically, based on their categorization. The algorithm first chooses the categories to explore for a query and then picks the best databases in the most appropriate categories. Our sec- ond algorithm is a ?flat? selection strategy that exploits the database catego- rization implicitly by using ?shrinkage,? a statistical technique for improving parameter estimation in the face of sparse data. Our shrinkage-based algo- rithm enhances the database content summaries with category-specific words. As we will see, shrinkage-enhanced summaries often characterize the database contents better than their ?unshrunk? counterparts do. Then, during database selection, our algorithm decides in an adaptive and query-specific way whether an application of shrinkage would be beneficial. We evaluate the performance of our content summary construction algo- rithms using a variety of databases, including 315 real web databases. We also evaluate our database selection strategies with extensive experiments that involve text databases and queries from the TREC testbed, together with rele- vance judgments associated with queries and database documents. We compare our methods with a variety of database selection algorithms. As we will see, our techniques result in a significant improvement in database selection quality over existing techniques, achieved efficiently just by exploiting the database classification information and without increasing the document-sample size. In brief, the main contributions presented in this article are as follows: ?a technique to sample text databases that results in higher-quality database content summaries than those produced by state-of-the-art alternatives; ?a technique to estimate the absolute document frequencies of the words in content summaries; ?a technique to improve the quality of sample-based content summaries using shrinkage; ?a hierarchical database selection algorithm that works over a topical classi- fication scheme; ?an adaptive database selection algorithm that decides in an adaptive and query-specific way whether to use the shrinkage-based content summaries; and ?a thorough, extensive experimental evaluation of the presented algorithms using a variety of datasets, including TREC data and 315 real web databases. The rest of the article is organized as follows. Section 2 gives the neces- sary background. Section 3 outlines our new technique for producing content ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:5 Table I. A Fragment of the Content Summaries of Two Databases CANCERLIT 3,801,351 documents Word df breast 181,102 cancer 1,893,838 ... ... CNN Money 13,313 documents Word df breast 65 cancer 255 ... ... summaries of text databases and presents our frequency estimation algorithm. Section 4 describes our hierarchical and shrinkage-based database selection al- gorithms, which build on our observation that topically similar databases have similar content summaries. Section 5 describes the settings for the experimen- tal evaluation of Sections 6 and 7. Finally, Section 8 describes related work and Section 9 concludes the article. 2. BACKGROUND In this section, we provide the required background and describe related ef- forts.Section2.1brieflysummarizeshowexistingdatabaseselectionalgorithms work, stressing their reliance on database ?content summaries.? Then, Sec- tion 2.2 describes the use of ?uniform? query probing for extraction of content summaries from text databases, and identifies the limitations of this technique. Finally, Section 2.3 discusses how focused query probing has been used in the past for the classification of text databases. 2.1 Database Selection Algorithms Database selection is an important task in the metasearching process, since it has a critical impact on the efficiency and effectiveness of query processing over multiple text databases. We now briefly outline how typical database selection algorithms work and how they depend on database content summaries to make decisions. A database selection algorithm attempts to find the best text databases to evaluate a given query, based on information about the database contents. Usu- ally, this information includes the number of different documents that contain each word, which we refer to as the document frequency of the word, plus per- haps some other simple related statistics [Gravano et al. 1997; Meng et al. 1998; Xu and Callan 1998], such as the number of documents stored in the database. Definition 2.1. The content summary S(D) of a database D consists of: ?the actual number of documents in D, |D|, and ?for each word w, the number df(w) of documents in D that include w. For notational convenience, we also use p(w|D) = df(w)|D| to denote the fraction of D documents that include w. Table I shows a small fraction of what the content summaries for two real text databases might look like. For example, the content summary for the CNN ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:6 ? P. G. Ipeirotis and L. Gravano Money database, a database with articles about finance, indicates that 255 out of the 13,313 documents in this database contain the word ?cancer,? while there are 1,893,838 documents with the word ?cancer? in CANCERLIT, a database with research articles about cancer. Given these summaries, a database selec- tion algorithm estimates the relevance of each database for a given query (e.g., in terms of the number of matches that each database is expected to produce for the query). Example 2.2. bGlOSS [Gravano et al. 1999] is a simple database selec- tion algorithm that assumes query words to be independently distributed over database documents to estimate the number of documents that match a given query. So, bGlOSS estimates that query [breast cancer] will match |D|? df(breast)|D| ? df(cancer)|D| similar 90, 225 documents in database CANCERLIT, where |D| is the number of documents in the CANCERLIT database and df(w)isthe number of documents that contain the word w. Similarly, bGlOSS estimates that roughly only one document will match the given query in the other data- base, CNN Money, of Table I. bGlOSS is a simple example from a large family of database selection algo- rithms that rely on content summaries such as those in Table I. Furthermore, database selection algorithms expect content summaries to be accurate and up- to-date. The most desirable scenario is when each database exports its content summary directly and reliably (e.g., via a protocol such as STARTS [Gravano et al. 1997]). Unfortunately, no protocol is widely adopted for web-accessible da- tabases, and there is little hope that such a protocol will emerge soon. Hence, we need other solutions to automate the construction of content summaries from databases that cannot or are not willing to export such information. We review one such approach next. 2.2 Uniform Probing for Content Summary Construction As discussed before, we cannot extract perfect content summaries for hidden- web text databases whose contents are not crawlable. When we do not have access to the complete content summary S(D) of a database D, we can only hope to generate a good approximation to use for database selection purposes. Definition 2.3. The approximate content summary ?(D) of a database D consists of: ?an estimate hatwidest| of the number of documents in D, and ?for each word w, an estimate hatwider(w)ofdf(w). Using the values hatwidest| and hatwider(w), we can define an approximation ?(w|D)of p(w|D)as?(w|D) = hatwiderdf (w)hatwidest|D| . Callanetal.[1999]andCallanandConnell[2001]presentedpioneeringwork on automatic extraction of approximate content summaries from ?uncoopera- tive? text databases that do not export such metadata. Their algorithm extracts a document sample via querying from a given database D, and approximates ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:7 df(w) using the frequency of each observed word w in the sample, sf(w) (i.e., hatwiderdf (w) = sf(w)). In detail, the algorithm proceeds as follows. Algorithm. (1) Start with an empty content summary where sf(w) = 0 for each word w, and a general (i.e., not specific to D), comprehensive word dictionary. (2) Pick a word (see the next paragraph) and send it as a query to database D. (3) Retrieve the top-k documents returned for the query. (4) If the number of retrieved documents exceeds a prespecified threshold, stop. Other- wise continue the sampling process by returning to step 2. Callan et al. suggested using k = 4 for step 3 and that 300 documents are sufficient (step 4) to create a representative content summary of a database. Also they describe two main versions of this algorithm that differ in how step 2 is executed. The algorithm QueryBasedSampling-OtherResource (QBS-Ord for short) picks a random word from the dictionary for step 2. In contrast, the algorithm QueryBasedSampling-LearnedResource (QBS-Lrd for short) selects the next query from among the words that have been already discovered dur- ing sampling. QBS-Ord constructs better profiles, but is more expensive than QBS-Lrd [Callan and Connell 2001]. Other variations of this algorithm per- form worse than QBS-Ord and QBS-Lrd, or have only marginal improvement in effectiveness at the expense of probing cost. Unfortunately, both QBS-Lrd and QBS-Ord have a few shortcomings. Since these algorithms set hatwider(w) = sf(w), the approximate frequencies hatwider(w) range between zero and the number of retrieved documents in the sample. In other words, the actual document frequency df(w) for each word w in the database is not revealed by this process. Hence, two databases with the same focus (e.g., two medical databases) but differing significantly in size might be assigned similar content summaries. Also, QBS-Ord tends to produce inefficient executions in which it repeatedly issues queries to databases that produce no matches. Ac- cording to Zipf?s law [Zipf 1949], most of the words in a collection occur very few times. Hence, a word that is randomly picked from a dictionary (which hope- fully contains a superset of the words in the database), is not likely to occur in any document of an arbitrary database. Similarly, for QBS-Lrd, the queries are derived from the already acquired vocabulary, and many of these words appear only in one or two documents, so a large fraction of the QBS-Lrd queries return only documents that have been retrieved before. These queries increase the number of queries sent by QBS-Lrd, but do not retrieve any new documents. In Section 3, we present our algorithm for approximate content summary con- struction that overcomes these problems and, as we will see, produces content summaries of higher quality than those produced by QBS-Ord and QBS-Lrd. 2.3 Focused Probing for Database Classification Another way to characterize the contents of a text database is to classify it in a Yahoo!-like hierarchy of topics according to the type of the documents that it contains. For example, CANCERLIT can be classified under the category ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:8 ? P. G. Ipeirotis and L. Gravano Fig. 1. Algorithm for classifying a database D into the category subtree rooted at category C. ?Health,? since it contains mainly health-related documents. Gravano et al. [2003] presented a method to automate the classification of web-accessible text databases, based on focused probing. The rationale behind this method is that queries closely associated with a topical category retrieve mainly documents about that category. For example, a query [breast cancer] is likely to retrieve mainly documents that are related to the ?Health? category. Gravano et al. [2003] automatically construct these topic-specific queries using document classifiers, derived via supervised ma- chine learning. By observing the number of matches generated for each such query at a database, we can place the database in a classification scheme. For example, if one database generates a large number of matches for queries asso- ciatedwiththe?Health?categoryandonlyafewmatchesforallothercategories, we might conclude that this database should be under category ?Health.? If the database does not return the number of matches for a query or does so unreli- ably, we can still classify the database by retrieving and classifying a sample of documents from the database. Gravano et al. [2003] showed that sample-based classification has both lower accuracy and higher cost than an algorithm that relies on the number of matches; however, in the absence of reliable match- ing statistics, classifying the database based on a document sample is a viable alternative. To classify a database, the algorithm in Gravano et al. [2003] (see Figure 1) starts by first sending those query probes associated with subcategories of the top node C of the topic hierarchy, and extracting the number of matches for each probe, without retrieving any documents. Based on the number of matches for the probes for each subcategory Ci, the classification algorithm then calcu- lates two metrics, the Coverage(D, Ci) and Specificity(D, Ci) for the subcate- gory: Coverage(D, Ci) is the absolute number of documents in D that are es- timated to belong to Ci, while Specificity(D, Ci) is the fraction of documents in D that are estimated to belong to Ci. The algorithm decides to classify D into a category Ci if the values of Coverage(D, Ci) and Specificity(D, Ci) ex- ceed two prespecified thresholds tauec and taues, respectively. These thresholds are ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:9 determined by ?editorial? decisions on how ?coarse? a classification should be. For example, higher levels of the specificity threshold taues result in assignments of databases mostly to higher levels of the hierarchy, while lower values tend to assign the databases to nodes closer to the leaves.7 When the algorithm detects that a database satisfies the specificity and coverage requirement for a subcat- egory Ci, it proceeds recursively in the subtree rooted at Ci. By not exploring other subtrees that did not satisfy the coverage and specificity conditions, the algorithm avoids exploring portions of the topic space that are not relevant to the database. Next, we introduce a novel technique for constructing content summaries that are highly accurate and efficient to build. Our new technique builds on the document sampling approach used by the QBS algorithms [Callan and Connell 2001] and on the text-database classification algorithm from Gravano et al. [2003]. Just like QBS, which we summarized in Section 2.2, our new technique probes the databases and retrieves a small document sample to construct the approximate content summaries. The classification algorithm, which we sum- marized in this section, provides a way to focus on those topics that are most representative of a given database?s contents, resulting in accurate and effi- ciently extracted content summaries. 3. CONSTRUCTING APPROXIMATE CONTENT SUMMARIES We now describe our algorithm for constructing content summaries for a text database. Our algorithm exploits a topic hierarchy to adaptively send focused probes to the database (Section 3.1). Our technique retrieves a ?biased? sam- ple containing documents that are representative of the database contents. Furthermore, our algorithm exploits the number of matches reported for each query to estimate the absolute document frequencies of words in the database (Section 3.2). 3.1 Classification-Based Document Sampling Our algorithm for approximate content summary construction exploits a topic hierarchy to adaptively send focused probes to a database. These queries tend to efficiently produce a document sample that is representative of the database contents, which leads to highly accurate content summaries. Furthermore, our algorithm classifies the databases along the way. In Section 4, we will show that we can exploit categorization to improve further the quality of both the generated content summaries and the database selection decisions. Our content summary construction algorithm is based on the classification algorithm from Gravano et al. [2003], an outline of which we presented in Sec- tion2.3(seeFigure1).Ourcontentsummaryconstructionalgorithmisshownin Figure2.Themaindifferencewiththeclassificationalgorithmisthatweexploit the focused probing to retrieve a document sample. We have enclosed in boxes those portions directly relevant to content summary extraction. Specifically, for 7Gravano et al. [2003] suggest that tauec approxequal 10 and taues approxequal 0.3 - 0.4 work well for the task of database classification. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:10 ? P. G. Ipeirotis and L. Gravano Fig. 2. Generalizing the classification algorithm from Figure 1 to generate a content summary for a database using focused query probing. each query probe, we retrieve k documents from the database in addition to the number of matches that the probe generates (box beta in Figure 2). Also, we record two sets of word frequencies based on the probe results and extracted documents (boxes beta and gamma ). These two sets are described next. (1) df(w) is the actual number of documents in the database that contain word w. The algorithm knows this number only if [w] is a single-word query probe that was issued to the database.8 (2) sf(w) is the number of documents in the extracted sample that contain word w. The basic structure of the probing algorithm is as follows. We explore (and send query probes for) only those categories with sufficient specificity and cover- age, as determined by the taues and tauec thresholds (for details, see Section 2.3). As a result, this algorithm categorizes the databases into the classification scheme during probing. We will exploit this categorization to improve the quality of the generated content summaries in Section 4.2. Figure 3 illustrates how our algorithm works for the CNN Sports Illus- trated database, a database with articles about sports, and for a toy hierar- chical scheme with four categories under the root node: ?Sports,? ?Health,? 8The number of matches reported by a database for a single-word query [w] might differ slightly from df(w), for example, if the database applies stemming [Salton and McGill 1983] to query words so that a query [computers] also matches documents with word ?computer.? ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:11 Health Science metallurgy (0) dna (30) Computers Sports soccer (7,530) cancer (780) (24,520) keyboard (32) ram (140) aids (80) Probing Process - Phase 1 Parent Node: Root Basketball Baseball Soccer Hockey jordan (1,230) liverpool(150) lakers (7,700) yankees (4,345) fifa (2,340) Probing Process - Phase 2 Parent Node: Sports nhl (4,245) canucks (234) The number of matches returned for each query is indicated in parentheses next to the query Fig. 3. Querying the CNN Sports Illustrated database with focused probes. ?Computers,? and ?Science.? We pick specificity and coverage thresholds taues = 0.4 and tauec = 10, respectively, which work well for the task of database clas- sification [Gravano et al. 2003]. The algorithm starts by issuing query probes associated with each of the four categories. The ?Sports? probes generate many matches (e.g., query [baseball] matches 24,520 documents). In contrast, probes for the other sibling categories (e.g., [metallurgy] for category ?Science?) gener- ate just a few or no matches. The Coverage of category ?Sports? is the sum of the number of matches for its probes, or 32,050. The Specificity of category ?Sports? is the fraction of matches that correspond to ?Sports? probes, or 0.967. Hence, ?Sports? satisfies the Specificity and Coverage criteria (recall that taues = 0.4 and tauec = 10) and is further explored in the next level of the hierarchy. In contrast, ?Health,? ?Computers,? and ?Science? are not considered further. By pruning the probe space, we improve the efficiency of the probing process by giving at- tention to the topical focus (or foci) of the database. (Out-of-focus probes would tend to return few or no matches.) During probing, our algorithm retrieves the top-k documents returned by each query (box beta in Figure 2). For each word w in a retrieved document, the al- gorithm computes sf(w) by measuring the number of documents in the sample, extracted in a probing round, that contain w. If a word w appears in document samples retrieved during later phases of the algorithm for deeper levels of the hierarchy, then all sf(w) values are added together (the merge step in box gamma ). ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:12 ? P. G. Ipeirotis and L. Gravano f = P B ? ? ? Known df ? Unknown df sf ... ... h ...... hepatitis... ... ... 200,000 matches 1,400,000 matches 600,000 matches f (frequency) r (rank) Fig. 4. Estimating unknown df values. Similarly, during probing, the algorithm keeps track of the number of matches produced by each single-word query [w]. As discussed, the number of matches for such a query is (an approximation of) the df(w) frequency (i.e., the number of documents in the database with word w). These df(?) frequencies are crucial to estimate the absolute document frequencies of all words that appear in the document sample extracted, as discussed next. 3.2 Estimating Absolute Document Frequencies The QBS-Ord and QBS-Lrd techniques return the frequency of words in the document sample (i.e., the sf(?) frequencies), with no absolute frequency in- formation. We now show how we can exploit the df(?) and sf(?) document fre- quencies that we extract from a database to build a content summary for the database with accurate absolute document frequencies. Before turning to the details of the algorithm, we describe a (simplified) ex- ample in Figure 4 to introduce the basic intuition behind our approach.9 After probing the CANCERLIT database using the algorithm in Figure 2, we rank all words in the extracted documents according to their sf(?) frequency. For exam- ple, ?cancer? has the highest sf(?) value and ?hepatitis? the lowest such value in Figure 4. The sf(?) value of each word is denoted by an associated vertical bar. Also, the figure shows the df(?) frequency of each word that appeared as a single-word query. For example, df(hepatitis) = 200, 000, because query probe [hepatitis] returned 200,000 matches. Note that the df value of some words (e.g., ?stomach?) is unknown. These words are in documents retrieved during probing, but did not appear as single-word probes. Finally, note from the figure 9The figures in this example are coarse approximations of the real ones, and we use them just to illustrate our approach. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:13 that sf(hepatitis) approxequal sf(stomach), and so we might want to estimate df(stomach) to be close to the (known) value of df(hepatitis). To specify how to ?propagate? the known df frequencies to ?nearby? words with similar sf frequencies, we exploit well-known laws on the distribution of words over text documents. Zipf [1949] was the first to observe that word- frequency distributions follow a power law, an observation later refined by Man- delbrot [1988]. Mandelbrot identified a relationship between the rank r and the frequency f of a word in a text database, f = P(r + p)B, where P, B, and p are database-specific parameters (P > 0, B < 0, p greaterequal 0). This formula indicates that the most frequent word in a collection (i.e., the word with rank r = 1) will tend to appear in about P(1 + p)B documents, while, say, the tenth most frequent word will appear in just about P(10+ p)B documents. Therefore, given Mandelbrot?s formula for the database and the word ranking, we can estimate the frequency of each word. Our technique relies on Mandelbrot?s formula to define the content summary of a database and consists of two steps, detailed next. (1) During probing, exploit the sf(?) frequencies derived during sampling to estimate the rank-frequency distribution of words over the entire database (Section 3.2.1). (2) After probing, exploit the df(?) frequencies obtained from one-word query probes to estimate the rank of these words in the actual database; then, estimate the document frequencies of all words by ?propagating? the known rank and document frequencies to ?nearby? words w for which we only know sf(w) and not df(w) (Section 3.2.2). 3.2.1 Estimating the Word Rank-Frequency Distribution. The first part of our technique estimates the parameters P and B (of a slightly simplified version10) of Mandelbrot?s formula for a given database. To do this, we examine how the parameters of Mandelbrot?s formula change for different sample sizes. We observed that in all the databases that we examined for our experiments, log(P) and B tend to increase logarithmically with the sample size |S|. (This is actually an effect of sampling from a power-law distribution [Baayen 2006].) Specifically, log(P) = P1 log(|S|) + P2 (1a) B = B1 log(|S|) + B2 (1b) and P1, P2, B1, and B2 are database-specific constants, independent of sample size. Based on the preceding empirical observations, we proceed as follows for a database D. At different points during the document sampling process, we calculate P and B. After sampling, we use regression to estimate the values of P1, P2, B1, and B2. We also estimate the size of database D using the sample- resample method [Si and Callan 2003] with five resampling queries. Finally, we 10For numerical stability, we define f = PrB, which allows us to use linear regression in the log-log space to estimate parameters P and B. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:14 ? P. G. Ipeirotis and L. Gravano compute the values of P and B for the database by substituting the estimated |D| for |S| in Eqs. (1a) and (1b). At this point, we have a description of the frequency-rank distribution for the actual database. 3.2.2 Estimating Document Frequencies. Given the parameters of Man- delbrot?s formula, the actual document frequency df(w) of each word w can be derived from its rank in the database. For high-frequency words, the rank in the sample is usually a good approximation of the rank in the database. Unfor- tunately, this is rarely the case for low-frequency words, for which we rely on the observation that the df(?) frequencies derived from one-word query probes can help estimate the rank and df(?) frequency of all words in the database. Our rank and frequency estimation algorithm works as follows. Algorithm. (1) Sort words in descending order of their sf(?) frequencies to determine the sample rank sr(wi) of each word wi; do not break ties for words with equal sf(?) frequency and assign the same sample rank sr(?) to these words. (2) For each word w in a one-word query probe (df(w) is known), use Mandelbrot?s formula and compute the database rank ar(w) = (df(w)P ) 1B. (3) For each word w not in a one-word query probe (df(w) is unknown), do the following. (a) Find two words w1 and w2 with known df and consider their ranks in the sample (i.e., sr(w1), sr(w2)) and in the database (i.e., ar(w1), ar(w2)).11 (b) Use interpolation in the log-log space to compute the database rank ar(w).12 (c) Use Mandelbrot?s formula to compute hatwider(w) = P ? ar(w)B, where ar(w)isthe rank of word w as computed in the previous step. Using the aforesaid procedure, we can estimate the df frequency of each word that appears in the sample. Example 3.1. Consider the medical database CANCERLIT and Figure 4. We know that df(liver) = 1, 400, 000 and df(hepatitis) = 200, 000, since the re- spective one-word queries reported as many matches. Furthermore, the ranks of the two words in the sample are sr(liver) = 4 and sr(hepatitis) = 10, re- spectively. While we know that the rank of the word ?kidneys? in the sample is sr(kidneys) = 8, we do not know df(kidneys) because [kidneys] was not a query probe. However, the known values of df(hepatitis) and df(liver) can help us estimate the rank of ?kidneys? in the database and, in turn, the df(kidneys) frequency. For the CANCERLIT database, we estimate that P = 6 ? 106 and B =-1.15. Thus, we estimate that ?liver? is the fourth most frequent word in the database (i.e., ar(liver) = 4), while ?hepatitis? is ranked number 20 (i.e., ar(hepatitis) = 20). Therefore, 15 words in the database are ranked be- tween ?liver? and ?hepatitis?, while in the sample there are only 5 such words. By exploiting this observation and by interpolation, we estimate that ?kid- neys? (with rank 8 in the sample) is the 14th most frequent word in the data- base. Then, using the rank information with Mandelbrot?s formula, we compute hatwiderdf (kidneys) = 6 ? 106 ? 14-1.15 similar= 288, 472. 11It is preferable, but not essential, to pick w1 and w2 such that sr(w1) < sr(w) < sr(w2). 12The exact formula is ar(w) = exp(ln(ar(w2))?ln(sr(w)/sr(w1))+ln(ar(w1))?ln(sr(w2)/sr(w)) ln(sr(w2)/sr(w1)) ). ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:15 During sampling, we also send to the database query probes that consist of more than one word. (Recall that our query probes are derived from an under- lying automatically learned document classifier.) We do not exploit multiword queries for determining the df frequencies of their words, since the number of matches returned by a Boolean-AND multiword query is only a lower bound on the df frequency of each intervening word. However, the average length of the query probes that we generate is small (less than 1.5 words in our experiments), and their median length is 1. Hence, the majority of the query probes provide us with df frequencies that we can exploit. Finally, a potential problem with the current algorithm is that it relies on the database reporting a value for the number of matches for a one-word query [w] that is equal (or at least close) to the value of df(w). Sometimes, however, these two values might differ (e.g., if a database applies stemming to query words). In this case, frequency estimates might not be reliable. However, it is rather easy to detect such configurations [Meng et al. 1999] and adapt the frequency estimation algorithm properly. For example, if we detect that a database uses stemming, we might decide to compute the frequency and rank of each word in the sample after the application of stemming and then adjust the algorithms accordingly. In summary, we have presented a novel technique for estimating the absolute document frequency of the words in a database. As we will see, this technique produces relatively accurate frequency estimates for the words in a document sample of the database. However, database words that are not in the sample documents in the first place are ignored and not made part of the resulting content summary. Unfortunately, any document sample of moderate size will necessarily miss many words that occur only a small number of times in the as- sociated database. The absence of these words from the content summaries can negatively affect the performance of database selection algorithms for queries that mention such words. To alleviate this sparse-data problem, we exploit the observation that incomplete content summaries of topically related databases can be used to complement each other, as discussed next. 4. DATABASE SELECTION WITH SPARSE CONTENT SUMMARIES So far, we have discussed how to efficiently construct approximate content summaries using document sampling. However, any efficient algorithm for constructing content summaries through query probes is likely to produce in- complete content summaries, which can adversely affect the effectiveness of the database selection process. To alleviate this sparse-data problem, we exploit the observation that incomplete content summaries of topically related databases can be used to complement each other. In this section, we present two alterna- tive algorithms that exploit this observation and make database selection more resilient to incomplete content summaries. Our first algorithm (Section 4.1) se- lects databases hierarchically, based on categorization of the databases. Our second algorithm (Section 4.2) is a flat selection strategy that exploits the data- base categorization implicitly by using shrinkage, and enhances the database content summaries with category-specific words that appear in topically similar databases. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:16 ? P. G. Ipeirotis and L. Gravano 4.1 Hierarchical Database Selection We now introduce a hierarchical database selection algorithm that exploits the database categorization and content summaries to alleviate the negative effect of incomplete content summaries. This algorithm consists of two basic steps, given next. Algorithm. (1) ?Propagate? the database content summaries to the categories of the hierarchical classification scheme and create the associated category content summaries using Definition 4.1. (2) Use the content summaries of categories and databases to perform database selec- tion hierarchically by zooming in on the most relevant portions of the topic hierarchy. The intuition behind our approach is that databases classified under similar topics tend to have similar vocabularies. (We present supporting experimental evidence for this statement in Section 6.2.) Hence, we can view the (potentially incomplete) content summaries of all databases in a category as complemen- tary, and exploit this for better database selection. For example, consider the CANCER.gov database and its associated content summary in Figure 5. As we can see, CANCER.gov was correctly classified under ?Cancer? by the algorithm of Section 3.1. Unfortunately, the word ?metastasis? did not appear in any of the documents extracted from CANCER.gov during probing, so this word is miss- ing from the content summary. However, we see that CancerBACUP13, another database classified under ?Cancer?, has hatwider(metastasis) = 3, 569, a relatively high value. Hence, we might conjecture that the word ?metastasis? is an impor- tant word for all databases in the ?Cancer? category and that this word did not appear in CANCER.gov because it was not discovered during sampling, and not because it does not occur in the database. Therefore, we can create a content summary with category ?Cancer? in such a way that the word ?metastasis? ap- pears with relatively high frequency. This summary is obtained by merging the summaries of all databases under the category. In general, we define the content summary of a category as follows. Definition 4.1. Consider a category C and the set db(C) ={D1, ..., Dn} of databases classified (not necessarily immediately) under C.14 The approximate content summary ?(C) of category C contains, for each word w, an estimate ?(w|C)ofp(w|C), where p(w|C) is the probability that a randomly selected document from a database in db(C) contains the word w. The ?(w|C) estimates in ?(C) are derived from the approximate content summaries of the databases 13http://www.cancerbacup.org.uk 14If a database Di is classified under multiple categories, we can treat Di as multiple disjoint subdatabases, with each subdatabase being associated with one of the Di categories and containing only the documents in the respective category. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:17 CANCER.gov 60,574 documents Word df ?... breast 13,379 ?... cancer 58,491 ?... diabetes 11,344 ?? metastasis CancerBACUP 17,328 documents Word df ?... breast 2,546 ?... cancer 16,735 ?... diabetes ?? metastasis 3,569 Category: Cancer |db(Cancer)| =2 77,902 documents Word df ?... breast 15,925 ?... cancer 75,226 ?... diabetes 11,344 ?? metastasis 3,569 WebMD 3,346,639 documents Word df ?... ?... ?... Category: Health |db(Health)| = 5 3,747,366 documents Word df ?... ?... ?... ? Fig. 5. Associating content summaries with categories. in db(C)as15 ?(w|C) = summationtext Delementdb(C) ?p(w|D) ? hatwidest|D|summationtext Delementdb(C) hatwidest|D| , (2) where hatwidest| is an estimate of the number of documents in D (see Definition 2.3).16 The approximate content summary ?(C) also includes: ?the number of databases |db(C)| under C (n in this definition); ?an estimate hatwider|=summationtextDelementdb(C) hatwidest| of the number of documents in all databases under C; and 15An alternative is to define ?p(w|C) = summationtext Delementdb(C) ?p(w|D) |db(C)| , which ?weights? each database equally, regardless of its size. We implemented this alternative and obtained results virtually identical to those for Eq. (2). 16We estimate the number of documents in the database as described in Section 3.2.1. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:18 ? P. G. Ipeirotis and L. Gravano Fig. 6. Selecting the K most specific databases for a query hierarchically. ?for each word w, an estimate hatwiderC(w) of the total number of documents under C that contain the word w: hatwiderC(w) = ?(w|C) ? hatwider|. By having content sumaries associated with categories in the topic hierar- chy, we can select databases for a query by proceeding hierarchically from the root category. At each level, we use existing flat database algorithms such as CORI [Callan et al. 1995] or bGlOSS [Gravano et al. 1999]. These algorithms assign a score to each database (or category, in our case) that specifies how promising the database (or category) is for the query, as indicated by the content summaries (see Example 2.2). Given the scores for categories at one level of the hierarchy, the selection process continues recursively down the most promising subcategories. As further motivation for our approach, earlier research has in- dicated that distributed information retrieval systems tend to produce better results when documents are organized in topically cohesive clusters [Xu and Croft 1999; Larkey et al. 2000]. Figure 6 specifies our hierarchical database selection algorithm in detail. The algorithm receives as input a query and the target number of databases K that we are willing to search for the query. Also, the algorithm receives the top categoryC asinput,andstartsbyinvokingaflatdatabaseselectionalgorithmto score all subcategories of C for the query (step 1), using the content summaries associated with the subcategories. We assume in our discussion that the scores produced by the database selection algorithms are greater than or equal to zero, with a zero score indicating that a database or category should be ignored for the query. If at least one promising subcategory has a nonzero score (step 2), then the algorithm picks the best such subcategory Cj (step 3). If Cj has K or more databases under it (step 4), the algorithm proceeds recursively under thatbranchonly(step5).Thisstrategyprivileges?topic-specific?databasesover those with broader scope. On the other hand, if Cj does not have sufficiently many (i.e., K or more) databases (step 6), then intuitively the algorithm has gone as deep in the hierarchy as possible (exploring only category Cj would result in fewer than K databases being returned). Then, the algorithm returns all |db(Cj )| databases under Cj , plus the best K -|db(Cj )| databases under C but not in Cj , according to the flat database selection algorithm of choice (step 7). If no subcategory of C has a nonzero score (step 8), then again this indicates that the execution has gone as deep in the hierarchy as possible. Therefore, we ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:19 Fig. 7. Exploiting a topic hierarchy for database selection. return the best K databases under C, according to the flat database selection algorithm (step 9). Figure 7 shows an example of an execution of this algorithm for query [babe ruth] and for a target of K = 3 databases. The top-level categories are evaluated by a flat database selection algorithm for the query, and the ?Sports? category is deemed best, with a score of 0.93. Since the ?Sports? category has more than three databases, the query is ?pushed? into this category. The algorithm pro- ceeds recursively by pushing the query into the ?Baseball? category. If we had initially picked K = 10 instead, the algorithm would have still picked ?Sports? as the first category to explore. However, ?Baseball? has only seven databases, so the algorithm picks them all, and chooses the best three databases under ?Sports? to reach the target of ten databases for the query. In summary, our hierarchical database selection algorithm attempts to choose the most specific databases for a query. By exploiting the database cate- gorization, this hierarchical algorithm manages to compensate for the necessar- ily incomplete database content summaries produced by query probing. How- ever, by first selecting the most appropriate categories, this algorithm might miss some relevant databases that are not under the selected categories. One solution would be to try different hierarchy-traversal strategies that could lead to the selection of databases from multiple branches of the hierarchy. Instead of following this direction of finding the appropriate traversal strategy, we opt for an alternative, flat selection scheme: We use the classification hierarchy only for improving the extracted content summaries, and we allow the database se- lection algorithm to choose among all available databases. Next, we describe this approach in detail. 4.2 Shrinkage-Based Database Selection As argued previously, content summaries built from relatively small document samples are inherently incomplete, which might affect the performance of da- tabase selection algorithms that rely on such summaries. Now, we show how we can exploit database category information to improve the quality of the ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:20 ? P. G. Ipeirotis and L. Gravano database summaries, and subsequently the quality of database selection deci- sions. Specifically, Section 4.2.1 presents an overview of our general approach, which builds on the shrinkage ideas from document classification [McCallum et al. 1998], while Section 4.2.2 explains in detail how we use shrinkage to con- struct content summaries. Finally, Section 4.2.3 presents a database selection algorithm that uses the shrinkage-based content summaries in an adaptive and query-specific way. 4.2.1 Overview of our Approach. In Sections 2.2 and 3.1, we discussed sampling-based techniques for building content summaries from hidden-web text databases, and argued that low-frequency words tend to be absent from these summaries. Additionally, other words might be disproportionately rep- resented in the document samples. One way to alleviate these problems is to increase the document sample size. Unfortunately, this solution might be im- practical, since it would involve extensive querying of (remote) databases. Even more importantly, increases in document sample size do not tend to result in comparable improvements in content summary quality [Callan and Connell 2001]. An interesting challenge is thus to improve the quality of approximate content summaries, without necessarily increasing the document sample size. This challenge has a counterpart in the problem of hierarchical document classification.Documentclassifiersrelyontrainingdatatoassociatewordswith categories. Often, only limited training data is available, which might lead to poor classifiers. Classifier quality can be increased with more training data, but creating large numbers of training examples might be prohibitively expensive. As a less expensive alternative, McCallum et al. [1998] suggested sharing train- ing data across related topic categories. Specifically, their shrinkage approach compensates for sparse training data for a category by using training exam- ples for more general categories. For example, the training documents for the ?Heart? category can be augmented with those from the more general ?Health? category. The intuition behind this approach is that the word distribution in ?Health? documents is hopefully related to that in the ?Heart? documents. We can apply the same shrinkage principle to our problem, which requires that databases be categorized into a topic hierarchy. This categorization might be an existing one (e.g., if the databases are classified under Open Directory17). Alternatively, databases can be classified automatically using the classifica- tion algorithm briefly reviewed in Section 2.3. Regardless of how databases are categorized, we can exploit this categorization to improve content summary coverage. The key intuition behind the use of shrinkage in this context is that databases under similar topics tend to have related content summaries. Hence, we can use the approximate content summaries for similarly classified data- bases to complement each other, as illustrated in the following example. Example 4.2. Figure 8 shows a fraction of a classification scheme with two text databases D1 and D2 classified under ?Heart,? and one text database D3 classified under the (higher-level) category ?Health.? Assume that the ap- proximate content summary of D1 does not contain the word ?hypertension,? 17http://www.dmoz.org ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:21 Fig. 8. A fraction of a classification hierarchy and content summary statistics for the word ?hypertension.? but that this word appears in many documents in D1. (?Hypertension? might not have appeared in any of the documents sampled to build ?(D1).) In con- trast, ?hypertension? appears in a relatively large fraction of D2 documents as reported in the content summary of D2, which is also classified under the ?Heart? category. Then, by ?shrinking? ?(hypertension|D1) towards the value of ?(hypertension|D2), we can capture more closely the actual (and unknown) value of p(hypertension|D1). The new, ?shrunk? value is, in effect, exploiting documents sampled from both D1 and D2. We expect databases under the same category to have similar content sum- maries. Furthermore, even databases classified under relatively general cate- goriescanhelpimprovetheapproximatecontentsummaryofamorespecificda- tabase. Consider database D3, classified under ?Health? in Figure 8. Here ?(D3) can help complement the content summary approximation of databases D1 and D2, which are classified under a subcategory of ?Health,? namely ?Heart.? Da- tabase D3, however, is a more general database that contains documents in topics other than heart-related. Hence, the influence of ?(D3)on ?(D1) should perhaps be less than that of, say, ?(D2). In general, and just as for document classification [McCallum et al. 1998], each category level might be assigned a different ?weight? during shrinkage. We discuss this and other specific aspects of our technique next. 4.2.2 Using Shrinkage over a Topic Hierarchy. We now define more for- mally how we can use shrinkage for content summary construction. For this, we use the notion of content summaries for the categories of a classification scheme (Definition 4.1) from Section 4.1. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:22 ? P. G. Ipeirotis and L. Gravano Creating shrunk content summaries. Section 4.2.1 argued that mixing in- formation from content summaries of topically related databases may lead to more complete approximate content summaries. We now formally describe how to use shrinkage for this purpose. In essence, we create a new content sum- mary for each database D by shrinking the approximate content summary of D, ?(D), so that it is ?closer? to the content summaries S(Ci) of each category Ci under which D is classified. Definition 4.3. Consider a database D classified under categories C1, ..., Cm of a hierarchical classification scheme, with Ci = Parent(Ci+1) for i = 1, ..., m - 1. Let C0 be a dummy category whose content summary ?(C0) contains the same estimate ?(w|C0) for every word w. Then, the shrunk content summary hatwide(D) of database D consists of: ?an estimate hatwidest| of the number of documents in D; and ?for each word w, a shrinkage-based estimate ?R(w|D)ofp(w|D), defined as ?R(w|D) = lambdam+1 ? ?(w|D) + msummationdisplay i=0 lambdai ? ?(w|Ci) (3) for a choice of lambdai values such that summationtextm+1i=0 lambdai = 1 (see next). As described so far, the ?(w|Ci) values in the ?(Ci) content summaries are not independent of each other: Since Ci = Parent(Ci+1), all the databases under Ci+1 are also used to compute ?(Ci), by Definition 4.1. To avoid this overlap, before estimating hatwide(D), we subtract from ?(Ci) all the data used to construct ?S(Ci+1). Also note that a simple version of Eq. (3) is used for database selection based on language models [Si et al. 2002]. Language model database selection ?smoothes? the ?(w|D) probabilities with the probability ?(w|G) for a ?global? category G. Our technique extends this principle and does multilevel smoothing of ?(w|D), using the hierarchical classification of D. We now describe how to compute the lambdai weights used in Eq. (3). Calculating category mixture weights. We define the lambdai mixture weights from Eq. (3), so as to make the shrunk content summaries hatwide(D) for each da- tabase D as similar as possible to both the starting summary ?(D) and the summary ?(Ci) of each category Ci under which D is classified. Specifically, we use expectation maximization (EM) [McCallum et al. 1998] to calculate the lambdai weights, using the algorithm in Figure 9. (This is a simple version of the EM algorithm from Dempster et al. [1977].) The Expectation step calculates the likelihood that content summary hatwide(D) corresponds to each category. The Maximization step weights the lambdai?s to maxi- mize the total likelihood across all categories. The result of the algorithm is the shrunk content summary hatwide(D), which incorporates information from multiple content summaries and is thus hopefully closer to the complete (and unknown) content summary S(D) of database D. For illustration purposes, Table II reports the computed mixture weights for two databases that we used in our experiments. As we can see, in both cases the original database content summary and that of the most specific category ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:23 Fig.9. Usingexpectationmaximizationtodeterminethelambdai mixtureweightsfortheshrunkcontent summary of a database D. for the database receive the highest weights (0.421 and 0.414, respectively, for the AIDS.org database, and 0.411 and 0.297, respectively, for the American Economics Association database). However, higher-level categories also receive nonnegligible weights. In general, the lambdam+1 weight associated with a database (as opposed to with the categories under which it is classified) is usually highest among the lambdai?s, and so the word-distribution statistics for the database are not eclipsed by the category statistics. (We verify this claim experimentally in Section 6.3.) Shrinkage might in some cases (incorrectly) reduce the estimated frequency of words that distinctly appear in a database. Fortunately, this reduction tends to be small because of the relatively high value of lambdam+1, and hence these dis- tinctive words remain with high frequency estimates. As an example, consider the AIDS.org database from Table II. The word chlamydia appears in 3.5% of those in the AIDS.org database. This word appears in 4% of the documents in the document sample from AIDS.org and in approximately 2% of those in the content summary for the AIDS category. After applying shrinkage, the esti- mated frequency of the word chlamydia is somewhat reduced, but still high. The shrinkage-based estimate is that chlamydia appears in 2.85% of the documents in AIDS.org, which is still close to the real frequency. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:24 ? P. G. Ipeirotis and L. Gravano Table II. Category Mixture Weights for Two Databases Database Category lambda Database Category lambda Uniform 0.075 Uniform 0.041 Root 0.026 American Root 0.041 AIDS.org Health 0.061 Economics Science 0.055 Diseases 0.003 Association Social Sciences 0.155 AIDS 0.414 Economics 0.297 AIDS.org 0.421 A.E.A. 0.411 Shrinkage might in some cases (incorrectly) cause inclusion of words in the content summary that do not appear in the corresponding database. For- tunately, such spurious words tend to be introduced in summaries with low weight. Using once again the AIDS.org database as an example, we observed that the word metastasis was (incorrectly) added by the shrinkage process to the summary: Metastasis does not appear in the database, but is included in documents in other databases under the Health category and hence is in the Health category content summary. The shrunk content summary for AIDS.org estimates that metastasis appears in just 0.03% of the database documents, so such a low estimate is unlikely to adversely affect database selection decisions. (We will evaluate the positive and negative effects of shrinkage experimentally later, in Sections 6 and 7.) Finally, note that the lambdai weights are computed offline for each database when the sampling-based database content summaries are created. This computation does not involve any overhead at query-processing time. 4.2.3 Improving Database Selection Using Shrinkage. So far, we intro- duced a shrinkage-based strategy to complement the incomplete content sum- mary of a database with the summaries of topically related databases. In princi- ple, existing database selection algorithms could proceed without modification and use the shrunk summaries to assign scores for all queries and databases. However, sometimes shrinkage might not be beneficial and should not be used. Intuitively, shrinkage should be used to determine the score s(q, D) for a query q and a database D only if the uncertainty associated with this score would otherwise be large. The uncertainty associated with an s(q, D) score depends on a number of sample-, database-, and query-related factors. An important factor is the size of the document sample relative to that of database D. If an approximate sum- mary ?(D) was derived from a sample that included most of the documents in D, then this summary is already ?sufficiently complete.? (For example, this situation might arise if D is a small database.) In this case, shrinkage is not necessary and might actually be undesirable, since it might introduce spurious words into the content summary from topically related (but not identical) da- tabases. Another factor is the frequency of query words in the sample used to determine ?(D). If, say, every word in a query appears in nearly all sample doc- uments and the sample is representative of the entire database contents, then there is little uncertainty on the distribution of the words over the database at large. Therefore, the uncertainty about the score assigned to the database ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:25 from the database selection algorithm is also low, and there is no need to apply shrinkage. Analogously, if every query word appears in only a small fraction of sample documents, then most probably the database selection algorithm would assign a low score to the database, since it is unlikely that the database is a good candidate for evaluating the query. Again, in this case shrinkage would provide limited benefit and should be avoided. However, consider the following scenario, involving bGlOSS and a multiword query for which most words ap- pear very frequently in the sample, but where one query word is missing from the document sample altogether. In this case, bGlOSS would assign a zero score to the database. The missing word, though, may have a nonzero frequency in the complete content summary, and the score assigned by bGlOSS to the da- tabase would have been significantly higher in the presence of this knowledge because of bGlOSS?s Boolean nature. So, the uncertainty about the database score that bGlOSS would assign if given the complete summary is high, and it is thus desirable to apply shrinkage. In general, for query-word distribution scenarios where the approximate content summary is not sufficient to reliably establish the query-specific score for a database, shrinkage should be used. More formally, consider a query q = [w1, ..., wn] with n words w1, ..., wn,a database D, and an approximate content summary for D, ?(D), derived from a randomsample S of D.Furthermore,supposethatwordwk appearsinexactlysk documents in the sample S. For every possible combination of values d1, ..., dn (see the following), we compute: ?the probability P that wk appears in exactly dk documents in D, for k = 1, ..., n,as P = nproductdisplay k=1 dgammak parenleftBig dk |D| parenrightBigsk parenleftBig 1 - dk|D| parenrightBig|S|-sk summationtext|D| i=0 igamma ? parenleftBig i |D| parenrightBigsk parenleftBig 1 - i|D| parenrightBig|S|-sk , (4) where gamma is a database-specific constant (for details, see Appendix A); and ?the score s(q, D) that the database selection algorithm of choice would assign to D if p(wk|D) = dk|D| , for k = 1, ..., n. So for each possible combination of values d1, ..., dn, we compute both the probability of the value combination and the score that the database selection algorithm would assign to D for this document frequency combination. Then, we can approximate the uncertainty behind the s(q, D) score by examining the mean and variance of database scores over the different d1, ..., dn values. This computation can be performed efficiently for a generic database selection algorithm: Given the sample frequencies s1, ..., sn, a large number of possible d1, ..., dn values have virtually zero probability of occurring, so we can ignore them. Additionally, mean and variance converge fast, even after examining only a small number of d1, ..., dn combinations. Specifically, we examine random d1, ..., dn combinations and periodically calculate the mean and variance of the score distribution. Usually, after examining just a few hundred random d1, ..., dn combinations, mean and variance converge to a stable value. The mean and variance computation typically requires less than 0.1 seconds for ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:26 ? P. G. Ipeirotis and L. Gravano Fig. 10. Using shrinkage adaptively for database selection. a single-word query, and approximately 4?5 seconds for a 16-word query.18 This computation can be even faster for a large class of database selection algorithmsthatassumeindependencebetweenquerywords(e.g.,Gravanoetal. [1999], Callan et al. [1995], and Xu and Croft [1999]). For these algorithms, we can calculate the mean and variance for each query word separately, and then combine them into the final mean score and variance, respectively (in Appendix B we provide more details). For algorithms that assume independence between query words, the computation time is typically less than 0.1 seconds. Figure 10 summarizes the preceding discussion and shows how we can adap- tively use shrinkage with an existing database selection algorithm. Specifically, the algorithm takes as input a query q and a set of databases D1, ..., Dm. The Content Summary Selection step decides whether to use shrinkage for each database Di, as discussed earlier. If the distribution of possible scores has high variance, then ?(Di) is considered unreliable and the shrunk content sum- mary hatwide(Di) is used instead. Otherwise, shrinkage is not applied. Then, the Scoring step computes the score s(q, Di) for each database Di, using the con- tent summary chosen for Di in the Content Summary Selection step. Finally, the Ranking step orders all databases by their final score for the query. The metasearcher then uses this rank to decide which databases to search for the query. In this section, we presented two database selection strategies that ex- ploit database classification to improve selection decisions in the presence of 18We measured the time on a PC with a dual AMD Athlon CPU, running at 1.8 GHz. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:27 incomplete content summaries. Next, we present the settings for the experi- mental evaluation of the content summary construction algorithm of Section 3 and of the database selection algorithms of Section 4. 5. EXPERIMENTAL SETTING In this section, we describe the data (Section 5.1), strategies for computing con- tent summaries (Section 5.2), and database selection algorithms (Section 5.3) that we use for the experiments reported in Sections 6 and 7. 5.1 Datasets The content summary construction techniques that we proposed before rely on a hierarchical categorization scheme. For our experiments, we use the classifi- cation scheme from Gravano et al. [2003], with 72 nodes organized in a 4-level hierarchy. To evaluate the algorithms described in this article, we use four datasets in conjunction with the hierarchical classification scheme. These are as follows. ?Controlled. This is a dataset that was also used for evaluating the task of database classification in Gravano et al. [2003]. To construct this dataset, we used postings from Usenet newsgroups where the signal-to-noise ratio was high and where the documents belonged (roughly) to one of the categories of our classification scheme. For example, the newsgroups comp.lang.c and comp.lang.c++ were considered relevant to category ?C/C++.? We collected 500,000 articles from April through May 2000. Out of these 500,000 articles, 81,000 were used to train and test the document classifiers that we used for the Focused Probing algorithm (see Section 5.2.1). We removed all headers from the newsgroup articles, with the exception of the ?Subject? line; we also removed the e-mail addresses contained in the articles. Except for these modifications, we made no changes to the collected documents. We used the remaining 419,000 articles to build the 500 databases in the Controlled dataset. The size of the 500 Controlled databases that we cre- ated ranges from 25 to 25,000 documents. Out of the 500 databases, 350 are homogeneous, with documents from a single category, while the remaining 150 are heterogeneous, with a variety of category mixes. We define a data- base as homogeneous when it has articles from only one node, regardless of whether this node is a leaf node. If it is not, then it has an equal number of articles from each leaf node in its subtree. Heterogeneous databases, on the other hand, have documents from different categories that reside in the same level in the hierarchy (not necessarily siblings), with different mixture percentages. We believe that these databases model real-world searchable web databases, with a variety of sizes and foci. ?TREC4. This is a set of 100 databases created using documents from TREC- 4 [Harman 1996] and separated into disjoint databases via clustering using the K-means algorithm as specified in Xu and Croft [1999]. By construction, the documents in each database are on roughly the same topic. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:28 ? P. G. Ipeirotis and L. Gravano Table III. Some of the Real Web Databases in the Web Dataset URL Documents Classification http://www.bartleby.com/ 375,734 Rootarrowright Artsarrowright Literaturearrowright Texts http://java.sun.com/ 78,870 Rootarrowright Computersarrowright Programmingarrowright Java http://mathforum.org/ 29,602 Rootarrowright Sciencearrowright Mathematics http://www.uefa.com/ 28,329 Rootarrowright Sportsarrowright Soccer ?TREC6. This is a set of 100 databases created using documents from TREC- 6 [Voorhees and Harman 1998] and separated into disjoint databases using the same methodology as for TREC4. ?Web. This set contains the top-5 real web databases from each of the 54 leaf categories of the hierarchy and from each of the 17 internal nodes of the hierarchy19 (except for the root), as ranked in the Google Directory,20 for a total of 315 databases.21 The size of these databases ranges from 100 to about 376,000 documents. Table III lists four example databases. We used the GNU Foundation?s wget crawler to download the HTML contents of each site, and kept only the text from each file by stripping the HTML tags using the lynx ?dump command. We use the Controlled dataset in Section 6 to extensively test the quality of the generated content summaries and to pick the variation of our probing strategy (from Section 3.1) that we will use for our subsequent experiments in Section 7. We also use the Web dataset in Section 6 to further validate results on the quality of the summaries. Finally, we use the TREC4 and TREC6 datasets, both for examining the quality of the content summaries and for testing the performance of the database selection algorithms in Section 7. (The TREC4 and TREC6 datasets are the only ones in our testbed that include queries and associated relevance judgments.) For indexing and searching the files in all datasets, we used Jakarta Lucene,22 an open-source full-text search engine. 5.2 Content Summary Construction Algorithms Our experiments evaluate a number of content summary construction tech- niques, which vary in their underlying document sampling algorithms (Sec- tion 5.2.1) and on whether they use shrinkage and absolute frequency estima- tion (Section 5.2.2). 5.2.1 Sampling Algorithms. We use different sampling algorithms for re- trieving the documents based on which we build the approximate content sum- maries ?(D) of each database D. We now describe the sampling algorithms in detail. ?Query-Based Sampling (QBS). We experimented with the two versions of QBS described in Section 2, namely QBS-Ord and QBS-Lrd. As the initial 19Instead of retrieving the top-5 databases from each category, a plausible alternative is to select a number of databases, from each hierarchy node, that is proportional to the size of the respective hierarchy subtree. In our work, we give equal weight to each category. 20http://directory.google.com/ 21We have fewer than 71 ? 5 = 355 databases because not all internal nodes included at least 5. 22http://lucene.apache.org/ ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:29 dictionary D for these two methods, we used all words in the Controlled databases.23 Eachqueryretrievesupto4previouslyunseendocuments.Sam- pling stops after retrieving 300 distinct documents. In our experiments, sam- pling also stops when 500 consecutive queries retrieve no new documents. To minimize the effect of randomness, we run each experiment over 5 QBS document samples for each database and report average results. ?Focused Probing (FPS). We evaluate our Focused Probing technique, which we introduced in Section 3.1, with a variety of underlying document classi- fiers. The document classifiers are used by Focused Probing to generate the queries sent to the databases. Specifically, we consider the following varia- tions of the Focused Probing technique: ?FP-RIPPER. Focused Probing using RIPPER [Cohen 1996] as the base document classifier. ?FP-C4.5. Focused Probing using C4.5RULES, which extracts classification rules from decision tree classifiers generated by C4.5 [Quinlan 1992]. ?FP-Bayes. Focused Probing using naive-Bayes classifiers [Duda et al. 2000] in conjunction with the technique to extract rules from numerically-based naive-Bayes classifiers from Gravano et al. [2003]. ?FP-SVM. Focused Probing using support vector machines with linear ker- nels [Joachims 1998] in conjunction with the same rule extraction tech- nique used for FP-Bayes. The query probes of these classifiers are typically short: The median query length is 1 word, average query length is 1.35 words, and maximum query length is 4 words. Further details about the characteristics of the classifiers are available in Gravano et al. [2003]. We also consider different values for the taues and tauec thresholds, which affect the granularity of sampling performed by the algorithm (see Section 3.1). All variations were tested with threshold taues ranging between 0 and 1. Low values of taues result in databases being pushed to more categories, which in turn results in larger document samples. To keep the number of experiments manageable, we fix the coverage threshold to tauec = 10, varying only the specificity threshold taues. 5.2.2 Shrinkage and Frequency Estimation. Our experiments also eval- uate the usefulness of our shrinkage (Section 4.2) and frequency estimation (Section 3.2) techniques. To evaluate the effect of shrinkage on content sum- mary quality, we create the shrunk content summary hatwide(D) for each database D and contrast its quality against that of the unshrunk content summary ?(D). Similarly, to evaluate the effect of our frequency estimation technique on con- tent summary quality, we consider the QBS and FPS summaries, both with and without this frequency estimation. We report results on the quality of content summaries before and after the application of our shrinkage algorithm. 23Note that this slightly favors QBS in the experiments over the Controlled databases: The initial dictionary contains a superset of words that appear in each database in the Controlled dataset. Experiments that use the Web, TREC4, and TREC6 datasets are not affected by this bias. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:30 ? P. G. Ipeirotis and L. Gravano To apply shrinkage, we need to classify each database into the 72-node topic hierarchy. Unfortunately, such classification is not available for TREC data, so for the TREC4 and TREC6 datasets we resort to our classification technique from Gravano et al. [2003], which we reviewed briefly in Section 2.3.24 A manual inspection of the classification results confirmed that they are generally accu- rate. For example, the TREC4 database all-83, with articles about AIDS, was correctly classified under the ?Rootarrowright Healtharrowright Diseasesarrowright AIDS? category. In- terestingly, in the case in which databases were not classified correctly, similar databases were still classified into the same (incorrect) category. For example, all-14, all-21, and all-44 are about middle-eastern politics and were classified under the ?Rootarrowright Sciencearrowright Social Sciencesarrowright History? category. Unlike TREC4 and TREC6, for which no ?external? classification of the da- tabases is available, for the Web databases we do not have to rely on query probing for classification; instead we can use the categories assigned to data- bases in the Google Directory. For QBS, the classification of each database in our dataset was indeed derived from the Google Directory. For FPS, we can either use the (correct) Google Directory database classification, as for QBS, or rely on the automatically computed database classification that this technique derives during document sampling. We tried both choices and found only small differ- ences in the experimental results. Therefore, for conciseness, we only report the FPS results for the automatically derived database classification. Finally, for the Controlled dataset, we use the automatically derived classification with taues = 0.25 and tauec = 10. 5.3 Database Selection Algorithms The algorithms presented in this article (Sections 4.1 and 4.2.3) are built on top of underlying ?base? database selection algorithms. We consider three well- known such algorithms from the literature. ?bGlOSS, as described in Gravano et al. [1999]. Databases are ranked for a query q by decreasing score s(q, D) =|D|?producttextwelementq ?(w|D). ?CORI, as described in French et al. [1999]. Databases are ranked for a query q by decreasing score s(q, D) = summationtextwelementq 0.4+0.6?T?I|q| , where T = (?(w|D) ? |D|)/(?(w|D) ?|D|+50 + 150 ? cw(D)mcw ), I = log(m+0.5cf(w) )/ log(m + 1.0), cf(w)is the number of databases containing w, m is the number of databases being ranked, cw(D) is the number of words in D, and mcw is the mean cw among the databases being ranked. One potential problem with the use of CORI in conjunction with shrinkage is that virtually every word has cf(w) equal to the number of databases in the dataset: Every word appears with nonzero probability in every shrunk content summary. Therefore, when we calculate cf(w) for a word w in our CORI experiments, we consider w as present in a database D only when round(|D|? ?R(w|D)) greaterequal 1. ?Language Modeling (LM), as described in Si et al. [2002]. Databases are ranked for a query q by decreasing score s(q, D) = producttextwelementq(lambda ? ?(w|D)+ (1- lambda)? 24We adapted the technique slightly so that each database is classified under exactly one category. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:31 ?(w|G)). The LM algorithm is equivalent to the KL-based database selection method described in Xu and Croft [1999]. For LM, p(w|D) is defined differ- ently than in Definition 2.1. Specifically, p(w|D) = tf(w,D)summationtext i tf(wi,D) , where tf(w, D) is the total number of occurrences of w in D. The algorithms described in Section 4.2 can be easily adapted to reflect this difference, by substituting this definition of p(w|D) for that in Definition 2.1. LM smoothes the ?(w|D) probabilitywiththeprobability ?(w|G)fora?global?category G.Inourexper- iments, we derive the probabilities ?(w|G) from the ?Root? category summary and we use lambda = 0.5, as suggested in Si et al. [2002]. We experimentally evaluate the aforesaid three database selection algo- rithms with three variations: ?Plain. Using unshrunk (incomplete) database content summaries extracted via QBS or FPS. ?Shrinkage.Usingshrinkagewhenappropriate(asdiscussedinSection4.2.3), again over database content summaries extracted via QBS or FPS. ?Hierarchical. Using unshrunk database content summaries (extracted via QBS or FPS) in conjunction with the hierarchical database selection algo- rithm from Section 4.1. Finally,toevaluatetheeffectofourfrequencyestimationtechnique(Section3.2) on database selection accuracy, we consider the QBS and FPS summaries both with and without this frequency estimation. Also, since stemming can help alleviate the data sparseness problem, we consider content summaries both with and without stemming. 6. EXPERIMENTAL RESULTS FOR CONTENT SUMMARY QUALITY In this section, we evaluate alternative content summary construction tech- niques. We first focus on the impact of the choice of sampling algorithm on content summary quality in Section 6.1. Then, in Section 6.2 we show that databases classified under similar categories tend to have similar content sum- maries. Finally, in Section 6.3 we show that shrinkage-based content sum- maries are of higher quality than their unshrunk counterparts. 6.1 Effect of Sampling Algorithm Consider a database D and a content summary A(D) computed using an ar- bitrary sampling technique. We now evaluate the quality of A(D) in terms of how well it approximates the ?perfect? content summary S(D), determined by examining every document in D. In the following definitions, WA is the set of words that appear in A(D), while WS is the (complete) set of words that appear in S(D). Our experiments are over the Controlled dataset. Recall. An important property of content summaries is their coverage of the actual database vocabulary. The weighted recall (wr) of A(D) with respect to S(D) is defined as wr = summationtext welementWAintersectionWS df(w)summationtext welementWS df(w) , which corresponds to the ctf ratio in Callan and Connell [2001]. This metric gives higher weight to more fre- quent words, but is calculated after stopwords (e.g., ?a?, ?the?) are removed, ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:32 ? P. G. Ipeirotis and L. Gravano Fig. 11(a). Weighted recall as a function of the specificity threshold taues and for the Controlled dataset. Fig. 11(b). Unweighted recall as a function of the specificity threshold taues and for the Controlled dataset. so this ratio is not artificially inflated by the discovery of common words. We report the weighted recall for the different content summary construction al- gorithms in Figure 11(a). Variants of the Focused Probing technique achieve substantially higher wr values than do QBS-Ord and QBS-Lrd. Early during probing, Focused Probing retrieves documents covering different topics, and then sends queries of increasing specificity, retrieving documents with more specialized words. As expected, the coverage of Focused Probing summaries increases for lower values of the specificity threshold taues, since the number of ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:33 Fig. 11(c). Spearman rank correlation coefficient as a function of the specificity threshold taues and for the Controlled dataset. Fig. 11(d). Relative error of the df estimations, for words with df > 3, as a function of the specificity threshold taues and for the Controlled dataset. documents retrieved for lower thresholds is larger (e.g., 493 documents for FP- SVM with taues = 0.25 versus 300 documents for QBS-Lrd): A sample of larger size,everythingelsebeingthesame,isbetterforcontentsummaryconstruction. In general, the difference in weighted recall between QBS-Lrd and QBS-Ord is small, but QBS-Lrd has slightly lower wr values due to the bias induced from querying only using previously discovered words. To understand whether low-frequency words are present in the approximate summaries, we resort to ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:34 ? P. G. Ipeirotis and L. Gravano Fig. 11(e). Number of interactions per database as a function of the specificity threshold taues and for the Controlled dataset. the unweighted recall (ur) metric, defined as ur = |WAintersectionWS||W S| . The ur metric is the fraction of words in a database that are present in a content summary. Fig- ure 11(b) shows trends similar to those for weighted recall, but the numbers are smaller, showing that lower-frequency words are not well represented in the approximate summaries. Correlation of word rankings. The recall metric can be helpful to compare the quality of different content summaries. However, this metric alone is not enough, since it does not capture the relative ranks of words in the content summary by their observed frequency. To measure how well a content sum- mary orders words by frequency with respect to the actual word frequency order in the database, we use the Spearman rank correlation coefficient (SRCC for short), which is also used in [Callan and Connell 2001] to evaluate the quality of the content summaries. (We use the version of SRCC that accounts for ties, as suggested by Callan and Connell [2001].) When two rankings are identical, then SRCC = 1; when they are uncorrelated, SRCC=0; and when they are in reverse order, SRCC = ?1. The results for the different algorithms are shown in Figure 11(c). Again, the content summaries produced by Focused Probing techniques have higher SRCC values than those for QBS-Lrd and QBS-Ord, hinting that Focused Probing retrieves a more representative sample of docu- ments from the database. Accuracy of frequency estimations. In Section 3.2, we introduced a technique to estimate the actual absolute frequencies of words in a database. To eval- uate the accuracy of our predictions, we computed the average relative error |df(w)- ? (w)|/df(w) for every word w with actual frequency df(w) > 3 (includ- ing the large tail of less-frequent words would highly distort the relative-error ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:35 computation, even for small estimation errors). Figure 11(d) reports the aver- age relative-error estimates for our algorithms. We also applied our absolute frequency estimation algorithm of Section 3.2 to QBS-Ord and QBS-Lrd, even though this estimation is not part of the original algorithms in Callan and Con- nell [2001]. As a general conclusion, our technique provides a good ballpark estimate of the absolute frequency of the words. Efficiency. To measure the efficiency of the probing methods, we report the sum of the number of queries sent to a database and the number of documents retrieved(?numberofinteractions?)inFigure11(e);seeGravanoetal.[2003]for a justification of this metric. Focused Probing techniques retrieve, on average, one document per query, while QBS-Lrd retrieves about one document for every two queries.25 QBS-Ord unnecessarily issues many queries that produce no document matches. The efficiency of the other techniques is correlated with their effectiveness. More expensive techniques tend to give better results. The exception is FP-SVM, which for taues > 0 has the lowest cost (or cost close to the lowest one) and gives results of comparable quality with respect to the more expensive methods. As discussed earlier, the Focused Probing probes were generally short, with a maximum of four words and a median of one word per query. Recall, rank correlation, and efficiency for identical sample size. We have seen that Focused Probing algorithms achieve better wr and SRCC values than do the QBS-Lrd and QBS-Ord algorithms. However, the Focused Probing al- gorithms generally retrieve a (moderately) larger number of documents than do QBS-Ord and QBS-Lrd, and the number of documents retrieved depends on how deeply into the categorization scheme the databases are classified. To test whether the improved performance of Focused Probing is just a result of larger sample size, we increased the sample size for QBS-Lrd to retrieve the same number of documents as each Focused Probing variant.26 We refer to the versions of QBS-Lrd that retrieve the same number of documents as FP-Bayes, FP-C4.5, FP-RIPPER, and FP-SVM as QBS-Bayes, QBS-C4.5, QBS-RIPPER, and QBS-SVM, respectively. The wr, ur, and SRCC values for the alternative versions of QBS-Lrd are shown in Figures 12(a), 12(b), and 12(c), respectively. We observe that the wr, ur, and SRCC values of the QBS methods improve with a larger document sam- ple, but are still lower than their Focused Probing counterparts; the difference is statistically significant at the 1% level according to a paired t-test. In gen- eral, the results show that Focused Probing methods are more effective than their QBS counterparts: The Focused Probing queries are generated by docu- ment classifiers and tend to ?cover? distinct parts of the document space. In con- trast, QBS methods query the database with words that appear in the retrieved documents, and these documents tend to contain words already present in the 25The average is computed over databases in the Controlled dataset, after the different content summary construction algorithms run to completion. 26We pick QBS-Lrd over QBS-Ord because the latter requires a much larger number of queries to extract its document sample: Most of its queries return no results (see Figure 11(e)), making it the most expensive method. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:36 ? P. G. Ipeirotis and L. Gravano Fig. 12(a). Weighted recall as a function of the specificity threshold taues, for the Controlled dataset and for the case where the FP and QBS methods retrieve the same number of documents; (b) unweighted recall as a function of the specificity threshold taues, for the Controlled dataset and for the case where the FP and QBS methods retrieve the same number of documents. sample. This difference is more pronounced in earlier stages of sampling, where Focused Probing sends more general queries. When Focused Probing starts sending queries for lower levels of the classification hierarchy, both Focused Probing and QBS demonstrate similar rates ofvocabulary growth. The exact ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:37 Fig. 12(c). Spearman rank correlation coefficient as a function of the specificity threshold taues, for the Controlled dataset and for the case where the FP and QBS methods retrieve the same number of documents. point where the two techniques start performing similarly depends on the size of the database. For large databases, Focused Probing dominates QBS even at deep levels of the hierarchy, while for smaller databases the benefits of Focused Probing are only visible during the first and second levels of sampling. Finally, we measured the number of interactions performed by the Focused Probing and QBS methods when they retrieve the same number of documents. The sum of the number of queries sent to a database and the number of docu- ments retrieved (?number of interactions?) is shown in Figure 12(d). The aver- age number of queries sent to each database is larger for the QBS methods than for their Focused Probing counterparts when they retrieve the same number of documents: QBS queries are derived from the already acquired vocabulary, and many of these words appear only in one or two documents, so a large fraction of the QBS queries return only documents that have been retrieved before. These queries increase the number of interactions for QBS, but do not retrieve any new documents. For completeness, we ran the same set of experiments for the Web, TREC4, and TREC6 datasets. We use content summaries extracted from FP-SVM with specificity threshold taues = 0.25 and coverage threshold tauec = 10: We note that FP-SVM exhibits the best accuracy-efficiency tradeoff, while taues = 0.25 leads to good database classification decisions as well (see Gravano et al. [2003]). We also use the respective QBS-SVM version of QBS. The results that we obtained (Table IV) were in general similar to those for the Controlled dataset. The main difference with the results obtained for the Controlled dataset is that the number of interactions is substantially lower for FP-SVM (and hence for ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:38 ? P. G. Ipeirotis and L. Gravano Fig. 12(d). Number of interactions per database as a function of the specificity threshold taues, for the Controlled dataset and for the case where the FP and QBS methods retrieve the same number of documents. Table IV. Weighted Recall, Unweighted Recall, Spearman Rank Correlation Coefficient, and Number of Interactions per Database Metric Dataset Method wr ur SRCC Interactions Web FP-SVM 0.887 0.520 0.813 623 Web QBS-SVM 0.879 0.456 0.810 650 TREC4 FP-SVM 0.972 0.599 0.884 650 TREC4 QBS-SVM 0.943 0.428 0.850 702 TREC6 FP-SVM 0.975 0.662 0.905 684 TREC6 QBS-SVM 0.952 0.545 0.883 694 These results are for the Web, TREC4, and TREC6 datasets and for the case where FP-SVM and QBS-SVM retrieve the same number of documents. QBS-SVM): Databases in the Controlled dataset are typically classified under multiple categories; in contrast, the databases in Web, TREC4, and TREC6 are generally classified under only one or two categories, and hence require much fewer queries for content summary construction than do Controlled databases. Evaluation conclusions. Overall, Focused Probing techniques produce sum- maries of better quality than do QBS-Ord and QBS-Lrd, both in terms of vo- cabulary coverage and word-ranking preservation. The cost of Focused Probing in terms of number of interactions with the databases is comparable to that for QBS-Lrd (for taues > 0), and significantly lower than that for QBS-Ord. Finally, the absolute frequency estimation technique of Section 3.2 gives good ballpark approximations of the actual frequencies. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:39 Fig. 13(a). Weighted recall for pairs of database content summaries, for the Controlled dataset as a function of the number of common categories in the database pairs. 6.2 Relationship Between Content Summaries and Categories A key conjecture behind our database selection algorithms is that databases under the same category tend to have closely related content summaries. Thus, we can use the content summary of a database to complement the (incomplete) content summary of another database in the same category (Section 4.2). We now explore this conjecture experimentally using the Controlled, Web, TREC4, and TREC6 datasets. Each database in the Controlled set is classified using taus = 0.25 and tauc = 10, and following definition in Gravano et al. [2003]. By construction, we know the contents of all databases in the Controlled set, as well as their correct classification. For the Web dataset, we use the database classification as given by Open Directory. Finally, we classify the databases in the TREC4 and TREC6 datasets using the classification algorithm from Gravano et al. [2003], with taues = 0.25 and tauec = 10. Then, for each pair of databases Di and Dj we measure ?the number of categories that they share, numCategories, where numCategories =|Path(Ideal(Di)) intersection Path(Ideal(Dj ))|, and where Ideal(D) is the correct classification of D, Path(Ideal(D)) = {category c |c element Ideal(D), or where c is an ancestor of some n element Ideal(D)}, for taues = 0.25 and tauec = 10; ?the wr, ur, and SRCC values of their correct content summaries. Figures 13(a), 13(b), and 13(c) report the wr, ur, and SRCC metrics, respec- tively, over all pairs of databases in the Controlled set and discriminated by numCategories. The larger the number of common categories between a pair of databases, the more similar their corresponding content summaries tend to be, according to the wr, ur, and SRCC metrics. Tables V(a), V(b), and V(c) report the wr, ur, and SRCC metrics, respectively, over all pairs of databases in the Web, TREC4, and TREC6 datasets, confirming the idea that databases ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:40 ? P. G. Ipeirotis and L. Gravano Fig. 13(b). Unweighted recall for pairs of database content summaries, for the Controlled dataset as a function of the number of common categories in the database pairs. Fig. 13(c). Spearman rank correlation coefficient for pairs of database content summaries, for the Controlled dataset as a function of the number of common categories in the database pairs. classified under similar categories have more similar content summaries than those under different topics. 6.3 Effect of Shrinkage We now report experimental results on the quality of the content summaries generated by the shrinkage technique from Section 4.2. To keep our exper- iments manageable, we use content summaries extracted from FP-SVM with specificity threshold taues = 0.25 and coverage threshold tauec = 10, which give good classification decisions. We also pick QBS-Lrd over QBS-Ord, since the former method demonstrates similar performance at substantially smaller cost than the latter. (See Section 6.1 for a justification of this choice.) For conciseness, we ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:41 Table V(a). Weighted Recall numCategories Dataset 1 2 3 4 Web 0.83 0.89 0.90 0.91 TREC4 0.85 0.89 0.92 0.95 TREC6 0.86 0.88 0.89 0.92 The measure is for pairs of database content summaries as a function of the number of common categories in the database pairs and for the Web, TREC4, and TREC6 datasets. Table V(b). Unweighted Recall numCategories Dataset 1 2 3 4 Web 0.46 0.51 0.53 0.55 TREC4 0.52 0.57 0.59 0.60 TREC6 0.53 0.57 0.58 0.61 The measure is for pairs of database content summaries, as a function of the number of common categories in the database pairs and for the Web, TREC4, and TREC6 datasets. Table V(c). Spearman Rank Correlation Coefficient numCategories Dataset 1 2 3 4 Web 0.50 0.59 0.67 0.69 TREC4 0.52 0.55 0.60 0.70 TREC6 0.52 0.55 0.57 0.62 The measure is for pairs of database content summaries, as a function of the number of common categories in the database pairs and for the Web, TREC4, and TREC6 datasets. now refer to FP-SVM as FPS and to QBS-Lrd as QBS. We evaluate the content summaries using the Controlled, Web, TREC4, and TREC6 datasets. Recall. We used the weighted and unweighted recall metrics to measure vocabulary coverage of shrunk content summaries. The shrunk content sum- maries include (with nonzero probability) every word in any content summary. Most words in any given content summary, however, tend to exhibit a very low probability. Therefore, not to inflate artificially the recall results (and con- versely, not to hurt artificially the precision results), we drop from the shrunk content summaries every word w with round(|D|? ?R(w|D)) < 1 during eval- uation. Intuitively, we drop from the content summary all words that are esti- mated to appear in less than one document. Table VI shows the weighted recall for different content summary construc- tion techniques. Most methods exhibit high weighted recall, which shows that document sampling techniques identify the most frequent words in the data- base. Not surprisingly, shrinkage increases the (already high) wr values and all shrinkage-based methods have close-to-perfect wr. This improvement is statis- tically significant in all cases: A paired t-test [Marques De S? 2003] showed significance at the 0.01% level. The improvement for the Web set is higher compared to that for the TREC4 and TREC6 datasets: The Web set contains ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:42 ? P. G. Ipeirotis and L. Gravano Table VI(a). Weighted Recall wr Sampl. Freq. Shrinkage Dataset Method Est. Yes No QBS No 0.903 0.745 Controlled QBS Yes 0.917 0.745 FPS No 0.912 0.827 FPS Yes 0.928 0.827 QBS No 0.962 0.875 Web QBS Yes 0.976 0.875 FPS No 0.989 0.887 FPS Yes 0.993 0.887 QBS No 0.937 0.918 TREC4 QBS Yes 0.959 0.918 FPS No 0.980 0.972 FPS Yes 0.983 0.972 QBS No 0.959 0.937 TREC6 QBS Yes 0.985 0.937 FPS No 0.979 0.975 FPS Yes 0.982 0.975 Table VI(b). Unweighted Recall ur Sampl. Freq. Shrinkage Dataset Method Est. Yes No QBS No 0.589 0.523 Controlled QBS Yes 0.601 0.523 FPS No 0.623 0.584 FPS Yes 0.638 0.584 QBS No 0.438 0.424 Web QBS Yes 0.489 0.424 FPS No 0.681 0.520 FPS Yes 0.711 0.520 QBS No 0.402 0.347 TREC4 QBS Yes 0.542 0.347 FPS No 0.678 0.599 FPS Yes 0.714 0.599 QBS No 0.549 0.475 TREC6 QBS Yes 0.708 0.475 FPS No 0.731 0.662 FPS Yes 0.784 0.662 larger databases, and the approximate content summaries are less complete than the respective approximate content summaries of TREC4 and TREC6. Our shrinkage technique becomes increasingly useful for larger databases. To understand whether low-frequency words are present in the approximate and shrunk content summaries, we use the unweighted recall metric. Table VI(b) shows that the shrunk content summaries have higher unweighted recall as well. Finally, recall is higher when shrinkage is used in conjunction with the fre- quency estimation technique. This behavior is to be expected: When frequency estimation is enabled, the words introduced by shrinkage are close to their real frequencies, and are used in precision and recall calculations. When frequency ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:43 Table VII(a). Weighted Precision wp Sampl. Freq. Shrinkage Dataset Method Est. Yes No QBS No 0.989 1.000 Controlled QBS Yes 0.979 1.000 FPS No 0.948 1.000 FPS Yes 0.940 1.000 QBS No 0.981 1.000 Web QBS Yes 0.973 1.000 FPS No 0.987 1.000 FPS Yes 0.947 1.000 QBS No 0.992 1.000 TREC4 QBS Yes 0.978 1.000 FPS No 0.987 1.000 FPS Yes 0.984 1.000 QBS No 0.978 1.000 TREC6 QBS Yes 0.943 1.000 FPS No 0.976 1.000 FPS Yes 0.958 1.000 Table VII(b). Unweighted Precision up Sampl. Freq. Shrinkage Dataset Method Est. Yes No QBS No 0.932 1.000 Controlled QBS Yes 0.921 1.000 FPS No 0.895 1.000 FPS Yes 0.885 1.000 QBS No 0.954 1.000 Web QBS Yes 0.942 1.000 FPS No 0.923 1.000 FPS Yes 0.909 1.000 QBS No 0.965 1.000 TREC4 QBS Yes 0.955 1.000 FPS No 0.901 1.000 FPS Yes 0.856 1.000 QBS No 0.936 1.000 TREC6 QBS Yes 0.847 1.000 FPS No 0.894 1.000 FPS Yes 0.850 1.000 estimation is not used, the estimated frequencies of the same words are often below 0.5, and therefore not used in precision and recall calculations. Precision. A database content summary constructed using a document sam- ple contains only words that appear in the database. In contrast, the shrunk content summaries may include words not in the corresponding databases. To measure the extent to which ?spurious? words are added (with high weight) by shrinkage in the content summary, we use the weighted precision (wp)ofA(D) with respect to S(D), wp = summationtext welementWAintersectionWS ?df (w)summationtext welementWA ?df (w) . Table VII(a) shows that shrinkage decreases weighted precision by just 0.8% to 6%. We also report the unweighted precision (up) metric, defined as up = |WAintersectionWS||W A| . This metric reveals how many words introduced in a content summary do not ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:44 ? P. G. Ipeirotis and L. Gravano Table VIII. Spearman Correlation Coefficient Sampl. Freq. Shrinkage Dataset Method Est. Yes No QBS No 0.723 0.628 Controlled QBS Yes 0.723 0.628 FPS No 0.765 0.665 FPS Yes 0.765 0.665 QBS No 0.904 0.812 Web QBS Yes 0.904 0.812 FPS No 0.917 0.813 FPS Yes 0.917 0.813 QBS No 0.981 0.833 TREC4 QBS Yes 0.981 0.833 FPS No 0.943 0.884 FPS Yes 0.943 0.884 QBS No 0.961 0.865 TREC6 QBS Yes 0.961 0.865 FPS No 0.937 0.905 FPS Yes 0.937 0.905 appear in the complete content summary (or, equivalently, in the underlying database). Table VII(b) reports the results for the up metric, which show that shrinkage-based techniques have unweighted precision usually above 90% and always above 84%. Word-ranking correlation. Table VIII shows that SRCC is higher for the shrunk content summaries. In general, SRCC is better for shrunk than for unshrunk content summaries (p < 0.001, according to a paired t-test): Not only do the shrunk content summaries have better vocabulary coverage (as the recall figures show), but also the newly added words tend to be ranked properly. Word-frequency accuracy. Our shrinkage-based algorithm modifies the prob- ability estimates ?(w|D) in the approximate summaries A(D), in order to gen- erate a summary whose probability distribution is closer to that of the original S(D). The KL-divergence compares the similarity of the A(D) estimates against the real values in S(D): KL = summationtextwelementWAintersectionWS p(w|D) ? log p(w|D)?p(w|D), where p(w|D) is de- fined as p(w|D) = tf(w,D)summationtext i tf(wi,D) and tf(w, D) is the total number of occurrences of w in D. The KL metric takes values from 0 to infinity, with 0 indicating that the two content summaries being compared are equal. Table IX shows that shrinkage helps decrease large KL values. (Recall that lower KL values indicate higher-quality summaries.) This is a characteristic of shrinkage [Hastie et al. 2001]: All summaries are shrunk towards some ?com- mon? content summary that has an average distance from all the summaries. This effectively reduces the variance of the estimations and leads to reduced estimation risk. However, shrinkage (moderately) hurts content-summary ac- curacy in terms of the KL metric in cases where KL is already low for the un- shrunk summaries. We use this observation in our shrinkage-based database selection algorithm in Section 4.2.3, where our algorithm attempts to identify those cases where shrinkage is likely to help general database selection accu- racy and avoids applying shrinkage in other cases. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:45 Table IX. KL-Divergence Sampl. Freq. Shrinkage Dataset Method Est. Yes No QBS No 0.364 0.732 Controlled QBS Yes 0.389 0.645 FPS No 0.483 0.542 FPS Yes 0.378 0.503 QBS No 0.361 0.531 Web QBS Yes 0.382 0.472 FPS No 0.298 0.254 FPS Yes 0.281 0.224 QBS No 0.296 0.300 TREC4 QBS Yes 0.175 0.180 FPS No 0.253 0.203 FPS Yes 0.193 0.118 QBS No 0.305 0.352 TREC6 QBS Yes 0.287 0.354 FPS No 0.223 0.193 FPS Yes 0.301 0.126 Evaluation conclusions. The general conclusion from our experiments on con- tent summary quality is that shrinkage drastically improves content summary recall at the expense of precision. The high weighted precision of shrinkage- based summaries suggests that the spurious words introduced by shrinkage appear with low weight in the summaries, which should reduce any potential negative impact on database selection. Next, we present experimental evidence that the loss in precision ultimately does not hurt, since shrinkage improves overall database selection accuracy. 7. EXPERIMENTAL RESULTS FOR DATABASE SELECTION ACCURACY In this section, we evaluate the accuracy of the database selection algorithms that we presented in this article. We first describe our evaluation metric, and then we study the performance of the proposed database selection algorithms under a variety of settings. Just as in Section 6.3, we use FP-SVM (for concise- ness, FPS) with specificity threshold taues = 0.25 and coverage threshold tauec = 10, and QBS-Lrd (for conciseness, QBS) as underlying content summary construc- tion algorithms. Consider a ranking of the databases vector = D1, ..., Dm according to the scores produced by a database selection algorithm for some query q. To measure the ?goodness? or general quality of such a rank, we follow an evaluation method- ology that is prevalent in the information retrieval community, and consider the number of documents in each database that are relevant to q, as deter- mined by a human judge [Salton and McGill 1983]. Intuitively, a good rank for a query includes (at the top) those databases with the largest number of relevant documents for the query. If r(q, Di) denotes the number of Di docu- ments that are relevant to query q, then A(q, vector , k) = summationtextki=1 r(q, Di) measures the total number of relevant documents among the top-k databases in vector .To normalize this measure, we consider a hypothetical, ?perfect? database rank vectorDH = Dh 1 , ..., Dhm in which databases are sorted by their r(q, Dhi ) value. (This ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:46 ? P. G. Ipeirotis and L. Gravano is, of course, unknown to the database selection algorithm.) Then, we define the Rk metric for a query and database rank vector as Rk = A(q, vectorD,k)A(q, vectorD H,k) [Gravano et al. 1999]. A ?perfect? ordering of k databases for a query yields Rk = 1, while a (poor) choice of k databases with no relevant content results in Rk = 0. We note that when a database receives the default score from a database selection al- gorithm (i.e., when the score assigned to a database for a query is equal to that assigned to an empty query), we consider that the database is not selected for searching. This sometimes results in a database selection algorithm selecting fewer than k databases for a query. The Rk metric relies on human-generated relevance judgments for the queries and documents. For our experiments on database selection accuracy, we focus on the TREC4 and TREC6 datasets, which include queries and asso- ciated relevance judgments.27 We use queries 201?250 from TREC-4 with the TREC4 dataset and queries 301?350 from TREC-6 with the TREC6 dataset. The TREC-4 queries are long, with 8?34 words and an average of 16.75 words per query. The TREC-6 queries are shorter, with 2?5 words and an average of 2.75 words per query. We considered eliminating stopwords (e.g., ?the?) from the queries, as well as applying stemming to the query and document words (e.g., so that a query [computers] matches documents with the word ?computing?). While the results improve with stopword elimination, a paired t-test showed that the difference in performance is not statistically significant; therefore, we only report results with stopword elimination. Stemming tends to improve performance for small values of k; the results are mixed when k > 10. Figures 14(a)?14(d) show results for the CORI database selection algo- rithm. We used both the TREC4 and TREC6 datasets and queries, as well as the QBS and FPS content summary construction strategies (Section 5.2). We consider applying CORI over ?unshrunk? content summaries (QBS-Plain and FPS-Plain), using the adaptive shrinkage-based strategy (QBS-Shrinkage and FPS-Shrinkage), and using the hierarchical algorithm (QBS-Hierarchical and FPS-Hierarchical). Figures 15(a)?15(d) show the results for the bGlOSS data- base selection algorithm, while Figures 16(a)?16(d) show the results for the LM database selection algorithm. Overall, a paired t-test shows that QBS-Shrinkage improves the database se- lection performance over QBS-Plain, and this improvement is statistically sig- nificant (p < 0.05). Moreover, FPS-Shrinkage improves the database selection performance relative to FPS-Plain, but this improvement is statistically signif- icant only when k < 10. We now describe the details of our findings. Shrinkage versus plain. The first conclusion from our experiments is that QBS-Shrinkage and FPS-Shrinkage improve performance compared to 27We do not consider the Web and Controlled datasets of Section 5.1 for these experiments because of the lack of relevance judgments for them. In Ipeirotis and Gravano [2002] we presented a pre- liminary evaluation of the hierarchical database selection algorithm of Section 4.1 over a subset of the Web dataset, for a relatively low-scale evaluation. (This evaluation used relevance judgments provided by volunteer colleagues.) The results in Ipeirotis and Gravano [2002] are consistent with those we present here over the TREC data. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:47 Fig. 14(a). The Rk ratio for CORI with stemming over the TREC4 dataset. Fig. 14(b). The Rk ratio for CORI without stemming over the TREC4 dataset. Fig. 14(c). The Rk ratio for CORI with stemming over the TREC6 dataset. Fig. 14(d). The Rk ratio for CORI without stemming over the TREC6 dataset. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:48 ? P. G. Ipeirotis and L. Gravano Fig. 15(a). The Rk ratio for bGlOSS with stemming over the TREC4 dataset. Fig. 15(b). The Rk ratio for bGlOSS without stemming over the TREC4 dataset. Fig. 15(c). The Rk ratio for bGlOSS with stemming over the TREC6 dataset. Fig. 15(d). The Rk ratio for bGlOSS without stemming over the TREC6 dataset. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:49 Fig. 16(a). The Rk ratio for LM with stemming over the TREC4 dataset. Fig. 16(b). The Rk ratio for LM without stemming over the TREC4 dataset. Fig. 16(c). The Rk ratio for LM with stemming over the TREC6 dataset. Fig. 16(d). The Rk ratio for LM without stemming over the TREC6 dataset. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:50 ? P. G. Ipeirotis and L. Gravano Table X. Percentage of Query-Database Pairs for which Shrinkage was Applied Sampl. Database Shrinkage Sampl. Database Shrinkage Dataset Method Selection Application Dataset Method Selection Application bGlOSS 35.42% bGlOSS 33.43% FPS CORI 17.32% FPS CORI 13.12% TREC4 LM 15.40% TREC6 LM 12.78% bGlOSS 78.12% bGlOSS 58.94% QBS CORI 15.68% QBS CORI 14.32% LM 17.32% LM 11.73% QBS-Plain and FPS-Plain, respectively. Shrinkage helps because new words are added in the content summaries in a database- and category-specific man- ner. In Table X, we report the number of times shrinkage was applied for each database-query pair and for each database selection algorithm. Since the queries for TREC6 are shorter, shrinkage was applied comparatively fewer times for TREC6 than for TREC4. Also, shrinkage was applied more frequently for bGlOSS than for LM and CORI. Specifically, bGlOSS does not have any form of smoothing and assigns zero scores to databases whose content summaries do not contain a query word. This results in high variance for the bGlOSS scores, which in turn triggers the application of shrinkage. Interestingly, Table X shows that shrinkage is applied relatively few times overall, yet its impact on database selection accuracy is large, as we have seen. To understand why, note that the Table X figures refer to database-query pairs. We have observed that the application of shrinkage to even a few critical da- tabases for a given query can sometimes dramatically improve the quality of the database rank that is produced for the query. As a real example of this phe- nomenon, consider the TREC-6 query [unexplained highway accidents] and database all-2, which contains 92.5% of all relevant documents for the query. Using the LM algorithm (for both FPS and QBS), database all-2 is ranked 16th, resulting in low Rk values for any k < 16. Our adaptive shrinkage algorithm decides to use shrinkage for this specific database-query pair, and database all-2 is ranked 3rd after application of shrinkage. This results in substantially larger Rk values for the shrinkage-based algorithms for 3 <=< k <=< 15. While our adaptive database selection algorithm applied shrinkage to just 5% of the databases for this query (i.e., for just 5 databases out of 100), the resulting da- tabase rank for the query is significantly better than the rank produced with no shrinkage. In general, even limited applications of shrinkage tend to have significant effect on the Rk ranking: The distribution of relevant documents across databases is typically skewed,28 and only a small number of databases contain the majority of relevant documents. Therefore, by ranking the impor- tant databases accurately, we can substantially improve the database selection performance. Shrinkage versus hierarchical. QBS-Hierarchical and FPS-Hierarchical generally outperform their ?plain? counterparts. This confirms our observation that categorization information helps compensate for incomplete summaries. Exploiting this categorization via shrinkage results in even higher accuracy: 28This effect is not only limited to TREC datasets, but is true for the web at large. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:51 QBS-Shrinkage and FPS-Shrinkage significantly outperform QBS- Hierarchical and FPS-Hierarchical. This improvement is due to the flat nature of our shrinkage method: While QBS-Shrinkage and FPS-Shrinkage can rank the databases globally, QBS-Hierarchical and FPS-Hierarchical have to make irreversible choices at each category level of the hierarchy. Even when a chosen category contains only a small number of databases with relevant doc- uments, the hierarchical algorithm continues to select (irrelevant) databases from the (relevant) category. When a query cuts across multiple categories, the hierarchical algorithm might fail to select the appropriate databases. In contrast, our shrinkage-based approach can potentially select databases from multiple categories and hence manages to identify the appropriate databases for a query, regardless of whether they are similarly classified or not. Adaptive versus universal application of shrinkage. The strategy in Sec- tion 4.2.3 dynamically decides when to apply shrinkage for database selec- tion. To understand whether this decision step is necessary, we evaluated the performance of the algorithms when we always decide to use shrinkage (i.e., when the hatwide(Di) content summary is always chosen in Figure 10). Figures 17(a)? (b) and 18(a)?(b) show the TREC4 results for CORI and bGlOSS, with QBS- Universal and FPS-Universal denoting universal application of shrinkage.29 The only case where QBS-Universal and FPS-Universal are better than QBS- Plain and FPS-Plain, respectively, is for bGlOSS (Figures 18(a) and (b)): Unlike CORI and LM, bGlOSS does not have any form of smoothing already built-in, so if a query word is not present in a content summary, bGlOSS assigns a zero score to the database. Unlike bGlOSS, CORI and LM perform worse when we apply shrinkage universally than when we do so adaptively. The only excep- tion is for content summaries created without the use of stemming and only for smallvaluesofk,buteveninthiscasethesmallimprovementisnotstatistically significant. This result indicates that CORI and LM handle incomplete content summaries in a more graceful way than does bGlOSS, since both CORI and LM have a form of smoothing already embedded. Frequency estimation. We also examined the effect of frequency estimation (Section 3.2) on database selection. Figures 19(a)?(d) show the results for CORI over TREC4 and TREC6. In general, frequency estimation affected only the per- formance of the CORI database selection algorithm, and had only a small effect on the performance of bGlOSS and LM, so we do not show plots for these two techniques.ThereasonisthatbGlOSSandLMrelyonprobabilitiesthatremain virtually unaffected after the frequency estimation step. In contrast, CORI re- lies on document frequencies. Figures 19(a)?(d) show that when shrinkage is used, frequency estimation improves the performance of CORI, by 10%?30% for small values of k, with respect to the case where raw word-frequencies for the document sample are used. Interestingly, frequency estimation alone, that is, without shrinkage, does not improve database selection as much, hinting that more accurate frequency estimates only improve database selection ac- curacy substantially when the underlying content summaries are sufficiently complete. 29The results for TREC6 are similar, while the results for LM are analogous to those for CORI. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:52 ? P. G. Ipeirotis and L. Gravano Fig. 17(a). The Rk ratio for CORI with stemming over the TREC4 dataset, with and without universal application of shrinkage. Fig. 17(b). The Rk ratio for CORI without stemming over the TREC4 dataset, with and without universal application of shrinkage. Evaluation conclusions. A general conclusion from the experiments is that the adaptive application of shrinkage significantly improves database selection when selection decisions are based on sparse content summaries. An interest- ing observation is that the universal application of shrinkage is not always beneficial, indicating that for cases where selection decisions are already accu- rate, shrinkage negatively affects the selection process. Another conclusion is that stemming-based summaries are typically better than their nonstemmed counterparts, since stemming reduces data sparseness. The difference is signif- icant for small numbers of selected databases, which indicates that stemming results in better database rankings. 8. RELATED WORK This section addresses literature relevant to the topics covered in this article. Portions of this article appeared in Ipeirotis and Gravano [2002, 2004]. First, Section 8.1 reviews work on database selection. Then, Section 8.2 discusses re- lated work for content summary construction. Finally, Section 8.3 summarizes various applications of query probing, a technique that we used extensively in this article. 8.1 Database Selection A large body of work has been devoted to distributed information retrieval, or metasearching, over text databases. As we discussed, a crucial task for a ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:53 Fig. 18(a). The Rk ratio for bGlOSS with stemming over the TREC4 dataset, with and without universal application of shrinkage. Fig. 18(b). The Rk ratio for bGlOSS without stemming over the TREC4 dataset, with and without universal application of shrinkage. metasearcher is database selection, which requires that the metasearcher have summaries of the database contents. Early database selection techniques relied on human-generated database descriptions. WAIS [Kahle et al. 1993] uses such descriptions and ranks data- bases according to their similarity to the queries. In Search Broker [Manber and Bigot 1997], each database is manually tagged with two or three category index descriptors. At query time, users specify the query category and then Search Broker selects the appropriate databases. Chakravarthy and Haase, Jr. [1995] use Wordnet [Fellbaum 1998] to complement manually assigned keywords that are used to describe each database for database selection. More robust database selection approaches rely on statistical metadata about the contents of the databases, generally following the type of content summary used in this article. CORI [Callan et al. 1995; Xu and Callan 1998] uses inference networks together with this kind of content summary to select the best databases for a query. (We used CORI for our experiments in Section 7.) GlOSS [Gravano et al. 1999] uses content summaries and selects databases for a query according to some notion of goodness for a query. GlOSS can choose among a variety of definitions of goodness, some of which depend on the re- trieval model supported by the databases. (We used bGlOSS, a variant of GlOSS originally introduced for Boolean databases, for our experiments in Section 7.) Yuwono and Lee [1997] use content summaries and rank databases according ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:54 ? P. G. Ipeirotis and L. Gravano Fig. 19(a). The Rk ratio for CORI with stemming over the TREC4 dataset, for summaries gener- ated with (?-FreqEst?) and without (?-NoFreqEst?) the use of frequency estimation. Fig. 19(b). The Rk ratio for CORI without stemming over the TREC4 dataset, for summaries generated with (?-FreqEst?) and without (?-NoFreqEst?) the use of frequency estimation. Fig. 19(c). The Rk ratio for CORI with stemming over the TREC6 dataset, for summaries gener- ated with (?-FreqEst?) and without (?-NoFreqEst?) the use of frequency estimation. Fig. 19(d). The Rk ratio for CORI without stemming over the TREC6 dataset, for summaries generated with (?-FreqEst?) and without (?-NoFreqEst?) the use of frequency estimation. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:55 to the cue validity of the query words: A query word w has high cue validity for a database D if the probability of observing w in D is comparatively higher than in other databases. Meng et al. [1999, 1998] also rely on content sum- maries to identify databases that contain the highest number of documents similar to a query, and similarity is computed using the cosine similarity met- ric. Meng et al. use a variety of methods to estimate the weight of words in the database, and propose to keep significant covariance statistics for word pairs that appear often together. The storage requirements for the content summaries in Meng et al. [1998] are much higher compared to other meth- ods that ignore the covariance statistics, such as Callan et al. [1995], Xu and Callan [1998], Gravano et al. [1999], and Yuwono and Lee [1997], which we de- scribed before. In a similar approach, Yu et al. [2001] rank databases for a query according to the highest similarity of any document in each database to the query. Baumgarten [1999, 1997] proposes a probabilistic framework for data- base selection and uses content summaries to derive the ?(w|D) probability es- timates that are used during querying. Most approaches that use content sum- maries rely either on access to all documents or on metadata directly exported by the databases, using, for example, a protocol like STARTS [Gravano et al. 1997]. French et al. [1999, 1998], Powell et al. [2000], and Powell and French [2003] present experimental evaluations of database selection algorithms. Their main conclusion is that CORI is robust and performs better than other database se- lection algorithms for a variety of datasets. Results by Xu and Croft [1999] and Si et al. [2002] indicate that a language modeling (LM) approach for database selection works better than CORI for topically focused databases. (We used the LM algorithm for our experiments in Section 7.) Xu and Croft [1999] and Larkey et al. [2000] show that organizing documents by topic helps improve database selection accuracy. Our results in Section 7 are consistent with these findings, since they show that classification-aware database selection algorithms per- form better than algorithms that ignore classification information. Our database selection techniques in Section 4 are built on top of an ar- bitrary ?base? database selection algorithm. We reported experiments using CORI, bGlOSS, and LM. Our experimental results show that our techniques improve database selection?in the face of sparse data?when used in conjunc- tion with a variety of existing flat algorithms. In the future, our techniques can continue to leverage new database selection algorithms that rely on content summaries to make the selection decisions. Another promising direction for future research is to extend our smoothing models for database selection algo- rithms that not only keep a content summary for each database but also exploit the actual documents retrieved during document sampling (e.g., Si and Callan [2005, 2004a, 2003], Hawking and Thomas [2005], and Shokouhi [2007]). Other database selection algorithms rely on hierarchical classification schemes (mostly for ?efficiency?) to direct queries to appropriate categories of the hierarchy [Dolin 1998; Sheldon 1995; Gravano et al. 1999; Choi and Yoo 2001; Yu et al. 1999]. The hierarchical database selection algorithm in Sheldon [1995] uses intentionally small content summaries that contain only the high-frequency terms that characterize each category. The hGlOSS ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:56 ? P. G. Ipeirotis and L. Gravano system [Gravano et al. 1999] focuses on the efficiency of selection and does not exploit any topic similarities of the databases. Similarly, the hierarchical orga- nization in Dolin [1998] focuses on efficiency and does not exploit the clustering of similar databases under the same categories. Fuhr [1999] briefly discusses the hierarchical database selection problem, but no special clustering of sim- ilar databases is considered to improve the hierarchical selection task. The aforementioned hierarchical algorithms also need access to all documents or to metadata directly exported by the databases. Our hierarchical database selec- tion algorithm in Section 4.1 first appeared in Ipeirotis and Gravano [2002]; this algorithm uses a topic hierarchy not only for efficiency, but also for improv- ing the quality of database selection decisions in the presence of sparse content summaries. Other approaches rely on users providing relevance judgments to create a profile of each database. Voorhees et al. [1995] use a set of training queries to learn the usefulness of each database and to decide how many docu- ments to retrieve from each. ProFusion [Gauch et al. 1996] and SavvySearch [Dreilinger and Howe 1997] also exploit historic data to learn the performance of each database for various types of queries. Then, databases that exhibit higher performance for a query are preferred over others. Fuhr [1999] uses a decision-theoretic model to decide whether to use a database and to determine how many documents to retrieve from a selected database. The method in Fuhr [1999] tries to minimize the cost of retrieval and assumes that the precision- recall curves of the underlying retrieval system either are known or can be estimated. 8.2 Constructing Database Content Summaries Unfortunately, hidden-web text databases do not usually export any metadata about their contents nor offer immediate access to them. Callan et al. [2001, 1999] probe databases with semi-random queries to extract content summaries from autonomous databases. (See Section 2 for a detailed discussion of this technique.) We used Callan et al.?s algorithm extensively in our experiments of Sections 6 and 7. Monroe et al. [2002] present and evaluate small variations of the algorithm from Callan et al. [1999] and Callan and Connell [2001]. Craswell et al. [2000] compared database selection algorithms in the presence of incom- plete content summaries extracted using document sampling, and observed that algorithm performance deteriorates with respect to its behavior over com- plete summaries. Sugiura and Etzioni [2000] proposed the Q-Pilot technique, which uses query expansion to route web queries to the appropriate search en- gines. It also characterizes databases using words that appear in the webpages that host the search interfaces, as well as words that appear in other web- pages that link to the databases. We used an adaptation of Q-Pilot for content summary generation in a preliminary experimental evaluation [Ipeirotis and Gravano 2002] of our hierarchical database selection algorithm of Section 4.1. Our experiments showed that the Q-Pilot content summaries are not sufficient for accurate database selection. Hawking and Thistlewaite [1999] used query probing to perform database selection by ranking databases by similarity to a ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:57 given query. Their algorithm assumed that the query interface to the database can handle normal queries and query probes differently, and that the cost to handle query probes is smaller than that for normal queries. A preliminary version of the algorithm of Section 3 for constructing content summaries appeared in Ipeirotis and Gravano [2002]. The frequency estima- tion algorithm in Ipeirotis and Gravano [2002] managed to produce relatively accurate frequency estimates for database words that appear in sample-based content summaries. However, a problem with this algorithm is the assump- tion that the rank of a word in a database coincides with the word?s rank in a document sample extracted from the database. Unfortunately, this assump- tion does not hold in general, and is largely false for words that appear in a database only a relatively small number of times. In Section 3.2, we presented an improved frequency estimation algorithm that addresses this problem and produces significantly more accurate word-frequency estimates for database words that appear only a relatively small number of times. Along a related research direction, Si and Callan [2003] show that database selection performance can be improved by considering database size estimates within their ReDDE database selection algorithm. ReDDE retains the docu- ments retrieved during content summary construction and uses this document sample to estimate the distribution of relevant documents across databases. Further studies by Si and Callan [2004b] show that CORI and LM are only marginally affected when used in conjunction with the database size estima- tion method from Si and Callan [2003]. This result is consistent with the behav- ior that we observed for CORI (without use of shrinkage) with our frequency estimation method. Our content summary construction technique in Section 4.2 appeared in Ipeirotis and Gravano [2004] and is based on the work by McCallum et al. [1998], who introduced a shrinkage-based approach for hierarchical document classification in the face of sparse data. Shrinkage is a form of smoothing, and smoothing has been used extensively in the area of speech recognition [Jelinek 1999] to improve probability estimates in language models. Language mod- eling has also been used for information retrieval [Croft and Lafferty 2003]. Notably, smoothing is present in recent language modeling approaches to infor- mation retrieval [Zhai and Lafferty 2004, 2002, 2001]. An interesting direction for future work is to examine the performance of smoothing models other than shrinkage-based for database selection, especially in the presence of database classification information. Liu et al. [2004] estimate the potential inaccuracy of the database rank pro- duced for a query by a database selection algorithm. If this inaccuracy is unac- ceptably large, then the query is dynamically evaluated on a few carefully cho- sendatabasestoreducetheuncertaintyassociatedwiththedatabaserank.This work does not take content summary accuracy into consideration. In contrast, in Section 4.2.3, we addressed the scenario where summaries are derived from document samples (and are hence incomplete) and decide dynamically whether shrinkage should be applied, without actually querying databases during da- tabase selection. The bulk of Section 4.2.3 appeared originally in Ipeirotis and Gravano [2004], where we described a generic method for computing the mean ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:58 ? P. G. Ipeirotis and L. Gravano and variance of a database score distribution when using sample-based content summaries.ThemethodinIpeirotisandGravano[2004]didnotusethefactthat most database selection algorithms assume independence of the query words. We will exploit this property in Appendix B to substantially improve both the accuracy and runtime performance of the mean-variance computation, which in turn improves substantially the runtime performance of our database selection algorithm of Section 4.2.3. 8.3 Miscellaneous Applications of Query Probing In this article, we used query probing for the extraction of content summaries from text databases. Query probing has helped in other related tasks. Gravano et al. [2003] use a small number of query probes derived using machine learn- ing techniques to categorize a text database in a topic hierarchy. (We briefly reviewed this algorithm in Section 2.3 and used it extensively in subsequent sections.) Perkowitz et al. [1997] use it to automatically understand query forms and to extract information from web databases to build a comparative shopping agent. New forms of crawlers [Raghavan and Garc? 2001] use query probing to automatically interact with web forms and crawl the contents of hidden-web databases. Cohen and Singer [1996] use RIPPER to learn queries that mainly retrieve documents in a specific category. The queries are used at a later time to retrieve new documents in this category. Flake et al. [2002] extract rules from nonlinear SVMs that identify documents with a common characteristic (e.g., ?calls for papers?). The generated rules are used to modify queries sent to a search engine, so that the queries retrieve mostly documents of the desired kind. Grefenstette and Nioche [2000] use query probing to de- termine the use of different languages on the web. The query probes are words that appear only in one language. The number of matches generated for each probe is subsequently used to estimate the number of webpages written in each language. Ghani et al. [2001] automatically generate queries to retrieve doc- uments written in a specific language. Meng et al. [1999] used guided query probing to determine sources of heterogeneity in the algorithms used to index and search locally at each text database. Bergholz and Chidlovskii [2004] probe a database with a carefully selected set of queries to identify the characteris- tics of the query language. Finally, the QXtract system [Agichtein and Gravano 2003] automatically generates queries to improve the efficiency of a given in- formation extraction system, such as Snowball [Agichtein and Gravano 2000] or Proteus [Yangarber and Grishman 1998], over large text databases. Specifi- cally, QXtract learns queries that tend to match those database documents that are useful for the extraction task at hand. The information extraction system can then focus only on these documents, which results in large performance improvements. 9. CONCLUSION Database selection is critical to building efficient metasearchers that interact with potentially large numbers of databases. Exhaustively searching all avail- able databases to answer each query is impractical (or even not possible) in ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:59 increasingly common scenarios. Current database selection algorithms rely on statistical summaries about the contents of the databases, in order to select the best databases for a given query; unfortunately, databases accessible through the web do not generally export these statistics. In this article, we presented ef- ficient algorithms for constructing content summaries for such databases. Our algorithms create content summaries of higher quality than alternative ap- proaches, and additionally categorize databases in a classification scheme. We also presented a shrinkage-based technique that further improves the quality of the generated content summaries. Shrinkage-based content summaries are more complete than their unshrunk counterparts. Our shrinkage-based tech- nique achieves this performance gain efficiently, without requiring any increase in size of the document samples. We also presented techniques for improving the performance of database selection algorithms in the case where database content summaries are de- rived from relatively small document samples. Such summaries are typically incomplete, and this can hurt the performance of database selection algorithms. We showed that classification-aware database selection algorithms can signif- icantly improve the accuracy of selection decisions in the face of incomplete content summaries. Both the hierarchical database selection algorithm of Sec- tion 4.1 and the adaptive, shrinkage-based algorithm of Section 4.2 perform better than their counterparts that do not exploit database classification. Fur- thermore,weshowedthattheshrinkage-basedstrategyoutperformsthehierar- chical database selection algorithm: The hierarchical algorithm initially selects databases under a single subtree of the classification hierarchy, thus failing to select appropriate databases for queries that cut across multiple categories. Shrinkage, on the other hand, embeds the category information in the content summaries. Therefore, a flat database selection algorithm can exploit the clas- sification information without being constrained by the classification hierarchy. APPENDIXES A. ESTIMATING SCORE DISTRIBUTIONS Section 4.2.3 discussed how to estimate the uncertainty associated with a da- tabase score for a query. Specifically, this estimate relies on the probability P of the different possible query keyword frequencies. To compute P, we assume independence of the words in the sample, as P = nproductdisplay k=1 p(dk|sk), where p(dk|sk) is the probability that wk occurs in dk documents in database D, given that it occurs in sk documents in sample S. Using the Bayes rule, we have p(dk|sk) = p(sk|dk)p(dk)summationtext|D| i=0 p(i)p(sk|i) . To compute p(sk|dk), we assume that the presence of each word wk follows a binomial distribution over the S documents, with |S| trials and probability of ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:60 ? P. G. Ipeirotis and L. Gravano success dk|D| for every trial. Then, p(sk|dk) = parenleftbigg|S| sk parenrightbiggparenleftbiggd k |D| parenrightbiggsk parenleftbigg 1 - dk|D| parenrightbigg|S|-sk p(dk|sk) = p(dk) parenleftBig dk |D| parenrightBigsk parenleftBig 1 - dk|D| parenrightBig|S|-sk summationtext|D| i=0 parenleftbigg p(i) parenleftBig i |D| parenrightBigsk parenleftBig 1 - i|D| parenrightBig|S|-sk parenrightbigg. Finally, to compute p(dk) we use the well-known fact that the distribution of words in text databases tends to follow a power law [Mandelbrot 1988]: Ap- proximately cfgamma words in a database have frequency f , where c and gamma are database-specific constants (c > 0, gamma<0). Then, p(dk) = cd gamma ksummationtext |D| i=1 cigamma = d gamma ksummationtext |D| i=1 igamma . Interestingly, gamma = 1B - 1, where B is a parameter of the frequency-rank dis- tribution of the database [Adamic 2002] and can be computed as described in Section 3.2. B. ESTIMATING SCORE VARIANCE The adaptive algorithm in Figure 10 computes the mean and variance of the query-score distribution for a database to decide whether to use shrinkage for the database content summary. In Section 4.2.3, we outlined a method for com- puting the mean and variance relatively efficiently for any arbitrary database selection algorithm. This computation can be made even faster for the large class of database selection algorithms that assume independence of the query words. For example, bGlOSS, CORI, and LM, the database selection algorithms that we used in our experiments, belong to this class. For these algorithms, we can calculate the mean and variance of the subscore associated with each query word separately, and then combine these word-level mean and variance values to compute the final-score mean and variance for the query. We show the deriva- tion of variance30 for bGlOSS and CORI. The computation of variance for LM is similar to the one for bGlOSS.31 Estimating Score Variance for bGlOSS bGlOSS defines the score s(q, D) of a database D for a query q as s(q, D) =|D|? productdisplay welementq ?(w|D). 30Computation of the mean score is simpler and the derivation is analogous to the variance com- putation presented here. 31In the computation of mean and variance for LM, we treat the values of ?p(w|G) as constants, since the variance of the random variable ?(w|G) is negligible compared to that of ?(w|D). ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:61 By definition of variance, we have Var(s(q, D)) = E[s(q, D)2] - (E[s(q, D)])2 = E bracketlefttp bracketleftbt parenleftBigg |D|? productdisplay welementq p(w|D) parenrightBigg2bracketrighttp bracketrightbt - parenleftBigg E bracketleftBigg |D|? productdisplay welementq p(w|D) bracketrightBiggparenrightBigg2 =|D|2 ? E bracketlefttp bracketleftbt parenleftBiggproductdisplay welementq p(w|D) parenrightBigg2bracketrighttp bracketrightbt -|D|2 ? parenleftBigg E bracketleftBiggproductdisplay welementq p(w|D) bracketrightBiggparenrightBigg2 . Since the p(w|D)?s are assumed to be independent, we have E bracketlefttp bracketleftbt parenleftBiggproductdisplay welementq p(w|D) parenrightBigg2bracketrighttp bracketrightbt = productdisplay welementq E bracketleftBig p(w|D)2 bracketrightBig parenleftBigg E bracketleftBiggproductdisplay welementq p(w|D) bracketrightBiggparenrightBigg2 = parenleftBiggproductdisplay welementq E bracketleftbigp(w|D)bracketrightbig parenrightBigg2 . Therefore, Var parenleftbigs(q, D)parenrightbig =|D|2 ? parenlefttp parenleftbtproductdisplay welementq E bracketleftBig p(w|D)2 bracketrightBig - parenleftBiggproductdisplay welementq E bracketleftbigp(w|D)bracketrightbig parenrightBigg2parenrighttp parenrightbt. The mean values of the distributions of p(w|D) and p(w|D)2 can be computed using the results from Appendix A. Since p(w|D) depends only on the frequency sw of the word w in ?(D), we have E bracketleftbigp(w|D)bracketrightbig = |D|summationdisplay i=1 i |D| p(i|sw) E bracketleftBig p(w|D)2 bracketrightBig = |D|summationdisplay i=1 parenleftbigg i |D| parenrightbigg2 p(i|sw) We can see that the variance Varparenleftbigs(q, D)parenrightbig can be computed efficiently because thereisnoneedtoconsiderfrequencycombinations,unlikethecaseforageneric database selection algorithm (see Section 4.2.3). Estimating Score Variance for CORI CORI defines the score s(q, D) of a database D for a query q as s(q, D) = summationdisplay welementq 0.4 + 0.6 ? Tw ? Iw |q| = 0.4 + 0.6 ? summationdisplay welementq Tw ? Iw |q| Tw = p(w|D) ?|D|p(w|D) ?|D|+50 + 150 ? cw(D) mcw , Iw = log parenleftbiggm + 0.5 cf(w) parenrightbigg / log(m + 1.0) where cf(w) is the number of databases containing w, m is the number of data- bases being ranked, cw(D) is the number of words in D, and mcw is the mean ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:62 ? P. G. Ipeirotis and L. Gravano cw among the databases being ranked. For simplicity in our following calcu- lations, we assume that cf(w) and cw(D) are constants, since the variance of their values is small compared to other components of the CORI formula. In this case, Iw is also constant. Then, by definition of variance we have Var(s(q, D)) = E[(s(q, D))2] - (E[s(q, D)])2 = E bracketlefttp bracketleftbt parenleftBigg 0.4 + 0.6 ? summationdisplay welementq Tw ? Iw |q| parenrightBigg2bracketrighttp bracketrightbt - parenleftBigg E bracketleftBigg 0.4 + 0.6 ? summationdisplay welementq Tw ? Iw |q| bracketrightBiggparenrightBigg2 = 0.36|q|2 ? E bracketlefttp bracketleftbt parenleftBiggsummationdisplay welementq Tw ? Iw parenrightBigg2bracketrighttp bracketrightbt - 0.36 |q|2 ? parenleftBigg E bracketleftBiggsummationdisplay welementq Tw ? Iw bracketrightBiggparenrightBigg2 = 0.36|q|2 parenleftBiggsummationdisplay welementq E bracketleftBig T2w ? I2w bracketrightBig + summationdisplay wi,wj elementq,inegationslash= j E bracketleftbigTwi ? Iwi ? Twj ? Iwj bracketrightbig - summationdisplay welementq parenleftBig E [Tw ? Iw] parenrightBig2 - summationdisplay wi,wj elementq,inegationslash= j E bracketleftbigTwi ? Iwi bracketrightbig ? E bracketleftbigTwj ? Iwj bracketrightbig parenrightBigg . By assuming independence of the words w in the query, the variables Twi and Twj are independent if i negationslash= j, and we have summationdisplay wi,wj elementq,inegationslash= j E bracketleftbigTwi ? Iwi ? Twj ? Iwj bracketrightbig = summationdisplay wi,wj elementq,inegationslash= j E bracketleftbigTwi ? Iwi bracketrightbig ? E bracketleftbigTwj ? Iwj bracketrightbig Therefore, Var(s(q, D)) = 0.36|q|2 parenleftBiggsummationdisplay welementq I2w ? parenleftbigEbracketleftbigT2wbracketrightbig - (E[Tw])2parenrightbig parenrightBigg . Again, the distribution of the random variables Tw and T2w can be computed using the results from Appendix A. The mean values of the distributions can be computed efficiently, since there is no need to consider frequency combinations, unlike the case for a generic database selection algorithm (see Section 4.2.3). REFERENCES ADAMIC, L. A. 2002. Zipf, power-laws, and Pareto?A ranking tutorial. http://ginger.hpl.hp. com/shl/papers/ranking/ranking.html. AGICHTEIN,E.AND GRAVANO, L. 2000. Snowball: Extracting relations from large plain-text collec- tions. In Proceedings of the 5th ACM Conference on Digital Libraries (DL). AGICHTEIN,E.AND GRAVANO, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE). BAAYEN, R. H. 2006. Word Frequency Distributions. Springer. BAUMGARTEN, C. 1999. A probabilistic solution to the selection and fusion problem in distributed information retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 246?253. BAUMGARTEN, C. 1997. A probabilistic model for distributed information retrieval. In Proceed- ings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 258?266. BERGHOLZ,A.AND CHIDLOVSKII, B. 2004. Using query probing to identify query language features on the Web. In Proceedings of the Distributed Multimedia Information Retrieval, SIGIR 2003 ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:63 Workshop on Distributed Information Retrieval, Revised Selected and Invited Papers. Lecture Notes in Computer Science, vol. 2924, Springer, 21?30. BERGMAN, M. K. 2001. The deep Web: Surfacing hidden value. J. Electron. Publ. 7, 1 (Aug.). BERNERS-LEE, T., HENDLER,J.,AND LASSILA, O. 2001. The semantic Web. Sci. Amer. 284,5(May), 34?43. CALLAN,J.P.AND CONNELL, M. 2001. Query-Based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97?130. CALLAN,J.P.,CONNELL, M., AND DU, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 479?490. CALLAN,J.P.,LU, Z., AND CROFT, W. B. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 21?28. CHAKRAVARTHY,A.S.AND HAASE,JR., K. W. 1995. Netserf: Using semantic knowledge to find Internet information archives. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),4? 11. CHOI,Y.S.AND YOO, S. I. 2001. Text database discovery on the Web: Neural net based approach. J. Intell. Inf. Syst. 16, 1 (Jan.), 5?20. COHEN, W. W. 1996. Learning trees and rules with set-valued features. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI), 8th Conference on Innovative Applications of Artificial Intelligence (IAAI), 709?716. COHEN,W.W.AND SINGER, Y. 1996. Learning to query the Web. In Proceedings of the AAAI Work- shop on Internet-Based Information Systems, 16?25. CRASWELL, N., BAILEY,P.,AND HAWKING, D. 2000. Server selection on the World Wide Web. In Proceedings of the 5th ACM Conference on Digital Libraries (DL). 37?46. CROFT,W.B.AND LAFFERTY, J. 2003. Language Modeling for Information Retrieval. Kluwer Aca- demic. DEMPSTER,A.P.,LAIRD, N. M., AND RUBIN, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statis. Soc. B, 39, 1?38. DOLIN, R. A. 1998. Pharos: A scalable distributed architecture for locating heterogeneous infor- mation sources. Ph.D. thesis, University of California, Santa Barbara. DREILINGER,D.AND HOWE, A. E. 1997. Experiences with selecting search engines using metasearch. ACM Trans. Inf. Syst. 15, 3, 195?222. DUDA,R.O.,HART,P.E.,AND STORK, D. G. 2000. Pattern Classification, 2nd ed. Wiley. FELLBAUM, C. 1998. WordNet: An Electronic Lexical Database. MIT Press. FLAKE, G., GLOVER, E., LAWRENCE,S.,AND GILES, C. L. 2002. Extracting query modifications from nonlinear SVMs. In Proceedings of the 11th International World Wide Web Conference (WWW). FRENCH, J. C., POWELL, A. L., CALLAN, J. P., VILES, C. L., EMMITT, T., PREY,K.J.,AND MOU, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 238?245. FRENCH,J.C.,POWELL, A. L., VILES, C. L., EMMITT,T.,AND PREY, K. J. 1998. Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 121? 129. FUHR, N. 1999. A decision-theoretic approach to database selection in networked IR. ACM Trans. Inf. Syst. 17, 3 (May), 229?249. GAUCH, S., WANG,G.,AND GOMEZ, M. 1996. ProFusion*: Intelligent fusion from multiple, dis- tributed search engines. J. Universal Comput. Sci. 2, 9 (Sept.), 637?649. GHANI, R., JONES, R., AND MLADENIC, D. 2001. Using the Web to create minority language corpora. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), 279?286. GRAVANO, L., IPEIROTIS,P.G.,AND SAHAMI, M. 2003. QProber: A system for automatic classification of hidden-Web databases. ACM Trans. Inf. Syst. 21, 1 (Jan.), 1?41. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:64 ? P. G. Ipeirotis and L. Gravano GRAVANO, L., GARC?A-MOLINA, H., AND TOMASIC, A. 1999. GlOSS: Text-Source discovery over the Internet. ACM Trans. Database Syst. 24, 2 (Jun.), 229?264. GRAVANO, L., CHANG, K. C.-C., GARC?A-MOLINA, H., AND PAEPCKE, A. 1997. STARTS: Stanford pro- posal for Internet meta-searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 207?218. GREFENSTETTE,G.AND NIOCHE, J. 2000. Estimation of English and non-English language use on the WWW. In Recherche d?Information Assist? par Ordinateur (RIAO). HARMAN, D. 1996. Overview of the Fourth Text REtrieval Conference (TREC-4). In NIST Special Publication 500-236: The 4th Text REtrieval Conference (TREC-4), 1?24. HASTIE, T., TIBSHIRANI, R., AND FRIEDMAN, J. H. 2001. The Elements of Statistical Learning. Springer. HAWKING,D.AND THOMAS, P. 2005. Server selection methods in hybrid portal search. In Proceed- ings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 75?82. HAWKING,D.AND THISTLEWAITE, P. B. 1999. Methods for information server selection. ACM Trans. Inf. Syst. 17, 1 (Jan.), 40?76. IPEIROTIS,P.G.AND GRAVANO, L. 2004. When one sample is not enough: Improving text database selection using shrinkage. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 767?778. IPEIROTIS,P.G.AND GRAVANO, L. 2002. Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB), 394?405. JELINEK, F. 1999. Statistical Methods for Speech Recognition. MIT Press. JOACHIMS, T. 1998. Text categorization with support vector machines: Learning with many rel- evant features. In Proceedings of the 10th European Conference on Machine Learning (ECML), 137?142. KAHLE, B., MORRIS, H., GOLDMAN, J., ERICKSON,T.,AND CURRAN, J. 1993. Interfaces for distributed systems of information servers. J. Amer. Soc. Inf. Sci. 44, 8 (Sept.), 453?467. LARKEY, L. S., CONNELL,M.E.,AND CALLAN, J. P. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), 282?289. LIU, Z., LUO, C., CHO,J.,AND CHU, W. 2004. A probabilistic approach to metasearching with adap- tive probing. In Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE), 547?559. MANBER,U.AND BIGOT, P. A. 1997. The search broker. In 1st USENIX Symposium on Internet Technologies and Systems (USITS). MANDELBROT, B. B. 1988. Fractal Geometry of Nature. W. H. Freeman. MARQUES DE S?, J. P. 2003. Applied Statistics. Springer. MCCALLUM, A., ROSENFELD, R., MITCHELL,T.M.,AND NG, A. Y. 1998. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (ICML), 359?367. MENG, W., YU,C.T.,AND LIU, K.-L. 1999. Detection of heterogeneities in a multiple text database environment. In Proceedings of the 4th IFCIS International Conference on Cooperative Informa- tion Systems (CoopIS), 22?33. MENG,W.,LIU,K.-L.,YU,C.T.,WANG,X.,CHANG,Y.,AND RISHE,N. 1998. Determiningtextdatabases to search in the Internet. In Proceedings of the 24th International Conference on Very Large Databases (VLDB), 14?25. MONROE, G. A., FRENCH,J.C.,AND POWELL, A. L. 2002. Obtaining language models of Web collec- tions using query-based sampling techniques. In Proceedings of the 35th Annual Hawaii Inter- national Conference on System Sciences (HICSS), 67?73. PERKOWITZ, M., DOORENBOS, R. B., ETZIONI,O.,AND WELD, D. S. 1997. Learning to understand information on the Internet: An example-based approach. J. Intell. Inf. Syst. 8, 2 (Mar.), 133?153. POWELL,A.L.AND FRENCH, J. C. 2003. Comparing the performance of collection selection algo- rithms. ACM Trans. Inf. Syst. 21, 4 (Oct.), 412?456. POWELL, A. L., FRENCH,J.C.,CALLAN,J.P.,CONNELL, M., AND VILES, C. L. 2000. The impact of database selection on distributed searching. In Proceedings of the 23rd Annual International ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. Classification-Aware Hidden-Web Text Database Selection ? 6:65 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 232? 239. QUINLAN, J. R. 1992. C4.5: Programs for Machine Learning. Morgan Kaufmann. RAGHAVAN,S.AND GARC?A-MOLINA, H. 2001. Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Databases (VLDB), 129?138. SALTON,G.A.AND MCGILL, M. J. 1983. Introduction to Modern Information Retrieval. McGraw- Hill. SHELDON, M. A. 1995. Content routing: A scalable architecture for network-based information discovery. Ph.D. thesis, Massachusetts Institute of Technology. SHOKOUHI, M. 2007. Central-Rank-Based collection selection in uncooperative distributed infor- mation retrieval. In Proceedings of the 29th European Conference on IR Research (ECIR). SI,L.AND CALLAN, J. 2005. Modeling search engine effectiveness for federated search. In Proceed- ings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 83?90. SI,L.AND CALLAN, J. 2004a. Unified utility maximization framework for resource selection. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), 32? 41. SI,L.AND CALLAN, J. P. 2004b. The effect of database size distribution on resource selection algorithms. In Proceedings of the Distributed Multimedia Information Retrieval, SIGIR 2003 Workshop on Distributed Information Retrieval, Revised Selected and Invited Papers. Lecture Notes in Computer Science, vol. 2924, Springer, 31?42. SI,L.AND CALLAN, J. P. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 298?305. SI, L., JIN, R., CALLAN,J.P.,AND OGILVIE, P. 2002. A language modeling framework for resource selection and results merging. In Proceedings of the ACM Conference on Information and Knowl- edge Management (CIKM), 391?397. SUGIURA,A.AND ETZIONI, O. 2000. Query routing for web search engines: Architecture and exper- iments. In Proceedings of the 9th International World Wide Web Conference (WWW). VOORHEES, E. M., GUPTA, N. K., AND JOHNSON-LAIRD, B. 1995. Learning collection fusion strate- gies. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 172?179. VOORHEES,E.M.AND HARMAN, D. 1998. Overview of the Sixth Text REtrieval Conference (TREC- 6). In NIST Special Publication 500-240: The 6th Text REtrieval Conference (TREC-6), 1?24. XU,J.AND CALLAN, J. P. 1998. Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 112?120. XU,J.AND CROFT, W. B. 1999. Cluster-Based language models for distributed retrieval. In Pro- ceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 254?261. YANGARBER,R.AND GRISHMAN, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC). YU,C.T.,MENG, W., WU,W.,AND LIU, K.-L. 2001. Efficient and effective metasearch for text databases incorporating linkages among documents. In Proceedings of the ACM SIGMOD Inter- national Conference on Management of Data (SIGMOD). YU,C.T.,MENG, W., LIU, K.-L., WU,W.,AND RISHE, N. 1999. Efficient and effective metasearch for a large number of text databases. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), 217?224. YUWONO,B.AND LEE, D. L. 1997. Server ranking for distributed text retrieval systems on the Internet. In Proceedings of the 5th International Conference on Database Systems for Advanced Applications (DASFAA), 41?50. ZHAI,C.AND LAFFERTY, J. D. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2 (Apr.), 179?214. ZHAI,C.AND LAFFERTY, J. D. 2002. Two-Stage language models for information retrieval. In Pro- ceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 49?56. ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008. 6:66 ? P. G. Ipeirotis and L. Gravano ZHAI,C.AND LAFFERTY, J. D. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 334?342. ZIPF, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley. Received October 2005; revised December 2006, June 2007; accepted June 2007 ACM Transactions on Information Systems, Vol. 26, No. 2, Article 6, Publication Date: March 2008.