Understanding, Estimating, and Incorporating Output Quality Into Join
Algorithms For Information Extraction
Alpa Jain1, Panagiotis Ipeirotis2, AnHai Doan3, Luis Gravano1
1Columbia University, 2New York University, 3University of Wisconsin-Madison
NYU Stern Research Working Paper
CeDER-08-01
Abstract
Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world
applications often require that the output of multiple IE systems be joined to produce the data of interest. To
optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time.
In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution
plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop
a principled approach to understand, estimate, and incorporate output quality into the join optimization process
over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems
used to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual
join algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, and
predicts the output quality ?and, of course, the execution time? of the alternate execution plans. We establish
the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a
large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.
1 Introduction
Many unstructured text documents contain valuable data that can be represented in structured form. Information
extraction (IE) systems automatically extract and build structured relations from text documents, enabling the efficient
use of such data in relational database systems. Real-world IE systems and architectures, such as DBLife1, Avatar2,
and UIMA [13], view IE systems as blackboxes and ?stitch? together the output from multiple such blackboxes to
produce the data of interest. A common operation that lies at the heart of these multi-blackbox systems is thus joining
the output from the IE systems. Accordingly, recent work [13, 20, 24] has started to study this important problem of
processing joins over multiple IE systems.
Just as in traditional relational join optimization, efficiency is, of course, important when joining the output of
multiple IE systems. Existing work [13, 24] has thus focused on this aspect of the problem, which is critical because
IE can be time-consuming (e.g., it often involves expensive text processing operations such as part-of-speech and
named-entity tagging). Unlike in the relational world, however, the join output quality is of critical importance, because
different join execution plans might differ drastically in the quality of their output. Several factors influence the output
quality, as we discuss below. The following example highlights one such factor, namely, how errors by individual IE
systems impact the join output quality.
Example 1.1 Consider two text databases, the blog entries from SeekingAlpha (SA), a highly visible blog that discusses
financial topics, and the archive of The Wall Street Journal (WSJ) newspaper. These databases embed information that
can be used to answer a financial analyst?s query (e.g., expressed in SQL) asking for all companies that recently merged,
including information regarding their CEOs (see Figure 1). To answer such a query, we can use IE systems to extract a
MergersangbracketleftCompany, MergedWithangbracketright relation from SA and an ExecutivesangbracketleftCompany, CEOangbracketright relation from WSJ. For Mergers,
we extract tuples such as angbracketleftMicrosoft, Softricityangbracketright, indicating that Microsoft merged with Softricity; for Executives, we
extract tuples such as angbracketleftMicrosoft, Steve Ballmerangbracketright, indicating that Steve Ballmer has been a CEO of Microsoft. After
joining all the extracted tuples, we can construct the information sought by the analyst. Unfortunately, the join result
is far from perfect. As shown in Figure 1, the IE system for Mergers incorrectly extracted tuple angbracketleftMicrosoft, Symantecangbracketright,
1http://dblife.cs.wisc.edu/
2http://www.almaden.ibm.com/cs/projects/avatar/
1
Figure 1: Joining information derived from multiple extraction systems.
and failed to extract tuple angbracketleftMicrosoft, aQuantiveangbracketright. Missing or erroneous tuples, in turn, hurt the quality of join results.
For example, the erroneous tuple angbracketleftMicrosoft, Symantecangbracketright is joined with the correct tuple angbracketleftMicrosoft, Steve Ballmerangbracketright
from the Executives relation to produce an erroneous join tuple angbracketleftMicrosoft, Symantec, Steve Ballmerangbracketright. square
A key observation that we make in this paper is that different join execution plans for extracted relations can differ
vastly in their output quality. Therefore, considering the expected output quality of each candidate plan is of critical
importance, and is at the core of this paper. As we will see, the output quality of a join execution plan depends on (a)
the configuration and characteristics of the IE systems used by the plan to process the text documents, and (b) the
document retrieval strategies used by the plan to retrieve the documents for extraction. Previous work has recognized
the importance of output quality for single-relation extraction [17, 18, 15]. Recent work [20] has also considered these
two factors for choosing a join execution plan over multiple extracted relations.
Unfortunately, the earlier work has failed to consider a third, important factor, namely, (c) the choice of join
algorithm. Our analysis reveals that the choice of join algorithm plays a crucial role in determining the overall output
quality of a join execution plan over extracted relations, just as this choice crucially affects execution time in a
relational model setting. In fact, different join algorithms for extracted relations might sometimes produce join results
of drastically different quality.
To the best of our knowledge, this paper presents the first holistic, in-depth study?incorporating all the above
factors, including the choice of join algorithms?of how to understand, estimate, and incorporate output quality into
the processing of joins for information extraction. Our analysis reveals that even a simple two-way join has a vast
execution plan space, where each execution plan might exhibit unique output quality?and, of course, execution
efficiency?characteristics. This highlights the need for estimation techniques for the output quality of each candidate
join execution plan, to guide the join optimization process in a quality-aware manner.
Our goal in this paper is to develop a principled approach to understand, estimate, and incorporate output quality
into the join optimization process. We examine several important questions: How should we configure the underlying IE
systems? What is the correct balance between precision and recall for the IE systems? Should we retrieve and process
all the database documents, or should we selectively retrieve and process only a small subset? What join execution
algorithm should we use, and what is the impact of this choice on the output quality?
A substantial challenge that we address is defining and extracting appropriate, comprehensive database statistics to
guide the join optimization process in a quality-aware manner. As a key contribution of this paper, we show how to
build rigorous statistical inference techniques to estimate the parameters necessary for our analytical models of output
quality; furthermore, our parameter estimation happens efficiently, on-the-fly during the join execution. As another key
contribution, we develop an end-to-end, quality-aware join optimizer that adaptively changes join execution strategies
if the available statistics suggest that a change is desirable.
In summary, the main contributions of this paper are:
? We introduce a principled approach to estimate the output quality and incorporate it into the join optimization
process over multiple extracted relations.
? We present an end-to-end, quality-aware join optimization approach based on our analytical models, as well as
effective methods to estimate all necessary model parameters.
2
? We establish the accuracy of our output quality models and the effectiveness of our join optimization approach
through an extensive experimental evaluation over real-life text collections and state-of-the-art IE systems.
The rest of this paper is organized as follows. In Section 2, we discuss background on joining extracted relations,
to understand the various factors that impact join quality. Then, in Section 3, we discuss three join algorithms for
extracted relations. In Section 4, we present our core results and introduce analytical models for the output quality of
these join algorithms. In Section 6, we introduce an approach for incorporating output quality into the join optimization
process. We then present our experimental results in Sections 7 and 8. Finally, we review related work in Section 9 and
conclude our paper in Section 10.
2 Understanding Join Quality
In this section, we provide background on the problem of joining relations extracted from text databases via IE systems.
We discuss important aspects of the problem that affect the overall quality of the join results. We define the family of
join execution plans that we consider, as well as user-specified quality preferences.
2.1 Tuning Extraction Systems: Impact on Extraction Quality
Extraction is a noisy process, and the extracted relations may contain erroneous tuples or miss valid tuples. An
extracted relation can be regarded as consisting of good tuples, which are the correctly extracted tuples, and bad
tuples, which are the erroneous tuples. For instance, in Figure 1, Mergers consists of one good tuple, angbracketleftMicrosoft,
Softricityangbracketright, and one bad tuple, angbracketleftMicrosoft, Symantecangbracketright. To control the quality of such extracted relations, IE systems
often expose multiple tunable ?knobs? that affect the proportion of good and bad tuples in the IE output. These knobs
may be decision thresholds, such as the minimum confidence required before generating a tuple from text. We denote a
particular configuration of such IE-specific knobs by theta [21].
Given a knob configuration theta for an IE system, we can robustly characterize the effect of such knob settings for the
IE system over a database D using two values [21]: (a) the true positive rate tp(theta) is the fraction of good tuples that
appear in the IE output over all the good tuples that could be extracted from database D with the IE system across all
possible knob configurations, while (b) the false positive rate fp(theta) is the fraction of bad tuples that appear in the IE
output over all the bad tuples that could be extracted from database D with the IE system across all possible knob
configurations. In practice, we estimate the tp(theta) and fp(theta) values using a ?development set? of documents and a set of
?ground truth? tuples [21].
After computing the angbracketlefttp(theta),fp(theta)angbracketright values for an IE system and for all possible configurations theta, we can keep only
the Pareto optimal configurations, that is, each configuration theta that has a angbracketlefttp(theta),fp(theta)angbracketright value that is not dominated by
other configurations. The resulting configurations correspond to the optimal quality tradeoffs that we can make when
materializing the underlying relation with the IE system in question: we can either pick high values of tp(theta) and high
values of fp(theta), which result in a high-recall, low-precision relation, or vice versa, or anything in between.
2.2 Choosing Document Retrieval Strategies: Impact on Extraction Quality
Analogous to the above classification of the tuples extracted by an IE system from a database of E, we can classify
each document in database D with respect to IE system E as a good document, if E can extract at least one good
tuple from the document, as a bad document, if E can extract only bad tuples from the document, or as an empty
document, if E cannot extract any tuples?good or bad?from the document. Ideally, when processing a text database
with an IE system, we should focus on good documents and process as few empty documents as possible, for efficiency
reasons; we should also process as few bad documents as possible, for both efficiency and output quality reasons. To
this end, several document retrieval strategies have been introduced [17, 18, 19], including the following ones, which we
consider in this paper.
Scan (SC) sequentially retrieves and processes each document in a database. While this strategy is guaranteed to
process all good documents, it also processes all the empty and bad ones, and may then introduce many bad tuples.
Filtered Scan (FS) is another scan-based strategy; instead of processing all available documents, FS uses a document
classifier to decide whether a document is good or not. By avoiding bad documents, Filtered Scan is more efficient than
Scan, and tends to have fewer bad tuples in the output. However, since the classifier may also erroneously reject good
documents, Filtered Scan also suffers from false negatives, and might not include all the good tuples in the output.
3
Automatic Query Generation (AQG) is a query-based strategy that attempts to retrieve good documents from the
database. Automatic Query Generation sends (automatically generated [2]) queries to a database; these queries are
expected to retrieve good documents. This strategy tends to retrieve and process only a small subset of the database
documents, and might have a relatively large number of false negatives.
We now show that we can leverage these document retrieval strategies, which focus on a single relation, and develop
join execution plans that involve multiple extracted relations and, in turn, multiple IE systems.
2.3 Choosing Join Execution Plans: User Preferences and Impact on Extraction Qual-
ity
In this paper, we focus on binary natural joins, involving two extracted relations; we leave higher order joins as future
work. As discussed above, and unlike in the relational world, different join execution plans in our text-based scenario
can differ radically in the quality of the join results that they produce. The output quality is affected by (a) the
configuration of the IE systems used to process the database documents, as argued in Section 2.1, and (b) the document
retrieval strategies used to select the documents for processing, as argued in Section 2.2. Interestingly, (c) the choice of
join algorithms also has an impact on output quality, as we will see. We thus define a join execution plan as follows:
Definition 2.1 [Join Execution Plan] Consider two databases D1 and D2, as well as two IE systems E1 and E2.
Assume Ei is trained to extract relation Ri from Di (i = 1,2). A join execution plan for computing R1 multicloseright R2 is a tuple
angbracketleftE1 angbracketlefttheta1 angbracketright,E2 angbracketlefttheta2 angbracketright,X1, X2,JNangbracketright, where thetai specifies the knob configuration of Ei (see Section 2.1) and Xi specifies the
document retrieval strategy for Ei over Di (see Section 2.2), for i = 1,2, while JN is the choice of join algorithm for
the execution, as we will discuss below.square
Given a join execution plan S over databases D1 and D2, the execution time Time(S,D1,D2) is the total time
required for S to generate the join results from D1 and D2. We identify the important components of execution time
for different join execution plans in Section 4, where we will also analyze the output quality associated with each plan.
We now introduce some additional notation that will be needed in our output-quality analysis in the remainder of
the paper. Recall from Section 2.1 that the tuples that an IE system extracts for a relation can be classified as good
tuples or bad tuples. Analogously, we can also classify the attribute value occurrences in an extracted relation according
to the tuples where these values occur. Specifically, consider an attribute value a and a tuple t in which a appears.
We say that the occurrence of a in t is a good attribute value occurrence if t is a good tuple; otherwise, this is a bad
attribute value occurrence. Note that an attribute value might have both good and bad occurrences. For instance, in
Figure 1 the value ?Microsoft? has both a good occurrence in (good) tuple angbracketleftMicrosoft, Softricityangbracketright and a bad occurrence
in (bad) tuple angbracketleftMicrosoft, Symantecangbracketright. We denote the set of good attribute value occurrences for an extracted relation
Ri by Agi and the set of bad attribute value occurrences by Abi.
Consider now a join R1 multicloseright R2 over two extracted relations R1 and R2. Just as in the single-relation case, the
join TR1multicloserightmulticloseleftR2 contains good and bad tuples, denoted as Tgoodmulticloserightmulticloseleft and Tbadmulticloserightmulticloseleft, respectively. The tuples in Tgoodmulticloserightmulticloseleft are
the result of joining only good tuples from the base relations; all other combinations result in bad tuples. Figure 2
illustrates this point using example relations R1 and R2, with 2 good and 3 bad tuples each. In this figure, we have
Ag1 = {a,c} and Ab1 = {b,d,e} for relation R1, and Ag2 = {a,b} and Ab2 = {x,c,e} for relation R2. The composition
of the join tuples yields |Tgoodmulticloserightmulticloseleft| = 1 and |Tbadmulticloserightmulticloseleft| = 3.
User Preferences: In our extraction-based scenario, there is a natural trade-off between output quality and
execution efficiency. Some join execution plans might produce ?quick-and-dirty? results, while other plans might result
in high-quality results that require a long time to produce. Ultimately, the right balance between quality and efficiency
is user-specific, so our query model includes such user preferences as an important feature. One approach for handling
these user preferences, which we follow in this paper, is for users to specify the desired output quality in terms of the
minimum number taug of good tuples that they request, together with the maximum number taub of bad tuples that they
can tolerate, so that |Tgoodmulticloserightmulticloseleft| greaterequal taug and |Tbadmulticloserightmulticloseleft| <=< taub. Other cost functions can be designed on top of this ?lower
level? model: examples include minimum ?precision? at top-k results, minimum ?recall? at the end of execution, or
even a goal to maximize a weighted combination of precision and recall within a pre-specified execution time budget.
Fundamentally, such alternate quality constraints can be mapped to the (somewhat lower level) model that we study in
this paper. Therefore, for conciseness and clarity, in our work the user quality requirements are expressed in terms of
taug and taub, as discussed above.
4
Figure 2: Composition of R1 multicloseright R2 from extracted relations R1 and R2.
3 Join Algorithms for Extracted Relations
As argued above, the choice of join algorithm is one of the key factors affecting the result quality. (Other factors,
which we have already discussed, are the tuning of the IE systems and the choice of document retrieval strategies;
see Sections 2.1 and 2.2.) We now briefly discuss three alternate join algorithms, which we later analyze in terms
of their output quality and execution efficiency. Following Section 2.3, each join algorithm will attempt to meet the
user-specified quality requirements as efficiently as possible. This goal is then related to that of ripple joins [16] for
online aggregation, which minimize the time to reach user-specified performance requirements. Our discussion of the
alternate join algorithms builds on the ripple join principles.
As we will see, the join algorithms base their stopping conditions on the user-specified quality requirements, given as
taug and taub bounds on the number of good and bad tuples in the join output. Needless to say, the join algorithms do not
have any a-priori knowledge of the correctness of the extracted tuples, so the algorithms will rely on estimates to decide
when the quality requirements have been met (see Section 4). On a related note, the join algorithms might estimate
that the quality requirements cannot be met, in which case the join optimizer may build on the current execution with
a different join execution plan or, alternatively, discard any produced results and start a new execution plan from
scratch (see Section 6).
3.1 Independent Join
The Independent Join algorithm (IDJN) [19] computes a two-way join by extracting the two relations independently
and then joining them to produce the final output. To extract each relation, IDJN retrieves the database documents
by choosing from the document retrieval strategies in Section 2.1, namely, Scan, Filtered Scan, and Automatic Query
Generation.
Figure 3 describes the IDJN algorithm for the settings of Definition 2.1 and for the case where the document retrieval
strategy is Scan. IDJN receives as input the user-specified output quality requirements taug and taub (see Section 2),
and the relevant extraction systems E1 angbracketlefttheta1 angbracketright and E2 angbracketlefttheta2 angbracketright. IDJN sequentially retrieves documents for both relations,
runs the extraction systems over them, and joins the newly extracted tuples with all the tuples from previously seen
documents. Conceptually, the join algorithm can be viewed as ?traversing? the Cartesian product D1 ? D2 of the
database documents, as illustrated in Figure 4: the horizontal axis represents the documents in D1, the vertical axis
represents the documents in D2, and each element in the grid represents a document pair in D1 ? D2, with a dark
circle indicating an already visited element. (The documents are displayed in the order of retrieval.) Figure 4 shows a
?square? version of IDJN, which retrieves documents from D1 and D2 at the same rate; we can generalize this algorithm
to a ?rectangle? version that retrieves documents from the databases at different rates.
The number of documents to explore will depend on the user-specified values for taug and taub, and also on the choice
of the retrieval strategies for each relation. When using Filtered Scan, we may filter out a retrieved document from a
database and, as a result, some portion of D1 ?D2 will remain unexplored. Similarly, when using Automatic Query
Generation, the maximum number of documents retrieved from a database may be limited, resulting in a similar effect.
5
Input: number of good tuples taug, number of bad tuples taub, E1angbracketlefttheta1angbracketright, E2angbracketlefttheta2angbracketright
Output: R1 multicloseright R2
Rj = emptyset /* tuples produced for R1 multicloseright R2 */
Tr1 = emptyset, Tr2 = emptyset /* tuples extracted for R1 and R2 */
Dr1 = emptyset,Dr2 = emptyset /* set of documents retrieved from D1 and D2 */
while {estimated # good tuples in Rj < taug} && {estimated # bad tuples in Rj <=< taub} do
if |Dr1| < |D1| then
Retrieve an unseen document d from D1 and add to Dr1
Process d using E1 at theta1 to generate tuples t1
end
if |Dr2| < |D2| then
Retrieve an unseen document d from D2 and add to Dr2
Process d using E2 at theta2 to generate tuples t2
end
Tjoin = (t1 multicloseright t2) union (Tr1 multicloseright t2) union (t1 multicloseright Tr2)
Tr1 = Tr1 union t1
Tr2 = Tr2 union t2
Rj = Rj union Tjoin
if {|Dr1| = |D1|} && { |Dr2| = |D2|} then
return Rj
end
end
return Rj
Figure 3: The IDJN algorithm using Scan.
Figure 4: Exploring D1 ?D2 with IDJN using Scan.
3.2 Outer/Inner Join
The IDJN algorithm does not effectively utilize any existing keyword-based indexes on the text databases. Existing
indexes can be used to guide the join execution towards processing documents likely to contain a joining tuple. The
next join algorithm, Outer/Inner Join (OIJN), shown in Figure 5, corresponds to a nested-loops join algorithm in the
relational world. OIJN thus picks one of the relations as the ?outer? relation and the other as the ?inner? relation.
(Our analysis in Section 4 can be used to identify which relation should serve as the outer relation in a join execution.)
OIJN retrieves documents for the outer relation using one of the document retrieval strategies (i.e., Scan, Filtered Scan,
or Automatic Query Generation) and processes them using an appropriate extraction system. OIJN then generates
keyword queries using the values for the joining attributes in the extracted outer relation. Using these queries, OIJN
retrieves and processes documents for the inner relation that are likely to contain the ?counterpart? for the already
seen outer-relation tuples.
Figure 6(a) illustrates the traversal through D1 ?D2 for the OIJN algorithm: each querying step corresponds to a
complete probe of the inner relation?s database, which sweeps out an entire row of D1 ?D2. Thus, OIJN effectively
traverses the space, biasing towards documents likely to contain joining tuples from the inner relation, which may result
in an efficient refinement over IDJN. However, the maximum number of documents that can be retrieved via a query
may be limited by the search interface, which, in turn, limits the maximum number of tuples retrieved using OIJN.
The impact of this limit on the number of matching documents is denoted in Figure 6(a) as gray circles that depict
some unexplored area in the D1 ?D2 space.
3.3 Zig-Zag Join
The Zig-Zag Join (ZGJN) algorithm generalizes the idea of using keyword queries from OIJN, so that we can now query
for both relations and interleave the extraction of the two relations; see Figure 7. Starting with one tuple extracted for
one relation, ZGJN retrieves documents?via keyword querying on the join attribute values?for extracting the second
6
Input: number of good tuples taug, number of bad tuples taub, E1angbracketlefttheta1angbracketright, E2angbracketlefttheta2angbracketright
Output: R1 multicloseright R2
Rj = emptyset /* tuples produced for R1 multicloseright R2 */
/* R1 is the outer relation and R2 is the inner relation */
Tr1 = emptyset, Tr2 = emptyset /* tuples extracted for R1 and R2 */
Dr1 = emptyset, Dr2 = emptyset /* sets of documents retrieved from D1 and D2 */
while {estimated # good tuples in Rj < taug} && {estimated # bad tuples in Rj <=< taub} do
Qs = emptyset /* queries for the inner relation */
Retrieve an unseen document d from D1 and add to Dr1
Process d using E1 at theta1 to generate tuples t1
Generate keyword queries from t1 and add to Qs
Tr1 = Tr1 union t1
Rj = Rj union (t1 multicloseright Tr2)
foreach query q element Qs do
Issue q to D2 to retrieve unseen matching documents and add them to Dr2
Process all unprocessed documents in Dr2 using E2 at theta2 to generate tuples t2
Tr2 = Tr2 union t2
Rj = Rj union (Tr1 multicloseright t2)
end
if |Dr1| = |D1| then
return Rj
end
end
return Rj
Figure 5: The OIJN algorithm using Scan for the outer relation.
(a) (b)
Figure 6: Exploring D1 ?D2 with (a) OIJN and (b) ZGJN.
relation. In turn, the tuples from the second relation are used to build keyword queries to retrieve documents for the
first relation, and the process iterates, effectively alternating the role of the ?outer? relation of a nested loops execution
over the two relations. Conceptually, each step corresponds to a sweep of an entire row or column of D1 ?D2, as shown
in Figure 6(b). Similarly to OIJN, ZGJN can efficiently traverse the D1 ?D2 space; however, just as for OIJN, the
space covered by ZGJN is limited by the maximum number of documents returned by the underlying search interface.
4 Estimating Join Quality
Each join execution plan (Definition 2.1) produces join results whose quality depends on the choice of IE system?and
their tuning parameters (see Section 2.1)?, document retrieval strategies (see Section 2.2), and join algorithms (see
Section 3). We now turn to the core of this paper, where we present analytical models for the output quality of the
join execution plans. Specifically, for each execution plan we derive formulas for the number |Tgoodmulticloserightmulticloseleft| of good tuples
and the number |Tbadmulticloserightmulticloseleft| of bad tuples in R1 multicloseright R2 that the plan produces, as a function of the number of documents
retrieved and processed by the IE systems.
4.1 Notation
In the rest of the discussion, we consider two text databases, D2 and D2, with two IE systems E1 and E2, where
extraction system Ei extracts relation Ri from Di (i = 1,2). Table 1 summarizes our notation for the good, bad, and
empty database documents, for the good and bad tuples and attribute values, as well as for their frequency in the
7
Input: number of good tuples taug, number of bad tuples taub, E1angbracketlefttheta1angbracketright, E2angbracketlefttheta2angbracketright, Qseed
Output: R1 multicloseright R2
Rj = emptyset /* tuples produced for R1 multicloseright R2 */
Tr1 = emptyset, Tr2 = emptyset /* tuples extracted from R1 and R2 */
Dr1 = emptyset, Dr2 = emptyset /* sets of documents retrieved from D1 and D2*/
Q1 = Qseed, Q2 = emptyset /* set of queries issued to D1 and D2 */
foreach query q1 element Q1 do
Issue q1 to D1 to retrieve unseen matching documents and add them to Dr1
Process all unprocessed documents in Dr1 using E1 at theta1 to generate tuples t1
Generate keyword queries from t1 and append to Q2
Tr1 = Tr1 union t1
Rj = Rj union (t1 multicloseright Tr2)
foreach query q2 element Q2 do
Issue q2 to D2 to retrieve unseen matching documents and add them to Dr2
Process all unprocessed documents in Dr2 using E2 at theta2 to generate tuples t2
Generate keyword queries from t2 and append to Q1
Tr2 = Tr2 union t2
Rj = Rj union (Tr1 multicloseright t2)
if {estimated # good tuples in Rj greaterequal taug} && {estimated # bad tuples in Rj <=< taub} then
return Rj
end
end
end
Figure 7: The ZGJN algorithm.
extracted relations (see Section 2.3). Additionally, for a natural join attribute A3, we denote the attribute values
common to both extracted relations R1 and R2 as follows: Agg = Ag1 intersection Ag2, Agb = Ag1 intersection Ab2, Abg = Ab1 intersection Ag2, and
Abb = Ab1 intersection Ab2. In Figure 2, for example, these sets are Agg = {a}, Agb = {c}, Abg = {b}, and Abb = {e}.
Symbol Description
Ri Extracted relations (i = 1, 2)
Ei Extraction system for Ri
Xi Document retrieval strategy for Ei
Di Database for extracting Ri
Dgi Set of good documents in Di
Dbi Set of bad documents in Di
Dei Set of empty documents in Di
Dri Set of documents retrieved from Di
Agi Good attribute values in Ri
Abi Bad attribute values in Ri
gi(a) Frequency of a in Dgi
bi(a) Frequency of a in Dbi
Oi(a) Frequency of a in Dri
Table 1: Notation summary.
4.2 Analyzing Join Execution Plans: General Scheme
We begin our analysis by sketching a general scheme to study the output of an execution plan in terms of its number
of good tuples |Tgoodmulticloserightmulticloseleft| and the number of bad tuples |Tbadmulticloserightmulticloseleft|. In later sections, we will instantiate this general
scheme for the various join execution plans that we study.
Consider relations R1 and R2, to be extracted and joined over a single common attribute A, and let a be a value of
join attribute A with g1(a) good occurrences in D14 and g2(a) good occurrences in D2. Suppose that a join execution
retrieves documents Dr1 from D1 and documents Dr2 from D2, and we observe gr1(a) good occurrences of a in Dr1
and gr2(a) good occurrences of a in Dr2. Then, the number of good join tuples with A = a is gr1(a) ? gr2(a) (see
Section 2.3). Generalizing this analysis to all good attribute occurrences common to both relations (i.e., to the values
in Agg) the total number |Tgoodmulticloserightmulticloseleft| of good tuples extracted for R1 multicloseright R2 is:
|Tgoodmulticloserightmulticloseleft| =
summationdisplay
aelementAgg
gr1(a) ? gr2(a) (1)
3Without loss of generality, we assume that we have a single join attribute A. When we have multiple join attributes, we treat their
union as a ?conceptual? single attribute.
4Conceptually, g(a) can be defined in terms of the number of good tuples that contain attribute value a. For simplicity, we assume
that each attribute value appears only once in each document. This simplification significantly reduces the complexity of the statistical
model, without losing much of its accuracy, since most of the attributes indeed appear only once in each document. (We verified the latter
experimentally.)
8
where Agg is the set of join attribute values with good occurrences in both relations (see Section 4.1). Among other
factors, the values of gr1(a) and gr2(a) depend on the frequencies g1(a) and g2(a) of a in the complete databases D1
and D2. As we will see in the next sections, we can estimate the conditional expected frequency E[gri(a)|gi(a)] for
each attribute value given the configuration of the IE systems, the choice of document retrieval strategy, and the choice
of join algorithm. Assuming, for now, that we know the conditional expectations, we have:
E[|Tgoodmulticloserightmulticloseleft|] =
summationdisplay
aelementAgg
E[gr1(a)|g1(a)] ? E[gr2(a)|g2(a)]
In practice, we do not know the exact frequencies g1(a) and g2(a) for each attribute value. However, we can typically
estimate the probability Pr{gi} that an attribute value occurs gi times in an extracted relation, using parametric or
nonparametric approaches (e.g., often the frequency distribution of attribute values follows a power-law [3, 17]). So,
E[|Tgoodmulticloserightmulticloseleft|] = |Agg| ?
|Dg1|summationdisplay
g1=1
|Dg2|summationdisplay
g2=1
E[gr1|g1] ? E[gr2|g2] ? Pr{g1,g2}
The factor Pr{g1,g2} is the probability that an attribute value has g1 good occurrences in D1 and g2 good occurrences
in D2. We make a simplifying independence assumption so that Pr{g1,g2} = Pr{g1} ? Pr{g2}. Alternatively, we can
use the fact that frequent attribute values in one relation are commonly frequent in the other relation as well, and vice
versa. In this scenario, we would have: Pr{g1} approxequal Pr{g2} and Pr{g1,g2} approxequal Pr{g1} approxequal Pr{g2}.
To compute the number |Tbadmulticloserightmulticloseleft| of bad tuples in R1 multicloseright R2 we proceed in the same fashion, with two main
differences: we need to consider three different classes of attributes, namely, Agb, Abg, and Abb, and we need to compute
the expected number of bad attribute occurrences in an extracted relation. Specifically,
|Tbadmulticloserightmulticloseleft| = Jgb + Jbg + Jbb
where Jgb = |Agb| ? summationtext|Dg1|g1=1 summationtext|Db2|b2=1 E[gr1vextendsinglevextendsingleg1] ? E[br2vextendsinglevextendsingleb2] ? Pr{g1,b2}. We can compute values for Jbg and Jbb using |Abg|
and |Abb|, respectively, along with appropriate tuple cardinality values.
Using this generic analysis along with the expected frequencies for the attribute occurrences, we can derive the
exact composition of the join R1 multicloseright R2. We now complete and instantiate this analysis to the alternate join algorithms
of Section 3.
4.3 Independent Join
Our goal is to derive the expected frequency E[gri] for good attribute occurrences and E[bri] for bad attribute
occurrences in Ri after we have retrieved Dri documents from Di, given the frequencies of occurrence in Di (i = 1,2).
As IDJN independently generates each base relation, the analysis for E[gr1] is the same as that for E[gr2], and depends
on the choice of document retrieval strategy and extraction system configuration; similarly, the analysis for E[br1] is
the same as that for E[br2]. Hence, we ignore the relation subindex in the discussion.
We start by computing E[gr] for an attribute value a with g(a) = g good occurrences. We focus only on the good
documents Dg in D, as good occurrences only appear in the good documents. We model a document retrieval strategy
as sampling processes over the good documents Dg [17, 18]. After retrieving Dgr good documents, the probability of
observing k times the good attribute occurrence a in the retrieved documents follows a hypergeometric distribution,
Hyper(|Dg|,|Dgr|,g,k), where Hyper(D,S,g,k) = parenleftbiggkparenrightbig ? parenleftbigD-gS-kparenrightbig/parenleftbigDSparenrightbig.
We process the retrieved documents Dgr using an extraction system E. As E is not perfect, even if we retrieve k
documents that contain the good attribute occurrences, E examines each of the k documents independently and with
probability tp(theta) for each document, outputs the occurrence. So, we will see good occurrences in the extracted tuples
only l <=< k times, and l is a random variable following the binomial distribution. Thus, the expected frequency of a
good attribute occurrence in an extracted relation, after processing j good documents, is:
E[grvextendsinglevextendsingle|Dgr| = j]=
gsummationdisplay
k=0
Hyper(|Dg|,j,g,k) ?
ksummationdisplay
l=0
l ? Bnm(k,l,tp(theta))
where Bnm(n,k,p) = parenleftbignkparenrightbig? pk ? (1-p)n-k is the binomial distribution and g is the frequency of good attribute occurrences
in D. The derivation for bad attribute occurrences E[br] is analogous to that for E[gr], but now a bad attribute
occurrence can be extracted from both good and bad documents in the database.
9
So far, the analysis implicitly assumed that we know the exact proportion of the good and bad documents retrieved,
i.e., |Dgr| and |Dbr|. In reality, though, this proportion depends on the choice of document retrieval strategy. The
effect of a document retrieval strategy was analyzed in [21]. We now discuss this analysis in the context of our setting.
Scan sequentially retrieves documents for E from database D, in no specific order. Therefore, when Scan retrieves
|Dr| documents, E processes |Dgr| good documents, where |Dgr| is a random variable that follows the hypergeometric
distribution. Specifically, the probability of processing exactly j good documents is:
Pr(|Dgr| = j) = Hyper(|D|,|Dr|,|Dg|,j)
where Hyper(D,S,g,k) = parenleftbiggkparenrightbig ? parenleftbigD-gS-kparenrightbig/parenleftbigDSparenrightbig is the hypergeometric distribution. We compute the probability of processing
j bad documents analogously.
Filtered Scan is similar to Scan, except that a document classifier filters out documents that are not good candidates
for containing good tuples. Thus, only documents that survive the classification step will be processed. Document
classifiers are not perfect either, and they are usually characterized by their true positive rate Ctp and false positive
rate Cfp. Intuitively, for a classifier C, the true positive rate Ctp is the fraction of documents in Dg classified as good,
and the false positive rate Cfp is the fraction of documents in Db incorrectly classified as good. Therefore, the main
difference from Scan is that now the probability of processing j good documents after retrieving |Dr| documents from
the database is:
Pr(|Dgr| = j) =
|Dr|summationdisplay
n=0
Hyper(|D|,|Dr|,|Dg|,n) ? Bnm(n,j,Ctp)
We compute the probability of processing j bad documents in a similar way using Cfp.
Automatic Query Generation retrieves documents from D by issuing queries designed to retrieve mainly good
documents. Consider the case where AQG has sent Q queries to the database. If a query q retrieves g(q) documents
and has precision P(q), where P(q) is the fraction of documents retrieved by q that are good, then the probability
that a good document is retrieved by q is P(q)?g(q)|Dg| . Assuming that the queries sent by AQG are only biased towards
documents in Dg, the queries are conditionally independent within Dg. In this case, the probability that a good
document d is retrieved by at least one of the Q queries is:
Prg(d) = 1 -
Qproductdisplay
i=1
parenleftbigg
1 - p(qi) ? g(qi)|Dg|
parenrightbigg
(2)
Since each document is retrieved independently of each other, the number of good documents retrieved (and processed)
follows a binomial distribution, with |Dg| trials and Prg(d) probability of success in each trial.
Pr(|Dgr| = j) = Bnm(|Dg|,j,Prg(d))
The analysis is analogous for the bad documents.
The execution time for an IDJN execution strategy follows from the general discussion above. Consider the case
when we retrieve |Dr1| documents from D1 and |Dr2| documents from D2 using Scan for both relations. In this case
IDJN does not filter or query for any documents, so the execution time is:
Time(S,D1,D2) =
2summationdisplay
i=1
|Dri| ? parenleftbigtiR + tiEparenrightbig
where tiR is the time to retrieve a document from Di and tiE is the time required to process the document using the Ei
IE system. For execution strategies that use Filtered Scan or Automatic Query Generation, we compute the execution
time by considering the time tiF to filter a document, or the time tiQ (together with the number of queries issued |Qsi|)
to send and retrieve the results of a query5.
4.4 Outer/Inner Join
For OIJN, the analysis for the outer relation is the same as that for an individual relation in IDJN: the expected
frequency of (good or bad) occurrences of an attribute value depends solely on the document retrieval strategy and the
extraction system for the outer relation. On the other hand, the expected frequency of the attribute occurrences for
5We make a number of simplifying assumptions here, assuming that the time to retrieve a document, to process it using an extraction
system, and so on is constant across documents. Even though this is rarely true in practice, the approximation is good enough for our
purposes, given the difficulty of knowing such values for each document.
10
the inner relation depends on the number of queries issued using attribute values from the outer relation, as well as on
the extraction system used to process the matching documents. Our analysis focuses on the inner relation; again, we
omit the relation subindex, for simplicity.
Consider again an attribute value a with g(a) good occurrences in the database, and a query q generated from a
which has H(q) document matches and precision P(q), where P(q) is the fraction of documents matching q that are
good. Thus, the set of good documents that can match q is Hg(q) with |Hg(q)| = |H(q)| ? P(q). When we issue q,
the subset of Hg(q) documents returned is limited by the search interface. Specifically, if the search interface returns
only the top-k documents for a query, for a fixed value of k, we expect to see k ? P(q) good documents. An important
observation is that the documents that match q but are not returned in the top-k results can also be retrieved by
queries other than q. Thus, when we issue Qs queries and retrieve Dgr good documents, we can observe a from two
disjoint sets: k ? P(q), the good documents in the top-k answers for q, and the rest, Dgrrest = Dgr -k ? P(q).
If Prq{grq|g(a),q} is the probability that we will observe attribute value a a total of grq times in the top results for
q, and Prr{grrest|g(a),Dgrrest} is the probability that we will observe attribute value a a total of grrest times in the
remainder documents, we have:
E[gr(a)bracketrightbig =
g(a)summationdisplay
l=0
l ? (Prq{l|g(a),q} + Prr{l|g(a),Dgrrest})
For Prq{grq|g(a),q}, we model querying as sampling over Hg(q) while drawing k ? P(q) samples, and derive:
Prq{grq|g(a),q} =
g(a)summationdisplay
i=0
Hyperparenleftbig|Hg(q)|,k ? P(q),g(a),iparenrightbig ? Bg(i,grq)
where Bg(k,l) = Bnm(k,l,tp(theta)).
For Prr{grrest|g(a),Dgrrest}, we observe that the total number g(a) of documents in Dgrrest is the number of
documents containing a minus the good documents that matched the query. Specifically, among documents not retrieved
via the query q, an attribute value can occur gprime(a) times where gprime(a) = g(a) ? |Hg(q)|-k?P(q)|Hg(q)| . We model the process of
retrieving documents for a, using queries other than q, as sampling over Dg while drawing samples Dgrrest, and derive:
Prr{l|g(a),|Dgrrest| = j} =
g(a)summationdisplay
i=0
Hg (gprime(a),i) ? Bg(i,l)
where Hg(k,i) = Hyper(|Dg|,j,k,i). For E[br(a)], i.e., a bad attribute value, we proceed similarly.
Regarding execution time, if we retrieve Dro documents using Scan for the outer relation and send Qs queries for
the inner relation, in turn retrieving Dri documents, the execution time is:
Time(S,D1,D2)=|Dro| ? (toR + toE) + |Dri| ? parenleftbigtiR + tiEparenrightbig + |Qs| ? tiQ
where toR and toE are the times to retrieve and process, respectively, a document for the outer relation, tiR and tiE are
the times to retrieve and process, respectively, a document for the inner relation, and tiQ is the time to issue a query to
the inner relation?s database. The value for |Dro| is determined so that the resulting join execution meets the user
requirements.
4.5 Zig-Zag Join
To analyze the ZGJN algorithm, we define a zig-zag graph consisting of two classes of nodes: attribute nodes (?a?nodes)
and document nodes (?d? nodes), and two classes of edges: hit edges and generates edges. A hit edge A arrowrightD connects
an a node to a d node, and denotes that a generated a hit on d, that is, d matches the query generated using a. A
generates edge d arrowrighta connects a d node to an a node and denotes that processing d generated a.
As an example, consider the zig-zag graph in Figure 8 for joining Mergers and Executives from Example 1.1 on
the Company attribute. We begin with a seed query [?Microsoft?] for Mergers and issue it to the D2 database. This
query hits a document d21. Processing d21 generates tuples for Executives, which contain values Microsoft and AOL
for Company. At this stage, the total number of attributes generated for Executives is determined by the number of
documents that matched the query [?Microsoft?]. Next, we issue the query [?AOL?] to D1, which retrieves documents
d11 and d12. The total number of documents retrieved from D1 is determined by the number of attribute values
generated for Executives in the previous step. Processing d11 for Mergers generates a new attribute value, AOL, which
is used to generate new queries for D2, and the process continues.
The above example shows that the characteristics of a ZGJN execution are determined by the total number of
attribute values and documents that could be reached following the edges on the zig-zag graph. Thus, the structure of
11
Figure 8: Sample zig-zag graph for Mergers multicloseright Executives.
the graph defines the execution time and the output quality for ZGJN. We study the interesting properties of a zig-zag
graph using the theory of random graphs [23]. Specifically, we extend the approach in [17, 18] and use generating
functions to describe the properties of a zig-zag graph. We begin by defining two generating functions, h0(x), which
describes the number of hits for a randomly chosen attribute value, and ga0(x), which describes the number of attributes
generated from a randomly chosen document.
h0(x) =
summationdisplay
k
pak ? xk, ga0(x) =
summationdisplay
k
pdk ? xk
where pak is the probability that a randomly chosen attribute a matches k documents, and pdk is the probability that
a randomly chosen document generates k attributes. To keep the model parameters manageable, we approximate the
distribution for pak with the attribute frequency distribution used by our general analysis (Section 4.2), as the two
distributions tend to be similar. Specifically, we derive the probability that an attribute a matches k documents using
the probability that a is extracted from k documents.
Our goal, however, is to study the frequency distribution of an attribute or a document chosen by following a
random edge. For this, we use the method in [23, 17] and define functions H(x) and Ga(x) that, respectively, describe
the attribute and the document frequency chosen by following a random edge on the zig-zag graph.
H(x) = xh0
prime(x)
h0prime(1), Ga(x) = x
ga0prime(x)
ga0prime(1)
where h0prime(x) is the first derivative of h0(x) and ga0prime(x) is the first derivative of ga0(x). To distinguish between the
relations, we denote the functions using subindices: Hi(x) and Gai(x), respectively, describe the attribute and the
document frequency distributions for Ri (i = 1,2).
We will now derive equations for the number of documents E[|Dr1|] and E[|Dr2|] retrieved from D1 and D2,
respectively, and the number of attribute values E[|Ar1|] and E[|Ar2|] generated for relation R1 and R2, respectively.
For our analysis, we exploit three useful properties, Moments, Power, and Composition of generating functions
(see [23, 17]). The distribution of the total number |Dr2| of documents retrieved from D2 using attributes from R1 can
be described by the function:
Dr2(x) = H1(x)
Further, the distribution of the attribute values generated from a D2 document picked by following a random edge is
given by Ga2(x). Using the Composition property, the distribution of the total number of attribute values generated
from |Dr2| is given by the function:
Ar2(x) = H1(Ga2(x))
The total number |Ar2| of R2 attribute values that will be used to derive the D2 documents is a random variable with
its distribution described by Ar2(x). Furthermore, the distribution of the documents retrieved by an R2 attribute value
picked by following a random edge is described by H2(x). Once again, using the Composition property, we describe the
distribution of the total number of D2 documents retrieved using Ar2 attribute values using the generating function:
Dr1(x) = Ar2(H2(x)) = H1(Ga2(H2(x)))
12
To describe the total number |Ar1| of R1 attribute values derived by processing a Dr1 document, we compose Dr1(x)
and Ga1(x), and define:
Ar1(x) = Dr1(Ga1(x)) = H1(Ga2(H2(Ga1(x))))
Next, we generalize the above functions for Q1 queries sent from R1 attribute values and using the Power property:
Dr2(x) = [H1(x)]|Q1| , Ar2(x) = [H1(Ga2(x))]|Q1|
Finally, we compute the expected values E[|Dr2|] after we have issued Q1 queries using R1 attribute values. For this,
we resort to the Moments property.
E[|Dr2|] =
bracketleftbigg d
dx [H1(x)]
|Q1|
bracketrightbigg
x=1
E[|Ar2|] =
bracketleftbigg d
dx [H1(Ga2(x))]
|Q1|
bracketrightbigg
x=1
We derive values for E[|Dr1|] and E[|Ar1|] in a similar manner.
We derived the total number of attributes E[|Ar1|] and E[|Ar2|] for the individual relations, but we are interested
in the total number of good and bad attribute occurrences generated for each relation. For this, we split the number of
attributes in a relation, using the fraction of good or bad attribute occurrences in a relation. For instance,
E[|gr1|] = E[|Ar1|] ? |Ag1||Ag
1| + |Ab1|
Given the analysis above, we compute the execution time of a zig-zag join that satisfies the user-specified quality
requirements: if we issue |Qsi| queries and retrieve |Dri| documents for relation Ri, i = 1,2, the execution time is:
Time(S,D1,D2) =
2summationdisplay
i=1
|Dri| ? parenleftbigtiR + ?tiEparenrightbig + |Qsi| ? tiQ
The values for |Qs1| and |Qs2| are the minimum values required for the output quality to meet the user-specified quality
requirements.
To summarize, in this section we rigorously analyzed each join algorithm for the various choices of document retrieval
strategies and extraction system configurations. Our analysis resulted in formulas for the join quality composition in
terms of the number of documents retrieved for each relation or the number of keyword queries issued to a database.
Conversely, this analysis can be used to determine these input values for a given output quality.
5 Estimating Model Parameters
To estimate the output quality, our analysis in Section 4 relies on three classes of parameters, namely, the retrieval-
specific parameters, the single-relation-specific parameters, and the join-specific parameters. The retrieval-specific
parameters involve parameters such as E[pg(q)],E[pb(q)], and E[g(q)] for the AQG queries, or the classifier properties
Ctp and Cfp for FS, or E[|H(q)|] and E[P(q)] for OIJN. The single-relation-specific parameters on which our analysis
relies are, |Dgi|, |Dbi|, |Dei|, for each extracted relation Ri, as well as the frequency distribution of the good and bad
attributes in the database and the document degree distribution (see Section 4.5). Finally, the join-specific parameters
are |Agg|,|Agb|,|Abg|, and |Abb|. Of these three classes of parameters, , the retrieval-specific parameters can be easily
estimated in a pre-execution, offline step [17, 21]. On the other hand, estimating the other family of parameters is a
more challenging task, which we discuss next.
Our parameter estimation method builds on the MLE-based estimation methods presented in [21]. Specifically, we
begin with estimating the parameters for each individual relation, namely, |Dg|, |Db|, |De|, along with the attribute
frequency distribution and document degree distribution. For this, we follow the PT-Iter-MLE estimation method
from [21]; the following discussion can be easily extended for other estimation methods from [21].
One of the fundamental parameters required by our analysis is the frequency of each attribute occurrence in the
database (e.g., g(a) and b(a) for an attribute a) and the document degree for each document in the database (i.e.,
number of attributes generated, denoted as a(d) from a document d). Following our estimation approach for a single
relation, we rely on the fact that the attribute frequencies for both category of attributes (good and bad) as well as the
document degree tend to follow a power law distribution, as we will see later in Section 7. Therefore, for the random
13
variable g(a), which represents the good attribute occurrence frequency, and the random variable g(a) + b(a), which
represents the bad attribute occurrence frequency, we have:
Pr{g(a) = i} = ibetagazeta(betaga) (3)
Pr{g(a) + b(a) = i} = ibetabazeta(beta
ba)
(4)
where betaga and betaba are the exponents of the power law distributions for the frequencies of good and bad attribute
occurrences, respectively. Similarly, for the random variable a(d) which represents the attributes generated from a
document, the probability mass function is given as:
Pr{a(d) = i} = i
betad
zeta(betad) (5)
where betad is the exponent of the power law for the document degree distribution. Knowing the distribution of the
frequencies of good and bad attribute occurrences can greatly facilitate the estimation task: our goal now is to derive
values for betaga, betaba, and betad.
Given a relation that can be extracted from database D, we begin with retrieving and processing documents from D.
After processing the retrieved documents Dr, we observe some tuples and their frequencies in the retrieved documents.
Following the PT-Iter-MLE method, we partition these observed tuples [21] into good and bad tuples, and based on
this classification we numerically derive the values for |Dg|, |Db|, and |De|.
Estimating betaga,betaba, and betad: To estimate values for betaga, we focus on the good attribute occurrences derived from
processing Dr, i.e., on the attribute values that are associated with the good tuples observed in Dr. To this end, we use
a maximum-likelihood-based estimation approach similar to PT-Iter-MLE [21], but with attribute occurrences instead
of tuples. Consider a good attribute occurrence a, and assume that the number of documents in Dr that generate a is
gs(a). We need to identify for a, its frequency in the entire database (i.e., g(a)), since there may be other database
documents not yet processed that can generate a. For this, we rely on the analysis from Section 4. Specifically, given a
good attribute occurrence a, the most likely frequency g(a) for a in the entire database is the one that maximizes the
following probability:
Pr{g(a)vextendsinglevextendsinglegs(a)} = Pr{gs(a)
vextendsinglevextendsingleg(a)} ? Pr{g(a)}
Pr{gs(a)} (6)
Since Pr{gs(a)} is constant across all values for g(a) in the above equation, we can ignore it for the maximization task.
To derive the factor, Pr{gs(a)vextendsinglevextendsingleg(a)}, which is the probability that we observe a good attribute occurrence gs(a) times
when it occurs g(a) times in the database, we use Equation 4.3 from our analysis. Finally, we estimate Pr{g(a)} using
Equation 3. Applying these factors to Equation 6, we iterate over the two steps of the PT-Iter-MLE approach until we
converge on a final value for betaga [21].
For bad attribute occurrences, we derive betaba in a similar fashion. Given a bad attribute occurrence a that was
observed bs(a) times we estimate the most likely frequency that maximizes the probability of observing bs(a), and
iteratively identify betaba.
The task of estimating the distribution parameter betad for the document degree is relatively simple. Following the
estimation approach discussed [21], as we retrieve database documents and process them using the maximum-sensitivity
setting of the associated extraction system. Thus, for each document in Dr, we know exactly the total number of
attributes in that document and, thus, we can directly fit a power law to the observed document degrees. Specifically,
given documents d1, d2, ? ? ? , d|Dr| in Dr, from which the number of attributes we derived are a(d1), a(d2),? ? ? ,a(d|Dr|),
respectively, we identify the value of a power law distribution parameter betad that maximizes the probability of observing
the number of attributes per document, a(di), using the likelihood function:
l{betadvextendsinglevextendsinglea(di)} =
|Dr|productdisplay
i=1
a(di)-betad
zeta(betad)
We numerically derive betad using the derivative of the log-likelihood function [21].
Estimating |Agg|,|Agb|,|Agb|, and |Abb|: The final step in our parameter estimation process is to derive the join-
specific parameters, namely |Agg|,|Agb|,|Abg|, and |Abb|. For this, we numerically solve the equations from Section 4
for deriving the expected number of good or bad tuples after processing documents Dr. Specifically, in Equation 4.2,
we showed how we can estimate E[|Tgoodmulticloserightmulticloseleft|] given |Agg| and other parameters discussed above. Conversely, for the
estimation task, we observe the value for E[|Tgoodmulticloserightmulticloseleft|] and numerically solve Equation 4.2 to derive |Agg|. To derive
the other values, namely, |Agb|,|Agb|, and |Abb|, we proceed analogously.
14
Figure 9: Sample reachability graph for Executives based on the zig-zag graph in Figure 8
6 Incorporating Output Quality into Join Optimization
In Section 3, we discussed three join algorithms and rigorously analyzed each algorithm in Section 4. Our analysis
predicts the execution time and the output quality of each join execution strategy and thus, allows us to generate
quality curves [21] for join execution strategies. Then, in Section 5, we showed how we can estimate the necessary
parameters for our analysis.
Using our analytical models along with the estimation techniques, we can build a quality-aware optimizer to process
join queries for a given user-specified quality requirement. Specifically, our optimization approach takes as input the
user-provided minimum number of good tuples taug and the maximum number of bad tuples taub, and selects an execution
strategy to efficiently meet the desired quality level. The optimizer begins with an initial choice of execution strategy.
As the initial strategy progresses, the optimizer derives the necessary parameters and determines a desirable execution
strategy for taug and taub, while checking for robustness of its decision using cross-validation.
A fundamental task in the optimization process is to identify the Cartesian space to explore for a given quality
requirement. Exhaustively ?plugging in,? for each database, D in our output quality model of Section 4 all possible
values for |Dr| (0,...,|D|) or |Qs| (0,...,|Ag| +|Ab|) is inefficient, so instead we resort to a simple heuristic to minimize
the sum of documents retrieved and processed (and hence the total execution time), conditioned on the product of the
number of occurrences of good attribute values in each relation. Specifically, we aim to reduce the difference between
the number of documents retrieved for each relation, since intuitively we are minimizing the sum of two numbers,
conditioned on their product. Thus, we select the number of documents for each database to be as close as possible.
Conceptually, this heuristic follows a ?square? traversal of the Cartesian space D1 ?D2 (see Section 3).
6.1 Reachability of ZGJN algorithm
A peculiarity of ZGJN is that it solely depends on documents retrieved and processed for a relation to ?generate? new
queries to retrieve documents for the other relation. These retrieved documents, in turn, govern the extent to which
we discover new attributes for the first relation. Essentially, the success of the ZGJN algorithm depends on whether
this query-based join execution ?reaches? all (or most of) the documents (and thus attributes) in each database. In
this section, we will examine the reachability of ZGJN in order to understand whether our proposed algorithm can
successfully ?reach? all the documents in each database. Our study builds upon the query-based reachability model
proposed by Agichtein et al. [3] for the case of single relations.
In Section 4.5, we introduced the zig-zag graph which describes the process of querying and retrieving documents
for each relation. However, when studying the reachability of a database, our goal is to examine the extent to which
new attributes from each individual relations are reached. For this, we define a reachability graph for each database
involved in the join. Interestingly, the reachability graph can be derived by ?folding? the zig-zag graph.
Definition 6.1 [Reachability Graph] We define the reachability graph RG(T, E) of a database with respect to the
ZGJN as a graph consisting of attributes from a single relation as nodes with edges such that a directed edge ai arrowrighttj
means that the attribute value aj occurs in a document that can be retrieved using the attribute value ai. square
Figure 9 shows the reachability graph for the database associated with the Mergers relation using the zig-zag graph
in Figure 8. As shown in Figure 9, RG contains five nodes, one for each attribute value, with an edge originating
from the node associated with the attribute value, Microsoft, to the node associated with the attribute value, AOL.
15
This edge was placed because using Microsoft as a query on the database for Executives relation, we could retrieve
documents that, in turn, generated queries for the Merges relation that retrieved documents containing the attribute
AOL. Similarly, there is a directed edge between the nodes associated with AOL and Merck. The figure also shows two
nodes with no incoming or outgoing edges as the documents associated with these attributes cannot be reached via
querying.
Earlier, Agichtein et al. [3] formalized the components of a reachability graph using the ?bowtie? structure in [5].
Specifically, the directed reachability graph consists of (a) the strongly connected component, Core, (b) the nodes not
in Core from which we can reach the nodes in Core via a directed path, In, and (c) the nodes not in Core or In which
can be reached from the Core via a directed path, Out. Given this general shape of the reachability graph, we are
interested in studying the size of the connected components in the graph. For this, we rely on the knowledge that
the reachability graph of a text database belongs to the general family of power law graphs. For power law graphs,
the biggest strongly connected component is denoted as the giant component, and it is well-known that if the giant
component emerges then the remaining connected components are expected to be small with the distribution of their
sizes following a power law. (We will show this later in our experiments.) Knowing the distribution of the vertex
degrees of the reachability graph, allows us to directly apply existing results. Specifically, Aiello et al. [4] showed how
we can predict the size of the giant component using only the average degree of the vertices in a graph, for a class of
power law graphs. The size of the giant component is an interesting property as it gives us an idea on the attributes
that can be reached by a ZGJN execution: intuitively, an execution that involves at least one attribute that belongs to
the Core, can reach all the other attributes in the Core and the Out components. We quantify the reachability for a
text database using the reachability metric introduced by Agichtein et al. [3]. Specifically, we define reachability as the
fraction of the nodes A that belong to the Core and Out components of the giant component CRG of a graph RG:
reachability = |Core(CRG)| + Out(CRG)|A| (7)
As discussed above, given a power law graph if a giant component emerges the rest of the components are expected
to be small. Therefore, we use a reachability metric based on the giant component, and now we are interested in
estimating the size of the giant component of a reachability graph. Chung and Lu [7] showed that for the class of power
law graphs with their exponent beta as beta < 3.475, we can derive the size of the giant component using the average degree
d of the vertices in a graph G. Following [3], we compute the relative size of the giant component as:
|CG|
|A| greaterequal
bracelefttp
braceexbraceexbraceleftmid
braceexbraceexbraceleftbt
1/d ?
parenleftBig
1 - 2radicalde
parenrightBig
if d greaterequal e
1/d ?
parenleftBig
1 - 1+logdd
parenrightBig
if 1 < d < e
0 if 0 < d <=< 1
(8)
As noted in [3], these equations can be applied for the case of directed graphs where d is the average outdegree of
RG. Furthermore, the value derived using the above equations may overestimate the reachability as it focuses on the
complete giant component as opposed to only focusing on the size of the Core and the Out components as defined in
Equation 7.
Next, our experimental evaluation show that the analytical models build for the various components of a join
execution strategies in this chapter accurately predict the output quality of each execution strategy, which, in turn,
allows us to build an effective quality-aware join optimization approach.
7 Experimental Settings
We now describe the settings for the experiments in Section 8, focusing on the IE systems, text collections, and the
retrieval strategies used.
IE Systems: We trained Snowball [1] for three relations: ExecutivesangbracketleftCompany,CEOangbracketright, Headquarters angbracketleftCompany,Locationangbracketright,
and Mergers(Company, MergedWith). We refer to these relations as EX, HQ, and MG, respectively. For theta (Section 2.1),
we picked minSim, a tuning parameter exposed by Snowball, which is the similarity threshold for extraction patterns
and the terms in the context of a candidate tuple.
Data Set: We used a collection of newspaper articles from The New York Times from 1995 (NYT95) and 1996
(NYT96), and from The Wall Street Journal (WSJ). The NYT96 database contains 135,438 documents and we used it
to train the extraction systems and the retrieval strategies. To evaluate the effectiveness of our approach, we used a
subset of 49,527 documents from NYT96, 50,269 documents from NYT95, and 98,732 documents from WSJ. For each
16
(a) (b)
Figure 10: Good (a) and bad (b) attribute frequency distribution for HQ.
(a) (b)
Figure 11: Good (a) and bad (b) attribute frequency distribution for EX.
relation and data set, we verified that the attribute and document frequency distributions tend to be power law for
both good and bad tuples. Figures 10 shows the tuple frequency distributions of both good (Figure 10(a)) and bad
tuples (Figure 10(b)) for HQ. Similarly, Figure 11 shows the attribute frequency distribution for EX.
Retrieval Strategies: For FS, we used a rule-based classifier created using Ripper [9]. For AQG, we used QXtract [2],
which relies on machine learning techniques to automatically learn queries that match documents with at least one
tuple. In our case, we train QXtract to only match good documents, avoiding the bad and empty ones.
Tuple Verification: To verify whether a tuple is good or bad, we follow the template-based approach described in [20].
Additionally, we also use a web-based ?gold? set from www.thompson.com.
Join Task: We defined a variety of join tasks involving combinations of the three relations and the three databases.
For our discussion, we will focus on the task of computing the join HQ multicloseright EX, with NYT96 and NYT95 as the hosting
databases for HQ and EX, respectively.
Join Execution Strategies: To generate the join execution strategies for a task, we explore various candidates for
individual relations and combine them using the three join algorithms of Section 3. For each relation, we generate
single-relation strategies by using two values for minSim (i.e., 0.4 and 0.8) and combining each such configuration with
the three document retrieval strategies.
Metrics: To compare the execution time of an execution plan chosen by our optimizer against a candidate plan, we
measure the relative difference in time by normalizing the execution time of the candidate plan by that for the chosen
17
(a) (b)
Figure 12: Distribution of the number of times an attribute value is generated from a document for (a) good and (b)
bad attributes.
plan. Specifically, we note the relative difference as tcto , where tc is the execution time for a candidate plan and to is the
execution time for the plan picked by the optimizer.
8 Experimental Results
We now discuss our experimental results. We begin with establishing the accuracy of our analytical models for estimating
the output quality for join execution plans. We then evaluate the effectiveness of our optimization approach.
Accuracy of the Analytical Models:
In our analysis, we define the frequency of an attribute occurrence (e.g., g(a) for a good attribute occurrence a) in
terms of the number of good tuples that contain an attribute value. For this, we assumed that each attribute occurs
only once in a document, which significantly reduced the complexity of the statistical models. First, we verify this
assumption for our relations: Figure 12 shows the distribution of the number of times that each good attribute value
is extracted from each document (Figure 12(a)). This figure validates our simplifying assumption: most attribute
occurrences are extracted just once from each document. (Figure 12(b)) shows the corresponding figure for bad attribute
occurrences.)
We now turn to verifying the accuracy of our Section 4 analysis. For this, we assumed perfect knowledge of the
various database-specific parameters: we used the actual frequency distributions for each attribute along with the
values for |Dg|, |Db|, and |De| for each database. Given a join execution strategy, we first estimate the output quality
of the join, i.e., E[|Tgoodmulticloserightmulticloseleft|] and E[|Tbadmulticloserightmulticloseleft|], using the appropriate analysis from Section 4, while varying values
for the number of retrieved documents from the database, i.e., |Dr1| and |Dr2|. For each |Dr1| and |Dr2| value, we
measure the actual output quality for an execution strategy. Figure 13 shows the actual and the estimated values for
the good (Figure 13(a)) and the bad (Figure 13(b)) join tuples generated using IDJN, Scan for both relations, and
minSim = 0.4. Similarly, Figure 14 shows the same results for OIJN when using Scan as the retrieval strategy for the
outer relation and minSim = 0.4 for both relations. Then, Figure 15 compares the estimated and the actual values
for ZGJN, for minSim = 0.4. We performed similar experiments for all other execution strategies. Additionally, we
also examined the accuracy of the estimated number of documents for query-based join algorithms, i.e., for OIJN and
ZGJN. Figure 16 shows the expected and the actual number of documents retrieved for varying number of queries
issued to each database, for ZGJN.
Overall, our estimates are either close to the actual values or follow similar trends as the actual values, thus confirming
the accuracy of our analysis. Of these observations, we discuss the case for bad tuples for OIJN (Figure 14(b)) and
ZGJN (Figure 15(b)), where our model overestimates the number of bad tuples. This overestimation can be traced
to a few outlier cases. To gain insight into this, we compared the expected and the actual number of bad attribute
occurrences. We observed four main cases where our estimated values were more than two orders of magnitude greater
than the actual values. These attribute values frequently appeared in the database but were not extracted by the
extraction system at the minSim setting used in our experiments. For instance, one such bad attribute occurrence,
18
(a) (b)
Figure 13: Estimated and actual number of (a) good tuples and (b) bad tuples for HQ multicloseright EX, using IDJN with Scan
and minSim = 0.4.
?CNN Center,? appears 895 and 2765 times in HQ and EX, respectively. When using OIJN and processing 50% of the
database documents for the outer relation, our estimated frequencies of the bad occurrences of this value was 28.3 and
29.7 times, respectively; in reality, this attribute value was not extracted, thus resulting in an overestimate of 812 join
tuples. This difference is further expanded for ZGJN due to a modeling choice: we assume that all queries used in
ZGJN will match some documents and the execution will not stall. We can account for stalling by incorporating the
reachability of a ZGJN execution based on the single-relation analysis in [17, 18].
We further break down our evaluation and study the estimated number of join tuples that we will observe for
each attribute occurrence after processing |Dr| documents. Specifically, given |Dr|, we use our analysis from Section 4
to estimate, for each attribute occurrence, the expected number of join tuples that we will observe in the output
after processing |Dr| documents. For each attribute occurrence, we also derive the actual number of join tuples that
we observe in the output. Given the estimated and the actual number of join tuples that we observe, we studied
the distribution of the estimation error computed as the number of actual observations minus the estimated number
of observations, across all attribute occurrences. Figure 17 shows this distribution for the case of good attribute
occurrences for IDJN (Figure 17(a)), OIJN (Figure 17(b)), and ZGJN (Figure 17(c)); Figure 18 shows the corresponding
graphs for bad attribute occurrences and IDJN (Figure 18(a)), OIJN (Figure 18(b)), and ZGJN (Figure 18(c)).
In general, we observe that the estimation error is centered around zero and, more importantly, for a significant
fraction of attribute occurrences the estimated error is 0 (note that the y-axis is on a log-scale). For good attribute
occurrences, often the estimated value is an overestimate. On the other hand, for bad attribute occurrences there were
relatively more numerous cases of overestimation due to the reasons discussed above. Thus, the observations from this
analysis are largely in line with our previous observations on the accuracy of the overall estimation model.
Reachability of Zig-Zag Join: Next, we study the reachability of the ZGJN algorithm (Section 6.1) using the
analysis in [3]. For this, we construct a reachability graph [3] for EX. The analysis in [3] relies on the knowledge that
reachability graphs of a text database belong to the general family of power law graphs. As a first step, we verify
whether the reachability graph is a power law graph. Figure 19 reports the outdegree distribution of the nodes in the
reachability graph for different number of matching documents retrieved for each query issued during the execution. We
observed that a power law distribution indeed best fits the data. Similarly, we also examine the size of the connected
components in the reachability graph [3]. Figure 20 reports the distribution of the size of the connected components for
different number of matching documents retrieved for each query issued during the execution.
As the next step, we study the estimated and the actual reachability of the graph using the reachability metric
defined in [3]. Specifically, we estimate this metric using the analysis presented in [3] and also derive the actual
reachability. Figure 21 reports the estimated and actual values for the reachability for different number of matching
documents retrieved for each query issued during the execution. As seen in the figure, in general, the ZGJN execution
can successfully reach a significant fraction of the attributes in the database, as denoted by a value of above 70% when
more than 5 documents are retrieved per query. Furthermore, the estimated value, as noted in [3], is an overestimate
of the actual value. (Refer to [3] for a detailed discussion on the reasons for this overestimation.) In summary, the
19
(a) (b)
Figure 14: Estimated and actual number of (a) good tuples and (b) bad tuples for HQ multicloseright EX, using OIJN with Scan
and minSim = 0.4.
reachability analysis in [3] allows us to examine whether a ZGJN-based execution for the given databases is desirable.
Effectiveness of the Optimization Approach: After verifying our modeling, we studied the effectiveness of our
optimization approach, which uses our models along with the parameter estimation process outlined in Section 6.
Specifically, we examine whether the optimizer picks the fastest execution strategy for a given output quality requirement.
For this, we provided the optimizer with two thresholds, taug and taub, which specify the minimum number of good tuples
requested and the maximum allowable bad tuples for the resulting join.
To evaluate the choice of execution strategy for a specified taug and taub pair, we compare the execution time for the
chosen plan S against that of the alternate executions plans that also meet the taug and taub requirements. Table 2 shows
the results for HQ multicloseright EX, for varying taug and taub. For each taug and taub pair, we show the number of candidate plans that
meet the taug and taub requirement. Furthermore, we show the number of candidate plans that result in faster executions
than the plan chosen by our optimizer and the number of candidate plans that result in slower executions than the
chosen plan. Finally, to highlight the difference between the associated execution times, we show the range of relative
difference in time for both faster and slower execution plans.
As shown in the table, our optimizer selects OIJN for low values of taug and taub, and progresses towards selecting
IDJN coupled with AQG or FS, eventually picking IDJN coupled with SC for high values of taug and taub. For most cases,
our optimizer selects an execution strategy that is the fastest strategy or close to the fastest strategy, as indicated by
having either no candidates with faster executions than the chosen plan or a small number of such executions. For
cases where the chosen plan is not the fastest option, the execution time of the faster candidates is close to the one
of the chosen plan, as indicated by the relative difference values (e.g., a value of 1 indicates the execution times for
both the candidate and the chosen plans were identical). An important observation is that the plans eliminated by the
optimizer were an order of magnitude (10 to 35 times) slower than the chosen plans.
An intriguing outcome of our experiments is that the choices for execution strategies do not involve ZGJN.
Interestingly, for our test data set, ZGJN is not a superior choice of execution algorithm as compared to other
algorithms. Intuitively, ZGJN does not specifically focus on filtering out any bad documents; therefore, ZGJN does
not meet the quality requirements as closely as other query-based strategies that use IDJN or OIJN along with AQG
or FS. Furthermore, the maximum number of tuples that can be extracted using ZGJN is limited, which makes it a
poor choice for higher values of taug and taub. ZGJN would be a competing choice for scenarios involving databases that
only provide query-based access (e.g., search engines or hidden-Web databases) and also for cases where the generated
queries match a relatively large number of good documents. Extending ZGJN to derive queries that focus on good
documents remains interesting future work.
20
(a) (b)
Figure 15: Estimated and actual number of (a) good tuples and (b) bad tuples for ZGJN.
(a) (b)
Figure 16: Estimated and actual number of documents retrieved by (a) HQ and (b) EX for ZGJN.
9 Related Work
Information extraction from text has received much attention in the database, AI, Web, and KDD communities (see
[8, 12] for tutorials). The majority of the efforts have considered the construction of a single extraction system that is as
accurate as possible (e.g., using HMMs and CRFs [8, 22]). Approaches to improve the efficiency of the IE process have
developed specialized document retrieval strategies; one such approach is the QXtract system [2], which uses machine
learning to derive keyword queries that identify documents rich in target information. We use QXtract in our work.
Related efforts to this paper are [17, 18] and [21], but this earlier work studied the task of extracting just one
relation, not our join problem. Specifically, in [17, 18] we studied the document retrieval strategies considered in this
paper for the goal of efficiently achieving a desired recall for a single-relation extraction task. The analysis in [17, 18]
assumes a perfect extraction system (i.e., all generated tuples are good). On the other hand, in [21] we studied document
retrieval strategies for single-relation extraction while accounting for extraction imprecision. Our current work builds
upon the statistical models presented in [17, 18, 21], extending them for multiple extraction systems.
Real-world applications often require multiple extraction systems [12, 24]. Hence, the problem of developing and
optimizing IE programs that consist of multiple extraction systems has received growing attention. Some of the existing
solutions write such programs as declaratively as possible (e.g., UIMA [13], GATE [10], Xlog [24]), while considering
only the execution time in their analysis.
21
(a) (b) (c)
Figure 17: Distribution of the estimation error, on a log-scale, for a good attribute occurrence using (a) IDJN, (b)
OIJN, and (c) ZGJN.
(a) (b) (c)
Figure 18: Distribution of the estimation error, on a log-scale, for a bad attribute occurrence using (a) IDJN, (b) OIJN,
and (c) ZGJN.
(a) (b) (c)
Figure 19: The outdegree distribution of the reachability graph for Executives when the maximum number of documents
retrieved for each query is (a) 5, (b) 10, and (c) 50.
22
(a) (b) (c)
Figure 20: The component size distribution of the reachability graph for Executives when the maximum number of
documents retrieved for each query is (a) 5, (b) 10, and (c) 50.
Figure 21: Estimated and actual reachability for Executives for different number of matching documents retrieved per
query issued during an execution.
23
Criteria Candidate Chosen plan # Faster # Slower Relative time range Relative time range
plans plans plans for faster plans for slower plans
taug taub JN theta1 theta2 X1 X2 min max min max
1 20 46 OIJN 0.4 0.4 FS (OIJN) 5 36 0.68 0.80 1.20 27.48
2 30 46 OIJN 0.8 0.4 AQG (OIJN) 10 32 0.19 0.75 1.78 11.91
2 50 47 OIJN 0.8 0.4 AQG (OIJN) 11 33 0.19 0.75 1.78 11.91
4 20 39 OIJN 0.4 0.4 FS (OIJN) 3 29 0.34 0.34 1.59 35.76
4 40 42 OIJN 0.4 0.4 FS (OIJN) 3 37 0.34 0.34 1.59 35.76
8 40 40 OIJN 0.8 0.4 AQG (OIJN) 3 33 0.19 0.19 1.15 22.20
8 80 44 OIJN 0.8 0.4 AQG (OIJN) 4 38 0.19 0.19 1.15 22.20
16 50 26 IDJN 0.4 0.4 FS AQG - 21 - - 1.22 11.62
16 80 36 IDJN 0.4 0.4 FS AQG 3 30 0.66 0.94 1.10 11.62
16 160 39 IDJN 0.4 0.4 FS AQG 3 34 0.66 0.94 1.10 11.62
32 84 26 IDJN 0.4 0.4 FS AQG - 22 - - 1.26 13.30
32 160 36 OIJN 0.8 0.4 AQG (OIJN) - 35 - - 1.55 20.62
32 320 40 OIJN 0.8 0.4 AQG (OIJN) - 39 - - 1.55 20.62
64 320 35 IDJN 0.8 0.4 AQG AQG - 34 - - 1.50 27.16
64 640 41 IDJN 0.8 0.4 AQG AQG - 40 - - 1.50 27.16
128 640 21 IDJN 0.4 0.4 FS AQG - 20 - - 1.19 9.41
128 1280 26 IDJN 0.4 0.4 FS AQG - 25 - - 1.19 9.41
256 1280 14 IDJN 0.4 0.4 SC AQG - 13 - - 1.18 2.89
256 2560 18 IDJN 0.4 0.4 SC AQG - 17 - - 1.01 2.89
512 1024 1 IDJN 0.8 0.8 SC SC - - - - - -
512 2560 3 IDJN 0.8 0.4 SC SC - 2 - - 1.02 1.15
512 5120 4 IDJN 0.4 0.4 FS SC - 3 - - 1.46 1.69
1024 5120 2 IDJN 0.8 0.4 SC SC 1 - 0.99 0.99 - -
1024 10240 2 IDJN 0.8 0.4 SC SC 1 - 0.99 0.99 - -
Table 2: Choice of execution strategies for varying output quality requirements expressed as taug and taub thresholds
(Section 2.3), and comparing the execution time of the chosen strategy against that of alternative execution strategies
that also meet the taug and taub requirements, for HQ multicloseright EX.
In prior work [20], we presented a query optimization approach for simple SQL queries involving joins while
accounting for both execution time and output quality. Our earlier paper considered only one simple heuristic to
estimate the quality of one simple join algorithm, namely, the IDJN algorithm, discussed and analyzed in this paper
(Section 3). Our current work substantially expands on [20] by modeling an extended family of join algorithms and
showing how to pick the best option dynamically. To the best of our knowledge, our current work is the first to carry
out an in-depth output quality analysis of a variety of join execution plans over multiple extraction systems.
Our work is also related to research on the ?loose integration? of relational DBMSs and text databases (e.g., [6, 14, 11]),
generally achieved via careful querying of the text databases. Our focus is different, on the explicit extraction of
structured data from plain-text documents, which requires that we understand and model the output quality that
results from the extraction and join processing plans, in addition to their extraction efficiency.
10 Conclusions
In this paper, we addressed the important problem of optimizing the execution of joins of relations extracted from
natural language text. As a key contribution of our paper, we developed rigorous analytical models to analyze the
output quality of a variety of join execution strategies. We also showed how to use our models to build a join optimizer
that attempts to minimize the time to execute a join while reaching user-specified result quality requirements. We
demonstrated the effectiveness of our optimizer for this task with an extensive experimental evaluation over real-world
data sets. We also established that the analytical models presented in this paper demonstrate a promising direction
towards building fundamental blocks for processing joins involving information extraction systems.
References
[1] E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In DL, 2000.
24
[2] E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
[3] E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.
[4] W. Aiello, F. Chung, and L. Lu. A random graph models for massive graphs. 2000.
[5] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In WWW,
pages 309?320, 2000.
[6] S. Chaudhuri, U. Dayal, and T. W. Yan. Join queries with external text sources: execution and optimization techniques. In SIGMOD, 1995.
[7] F. Chung and L. Lu. Connected components in random graphs with given degree sequences. Annals of Combinatorics, 6:125?145, 2002.
[8] W. Cohen and A. McCallum. Information extraction from the World Wide Web (tutorial). In KDD, 2003.
[9] W. W. Cohen. Learning trees and rules with set-valued features. In IAAI, 1996.
[10] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: An architecture for development of robust HLT applications. In ACL,
2002.
[11] S. Dessloch and N. Mattos. Integrating SQL databases with content-specific search engines. In VLDB, 1997.
[12] A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction (tutorial). In SIGMOD, 2003.
[13] D. Ferrucci and A. Lally. UIMA: An architectural approach to unstructured information processing in the corporate research environment. In
Nat. Lang. Eng., 2004.
[14] R. Goldman and J. Widom. WSQ/DSQ: A practical approach for combined querying of databases and the web. In SIGMOD, 2000.
[15] R. Gupta and S. Sarawagi. Curating probabilistic databases from information extraction models. In VLDB, 2006.
[16] P. Haas and J. Hellerstein. Ripple joins for online aggregation. In SIGMOD, 1999.
[17] P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search or to crawl? Towards a query optimizer for text-centric tasks. In SIGMOD,
2006.
[18] P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. Towards a query optimizer for text-centric tasks. ACM Transactions on Database
Systems, 32(4), Dec. 2007.
[19] A. Jain, A. Doan, and L. Gravano. SQL queries over unstructured text databases. In ICDE, 2007.
[20] A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, 2008. To appear.
[21] A. Jain and P. G. Ipeirotis. A quality-aware optimizer for information extraction. Technical Report CeDER-08-02, New York University, 2007.
[22] I. Mansuri and S. Sarawagi. A system for integrating unstructured data into relational databases. In ICDE, 2006.
[23] M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Physical Review
E, 64(2), Aug. 2001.
[24] W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Declarative information extraction using Datalog with embedded extraction predicates.
In VLDB, 2007.
25