+=8E4AB8CH>5&4==BH;E0=80+=8E4AB8CH>5&4==BH;E0=80
)27>;0A;H><<>=B)27>;0A;H><<>=B
0C010B4(4B40A27A>D?) 4?0AC<4=C>5><?DC4A=5>A<0C8>=)284=24
D=4
=30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>==30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>=
.0270AHE4B
+=8E4AB8CH>5&4==BH;E0=80
I8E4B28BD?4==43D
0=84;0;>A4B2D
$(
#0A2A843<0=
+=8E4AB8CH>5-0B78=6C>=
;>="4EH
+=8E4AB8CH>5-0B78=6C>=
0=84;)-4;3
+=8E4AB8CH>5-0B78=6C>=
>;;>FC78B0=30338C8>=0;F>A:B0C7CC?BA4?>B8C>AHD?4==43D31/A4B40A27
E4B.0270AH;>A4B2D0=84;0A843<0=#0A2"4EH;>=0=3-4;30=84;)=30?C8E4'D4AH
G42DC8>=)HBC4<5>A0C0=C46A0C8>=
0C010B4(4B40A27A>D?)
7CC?BA4?>B8C>AHD?4==43D31/A4B40A27
&>BC?A8=CE4AB8>=>?HA867C#*78B8BC740DC7>ABE4AB8>=>5C74F>A:C8B?>BC4374A41H?4A<8BB8>=>5
#5>AH>DA?4AB>=0;DB4$>C5>AA438BCA81DC8>=*7434J=8C8E4E4AB8>=F0B?D1;8B7438=
)#%
,>;D<4
BBD4 D=4?064B
&D1;8B74A+("7CC??>AC0;02<>A6
28C0C8>=25<832>;;?>AC0;3;#*%!$
$%*CC74C8<4>5?D1;820C8>=C740DC7>A.0270AHE4BF0B0K;80C43F8C7C74+=8E4AB8CH>5-0B78=6C>=
DAA4=C;H D=4748B0502D;CH<4<14A>5C744?0AC<4=C>5><?DC4A0=3=5>A<0C8>=)284=24B0CC74
+=8E4AB8CH>5&4==BH;E0=80
*78B?0?4A8B?>BC430C)27>;0A;H><<>=B7CC?BA4?>B8C>AHD?4==43D31/A4B40A27
>A<>A48=5>A<0C8>=?;40B42>=C02CA4?>B8C>AH?>1>GD?4==43D
=30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>==30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>=
1BCA02C1BCA02C
'D4AH?A>24BB8=68=30C08=C46A0C8>=>22DAB>E4A=4CF>A:1>D=30DC>=><>DB30C0B>DA24B*78BA4@D8A4B
4GC4=B8>=BC>CA038C8>=0;>?C8<8I0C8>=0=34G42DC8>=C427=8@D4B5>AC7A44A40B>=BC74A48B0=01B4=24>5
@D0;8CHBC0C8BC82B01>DCC7430C030C0CA0=B54AA0C4B0A4D=?A4382C01;40=31DABCH0=3B;>F>AD=0E08;01;4
30C0B>DA24B20=>5C4=14A4?;02431H>E4A;0??8=6>A<8AA>A43B>DA24B*78B?0?4A?A4B4=CBC74*D:F8;0
30C08=C46A0C8>=BHBC4<34B86=43C>BD??>AC030?C8E8CH0C8CB2>A4DB8=60CF>?A>=6430??A>027
=C4A;40E43?;0==8=60=34G42DC8>=F8C7?0AC80;>?C8<8I0C8>=0;;>FB*D:F8;0C>@D82:;HA42>E4A5A><
3428B8>=B10B43>=8=022DA0C44BC8<0C4BDA8=64G42DC8>=*D:F8;0DB4B030?C8E4@D4AH>?4A0C>ABBD27
0BC743>D1;4?8?4;8=4370B79>8=F7827?A>3D24B0=BF4AB@D82:;H0=3C743H=0<822>;;42C>AF7827
A>1DBC;H0=34K284=C;H2><?DC4BD=8>=B02A>BB>E4A;0??8=630C0B>DA24B-434<>=BCA0C4C70CC74
*D:F8;00A278C42CDA44GC4=3B?A4E8>DB8==>E0C8>=B8=030?C8E44G42DC8>=BD270B@D4AHB2A0<1;8=6<83
4G42DC8>=A4>?C8<8I0C8>=0=327>>B4=>34B0=3F4?A4B4=C4G?4A8<4=C0;4E834=24C70C>DAC427=8@D4B
A4BD;C8=1470E8>A34B8A01;45>A030C08=C46A0C8>=BHBC4<
><<4=CB><<4=CB
&>BC?A8=CE4AB8>=>?HA867C#*78B8BC740DC7>ABE4AB8>=>5C74F>A:C8B?>BC4374A41H
?4A<8BB8>=>5#5>AH>DA?4AB>=0;DB4$>C5>AA438BCA81DC8>=*7434J=8C8E4E4AB8>=F0B?D1;8B7438=
)#%
,>;D<4BBD4 D=4?064B
&D1;8B74A+("7CC??>AC0;02<>A6
28C0C8>=25<832>;;?>AC0;3;#*%!$
$%*CC74C8<4>5?D1;820C8>=C740DC7>A.0270AHE4BF0B0K;80C43F8C7C74+=8E4AB8CH>5
-0B78=6C>=DAA4=C;H D=4748B0502D;CH<4<14A>5C744?0AC<4=C>5><?DC4A0=3=5>A<0C8>=
)284=24B0CC74+=8E4AB8CH>5&4==BH;E0=80
*78B2>=54A4=24?0?4A8B0E08;01;40C)27>;0A;H><<>=B7CC?BA4?>B8C>AHD?4==43D31/A4B40A27
An Adaptive Query Execution System for Data Integration
Zachary G. Ives
University of Washington
zives@cs.washington.edu
Daniela Florescu
INRIA Roquencourt
Daniela.Florescu@inria.fr
Marc Friedman
University of Washington
friedman@cs.washington.edu
Alon Levy
University of Washington
alon@cs.washington.edu
Daniel S. Weld
University of Washington
weld@cs.washington.edu
Abstract
Query processing in data integration occurs over network-bound,
autonomous data sources. This requires extensions to traditional
optimization and execution techniques for three reasons: there is an
absence of quality statistics about the data, data transfer rates are
unpredictable and bursty, and slow or unavailable data sources can
often be replaced by overlapping or mirrored sources. This paper
presents the Tukwila data integration system, designed to support
adaptivity at its core using a two-pronged approach. Interleaved
planning and execution with partial optimization allows Tukwila
to quickly recover from decisions based on inaccurate estimates.
During execution, Tukwila uses adaptive query operators such as
the double pipelined hash join, which produces answers quickly,
and the dynamic collector, which robustly and efficiently computes
unions across overlapping data sources. We demonstrate that the
Tukwila architecture extends previous innovations in adaptive ex-
ecution (such as query scrambling, mid-execution re-optimization,
and choose nodes), and we present experimental evidence that our
techniques result in behavior desirable for a data integration sys-
tem.
1 Introduction
The goal of a data integration system is to provide a uni-
form query interface to a multitude of data sources. The data
integration problem primarily arises in two contexts: organi-
zations trying to provide access to a collection of internal au-
tonomous sources, and systems that present a uniform inter-
face to a multitude of sources available on the World-Wide
Web (WWW). The key advantage of a data integration sys-
tem is that it frees users fromhavingto locate the sources rel-
evant to their query, interact with each source independently,
and manually combine the data from the different sources.
This research was funded in part by ARPA / Rome Labs grant F30602-
95-1-0024, Office of Naval Research Grant N00014-98-1-0147, by National
Science Foundation Grants IRI-9303461, IIS-9872128, and 9874759.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advan-
tage and that copies bear this notice and the full citation on the first
page. To copy otherwise, to republish, to post on serverse or to re-
distribute to lists, requires prior specific permission and/or a fee.
SIGMOD ’99 Philadelphia, PA
Copyright 1999 ACM 1-58113-084-8/99/05...$5.00
The problem of data integration has received significant at-
tention in the research community as evidenced by numer-
ous research projects (e.g., [10, 20, 25, 17, 9, 3, 6, 1, 25, 21,
4, 13]) and the emergence of several commercial products
(e.g., DataJoiner [23] and jango.excite.com).
Three main challenges distinguish the design of a data
integration system from that of a traditional database sys-
tem: query reformulation, the construction of wrapper pro-
grams, and the design of new query processing techniques
for this more unpredictable environment. While the prob-
lems of reformulation and rapid wrapper development have
been the focus of previous work (e.g., [10, 17, 1, 9, 16, 3]),
relatively little attention has been given to the development
of query optimization algorithms and efficient query execu-
tion engines for data integration systems. These components
are now the critical bottleneck to making such systems de-
ployable in practice.
1.1 The Need for Adaptivity
To date, most data integration research has focused on the
problem of integratinginformation from web-based data sources,
where the amount of data returned by each source is gener-
ally small. The greater problem — that of querying over
multiple autonomous data sources of moderate size, across
intranets as well as the Internet — requires us to integrate
novel query execution techniques.
Several characteristics of the data integration problem
render existing database optimizers and execution engines
(or simple extensions thereof) inappropriate in the context
of data integration:
Absence of statistics: statistics about the data (e.g., cardi-
nalities, histograms) are central to a query optimizer’s cost
estimates for query execution plans. Since data integra-
tion systems manipulate data from autonomous external
sources, the system has relatively few and often unreliable
statistics about the data.
Unpredictabledata arrival characteristics: unlike tradi-
tional systems, data integration systems have little knowl-
edge about the rate of data arrival from the sources. Two
phenomenathat occur frequently in practice are significant
initial delays before data starts arriving, and bursty arrivals
of data thereafter. Hence, even if the query optimizer is
able to determine the best plan based on total work, the
data arrival characteristics may cause it to be inefficient in
practice [22].
Overlap and redundancy among sources: as a result of
the heterogeneity of the data sources, there is often signifi-
cant overlap or redundancy among them. Hence, the query
processor needs to be able to efficiently collect related data
from multiple sources, minimize the access to redundant
sources, and respond flexibly when some sources are un-
available.
Since data integration systems are designed for online
querying of data on the network, they have two other im-
portant characteristics. First, it is important to optimize the
time to the initial answers to the query, rather than to mini-
mize the total work of the system. Also, network bandwidth
generally constrains the data sources to be somewhat smaller
than in traditional database applications.
For all of these reasons, a data integration query proces-
sor should be adaptive. This is particularlytrue since a query
optimizer is unlikely to produce good plans from bad meta-
data, and even a plan that may be good on average should
be abandoned if unexpected situations arise. While runtime
adaptivity has been shown to speed up performance even in
traditional systems [15, 12], it becomes critical to perfor-
mance in the data integration context (e.g., [22]).
1.2 Adaptive Features of Tukwila
This paper describes the Tukwila
1
data integration system,
designed with adaptivity built into its core. There are two
levels at which Tukwila exhibits adaptive behavior: between
the optimizer and execution engine, through a process of in-
terleaved planning and execution, and within the execution
engine, with operators designed for dynamic execution.
Interleavingplanning and execution: when Tukwila pro-
cesses a query it does not necessarily create a complete
query executionplan beforebeginningto evaluate the query.
If the optimizer concludes that it does not have enough
metadata with which to reliably compare candidate query
execution plans, it may choose to send only a partial plan
to the execution engine, and decide how to proceed after
the partial plan has been completed, as in [7]. Alterna-
tively, the optimizer may send a complete plan, but the
execution engine may check for conditions that require in-
cremental re-optimization.
Adaptive operators: Tukwila incorporates operators that
are especially well suited for adaptive execution and for
minimizing the time required to obtain the first answers to
a query. Specifically, it employs an enhanced version of
the double pipelined hash join [24] (a join implementation
which executes in a symmetric, data-driven manner) and
techniques for adapting its execution when there is insuf-
ficient memory. In addition, the Tukwila execution engine
includes a collector operator whose task is to efficiently
union data from a large set of possibly overlapping or re-
dundant sources. Finally, Tukwila query execution plans
1
Tukwila is a scenic city near Seattle in the Northwest United States.
can contain conditional nodes in the spirit of [12] in order
to adapt to conditions that can be anticipated at optimiza-
tion time.
Adaptive behavior in Tukwila is coordinated in a uni-
form fashion by a set of event-condition-action rules. An
event may be raised by the execution of operators (e.g., out
of memory, data source not responding) or at materialization
points in the plan. The possible actions include modifying
operator execution, reordering of operators, or re-optimization.
1.3 Example
A simple example demonstrates the breadth of Tukwila’s
adaptive behavior. Suppose that the same query (Figure 1a)
is issued to the system under three extreme conditions: when
the source tables are of unknownsize, are small, or are large.
Each time, assume that the relative statistics are such that a
traditional optimizer would construct the join tree in Fig-
ure 1b. In a traditional query engine, the join implementa-
tions, memory allocations, and materialization points will be
fixed at compile time, and the tree will be executed in a pre-
determined order. Tukwila implements mechanisms needed
to behave more adaptively. Consider its response to the three
cases:
No size information: With no informationthere is no point
in traditional optimization. Instead, the optimizer may de-
cide to compute a partial result that it chooses heuristically,
such as the join AB, and decide afterwards what do next.
Small tables: Tukwila chooses the double pipelined join
implementation for joins of small cardinality, and pipelines
the entire query. When source latencies are high, this type
of join has a large advantage over traditional joins, but it de-
mands considerably more memory. To handle the “unlucky”
case that memory is exceeded, the join operator has an over-
flow resolution mechanism.
Large tables: If the tables are sufficiently large, Tukwila’s
optimizer chooses standard hash joins, and breaks the pipeline,
perhaps after join AB in Figure 1b. Now, depending on the
rules in force, one of two things may happen during execu-
tion:
Rescheduling: If all sources respond, and table AB has
a cardinality sufficiently close to the optimizer’s estimate,
execution continues normally. Should some sources re-
spond slowly, however, Tukwila can reschedule as with
query scrambling [22]. If the connection to data source A
times out, join DE will be executed preemptively. Should
that time out as well, the optimizer is called with that infor-
mation to produce a plan reordered to use the non-blocked
sources first.
Re-optimization: After the AB join completes and ma-
terializes, Tukwila compares the actual cardinality with
the optimizer’s estimate. As in [15], if this value signif-
icantly differs from the optimizer’s estimate, the optimizer
is awakened to find a cheaper plan (perhaps the one in Fig-
ure 1c) given more accurate information.
Select * from A,B,C,D,E
where A.ssn =B.ssn
and B.ssn=C.ssn
and C.ssn=D.ssn
and D.ssn=E.ssn
ABCDE
ABC DE
A
AB
B
C D E
(a) (b) (c)
ABCDE
CDE AB
D
DE
E
C A B
Figure 1: Sample query, initial join tree, and join tree produced by re-optimization.
The paper is organized as follows. Section 2 provides an
overview of the architecture of Tukwila. Section 3 describes
the mechanisms for interleaving of planning and execution.
Section 4 describes the new query operator algorithms used
in Tukwila. Section 5 discusses the implemented system.
Section 6 describes our experimental results. Section 7 dis-
cusses related work, and Section 8 discusses several addi-
tional issues and concluding remarks.
2 Tukwila Architecture
This section provides an overview of the Tukwila architec-
ture as illustrated in Figure 2.
Queries: A Tukwila user poses queries in terms of a medi-
ated relational schema. The relations in the mediatedschema
are virtual in the sense that their extensions are not stored
anywhere. The goal of the mediated schema is to abstract
the details of the data sources’ schemata from the user. In
this paper we limit our discussion to select-project-join (con-
junctive) queries over this mediated schema.
Data source catalog: The catalog contains several types
of metadata about each data source. The first of these is
a semantic description of the contents of the data sources.
Second is overlap information about pairs of data sources
(that is, the probability that a data value
appears in source
if
is known to appear in source
) for use by collector
operators, as in [8]. In the extreme case, overlap information
can indicate that two sites are mirrors of each other. Finally,
the catalog may contain key statistics about the data, such as
the cost of accessing each source, the sizes of the relations
in the sources, and selectivity information.
Queryreformulation: The queryoverthe mediated schema
is fed intothe Tukwila query reformulationcomponent,which
is based on an enhanced version of the algorithm described
in [17]. In general, a query reformulator converts the user’s
query into a union of conjunctivequeries referringto the data
source schemata. This paper focuses on a limited form in
which we have a single query that may include disjunction
at the leaves. This limited disjunction, which is handled by
our dynamic collector operator, is useful in handling multi-
ple overlapping or mirrored data sources with the same at-
tributes, e.g. in a query over bibliographical databases.
Queryoptimizer: The query optimizertransforms therewrit-
ten query into a query execution plan for the execution en-
gine. The optimizer has the ability to create partial plans
if essential statistics are missing or uncertain, and also pro-
duces rules to define adaptive behavior during runtime.
Query executionengine: The query execution engine pro-
cesses query plans produced by the optimizer. The execution
engine emphasizes time-to-first result and includes operators
designed to facilitate this. It includes an event handler for
dynamically interpreting rules and supports incremental re-
optimization.
Wrappers: the queryexecutionengine communicateswith
the data sources through a set of wrapper programs. Wrap-
pers handle the communication with the data sources and,
when necessary, translate the data from the formats used in
the sources to those used in Tukwila. We assume a location-
independent wrapper model, where wrappers can be placed
either at the data source or at the execution system.
3 Interleaving Planning and
Execution
The query optimizer takes a query from the reformulatorand
uses information from the source catalog to produce query
execution plans for the execution engine via a System-R
style dynamic programming algorithm. The non-traditional
aspects of the Tukwila optimizer include the following:
The optimizer does not always create a complete execu-
tion plan for the query. If essential statistics are missing or
uncertain, the optimizer may generate a partial plan with
only the first steps specified, deferring subsequent plan-
ning until sources have been contacted and critical meta-
data obtained.
In addition to producing the annotated operator tree, the
optimizer also generates the appropriate event-condition-
action rules. These rules specify (1) when and how to
modify the implementation of certain operators at runtime
if needed, and (2) conditions to check at materialization
points in order to detect opportunities for re-optimization.
The query optimizer conserves the state of its search space
when it calls the execution engine. The optimizer is able
to efficiently resume optimization in incremental fashion
if needed.
3.1 Query Plans
Operators in Tukwila are organizedinto pipelinedunits called
fragments. At the end of a fragment, pipelines terminate,