scispace - formally typeset
Open AccessJournal ArticleDOI

An adaptive query execution system for data integration

Reads0
Chats0
TLDR
It is demonstrated that the Tukwila architecture extends previous innovations in adaptive execution (such as query scrambling, mid-execution re-optimization, and choose nodes), and experimental evidence that the techniques result in behavior desirable for a data integration system is presented.
Abstract
Query processing in data integration occurs over network-bound, autonomous data sources. This requires extensions to traditional optimization and execution techniques for three reasons: there is an absence of quality statistics about the data, data transfer rates are unpredictable and bursty, and slow or unavailable data sources can often be replaced by overlapping or mirrored sources. This paper presents the Tukwila data integration system, designed to support adaptivity at its core using a two-pronged approach. Interleaved planning and execution with partial optimization allows Tukwila to quickly recover from decisions based on inaccurate estimates. During execution, Tukwila uses adaptive query operators such as the double pipelined hash join, which produces answers quickly, and the dynamic collector, which robustly and efficiently computes unions across overlapping data sources. We demonstrate that the Tukwila architecture extends previous innovations in adaptive execution (such as query scrambling, mid-execution re-optimization, and choose nodes), and we present experimental evidence that our techniques result in behavior desirable for a data integration system.

read more

Content maybe subject to copyright    Report

+=8E4AB8CH>5&4==BH;E0=80+=8E4AB8CH>5&4==BH;E0=80
)27>;0A;H><<>=B)27>;0A;H><<>=B
0C010B4(4B40A27A>D?) 4?0AC<4=C>5><?DC4A=5>A<0C8>=)284=24
D=4
=30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>==30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>=
.0270AHE4B
+=8E4AB8CH>5&4==BH;E0=80
I8E4B28BD?4==43D
0=84;0;>A4B2D
$(
#0A2A843<0=
+=8E4AB8CH>5-0B78=6C>=
;>="4EH
+=8E4AB8CH>5-0B78=6C>=
0=84;)-4;3
+=8E4AB8CH>5-0B78=6C>=
>;;>FC78B0=30338C8>=0;F>A:B0C7CC?BA4?>B8C>AHD?4==43D31/A4B40A27
E4B.0270AH;>A4B2D0=84;0A843<0=#0A2"4EH;>=0=3-4;30=84;)=30?C8E4'D4AH
G42DC8>=)HBC4<5>A0C0=C46A0C8>=
0C010B4(4B40A27A>D?)

7CC?BA4?>B8C>AHD?4==43D31/A4B40A27
&>BC?A8=CE4AB8>=>?HA867C#*78B8BC740DC7>ABE4AB8>=>5C74F>A:C8B?>BC4374A41H?4A<8BB8>=>5
#5>AH>DA?4AB>=0;DB4$>C5>AA438BCA81DC8>=*7434J=8C8E4E4AB8>=F0B?D1;8B7438=
)#%
,>;D<4
BBD4 D=4?064B
&D1;8B74A+("7CC??>AC0;02<>A6
28C0C8>=25<832>;;?>AC0;3;#*%!$
$%*CC74C8<4>5?D1;820C8>=C740DC7>A.0270AHE4BF0B0K;80C43F8C7C74+=8E4AB8CH>5-0B78=6C>=
DAA4=C;H D=4748B0502D;CH<4<14A>5C744?0AC<4=C>5><?DC4A0=3=5>A<0C8>=)284=24B0CC74
+=8E4AB8CH>5&4==BH;E0=80
*78B?0?4A8B?>BC430C)27>;0A;H><<>=B7CC?BA4?>B8C>AHD?4==43D31/A4B40A27
>A<>A48=5>A<0C8>=?;40B42>=C02CA4?>B8C>AH?>1>GD?4==43D

=30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>==30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>=
1BCA02C1BCA02C
'D4AH?A>24BB8=68=30C08=C46A0C8>=>22DAB>E4A=4CF>A:1>D=30DC>=><>DB30C0B>DA24B*78BA4@D8A4B
4GC4=B8>=BC>CA038C8>=0;>?C8<8I0C8>=0=34G42DC8>=C427=8@D4B5>AC7A44A40B>=BC74A48B0=01B4=24>5
@D0;8CHBC0C8BC82B01>DCC7430C030C0CA0=B54AA0C4B0A4D=?A4382C01;40=31DABCH0=3B;>F>AD=0E08;01;4
30C0B>DA24B20=>5C4=14A4?;02431H>E4A;0??8=6>A<8AA>A43B>DA24B*78B?0?4A?A4B4=CBC74*D:F8;0
30C08=C46A0C8>=BHBC4<34B86=43C>BD??>AC030?C8E8CH0C8CB2>A4DB8=60CF>?A>=6430??A>027
=C4A;40E43?;0==8=60=34G42DC8>=F8C7?0AC80;>?C8<8I0C8>=0;;>FB*D:F8;0C>@D82:;HA42>E4A5A><
3428B8>=B10B43>=8=022DA0C44BC8<0C4BDA8=64G42DC8>=*D:F8;0DB4B030?C8E4@D4AH>?4A0C>ABBD27
0BC743>D1;4?8?4;8=4370B79>8=F7827?A>3D24B0=BF4AB@D82:;H0=3C743H=0<822>;;42C>AF7827
A>1DBC;H0=34K284=C;H2><?DC4BD=8>=B02A>BB>E4A;0??8=630C0B>DA24B-434<>=BCA0C4C70CC74
*D:F8;00A278C42CDA44GC4=3B?A4E8>DB8==>E0C8>=B8=030?C8E44G42DC8>=BD270B@D4AHB2A0<1;8=6<83
4G42DC8>=A4>?C8<8I0C8>=0=327>>B4=>34B0=3F4?A4B4=C4G?4A8<4=C0;4E834=24C70C>DAC427=8@D4B
A4BD;C8=1470E8>A34B8A01;45>A030C08=C46A0C8>=BHBC4<
><<4=CB><<4=CB
&>BC?A8=CE4AB8>=>?HA867C#*78B8BC740DC7>ABE4AB8>=>5C74F>A:C8B?>BC4374A41H
?4A<8BB8>=>5#5>AH>DA?4AB>=0;DB4$>C5>AA438BCA81DC8>=*7434J=8C8E4E4AB8>=F0B?D1;8B7438=
)#%
,>;D<4BBD4 D=4?064B
&D1;8B74A+("7CC??>AC0;02<>A6
28C0C8>=25<832>;;?>AC0;3;#*%!$
$%*CC74C8<4>5?D1;820C8>=C740DC7>A.0270AHE4BF0B0K;80C43F8C7C74+=8E4AB8CH>5
-0B78=6C>=DAA4=C;H D=4748B0502D;CH<4<14A>5C744?0AC<4=C>5><?DC4A0=3=5>A<0C8>=
)284=24B0CC74+=8E4AB8CH>5&4==BH;E0=80
*78B2>=54A4=24?0?4A8B0E08;01;40C)27>;0A;H><<>=B7CC?BA4?>B8C>AHD?4==43D31/A4B40A27

An Adaptive Query Execution System for Data Integration
Zachary G. Ives
University of Washington
zives@cs.washington.edu
Daniela Florescu
INRIA Roquencourt
Daniela.Florescu@inria.fr
Marc Friedman
University of Washington
friedman@cs.washington.edu
Alon Levy
University of Washington
alon@cs.washington.edu
Daniel S. Weld
University of Washington
weld@cs.washington.edu
Abstract
Query processing in data integration occurs over network-bound,
autonomous data sources. This requires extensions to traditional
optimization and execution techniques for three reasons: there is an
absence of quality statistics about the data, data transfer rates are
unpredictable and bursty, and slow or unavailable data sources can
often be replaced by overlapping or mirrored sources. This paper
presents the Tukwila data integration system, designed to support
adaptivity at its core using a two-pronged approach. Interleaved
planning and execution with partial optimization allows Tukwila
to quickly recover from decisions based on inaccurate estimates.
During execution, Tukwila uses adaptive query operators such as
the double pipelined hash join, which produces answers quickly,
and the dynamic collector, which robustly and efficiently computes
unions across overlapping data sources. We demonstrate that the
Tukwila architecture extends previous innovations in adaptive ex-
ecution (such as query scrambling, mid-execution re-optimization,
and choose nodes), and we present experimental evidence that our
techniques result in behavior desirable for a data integration sys-
tem.
1 Introduction
The goal of a data integration system is to provide a uni-
form query interface to a multitude of data sources. The data
integration problem primarily arises in two contexts: organi-
zations trying to provide access to a collection of internal au-
tonomous sources, and systems that present a uniform inter-
face to a multitude of sources available on the World-Wide
Web (WWW). The key advantage of a data integration sys-
tem is that it frees users fromhavingto locate the sources rel-
evant to their query, interact with each source independently,
and manually combine the data from the different sources.
This research was funded in part by ARPA / Rome Labs grant F30602-
95-1-0024, Office of Naval Research Grant N00014-98-1-0147, by National
Science Foundation Grants IRI-9303461, IIS-9872128, and 9874759.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advan-
tage and that copies bear this notice and the full citation on the first
page. To copy otherwise, to republish, to post on serverse or to re-
distribute to lists, requires prior specific permission and/or a fee.
SIGMOD ’99 Philadelphia, PA
Copyright 1999 ACM 1-58113-084-8/99/05...$5.00
The problem of data integration has received significant at-
tention in the research community as evidenced by numer-
ous research projects (e.g., [10, 20, 25, 17, 9, 3, 6, 1, 25, 21,
4, 13]) and the emergence of several commercial products
(e.g., DataJoiner [23] and jango.excite.com).
Three main challenges distinguish the design of a data
integration system from that of a traditional database sys-
tem: query reformulation, the construction of wrapper pro-
grams, and the design of new query processing techniques
for this more unpredictable environment. While the prob-
lems of reformulation and rapid wrapper development have
been the focus of previous work (e.g., [10, 17, 1, 9, 16, 3]),
relatively little attention has been given to the development
of query optimization algorithms and efficient query execu-
tion engines for data integration systems. These components
are now the critical bottleneck to making such systems de-
ployable in practice.
1.1 The Need for Adaptivity
To date, most data integration research has focused on the
problem of integratinginformation from web-based data sources,
where the amount of data returned by each source is gener-
ally small. The greater problem that of querying over
multiple autonomous data sources of moderate size, across
intranets as well as the Internet requires us to integrate
novel query execution techniques.
Several characteristics of the data integration problem
render existing database optimizers and execution engines
(or simple extensions thereof) inappropriate in the context
of data integration:
Absence of statistics: statistics about the data (e.g., cardi-
nalities, histograms) are central to a query optimizer’s cost
estimates for query execution plans. Since data integra-
tion systems manipulate data from autonomous external
sources, the system has relatively few and often unreliable
statistics about the data.
Unpredictabledata arrival characteristics: unlike tradi-
tional systems, data integration systems have little knowl-
edge about the rate of data arrival from the sources. Two
phenomenathat occur frequently in practice are significant
initial delays before data starts arriving, and bursty arrivals
of data thereafter. Hence, even if the query optimizer is

able to determine the best plan based on total work, the
data arrival characteristics may cause it to be inefficient in
practice [22].
Overlap and redundancy among sources: as a result of
the heterogeneity of the data sources, there is often signifi-
cant overlap or redundancy among them. Hence, the query
processor needs to be able to efficiently collect related data
from multiple sources, minimize the access to redundant
sources, and respond flexibly when some sources are un-
available.
Since data integration systems are designed for online
querying of data on the network, they have two other im-
portant characteristics. First, it is important to optimize the
time to the initial answers to the query, rather than to mini-
mize the total work of the system. Also, network bandwidth
generally constrains the data sources to be somewhat smaller
than in traditional database applications.
For all of these reasons, a data integration query proces-
sor should be adaptive. This is particularlytrue since a query
optimizer is unlikely to produce good plans from bad meta-
data, and even a plan that may be good on average should
be abandoned if unexpected situations arise. While runtime
adaptivity has been shown to speed up performance even in
traditional systems [15, 12], it becomes critical to perfor-
mance in the data integration context (e.g., [22]).
1.2 Adaptive Features of Tukwila
This paper describes the Tukwila
1
data integration system,
designed with adaptivity built into its core. There are two
levels at which Tukwila exhibits adaptive behavior: between
the optimizer and execution engine, through a process of in-
terleaved planning and execution, and within the execution
engine, with operators designed for dynamic execution.
Interleavingplanning and execution: when Tukwila pro-
cesses a query it does not necessarily create a complete
query executionplan beforebeginningto evaluate the query.
If the optimizer concludes that it does not have enough
metadata with which to reliably compare candidate query
execution plans, it may choose to send only a partial plan
to the execution engine, and decide how to proceed after
the partial plan has been completed, as in [7]. Alterna-
tively, the optimizer may send a complete plan, but the
execution engine may check for conditions that require in-
cremental re-optimization.
Adaptive operators: Tukwila incorporates operators that
are especially well suited for adaptive execution and for
minimizing the time required to obtain the first answers to
a query. Specifically, it employs an enhanced version of
the double pipelined hash join [24] (a join implementation
which executes in a symmetric, data-driven manner) and
techniques for adapting its execution when there is insuf-
ficient memory. In addition, the Tukwila execution engine
includes a collector operator whose task is to efficiently
union data from a large set of possibly overlapping or re-
dundant sources. Finally, Tukwila query execution plans
1
Tukwila is a scenic city near Seattle in the Northwest United States.
can contain conditional nodes in the spirit of [12] in order
to adapt to conditions that can be anticipated at optimiza-
tion time.
Adaptive behavior in Tukwila is coordinated in a uni-
form fashion by a set of event-condition-action rules. An
event may be raised by the execution of operators (e.g., out
of memory, data source not responding) or at materialization
points in the plan. The possible actions include modifying
operator execution, reordering of operators, or re-optimization.
1.3 Example
A simple example demonstrates the breadth of Tukwilas
adaptive behavior. Suppose that the same query (Figure 1a)
is issued to the system under three extreme conditions: when
the source tables are of unknownsize, are small, or are large.
Each time, assume that the relative statistics are such that a
traditional optimizer would construct the join tree in Fig-
ure 1b. In a traditional query engine, the join implementa-
tions, memory allocations, and materialization points will be
fixed at compile time, and the tree will be executed in a pre-
determined order. Tukwila implements mechanisms needed
to behave more adaptively. Consider its response to the three
cases:
No size information: With no informationthere is no point
in traditional optimization. Instead, the optimizer may de-
cide to compute a partial result that it chooses heuristically,
such as the join AB, and decide afterwards what do next.
Small tables: Tukwila chooses the double pipelined join
implementation for joins of small cardinality, and pipelines
the entire query. When source latencies are high, this type
of join has a large advantage over traditional joins, but it de-
mands considerably more memory. To handle the “unlucky”
case that memory is exceeded, the join operator has an over-
flow resolution mechanism.
Large tables: If the tables are sufficiently large, Tukwilas
optimizer chooses standard hash joins, and breaks the pipeline,
perhaps after join AB in Figure 1b. Now, depending on the
rules in force, one of two things may happen during execu-
tion:
Rescheduling: If all sources respond, and table AB has
a cardinality sufficiently close to the optimizer’s estimate,
execution continues normally. Should some sources re-
spond slowly, however, Tukwila can reschedule as with
query scrambling [22]. If the connection to data source A
times out, join DE will be executed preemptively. Should
that time out as well, the optimizer is called with that infor-
mation to produce a plan reordered to use the non-blocked
sources first.
Re-optimization: After the AB join completes and ma-
terializes, Tukwila compares the actual cardinality with
the optimizer’s estimate. As in [15], if this value signif-
icantly differs from the optimizer’s estimate, the optimizer
is awakened to find a cheaper plan (perhaps the one in Fig-
ure 1c) given more accurate information.

Select * from A,B,C,D,E
where A.ssn =B.ssn
and B.ssn=C.ssn
and C.ssn=D.ssn
and D.ssn=E.ssn
ABCDE
ABC DE
A
AB
B
C D E
(a) (b) (c)
ABCDE
CDE AB
D
DE
E
C A B
Figure 1: Sample query, initial join tree, and join tree produced by re-optimization.
The paper is organized as follows. Section 2 provides an
overview of the architecture of Tukwila. Section 3 describes
the mechanisms for interleaving of planning and execution.
Section 4 describes the new query operator algorithms used
in Tukwila. Section 5 discusses the implemented system.
Section 6 describes our experimental results. Section 7 dis-
cusses related work, and Section 8 discusses several addi-
tional issues and concluding remarks.
2 Tukwila Architecture
This section provides an overview of the Tukwila architec-
ture as illustrated in Figure 2.
Queries: A Tukwila user poses queries in terms of a medi-
ated relational schema. The relations in the mediatedschema
are virtual in the sense that their extensions are not stored
anywhere. The goal of the mediated schema is to abstract
the details of the data sources’ schemata from the user. In
this paper we limit our discussion to select-project-join (con-
junctive) queries over this mediated schema.
Data source catalog: The catalog contains several types
of metadata about each data source. The first of these is
a semantic description of the contents of the data sources.
Second is overlap information about pairs of data sources
(that is, the probability that a data value
appears in source

if
is known to appear in source

) for use by collector
operators, as in [8]. In the extreme case, overlap information
can indicate that two sites are mirrors of each other. Finally,
the catalog may contain key statistics about the data, such as
the cost of accessing each source, the sizes of the relations
in the sources, and selectivity information.
Queryreformulation: The queryoverthe mediated schema
is fed intothe Tukwila query reformulationcomponent,which
is based on an enhanced version of the algorithm described
in [17]. In general, a query reformulator converts the user’s
query into a union of conjunctivequeries referringto the data
source schemata. This paper focuses on a limited form in
which we have a single query that may include disjunction
at the leaves. This limited disjunction, which is handled by
our dynamic collector operator, is useful in handling multi-
ple overlapping or mirrored data sources with the same at-
tributes, e.g. in a query over bibliographical databases.
Queryoptimizer: The query optimizertransforms therewrit-
ten query into a query execution plan for the execution en-
gine. The optimizer has the ability to create partial plans
if essential statistics are missing or uncertain, and also pro-
duces rules to define adaptive behavior during runtime.
Query executionengine: The query execution engine pro-
cesses query plans produced by the optimizer. The execution
engine emphasizes time-to-first result and includes operators
designed to facilitate this. It includes an event handler for
dynamically interpreting rules and supports incremental re-
optimization.
Wrappers: the queryexecutionengine communicateswith
the data sources through a set of wrapper programs. Wrap-
pers handle the communication with the data sources and,
when necessary, translate the data from the formats used in
the sources to those used in Tukwila. We assume a location-
independent wrapper model, where wrappers can be placed
either at the data source or at the execution system.
3 Interleaving Planning and
Execution
The query optimizer takes a query from the reformulatorand
uses information from the source catalog to produce query
execution plans for the execution engine via a System-R
style dynamic programming algorithm. The non-traditional
aspects of the Tukwila optimizer include the following:
The optimizer does not always create a complete execu-
tion plan for the query. If essential statistics are missing or
uncertain, the optimizer may generate a partial plan with
only the first steps specified, deferring subsequent plan-
ning until sources have been contacted and critical meta-
data obtained.
In addition to producing the annotated operator tree, the
optimizer also generates the appropriate event-condition-
action rules. These rules specify (1) when and how to
modify the implementation of certain operators at runtime
if needed, and (2) conditions to check at materialization
points in order to detect opportunities for re-optimization.
The query optimizer conserves the state of its search space
when it calls the execution engine. The optimizer is able
to efficiently resume optimization in incremental fashion
if needed.
3.1 Query Plans
Operators in Tukwila are organizedinto pipelinedunits called
fragments. At the end of a fragment, pipelines terminate,

Citations
More filters
Proceedings ArticleDOI

Models and issues in data stream systems

TL;DR: The need for and research issues arising from a new model of data processing, where data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams are motivated.
Journal ArticleDOI

TinyDB: an acquisitional query processing system for sensor networks

TL;DR: This work evaluates issues in the context of TinyDB, a distributed query processor for smart sensor devices, and shows how acquisitional techniques can provide significant reductions in power consumption on the authors' sensor devices.
Journal ArticleDOI

Data fusion

TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.
Journal ArticleDOI

Answering queries using views: A survey

TL;DR: The state of the art on the problem of answering queries using views is surveyed, the algorithms proposed to solve it are described, and the disparate works into a coherent framework are synthesized.
Journal ArticleDOI

Aurora: a new model and architecture for data stream management

TL;DR: The basic processing model and architecture of Aurora, a new system to manage data streams for monitoring applications, are described and a stream-oriented set of operators are described.
References
More filters
Book

Principles of Distributed Database Systems

TL;DR: This third edition of a classic textbook can be used to teach at the senior undergraduate and graduate levels and concentrates on fundamental theories as well as techniques and algorithms in distributed data management.
Journal ArticleDOI

Query evaluation techniques for large databases

TL;DR: This survey describes a wide array of practical query evaluation techniques for both relational and postrelational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set-matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains.
Proceedings Article

Querying Heterogeneous Information Sources Using Source Descriptions

TL;DR: The Information Manifold is described, an implemented system that provides uniform access to a heterogeneous collection of more than 100 information sources, many of them on the WWW, and algorithms that use the source descriptions to prune effciently the set of information sources for a given query are described.
Proceedings Article

Wrapper induction for information extraction

TL;DR: This work introduces wrapper induction, a method for automatically constructing wrappers, and identifies hlrt, a wrapper class that is e ciently learnable, yet expressive enough to handle 48% of a recently surveyed sample of Internet resources.

The TSIMMIS project: Integration of heterogeneous information sources

TL;DR: The Tsimmis project as mentioned in this paper is a joint project between Stanford and IBM Almaden Research Center to develop tools that facilitate the rapid integration of heterogeneous information sources that may include both structured and unstructured data.