An adaptive query execution system for data integration

doi:10.1145/304181.304209

+=8E4AB8CH>5&4==BH;E0=80+=8E4AB8CH>5&4==BH;E0=80

)27>;0A;H><<>=B)27>;0A;H><<>=B

0C010B4(4B40A27A>D?) 4?0AC<4=C>5><?DC4A=5>A<0C8>=)284=24

D=4

=30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>==30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>=

.0270AHE4B

+=8E4AB8CH>5&4==BH;E0=80

I8E4B28BD?4==43D

0=84;0;>A4B2D

$(

#0A2A843<0=

+=8E4AB8CH>5-0B78=6C>=

;>="4EH

+=8E4AB8CH>5-0B78=6C>=

0=84;)-4;3

+=8E4AB8CH>5-0B78=6C>=

>;;>FC78B0=30338C8>=0;F>A:B0C7CC?BA4?>B8C>AHD?4==43D31/A4B40A27

E4B.0270AH;>A4B2D0=84;0A843<0=#0A2"4EH;>=0=3-4;30=84;)=30?C8E4'D4AH

G42DC8>=)HBC4<5>A0C0=C46A0C8>=

0C010B4(4B40A27A>D?)



7CC?BA4?>B8C>AHD?4==43D31/A4B40A27

&>BC?A8=CE4AB8>=>?HA867C#*78B8BC740DC7>ABE4AB8>=>5C74F>A:C8B?>BC4374A41H?4A<8BB8>=>5

#5>AH>DA?4AB>=0;DB4$>C5>AA438BCA81DC8>=*7434J=8C8E4E4AB8>=F0B?D1;8B7438=

)#%

,>;D<4

BBD4 D=4?064B

&D1;8B74A+("7CC??>AC0;02<>A6

28C0C8>=25<832>;;?>AC0;3;#*%!$

$%*CC74C8<4>5?D1;820C8>=C740DC7>A.0270AHE4BF0B0K;80C43F8C7C74+=8E4AB8CH>5-0B78=6C>=

DAA4=C;H D=4748B0502D;CH<4<14A>5C744?0AC<4=C>5><?DC4A0=3=5>A<0C8>=)284=24B0CC74

+=8E4AB8CH>5&4==BH;E0=80

*78B?0?4A8B?>BC430C)27>;0A;H><<>=B7CC?BA4?>B8C>AHD?4==43D31/A4B40A27

>A<>A48=5>A<0C8>=?;40B42>=C02CA4?>B8C>AH?>1>GD?4==43D

=30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>==30?C8E4'D4AHG42DC8>=)HBC4<5>A0C0=C46A0C8>=

1BCA02C1BCA02C

'D4AH?A>24BB8=68=30C08=C46A0C8>=>22DAB>E4A=4CF>A:1>D=30DC>=><>DB30C0B>DA24B*78BA4@D8A4B

4GC4=B8>=BC>CA038C8>=0;>?C8<8I0C8>=0=34G42DC8>=C427=8@D4B5>AC7A44A40B>=BC74A48B0=01B4=24>5

@D0;8CHBC0C8BC82B01>DCC7430C030C0CA0=B54AA0C4B0A4D=?A4382C01;40=31DABCH0=3B;>F>AD=0E08;01;4

30C0B>DA24B20=>5C4=14A4?;02431H>E4A;0??8=6>A<8AA>A43B>DA24B*78B?0?4A?A4B4=CBC74*D:F8;0

30C08=C46A0C8>=BHBC4<34B86=43C>BD??>AC030?C8E8CH0C8CB2>A4DB8=60CF>?A>=6430??A>027

=C4A;40E43?;0==8=60=34G42DC8>=F8C7?0AC80;>?C8<8I0C8>=0;;>FB*D:F8;0C>@D82:;HA42>E4A5A><

3428B8>=B10B43>=8=022DA0C44BC8<0C4BDA8=64G42DC8>=*D:F8;0DB4B030?C8E4@D4AH>?4A0C>ABBD27

0BC743>D1;4?8?4;8=4370B79>8=F7827?A>3D24B0=BF4AB@D82:;H0=3C743H=0<822>;;42C>AF7827

A>1DBC;H0=34K284=C;H2><?DC4BD=8>=B02A>BB>E4A;0??8=630C0B>DA24B-434<>=BCA0C4C70CC74

*D:F8;00A278C42CDA44GC4=3B?A4E8>DB8==>E0C8>=B8=030?C8E44G42DC8>=BD270B@D4AHB2A0<1;8=6<83

4G42DC8>=A4>?C8<8I0C8>=0=327>>B4=>34B0=3F4?A4B4=C4G?4A8<4=C0;4E834=24C70C>DAC427=8@D4B

A4BD;C8=1470E8>A34B8A01;45>A030C08=C46A0C8>=BHBC4<

><<4=CB><<4=CB

&>BC?A8=CE4AB8>=>?HA867C#*78B8BC740DC7>ABE4AB8>=>5C74F>A:C8B?>BC4374A41H

?4A<8BB8>=>5#5>AH>DA?4AB>=0;DB4$>C5>AA438BCA81DC8>=*7434J=8C8E4E4AB8>=F0B?D1;8B7438=

)#%

,>;D<4BBD4 D=4?064B

&D1;8B74A+("7CC??>AC0;02<>A6

28C0C8>=25<832>;;?>AC0;3;#*%!$

$%*CC74C8<4>5?D1;820C8>=C740DC7>A.0270AHE4BF0B0K;80C43F8C7C74+=8E4AB8CH>5

-0B78=6C>=DAA4=C;H D=4748B0502D;CH<4<14A>5C744?0AC<4=C>5><?DC4A0=3=5>A<0C8>=

)284=24B0CC74+=8E4AB8CH>5&4==BH;E0=80

*78B2>=54A4=24?0?4A8B0E08;01;40C)27>;0A;H><<>=B7CC?BA4?>B8C>AHD?4==43D31/A4B40A27

An Adaptive Query Execution System for Data Integration



Zachary G. Ives

University of Washington

zives@cs.washington.edu

Daniela Florescu

INRIA Roquencourt

Daniela.Florescu@inria.fr

Marc Friedman

University of Washington

friedman@cs.washington.edu

Alon Levy

University of Washington

alon@cs.washington.edu

Daniel S. Weld

University of Washington

weld@cs.washington.edu

Abstract

Query processing in data integration occurs over network-bound,

autonomous data sources. This requires extensions to traditional

optimization and execution techniques for three reasons: there is an

absence of quality statistics about the data, data transfer rates are

unpredictable and bursty, and slow or unavailable data sources can

often be replaced by overlapping or mirrored sources. This paper

presents the Tukwila data integration system, designed to support

adaptivity at its core using a two-pronged approach. Interleaved

planning and execution with partial optimization allows Tukwila

to quickly recover from decisions based on inaccurate estimates.

During execution, Tukwila uses adaptive query operators such as

the double pipelined hash join, which produces answers quickly,

and the dynamic collector, which robustly and efﬁciently computes

unions across overlapping data sources. We demonstrate that the

Tukwila architecture extends previous innovations in adaptive ex-

ecution (such as query scrambling, mid-execution re-optimization,

and choose nodes), and we present experimental evidence that our

techniques result in behavior desirable for a data integration sys-

tem.

1 Introduction

The goal of a data integration system is to provide a uni-

form query interface to a multitude of data sources. The data

integration problem primarily arises in two contexts: organi-

zations trying to provide access to a collection of internal au-

tonomous sources, and systems that present a uniform inter-

face to a multitude of sources available on the World-Wide

Web (WWW). The key advantage of a data integration sys-

tem is that it frees users fromhavingto locate the sources rel-

evant to their query, interact with each source independently,

and manually combine the data from the different sources.



This research was funded in part by ARPA / Rome Labs grant F30602-

95-1-0024, Ofﬁce of Naval Research Grant N00014-98-1-0147, by National

Science Foundation Grants IRI-9303461, IIS-9872128, and 9874759.

Permission to make digital or hard copies of all or part of this work

for personal or classroom use is granted without fee provided that

copies are not made or distributed for proﬁt or commercial advan-

tage and that copies bear this notice and the full citation on the ﬁrst

page. To copy otherwise, to republish, to post on serverse or to re-

distribute to lists, requires prior speciﬁc permission and/or a fee.

SIGMOD ’99 Philadelphia, PA

The problem of data integration has received signiﬁcant at-

tention in the research community as evidenced by numer-

ous research projects (e.g., [10, 20, 25, 17, 9, 3, 6, 1, 25, 21,

4, 13]) and the emergence of several commercial products

(e.g., DataJoiner [23] and jango.excite.com).

Three main challenges distinguish the design of a data

integration system from that of a traditional database sys-

tem: query reformulation, the construction of wrapper pro-

grams, and the design of new query processing techniques

for this more unpredictable environment. While the prob-

lems of reformulation and rapid wrapper development have

been the focus of previous work (e.g., [10, 17, 1, 9, 16, 3]),

relatively little attention has been given to the development

of query optimization algorithms and efﬁcient query execu-

tion engines for data integration systems. These components

are now the critical bottleneck to making such systems de-

ployable in practice.

1.1 The Need for Adaptivity

To date, most data integration research has focused on the

problem of integratinginformation from web-based data sources,

where the amount of data returned by each source is gener-

ally small. The greater problem — that of querying over

multiple autonomous data sources of moderate size, across

intranets as well as the Internet — requires us to integrate

novel query execution techniques.

Several characteristics of the data integration problem

render existing database optimizers and execution engines

(or simple extensions thereof) inappropriate in the context

of data integration:



Absence of statistics: statistics about the data (e.g., cardi-

nalities, histograms) are central to a query optimizer’s cost

estimates for query execution plans. Since data integra-

tion systems manipulate data from autonomous external

sources, the system has relatively few and often unreliable

statistics about the data.



Unpredictabledata arrival characteristics: unlike tradi-

tional systems, data integration systems have little knowl-

edge about the rate of data arrival from the sources. Two

phenomenathat occur frequently in practice are signiﬁcant

initial delays before data starts arriving, and bursty arrivals

of data thereafter. Hence, even if the query optimizer is

able to determine the best plan based on total work, the

data arrival characteristics may cause it to be inefﬁcient in

practice [22].



Overlap and redundancy among sources: as a result of

the heterogeneity of the data sources, there is often signiﬁ-

cant overlap or redundancy among them. Hence, the query

processor needs to be able to efﬁciently collect related data

from multiple sources, minimize the access to redundant

sources, and respond ﬂexibly when some sources are un-

available.

Since data integration systems are designed for online

querying of data on the network, they have two other im-

portant characteristics. First, it is important to optimize the

time to the initial answers to the query, rather than to mini-

mize the total work of the system. Also, network bandwidth

generally constrains the data sources to be somewhat smaller

than in traditional database applications.

For all of these reasons, a data integration query proces-

sor should be adaptive. This is particularlytrue since a query

optimizer is unlikely to produce good plans from bad meta-

data, and even a plan that may be good on average should

be abandoned if unexpected situations arise. While runtime

adaptivity has been shown to speed up performance even in

traditional systems [15, 12], it becomes critical to perfor-

mance in the data integration context (e.g., [22]).

1.2 Adaptive Features of Tukwila

This paper describes the Tukwila

1

data integration system,

designed with adaptivity built into its core. There are two

levels at which Tukwila exhibits adaptive behavior: between

the optimizer and execution engine, through a process of in-

terleaved planning and execution, and within the execution

engine, with operators designed for dynamic execution.



Interleavingplanning and execution: when Tukwila pro-

cesses a query it does not necessarily create a complete

query executionplan beforebeginningto evaluate the query.

If the optimizer concludes that it does not have enough

metadata with which to reliably compare candidate query

execution plans, it may choose to send only a partial plan

to the execution engine, and decide how to proceed after

the partial plan has been completed, as in [7]. Alterna-

tively, the optimizer may send a complete plan, but the

execution engine may check for conditions that require in-

cremental re-optimization.



Adaptive operators: Tukwila incorporates operators that

are especially well suited for adaptive execution and for

minimizing the time required to obtain the ﬁrst answers to

a query. Speciﬁcally, it employs an enhanced version of

the double pipelined hash join [24] (a join implementation

which executes in a symmetric, data-driven manner) and

techniques for adapting its execution when there is insuf-

ﬁcient memory. In addition, the Tukwila execution engine

includes a collector operator whose task is to efﬁciently

union data from a large set of possibly overlapping or re-

dundant sources. Finally, Tukwila query execution plans

1

Tukwila is a scenic city near Seattle in the Northwest United States.

can contain conditional nodes in the spirit of [12] in order

to adapt to conditions that can be anticipated at optimiza-

tion time.

Adaptive behavior in Tukwila is coordinated in a uni-

form fashion by a set of event-condition-action rules. An

event may be raised by the execution of operators (e.g., out

of memory, data source not responding) or at materialization

points in the plan. The possible actions include modifying

operator execution, reordering of operators, or re-optimization.

1.3 Example

A simple example demonstrates the breadth of Tukwila’s

adaptive behavior. Suppose that the same query (Figure 1a)

is issued to the system under three extreme conditions: when

the source tables are of unknownsize, are small, or are large.

Each time, assume that the relative statistics are such that a

traditional optimizer would construct the join tree in Fig-

ure 1b. In a traditional query engine, the join implementa-

tions, memory allocations, and materialization points will be

ﬁxed at compile time, and the tree will be executed in a pre-

determined order. Tukwila implements mechanisms needed

to behave more adaptively. Consider its response to the three

cases:

No size information: With no informationthere is no point

in traditional optimization. Instead, the optimizer may de-

cide to compute a partial result that it chooses heuristically,

such as the join AB, and decide afterwards what do next.

Small tables: Tukwila chooses the double pipelined join

implementation for joins of small cardinality, and pipelines

the entire query. When source latencies are high, this type

of join has a large advantage over traditional joins, but it de-

mands considerably more memory. To handle the “unlucky”

case that memory is exceeded, the join operator has an over-

ﬂow resolution mechanism.

Large tables: If the tables are sufﬁciently large, Tukwila’s

optimizer chooses standard hash joins, and breaks the pipeline,

perhaps after join AB in Figure 1b. Now, depending on the

rules in force, one of two things may happen during execu-

tion:



Rescheduling: If all sources respond, and table AB has

a cardinality sufﬁciently close to the optimizer’s estimate,

execution continues normally. Should some sources re-

spond slowly, however, Tukwila can reschedule as with

query scrambling [22]. If the connection to data source A

times out, join DE will be executed preemptively. Should

that time out as well, the optimizer is called with that infor-

mation to produce a plan reordered to use the non-blocked

sources ﬁrst.



Re-optimization: After the AB join completes and ma-

terializes, Tukwila compares the actual cardinality with

the optimizer’s estimate. As in [15], if this value signif-

icantly differs from the optimizer’s estimate, the optimizer

is awakened to ﬁnd a cheaper plan (perhaps the one in Fig-

ure 1c) given more accurate information.

Select * from A,B,C,D,E

where A.ssn =B.ssn

and B.ssn=C.ssn

and C.ssn=D.ssn

and D.ssn=E.ssn

ABCDE

ABC DE

A

AB

B

C D E

(a) (b) (c)

ABCDE

CDE AB

D

DE

E

C A B

Figure 1: Sample query, initial join tree, and join tree produced by re-optimization.

The paper is organized as follows. Section 2 provides an

overview of the architecture of Tukwila. Section 3 describes

the mechanisms for interleaving of planning and execution.

Section 4 describes the new query operator algorithms used

in Tukwila. Section 5 discusses the implemented system.

Section 6 describes our experimental results. Section 7 dis-

cusses related work, and Section 8 discusses several addi-

tional issues and concluding remarks.

2 Tukwila Architecture

This section provides an overview of the Tukwila architec-

ture as illustrated in Figure 2.

Queries: A Tukwila user poses queries in terms of a medi-

ated relational schema. The relations in the mediatedschema

are virtual in the sense that their extensions are not stored

anywhere. The goal of the mediated schema is to abstract

the details of the data sources’ schemata from the user. In

this paper we limit our discussion to select-project-join (con-

junctive) queries over this mediated schema.

Data source catalog: The catalog contains several types

of metadata about each data source. The ﬁrst of these is

a semantic description of the contents of the data sources.

Second is overlap information about pairs of data sources

(that is, the probability that a data value



appears in source



if



is known to appear in source



) for use by collector

operators, as in [8]. In the extreme case, overlap information

can indicate that two sites are mirrors of each other. Finally,

the catalog may contain key statistics about the data, such as

the cost of accessing each source, the sizes of the relations

in the sources, and selectivity information.

Queryreformulation: The queryoverthe mediated schema

is fed intothe Tukwila query reformulationcomponent,which

is based on an enhanced version of the algorithm described

in [17]. In general, a query reformulator converts the user’s

query into a union of conjunctivequeries referringto the data

source schemata. This paper focuses on a limited form in

which we have a single query that may include disjunction

at the leaves. This limited disjunction, which is handled by

our dynamic collector operator, is useful in handling multi-

ple overlapping or mirrored data sources with the same at-

tributes, e.g. in a query over bibliographical databases.

Queryoptimizer: The query optimizertransforms therewrit-

ten query into a query execution plan for the execution en-

gine. The optimizer has the ability to create partial plans

if essential statistics are missing or uncertain, and also pro-

duces rules to deﬁne adaptive behavior during runtime.

Query executionengine: The query execution engine pro-

cesses query plans produced by the optimizer. The execution

engine emphasizes time-to-ﬁrst result and includes operators

designed to facilitate this. It includes an event handler for

dynamically interpreting rules and supports incremental re-

optimization.

Wrappers: the queryexecutionengine communicateswith

the data sources through a set of wrapper programs. Wrap-

pers handle the communication with the data sources and,

when necessary, translate the data from the formats used in

the sources to those used in Tukwila. We assume a location-

independent wrapper model, where wrappers can be placed

either at the data source or at the execution system.

3 Interleaving Planning and

Execution

The query optimizer takes a query from the reformulatorand

uses information from the source catalog to produce query

execution plans for the execution engine via a System-R

style dynamic programming algorithm. The non-traditional

aspects of the Tukwila optimizer include the following:



The optimizer does not always create a complete execu-

tion plan for the query. If essential statistics are missing or

uncertain, the optimizer may generate a partial plan with

only the ﬁrst steps speciﬁed, deferring subsequent plan-

ning until sources have been contacted and critical meta-

data obtained.



In addition to producing the annotated operator tree, the

optimizer also generates the appropriate event-condition-

action rules. These rules specify (1) when and how to

modify the implementation of certain operators at runtime

if needed, and (2) conditions to check at materialization

points in order to detect opportunities for re-optimization.



The query optimizer conserves the state of its search space

when it calls the execution engine. The optimizer is able

to efﬁciently resume optimization in incremental fashion

if needed.

3.1 Query Plans

Operators in Tukwila are organizedinto pipelinedunits called

fragments. At the end of a fragment, pipelines terminate,

An adaptive query execution system for data integration

Figures

Citations

Models and issues in data stream systems

TinyDB: an acquisitional query processing system for sensor networks

Data fusion

Answering queries using views: A survey

Aurora: a new model and architecture for data stream management

References

Principles of Distributed Database Systems

Query evaluation techniques for large databases

Querying Heterogeneous Information Sources Using Source Descriptions

Wrapper induction for information extraction

The TSIMMIS project: Integration of heterogeneous information sources

Related Papers (5)

Eddies: continuously adaptive query processing

Querying Heterogeneous Information Sources Using Source Descriptions

NiagaraCQ: a scalable continuous query system for Internet databases

Query evaluation techniques for large databases

Continuously adaptive continuous queries over streams