What have the authors contributed in "Efficient multi-way theta-join processing using mapreduce" ?

Q: What have the authors contributed in "Efficient multi-way theta-join processing using mapreduce" ?

In this work, the authors study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a costeffective perspective.

(Open Access) Efficient multi-way theta-join processing using MapReduce (2012) | Xiaofei Zhang

Efﬁcient Multi-way Theta-Join Processing Using

MapReduce

Xiaofei Zhang

HKUST

Hong Kong

zhangxf@cse.ust.hk

Lei Chen

HKUST

Hong Kong

leichen@cse.ust.hk

Min Wang

HP Labs China

Beijing, China

min.wang6@hp.com

ABSTRACT

Multi-way Theta-join queries are powerful in describing com-

plex relations and therefore widely employed in real prac-

tices. Howeve r, existing solutions from traditional distribut-

ed and parallel databases for multi-way Theta-join queries

cannot be easily extended to ﬁt a shared-nothing distributed

computing paradigm, which is proven to be able to sup-

port OLAP applications over immense d a t a volumes. In

this work, we study the problem of eﬃcient process in g of

multi-way Theta-join queries usin g MapReduce from a co s t-

eﬀective perspective. Although there have been some works

using the (key,value) pair-based programming model to sup-

port join operations, eﬃcient processing of multi-way Theta-

join queries has never been fully explored. The substantial

challenge lies in, given a number of processing units (that

can run Map or Reduce tasks), mapping a multi-way Theta-

join query to a number of MapReduce jobs and having them

executed in a well scheduled sequence, such that the total

processing time span is minim iz ed . Our solution mainly in-

cludes two parts: 1) cost metrics for b o t h single MapReduce

job and a number of MapReduce jobs ex ec u te d in a certain

order; 2) the eﬃcient execution of a chain-typed Theta-join

with only one MapRed u ce job. Comparing with the qu e ry

evaluation strategy proposed in [23] and the widely adopted

Pig Latin and Hive SQL solutions , our method achieves sig-

niﬁcant im p rovement of the join processing eﬃciency.

1. INTRODUCTION

Data analytical queries in real practices commonly in-

volve multi-way join operations. The operators involved in a

multi-way join query are more than just Equi-join. Instead,

the join condition can be deﬁned as a binary funct io n θ that

belongs to {<,≤,=,≥,>, <> } , as known as Theta-join. Com-

pared with Equi-join, it is more general and expressive in

relation description and surprisin gl y handy in data an a lyt i c

queries. Thus, eﬃcient processing of multi-way Theta-join

queries plays a critical role in the system performance. In

fact, evaluating multi-way Theta-joins has always been a

challenging problem along with the development of database

technology. Early wo rk s, like [8][26] [2 2 ] and etc., have elab-

orated the complexity of the p ro b l em and presented th eir

evaluation strategies. However, their solutions do not scale

to process the multi-way Theta-joins over the data of tremen-

dous volumes. For instance, as reported from Facebook [5]

and Google [11], the underlying d a t a volume is of hundreds

of tera-bytes or even peta-bytes. In such scenarios, solu-

tions from the traditional distributed or parallel databases

are infeasible due to unsatisfactory scalability and poor fault

tolerance.

On the contrary, (key,value)-based MapReduce program-

ming model substantially guarantees great scalab il ity and

strong fault tolerance property. It has emerged as the most

popular processing paradigm in a shared-not h i n g computing

environment. Recently, devoting research eﬀorts towards ef-

ﬁcient and eﬀective analytic processing over immense data

have been made within the MapReduce framework. Cur-

rently, the database community mainly focuses on two is-

sues. First, the transformation from certain relational al-

gebra operator, like similarity join, to its (key,value)- b a sed

parallel implementation. Second, the tuning o r re-design

of the transformatio n function such that the MapReduce

job is executed more eﬃciently in terms of less time cost or

computing resources consumption. Although various rela-

tional operators, like pair-wise Theta-join , fuzzy join, aggre-

gation operators and etc., are evaluated and implemented

using MapReduce, there is l it t le eﬀort exploring the eﬃ-

cient processing of multi-way join queries, especially more

general compu t a ti o n namely Theta-join, using MapReduce.

The reason is that, the problem involves more than just a

relational operator→(key,valu e) pair transformation and the

tuning, there are other critical issues needed to be addressed:

1) How many M a p Reduce jobs should we employ to evaluate

the query? 2) What is each MapReduce job responsible for?

3) How should multiple MapReduce jobs be scheduled?

To add res s the problem, there are two challenging issues

needed to be resolved. Firstly, the number of availa b le com-

puting units is in fact limi ted , which is often neglected when

mapping a task to a set of MapReduce jobs . Although

the pay-as-you-go policy of Cloud computing platform could

promise as many computing resources as required, however,

once a computing environment is established, the allowed

maximum number of concurrent Map and Reduce tasks is

ﬁxed according to the system conﬁguration. Even taken

the auto scaling feature of Amazon EC2 platform [18] into

consideration, the maximum number of involved computing

units are pre-determined by the user-deﬁn ed proﬁles. There-

1184

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee. Articles from this volume were invited to present

their results at The 38th International Conference on Very Large Data Bases,

August 27th - 31st 2012, Istanbul, Turkey.

Proceedings of the VLDB Endowment, Vol. 5, No. 11

fore, with the use r speciﬁed Reduce ta sk number, a multi-

way Theta-join query is processed with only limited number

of available computing units.

The second challenge is tha t , the decomposition of a multi-

way Theta-join query into a number of MapReduce tasks is

non-trivial. Work [28] targets at the multi-way Equi-join

processing. It decomp o s es a query into several MapReduce

jobs and schedules the execution based on a speciﬁc cost

model. However, it only considers the pair-wise join as the

basic scheduling unit. In other words, it follows the tradi-

tional multi-way join pro ce ssi n g methodology, which eval-

uates the query with a sequence of pair-wis e joins. This

methodology excludes the possible optimization opportunity

to evaluate a multi-way join in on e MapReduce job. Our

observation is that, under certain conditions, evaluating a

multi-way join with one MapReduce job is much more eﬃ-

cient than with a sequence of MapReduce jobs conducting

pair-wise joins. Work [23] reports the same observation. One

dominating reason is t h a t , the I/O c o st s of intermediate re-

sults generated by multiple MapReduce jobs may become

unacceptable overheads. Work [2] p re sents the solution of

evaluating a multi-way join in one MapReduce job, which

only works for the Equi-join case. Since the Theta-join can-

not be answered by simply making the join attribute the

partition key, thus, the solution proposed in [2] cannot be ex-

tended to solve the case of multi-way Theta-joins. Work [25]

demonstrates eﬀective pair-wise Theta-join processing using

MapReduce by partitioning a two dimen si on a l result space

formed by the cross-product of two relations. For t h e case

of multi-way join, the result space is a hyper-cube, whose

dimensionality is the number of the relati on s involved in

the q u ery. Unfortun a t ely, work [25] does not explore how

to extend th eir solution to handle the partition in high di-

mensions. Moreover, the questio n about wheth er we should

evaluate a complex query wit h a sin g le M a p R edu c e jo b or

several MapReduce jobs, is not cle a r yet. Therefore, there

is no straightforward solution to combine the techni qu e s in

existing literatures to evaluate a multi-way Theta-join query.

Meanwhile, assume a set of MapReduce jobs are gener-

ated for the query evaluation. Then given a limited number

of processing units, it remains a challenge to schedule the

execution of MapReduce jo b s , such that the q u ery can be

answered with the min i mum time span. These jobs may have

dependency relationships and inter-compe ti t io n for resource

consumptions during the concurrent execution. Currently,

the MapReduce framework requires the number of Reduc e

tasks as a u ser speciﬁed input. Thus, after decomposing a

multi-way Theta-jo in query into a number of MapReduce

jobs, one challenging issue is how to specify each job a

proper Reduce task number, such that the overall scheduling

achieves the minimum ex ec u t io n time span.

Speciﬁcally, the problem that we are working on is: given

a number of processing units (that can run Map o r Re-

duce tasks), mapping a multi-way Theta-join to a number of

MapReduce jobs and having them executed in a well sched-

uled order, such that the total processing time span is mini-

mized. Our solution to this challenging problem i n cl u d es two

core techniques. The ﬁrst one is, given a mult i- way Theta-

join query, we examine all the possible decomposition plans

and estimate the minimum execution time cost for each plan.

Especially, we ﬁgure out the rules to properly decompose the

original multi-way Theta-join query and study the most ef-

ﬁcient solution to evaluate multiple join condition functions

using one Ma p R ed u c e job. Th e second technique is that,

given a limited number of comp u t ing units and a pool of

possible MapReduce jobs to eval u a te the query, we design a

novel so lu t io n to select jobs to eﬀectively evaluate the query

as fast as possible. To evaluate the c os t, we develop an I/O

and network aware cost model to describe the behavior of a

MapReduce job.

To the best of our knowledge, this is the ﬁrst work explor-

ing the multi-way Theta-joins evaluation using MapRed u c e.

Our main contributions are listed as follows:

• We establish the rules to decom pose a multi-way join

query. Under our proposed cost model, we can ﬁgure

out whether a multi-way join query sho u ld be evalu-

ated with multiple MapReduce jobs or a single MapRe-

duce job.

• We develop a resource aware (key,value) pair distri-

bution method to evaluate the chain-typed multi-way

Theta-join query with one MapReduce job, which guar-

antees minimized volume of data copying over the net-

work, as well as evenly distributed workload among

Reduce tasks.

• We validate our cost model and the solution for multi-

way Theta-join queries with extensive experiments.

The rest of the paper is organized as follows. In Section 2,

we brieﬂy review the MapReduce computing paradigm and

elaborate the application scenario for multi-way Theta-joins.

We formally deﬁne our problem in Section 3 and present

our cost model i n section 4. We take Section 5 to explain

our query evalua t io n strategies in det a il s. We validate our

solution in Section 6 with extensive experiments on both real

and synthetic data sets. We summarize and compare the

most recent closely related work in Section 7 and conclude

our work in S ec t io n 8.

2. PRELIMINARIES

In this section we brieﬂy present the MapReduce program-

ming model and how it has been applied to evaluate join

queries. More importantly, we elaborate the diﬃculties and

limitations of current solutio n s to solve the multi-way Theta-

joins with a concrete example.

2.1 MapReduce & Join Processing

MapReduce provides a simple parallel programming model

for data-intensive applications in a shared-nothing environ-

ment [12]. It was originally developed for indexing crawled

websites and OLAP appl ic a t io n s . Generally, a Master node

invokes Map tasks on c om p u t in g nodes that possess the

input data, which guarantees the locality of computation.

Map tasks transform the input (key,value) pair (k

) to n

new pairs: (k

), (k

), ..., (k

). The output of Map

tasks are then partitioned by the default hashing to diﬀer-

ent Reduce tasks according to k

. Onc e the Reduce tasks

receive (key,value) pai rs grouped by k

, they perform the

user speciﬁed computation on all the values of each key, and

write results back to the storage.

Obviously, this (key,value)-ba s ed programming model i m-

plies a natural implementation of Equi-join. By making the

join attribute the key, records that can be joined together

are sent to the same Reduce task. Even for the simila rity

join case [27], as long as the similarity metric is deﬁned,

each data record is assigned with a key set K = {k

, ..., k

and the intersection of s im il a r data records’ key sets is never

1185

empty. Thus, through such a mapp in g , it guarantees that

similar data records will be sent to at least one common

Reduce task.

In fac t, this key set method can be appli ed to any type of

join operator. However, to ensure that joinable data records

are always assigned to overlapping key sets, the cardinality

of a data record’s K can be very larg e. In the worst case,

it is the total number of Reduce tasks. Since the cardina l-

ity of a record’s K implies the number of times this record

being duplicated among Reduce tasks, the larger the value

is, the more computing overheads in terms of I/O and CPU

consumption will be introduced. Therefore, the essential op-

timization goal is to ﬁnd “the optimal” assignment of K to

each data rec o rd , such that the join query can be evaluated

with minimized data transmission over the network.

Another commo n concern about the MapReduce p ro g ra m -

ming model is i t s poor immunity to key skews. If (key,valu e)

pairs are high l y unevenly distributed among Reduc e tasks,

the syst em throughp u t can degrade signiﬁcantly. Unfortu-

nately, this could be a common scenario in join operations.

If there exist “popular” join attribute values, or the join con-

dition is an inequality, some data records can be joined with

huge number of data records from other relations , which

implies signiﬁcant key skew among the Red u c e tasks. More-

over, the fault tolerance property of the MapReduce pro-

gramming model is guaranteed on the cost of saving all the

intermediate results. Thus, the overhead of disk I/O domi-

nates the time eﬃciency of iterative MapReduce jobs. The

same observation has been made in [28].

In su mm a ry, to eﬃciently process join operations using

MapReduce is non-trivial. Especially when it comes to multi-

way join processing, selecting proper MapReduce jobs and

deciding a proper K for each data record make the problem

more challenging.

2.2 Multi-way Theta-Join

Theta-join is the join operation that takes inequality con-

ditions of join att ri b u tes ’ values into consideration, namely

the join condition function θ ∈ {<, >, =, <>, ≤, ≥}. Multi-

way Theta-join is a powerful analytic tool to elaborate com-

plex data correlation s . Consider the following application

scenario:

“Assume we have n cities, {c

, c

, ..., c

}, and all the

ﬂights information F I

i,j

between any two cities c

and c

Given a sequence of cities < c

, ..., c

>, and the stay-over

time length which must fall in the interval L

= [l

, l

] at

each city c

, ﬁnd out all the possible travel plans.”

This is a practical qu ery that could help travelers pla n

their trips. For illustration purpose, we simply assume F I

i,j

is a table containing ﬂight No., departure time (d t) and ar-

rival time (at). Then the above request can be easily an-

swered with a multi-way Theta-join operation over F I

s,s+1

..., F I

t−1,t

, by specifying the time interval between two suc-

cessive ﬂights falling into the particular city’s stay-over in-

terval requirement. For example, the θ function between

F I

s,s+1

and F I

s+1,s+2

is F I

s,s+1

.at+L

s+1

< F I

s+1,s+2

.dt

< F I

s,s+1

.at + L

s+1

To evaluate such queries, a straightforward method is to

iteratively conduct pair-wise Theta-join. However, this eval-

uation strategy might exclude some more eﬃ ci ent evaluatio n

plans. For instance, instead of using pair-wise joins, we can

evaluate multiple join conditions in one task. Therefore, less

MapReduce jobs are needed, which implies less computation

overheads in terms of the disk I/O of intermediate results.

3. PROBLEM DEFINITION

In thi s work, we mainly focus on the eﬃcient processing of

multi-way T h et a -jo i n s using MapReduce. Our solution tar-

gets on the MapReduce job identiﬁcation and scheduling.

In other words, we work on the rules to properly decom-

pose the query processing into several MapReduce jobs and

have them executed in a well scheduled fashion, such that

the minimum evaluation time span is achieved. In this sec-

tion, we shall ﬁrst present the terminologies that we use in

this paper, and then give the formal deﬁnition of the prob-

lem. We show that the problem o f ﬁnding the optimal query

evaluation plan is NP hard.

3.1 Terminology and Statement

For the ease of presentation, in the rest of the paper we

use the notati o n of “N-join” query to denote a multi-way

Theta-join query. We use MRJ to den o te a MapReduce job.

Consider a N-join query Q deﬁned over m relati on s R

, ...,

and n speciﬁed join conditions θ

, ..., θ

. As adopted

in many other works, like in [28], we can present Q as a

graph, namely a joi n graph. For completeness, we deﬁne a

join graph G

as follows:

Deﬁnition 1 A join graph G

=hV, E, Li is a connected gra-

ph with edge labels, where V ={v|v ∈ {R

, ..., R

}}, E=

{e|e = (v

, v

) ⇐⇒ ∃θ, R

⊲⊳

∈ Q}, L={l|l(e

) = θ

Intuitively, G

is generated by making every relation in Q a

vertex and connect in g two vertices if there is a join operator

between them. The edge is labeled with the corresponding

join function θ. To eval u a te Q, every θ function, i.e., every

edge from G

, needs to be evaluated. However, to evaluate all

the edges in G

, there are exponential number of plans since

any arbitrary number of co n n ec t in g edges can be evaluated

in one MRJ. We propose a join-path graph to cover all the

possibilities. For the purpose of clear illustration , we deﬁne

a no-edge-repeating path between two vertices of G

in the

ﬁrst place.

Deﬁnition 2 A no-edge-repeating path p between two ver-

tices v

and v

in G

is a traversing sequence of connecting

edges he

, ..., e

i between v

and v

in G

, in which no edge

appears more than once.

Deﬁnition 3 A join-path graph G

=hV, E

′

, L

′

, W, Si is a

complete weighted graph with edge labels, where ea ch edge is

associated with a weight and scheduling information. Speci f -

ically, V ={v|v ∈ {R

, ..., R

}}, E

′

={e

′

= (v

, v

) repre-

sents a unique no-edge-repeating path p between v

and v

in G

}, L

′

= {l

′

) = l

′

, v

) =

l(e), e ∈ p between v

and v

}, W = {w|w(e

′

) is the minimal cost to evaluate e

′

S = {s|s(e

′

) is the schedu li ng to evaluate e

′

at the cost of

w( e

′

)}.

In the deﬁnition, the scheduling information on the edge

refers to some user speciﬁed parameter to run a MRJ, such

that this job is expected to be accomplished as fast as pos-

sible. In this work, we consider the number of Reduce tasks

assigned to a M RJ as the scheduling parameter, denoted

as RN(MRJ), as it is the only parameter that users need

to specify in their programs. The reason we take this pa-

rameter into consi d era t io n is based on two observations from

extensive experiments: 1) It is not guaranteed t h at the more

computing units involved in Reduce tasks, the sooner a MRJ

job is accomplished; 2) Given limited computing units, there

is resource competition among multiple MRJs.

1186

Intuitively, we enumerate all the possib le join combina-

tions in G

. Note that in the context of join processing,

⊲⊳ R

is the same with R

⊲⊳ R

, therefore,

is an undirected graph. We elaborate Deﬁnition 3.3 with

the following example. Given a join graph G

, shown on the

left in Fig.1, a corresponding join-path graph G

is gener-

ated, which is presented in an adjacent matrix format on the

right. The numbers enclos ed in bracelets are the involved θ

functions on a path. For instance, in the cell corresponding

to R

and R

, {3, 4, 6, 5, 2} indi ca t es a no-edge-repeating

path {θ

, θ

} between R

and R

. For this par-

ticular example, notice that for every node there exists a

closed traversing path (or circuit) which covers all the edges

exactly once, namely the “Eulerian Circuit”. We use E(G

)

to denote a “Eulerian Circuit ” of G

in the ﬁg u re. Since

we only ca re what edges are involved in a path, any E(G

)

would be suﬃcient. Notice that in the ﬁgure, edge weights

and scheduling information are not presented. As a matter

of fact, these information are incrementally computed dur-

ing the generatio n of G

, which will be illu s tra t ed in the

later Section.

R2 R4

R1 R2 R3 R4 R5

{1,2,3} {1} {3,2}

{3,4,6,5,2}

{3,4} {3,5,6}

{3} {1,2}

{1,2,4,6,5} {3,4,6,5}

{1,2,5} {1,2,4,6}

{3,5} {3,4,6}

R2 !

{1,3,2}

{2,4} {2,5,6}

{1,3,4}

{1,3,5,6}

{2} {1,3}

{2,4,6,5}

{1,3,4,6,5}

{2,5} {2,4,6}

{1,3,5} {1,3,4,6}

R3 ! !

{4,5,6}

{4} {6,5}

{4,3,1,2}

{6,5,3,1,2}

{6} {4,5}

{4,3,1,2,5}

R4 ! ! !

{4,6,5}{3,1,2}

{5} {4,6}

{3,1,2,5}

{3,1,2,4,6}

R5 ! ! ! !

{4,5,6}

)(

Figure 1: Example join graph G

and its c o rre s pond-

ing join-path graph G

, pre s e nted in an adjacent

matrix

According to the deﬁnition of G

, any edge e

′

in G

is a

collection of connecting edges in G

. Thus, e

′

in fact implies

a subgraph of G

. As we use one MRJ to evaluate e

′

, denoted

as MRJ(e

′

), G

’s edge set represents all the possible MRJs

that can be empl oyed to evaluate the original query Q. Let

T denot e a set of MRJs that are selec t ed from G

’s edge set.

Intuitively, if the MRJs in T cover all the join conditions of

the original query, we can answer the query by executing all

these MRJs. Formally, we deﬁne that T is “suﬃcient” as

follows:

Deﬁnition 4 T , a collection of MRJ s , is suﬃcient to eval-

uate Q iﬀ

′

= G

.E, where MRJ(e

′

)∈ T ,

Since it is trivial to check whether T is suﬃcient, for the

rest of this work, we only consider the case that T is suf-

ﬁcient. Thus, given T , we deﬁne its execution plan P as a

speciﬁc execution sequence of MRJs, which minimizes the

time span of using T to evaluate the original query Q. For-

mally, we can deﬁne our problem as follows:

Problem Deﬁnition: Given a N-join query Q and k

pro-

cessing unit s, a join-path graph G

according t o Q’s join

graph G

is built. We want to select a collection of edges

from G

that correspondingly form a set of MRJs, denoted

as T

opt

, such that the re exists an execution plan P of T

opt

which minim iz es the query evaluatio n time.

Obviously, there are many diﬀerent choices of T to evalu-

ate Q. Moreover, given T a n d limited processing units, dif-

ferent execution plans yield diﬀerent evaluation time spans.

In fact, the determination of P is non-trivial, we give the

detailed analysis of the h a rd n e ss of our problem i n the next

subsection. As we shall elaborate later, given T and k

avail-

able processing units, we adopt an approximation me th od to

determine P in linear time.

3.2 Problem Hardness

According to the problem deﬁnition, we need two steps to

ﬁnd T

opt

: 1) generate G

from G

; 2) select MRJs for T

opt

Neither one of these two steps is easy to solve.

For the ﬁrst step, to constru ct G

, we need to enumerate

all the no-edge-repeating pa t h s between any pair of vertices

in G

. Assume G

has the “Eulerian trail”[16], which is a

way to traverse the graph with every edge be visited exactly

once, then for any pair of vertices v

and v

, any diﬀerent

no-edge-repeating path between them is a “sub-path” of an

Eulerian trail. If we know all th e no-edge-repeating paths

between any pair of vertices, we can enumerate all the Eule-

rian trails in polynomial tim e. Therefore, the complexity of

constructing G

is at least as hard as enumerating all the

Eulerian trails of a given graph, which is known to be #P-

complete [6]. Moreover, we ﬁnd that even G

does not have

an Eul eri a n trail, the problem co mp l exi ty is not reduced at

all, as we elaborate in the proof of the following theorem.

Theorem 1 Genera ti ng G

from a given G

is a #P com-

plete problem.

Proof. If G

has the Eulerian trail, constructing G

#P-complete (see the discussion above).

On the contrary, if G

does not have the Eulerian trail, it

implies that there are r vertices having odd degrees, where

r > 2. Now consider that we add one vi rtua l vertex and con-

necting it with r-1 vertices of odd degrees. Now the graph

must have an Eulerian trail. If we can easily construct the

join-path graph of the new graph , the ori g in a l graph’s G

can be computed i n polynomial ti me. We elaborate with the

following example, as shown in Fig. 2 . Assume v

is added

to the original G

, then by computing the jo in - p a th graph

of the new graph, we know all the no-edge-repeati n g paths

between v

and v

. Then, a no-edge-repeating path between

and v

cannot exist if it has v

involved. By simply re-

moving all the enumerated paths that go through v

, we can

obtain the G

of the orig in a l G

. Thus, the dominating cost

of constructing G

is still the enumeration of all Eulerian

trails. Therefore, this problem is #P-complete.

Figure 2: Adding virtual vertex v

to G

Although it is diﬃcult to compute the exact G

, we ﬁnd

that a sub g ra p h of G

, which contains all the vertices an d

denoted as G

′

, could be suﬃcient to guarantee the optimal

query evaluation eﬃciency. We take the following principle

into the consideration. Given the same number of processing

units, if it takes longer t i me to evaluate R

⊲⊳ R

with one MRJ compared to the total time cost of evaluating

⊲⊳ R

and R

⊲⊳ R

separately and merging the results,

we do not take R

⊲⊳ R

into consideration.

By following this principle, we can avoid enumerating all

the possible no-edge-repeating paths between any pair of

1187

vertices. As a matter of fact, we can obtain such a suﬃcient

′

in polynomial time.

The second step of our solution is to select t h e T

opt

. As-

sume the G

′

computed from the ﬁrst step provides a col-

lection of edges, accordingly, we h ave a collect io n of MRJ

candidates to evaluate the query. Although each edge in

is associa t ed with a weight d en o ti n g the minimum time

cost to evaluate all the join conditions contained in this edge,

it is just an estimated t im e span on the condition that there

are enough processing units. However, when a T is chosen,

and the number of processing units is limited, the time cost

of using T to answer Q need to be re-estimated. Assume

we can ﬁnd the time cost estimation of T , denoted as C(T ),

then the problem is to ﬁnd such an optimal T

opt

from all

possible T s, which has the minimum time cost. Apparently,

this is a variance of the classic set cover problem, which is

known to be NP hard [10]. Therefore, there are many heuris-

tics and approximation algorithms can be adopted to solve

the selection problem.

As clearly indicated in the problem deﬁnition, the solution

lies in the construction of G

′

and smartly select T based on

the cost estimation of a group of M R Js. Therefore, fo r the

rest of the paper, we shall ﬁrst elaborate our cost models for

a single MRJ and a group of MRJs. Then we present our

detailed solution for the N- jo in query eval u at i o n .

4. COST MODEL

To highli g ht our observatio n s on how much t h e overlap-

ping of computation and network cost would aﬀect the ex-

ecution of a MRJ, in this section we present a generalized

analytical study on the execution t ime of b o t h a single MRJ

and a group of MRJs. In the context of G

construction

and T selection, we study the estim a ti o n of w ( e

′

), where

′

∈ G

.E, and C(T ), which is the time cost to evaluate T .

4.1 Estimating w( e

′

): Model for Single MRJ

Since our target is to ﬁnd an optimal join plan, we only

consider the processing cost of join operations with MRJs.

Generally, most of the CPU time for join p rocessing is spent

on simple comparison and counting, thus, system I / O cost

dominates the total execution time. For MapReduce jobs,

heavy cost on large scale sequential disk scan and frequent

I/O of intermediate results dominate the execution time.

Therefore, we shall build a model for a MRJ’s execution

time based on the analysis of I/O a n d network co st .

General MapReduce computing framework involves t h ree

phases of data pro c es si n g: Map, Reduce and the da t a copy-

ing from M a p tasks to Reduce tasks, as shown in Fig.3.

In the ﬁgure, each “M” stands for a Map task; each “CP”

stands for one phase of Map output copying over network,

and each “R” stands for a Reduce task. Since each Map

task is based on a data block, we assume that the unit pro-

cessing cost for each Map task is t

. Moreover, since the

entire input data may no t be loaded into the system mem-

ory within one round [12] [3], we assume these Map tasks are

performed rou n d by round (we have the same observation in

practice). However, the size of Reduce task is subjected to

the (key, val ue ) distribution. As s h own in Fig.3, the make

span of a MRJ is dominated by the most time co nsu m in g

Reduce task. Therefore, we only consider the Reduce task

with the largest volume of inputs in the following analysis.

Assume the total input size of a MRJ is S

, the total inter-

mediate data copied from Map to Reduce is of size S

, the

number of M a p tasks and Reduce tasks are m and n, respec-

tively. In addition, a s a general assumption, S

is considered

to be evenly partitioned among m Map tasks [24]. Let J

and J

denote the total time cost of three phases re-

spectively, T be the total execution time of a MRJ. Then

T ≤ J

+ J

holds due to the overlapping between

and J

tCP

M R

CP CP

CP CP CP CP

JCP

tCP

Case 1:

Case 2:

Figure 3: MapReduce workﬂow

For each Map task, it pe rfo rms disk I/O and data pro-

cessing. Si nc e disk I/O is the dominant cost, therefore, we

can estimate the time cost for single Map task b a sed on disk

I/O. Disk I/O contains two part s, one is sequential reading,

the other is data spilling. Then the time cost for single Map

Task t

= (C

+ p × α) ×

(1)

where C

is a constant factor regarding disk I/O capability,

p is a random variable denoting the cost o f spilling inter-

mediate data. For a given system conﬁguration, p subjects

to the intermediate data size; it increases while spilled data

size grows. α denotes the output rat io of a Map task, which

is query speciﬁc and can be computed with the select iv ity

estimation. Assu me m

′

is the current number of Map tasks

running in parallel in the system, then J

can be computed

as follows

= t

′

(2)

For J

, let t

be the time cost for copying the output

of single Map task to n Reduce tasks, it includes the co st

of data copying over network as well as overhead of serv -

ing network protocols. t

is calculated with the following

formula,

= C

α × S

n × m

+ q × n (3)

where C

is a con st a nt number denoting the eﬃciency of

data copying over network, q is a random variable which

represents the cost of a Map task serving n connections from

n Reduce tasks. Intuitively, there is a rapid growth of q while

n gets larger. Thus, J

can be comput ed as follows:

′

× t

(4)

For J

, intuitively it is dominated by the Reduce task

which has the biggest size of input. We assume that the key

distribution in the input ﬁle is random; thus let S

denote

the input size of Reduce task i, then according to the Central

Limit Theorem[20], we can assume for i = 1 , ..., n, S

follows

a normal distribution N ∼ (µ, σ), where µ is determined by

α × S

and σ subjects to data set properties, which can be

learned from history query logs. Thus, by employing the

rule of “three sigmas”[20], we make S

∗

= α × S

× n

−1

+ 3σ

the biggest input size to a Reduce task, then

= (p + β × C

) × S

∗

(5)

1188

Efficient multi-way theta-join processing using MapReduce

Figures

Citations

A survey of large-scale analytical query processing in MapReduce

Distributed data management using MapReduce

From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System

Scalable subgraph enumeration in MapReduce

MapReduce algorithms for big data analysis

References

Introduction to Algorithms

MapReduce: simplified data processing on large clusters

Introduction to Algorithms

MapReduce: simplified data processing on large clusters

Probability theory : the logic of science

Related Papers (5)

MapReduce: simplified data processing on large clusters

A comparison of join algorithms for log processing in MaPreduce

Optimizing joins in a map-reduce environment

Efficient parallel set-similarity joins using MapReduce

Map-reduce-merge: simplified relational data processing on large clusters

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Efficient multi-way theta-join processing using mapreduce" ?