scispace - formally typeset
Open AccessJournal ArticleDOI

Efficient multi-way theta-join processing using MapReduce

Reads0
Chats0
TLDR
This work studies the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective and achieves significant improvement of the join processing efficiency.
Abstract
Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key, value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.

read more

Content maybe subject to copyright    Report

Efficient Multi-way Theta-Join Processing Using
MapReduce
Xiaofei Zhang
HKUST
Hong Kong
zhangxf@cse.ust.hk
Lei Chen
HKUST
Hong Kong
leichen@cse.ust.hk
Min Wang
HP Labs China
Beijing, China
min.wang6@hp.com
ABSTRACT
Multi-way Theta-join queries are powerful in describing com-
plex relations and therefore widely employed in real prac-
tices. Howeve r, existing solutions from traditional distribut-
ed and parallel databases for multi-way Theta-join queries
cannot be easily extended to fit a shared-nothing distributed
computing paradigm, which is proven to be able to sup-
port OLAP applications over immense d a t a volumes. In
this work, we study the problem of efficient process in g of
multi-way Theta-join queries usin g MapReduce from a co s t-
effective perspective. Although there have been some works
using the (key,value) pair-based programming model to sup-
port join operations, efficient processing of multi-way Theta-
join queries has never been fully explored. The substantial
challenge lies in, given a number of processing units (that
can run Map or Reduce tasks), mapping a multi-way Theta-
join query to a number of MapReduce jobs and having them
executed in a well scheduled sequence, such that the total
processing time span is minim iz ed . Our solution mainly in-
cludes two parts: 1) cost metrics for b o t h single MapReduce
job and a number of MapReduce jobs ex ec u te d in a certain
order; 2) the efficient execution of a chain-typed Theta-join
with only one MapRed u ce job. Comparing with the qu e ry
evaluation strategy proposed in [23] and the widely adopted
Pig Latin and Hive SQL solutions , our method achieves sig-
nificant im p rovement of the join processing efficiency.
1. INTRODUCTION
Data analytical queries in real practices commonly in-
volve multi-way join operations. The operators involved in a
multi-way join query are more than just Equi-join. Instead,
the join condition can be defined as a binary funct io n θ that
belongs to {<,,=,,>, <> } , as known as Theta-join. Com-
pared with Equi-join, it is more general and expressive in
relation description and surprisin gl y handy in data an a lyt i c
queries. Thus, efficient processing of multi-way Theta-join
queries plays a critical role in the system performance. In
fact, evaluating multi-way Theta-joins has always been a
challenging problem along with the development of database
technology. Early wo rk s, like [8][26] [2 2 ] and etc., have elab-
orated the complexity of the p ro b l em and presented th eir
evaluation strategies. However, their solutions do not scale
to process the multi-way Theta-joins over the data of tremen-
dous volumes. For instance, as reported from Facebook [5]
and Google [11], the underlying d a t a volume is of hundreds
of tera-bytes or even peta-bytes. In such scenarios, solu-
tions from the traditional distributed or parallel databases
are infeasible due to unsatisfactory scalability and poor fault
tolerance.
On the contrary, (key,value)-based MapReduce program-
ming model substantially guarantees great scalab il ity and
strong fault tolerance property. It has emerged as the most
popular processing paradigm in a shared-not h i n g computing
environment. Recently, devoting research efforts towards ef-
ficient and effective analytic processing over immense data
have been made within the MapReduce framework. Cur-
rently, the database community mainly focuses on two is-
sues. First, the transformation from certain relational al-
gebra operator, like similarity join, to its (key,value)- b a sed
parallel implementation. Second, the tuning o r re-design
of the transformatio n function such that the MapReduce
job is executed more efficiently in terms of less time cost or
computing resources consumption. Although various rela-
tional operators, like pair-wise Theta-join , fuzzy join, aggre-
gation operators and etc., are evaluated and implemented
using MapReduce, there is l it t le effort exploring the effi-
cient processing of multi-way join queries, especially more
general compu t a ti o n namely Theta-join, using MapReduce.
The reason is that, the problem involves more than just a
relational operator(key,valu e) pair transformation and the
tuning, there are other critical issues needed to be addressed:
1) How many M a p Reduce jobs should we employ to evaluate
the query? 2) What is each MapReduce job responsible for?
3) How should multiple MapReduce jobs be scheduled?
To add res s the problem, there are two challenging issues
needed to be resolved. Firstly, the number of availa b le com-
puting units is in fact limi ted , which is often neglected when
mapping a task to a set of MapReduce jobs . Although
the pay-as-you-go policy of Cloud computing platform could
promise as many computing resources as required, however,
once a computing environment is established, the allowed
maximum number of concurrent Map and Reduce tasks is
fixed according to the system configuration. Even taken
the auto scaling feature of Amazon EC2 platform [18] into
consideration, the maximum number of involved computing
units are pre-determined by the user-defin ed profiles. There-
1184
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 38th International Conference on Very Large Data Bases,
August 27th - 31st 2012, Istanbul, Turkey.
Proceedings of the VLDB Endowment, Vol. 5, No. 11
Copyright 2012 VLDB Endowment 2150-8097/12/07... $ 10.00.

fore, with the use r specified Reduce ta sk number, a multi-
way Theta-join query is processed with only limited number
of available computing units.
The second challenge is tha t , the decomposition of a multi-
way Theta-join query into a number of MapReduce tasks is
non-trivial. Work [28] targets at the multi-way Equi-join
processing. It decomp o s es a query into several MapReduce
jobs and schedules the execution based on a specific cost
model. However, it only considers the pair-wise join as the
basic scheduling unit. In other words, it follows the tradi-
tional multi-way join pro ce ssi n g methodology, which eval-
uates the query with a sequence of pair-wis e joins. This
methodology excludes the possible optimization opportunity
to evaluate a multi-way join in on e MapReduce job. Our
observation is that, under certain conditions, evaluating a
multi-way join with one MapReduce job is much more effi-
cient than with a sequence of MapReduce jobs conducting
pair-wise joins. Work [23] reports the same observation. One
dominating reason is t h a t , the I/O c o st s of intermediate re-
sults generated by multiple MapReduce jobs may become
unacceptable overheads. Work [2] p re sents the solution of
evaluating a multi-way join in one MapReduce job, which
only works for the Equi-join case. Since the Theta-join can-
not be answered by simply making the join attribute the
partition key, thus, the solution proposed in [2] cannot be ex-
tended to solve the case of multi-way Theta-joins. Work [25]
demonstrates effective pair-wise Theta-join processing using
MapReduce by partitioning a two dimen si on a l result space
formed by the cross-product of two relations. For t h e case
of multi-way join, the result space is a hyper-cube, whose
dimensionality is the number of the relati on s involved in
the q u ery. Unfortun a t ely, work [25] does not explore how
to extend th eir solution to handle the partition in high di-
mensions. Moreover, the questio n about wheth er we should
evaluate a complex query wit h a sin g le M a p R edu c e jo b or
several MapReduce jobs, is not cle a r yet. Therefore, there
is no straightforward solution to combine the techni qu e s in
existing literatures to evaluate a multi-way Theta-join query.
Meanwhile, assume a set of MapReduce jobs are gener-
ated for the query evaluation. Then given a limited number
of processing units, it remains a challenge to schedule the
execution of MapReduce jo b s , such that the q u ery can be
answered with the min i mum time span. These jobs may have
dependency relationships and inter-compe ti t io n for resource
consumptions during the concurrent execution. Currently,
the MapReduce framework requires the number of Reduc e
tasks as a u ser specified input. Thus, after decomposing a
multi-way Theta-jo in query into a number of MapReduce
jobs, one challenging issue is how to specify each job a
proper Reduce task number, such that the overall scheduling
achieves the minimum ex ec u t io n time span.
Specifically, the problem that we are working on is: given
a number of processing units (that can run Map o r Re-
duce tasks), mapping a multi-way Theta-join to a number of
MapReduce jobs and having them executed in a well sched-
uled order, such that the total processing time span is mini-
mized. Our solution to this challenging problem i n cl u d es two
core techniques. The first one is, given a mult i- way Theta-
join query, we examine all the possible decomposition plans
and estimate the minimum execution time cost for each plan.
Especially, we figure out the rules to properly decompose the
original multi-way Theta-join query and study the most ef-
ficient solution to evaluate multiple join condition functions
using one Ma p R ed u c e job. Th e second technique is that,
given a limited number of comp u t ing units and a pool of
possible MapReduce jobs to eval u a te the query, we design a
novel so lu t io n to select jobs to effectively evaluate the query
as fast as possible. To evaluate the c os t, we develop an I/O
and network aware cost model to describe the behavior of a
MapReduce job.
To the best of our knowledge, this is the first work explor-
ing the multi-way Theta-joins evaluation using MapRed u c e.
Our main contributions are listed as follows:
We establish the rules to decom pose a multi-way join
query. Under our proposed cost model, we can figure
out whether a multi-way join query sho u ld be evalu-
ated with multiple MapReduce jobs or a single MapRe-
duce job.
We develop a resource aware (key,value) pair distri-
bution method to evaluate the chain-typed multi-way
Theta-join query with one MapReduce job, which guar-
antees minimized volume of data copying over the net-
work, as well as evenly distributed workload among
Reduce tasks.
We validate our cost model and the solution for multi-
way Theta-join queries with extensive experiments.
The rest of the paper is organized as follows. In Section 2,
we briefly review the MapReduce computing paradigm and
elaborate the application scenario for multi-way Theta-joins.
We formally define our problem in Section 3 and present
our cost model i n section 4. We take Section 5 to explain
our query evalua t io n strategies in det a il s. We validate our
solution in Section 6 with extensive experiments on both real
and synthetic data sets. We summarize and compare the
most recent closely related work in Section 7 and conclude
our work in S ec t io n 8.
2. PRELIMINARIES
In this section we briefly present the MapReduce program-
ming model and how it has been applied to evaluate join
queries. More importantly, we elaborate the difficulties and
limitations of current solutio n s to solve the multi-way Theta-
joins with a concrete example.
2.1 MapReduce & Join Processing
MapReduce provides a simple parallel programming model
for data-intensive applications in a shared-nothing environ-
ment [12]. It was originally developed for indexing crawled
websites and OLAP appl ic a t io n s . Generally, a Master node
invokes Map tasks on c om p u t in g nodes that possess the
input data, which guarantees the locality of computation.
Map tasks transform the input (key,value) pair (k
1
,v
1
) to n
new pairs: (k
2
1
,v
2
1
), (k
2
2
,v
2
2
), ..., (k
2
n
,v
2
n
). The output of Map
tasks are then partitioned by the default hashing to differ-
ent Reduce tasks according to k
2
i
. Onc e the Reduce tasks
receive (key,value) pai rs grouped by k
2
i
, they perform the
user specified computation on all the values of each key, and
write results back to the storage.
Obviously, this (key,value)-ba s ed programming model i m-
plies a natural implementation of Equi-join. By making the
join attribute the key, records that can be joined together
are sent to the same Reduce task. Even for the simila rity
join case [27], as long as the similarity metric is defined,
each data record is assigned with a key set K = {k
i
, ..., k
j
},
and the intersection of s im il a r data records’ key sets is never
1185

empty. Thus, through such a mapp in g , it guarantees that
similar data records will be sent to at least one common
Reduce task.
In fac t, this key set method can be appli ed to any type of
join operator. However, to ensure that joinable data records
are always assigned to overlapping key sets, the cardinality
of a data record’s K can be very larg e. In the worst case,
it is the total number of Reduce tasks. Since the cardina l-
ity of a record’s K implies the number of times this record
being duplicated among Reduce tasks, the larger the value
is, the more computing overheads in terms of I/O and CPU
consumption will be introduced. Therefore, the essential op-
timization goal is to find “the optimal” assignment of K to
each data rec o rd , such that the join query can be evaluated
with minimized data transmission over the network.
Another commo n concern about the MapReduce p ro g ra m -
ming model is i t s poor immunity to key skews. If (key,valu e)
pairs are high l y unevenly distributed among Reduc e tasks,
the syst em throughp u t can degrade significantly. Unfortu-
nately, this could be a common scenario in join operations.
If there exist “popular” join attribute values, or the join con-
dition is an inequality, some data records can be joined with
huge number of data records from other relations , which
implies significant key skew among the Red u c e tasks. More-
over, the fault tolerance property of the MapReduce pro-
gramming model is guaranteed on the cost of saving all the
intermediate results. Thus, the overhead of disk I/O domi-
nates the time efficiency of iterative MapReduce jobs. The
same observation has been made in [28].
In su mm a ry, to efficiently process join operations using
MapReduce is non-trivial. Especially when it comes to multi-
way join processing, selecting proper MapReduce jobs and
deciding a proper K for each data record make the problem
more challenging.
2.2 Multi-way Theta-Join
Theta-join is the join operation that takes inequality con-
ditions of join att ri b u tes values into consideration, namely
the join condition function θ {<, >, =, <>, , ≥}. Multi-
way Theta-join is a powerful analytic tool to elaborate com-
plex data correlation s . Consider the following application
scenario:
Assume we have n cities, {c
1
, c
2
, ..., c
n
}, and all the
flights information F I
i,j
between any two cities c
i
and c
j
.
Given a sequence of cities < c
s
, ..., c
t
>, and the stay-over
time length which must fall in the interval L
i
= [l
1
, l
2
] at
each city c
i
, find out all the possible travel plans.
This is a practical qu ery that could help travelers pla n
their trips. For illustration purpose, we simply assume F I
i,j
is a table containing flight No., departure time (d t) and ar-
rival time (at). Then the above request can be easily an-
swered with a multi-way Theta-join operation over F I
s,s+1
,
..., F I
t1,t
, by specifying the time interval between two suc-
cessive flights falling into the particular city’s stay-over in-
terval requirement. For example, the θ function between
F I
s,s+1
and F I
s+1,s+2
is F I
s,s+1
.at+L
s+1
.l
1
< F I
s+1,s+2
.dt
< F I
s,s+1
.at + L
s+1
.l
2
.
To evaluate such queries, a straightforward method is to
iteratively conduct pair-wise Theta-join. However, this eval-
uation strategy might exclude some more effi ci ent evaluatio n
plans. For instance, instead of using pair-wise joins, we can
evaluate multiple join conditions in one task. Therefore, less
MapReduce jobs are needed, which implies less computation
overheads in terms of the disk I/O of intermediate results.
3. PROBLEM DEFINITION
In thi s work, we mainly focus on the efficient processing of
multi-way T h et a -jo i n s using MapReduce. Our solution tar-
gets on the MapReduce job identification and scheduling.
In other words, we work on the rules to properly decom-
pose the query processing into several MapReduce jobs and
have them executed in a well scheduled fashion, such that
the minimum evaluation time span is achieved. In this sec-
tion, we shall first present the terminologies that we use in
this paper, and then give the formal definition of the prob-
lem. We show that the problem o f finding the optimal query
evaluation plan is NP hard.
3.1 Terminology and Statement
For the ease of presentation, in the rest of the paper we
use the notati o n of “N-join” query to denote a multi-way
Theta-join query. We use MRJ to den o te a MapReduce job.
Consider a N-join query Q defined over m relati on s R
1
, ...,
R
m
and n specified join conditions θ
1
, ..., θ
n
. As adopted
in many other works, like in [28], we can present Q as a
graph, namely a joi n graph. For completeness, we define a
join graph G
J
as follows:
Definition 1 A join graph G
J
=hV, E, Li is a connected gra-
ph with edge labels, where V ={v|v {R
1
, ..., R
m
}}, E=
{e|e = (v
i
, v
j
) θ, R
i
θ
R
j
Q}, L={l|l(e
i
) = θ
i
}.
Intuitively, G
J
is generated by making every relation in Q a
vertex and connect in g two vertices if there is a join operator
between them. The edge is labeled with the corresponding
join function θ. To eval u a te Q, every θ function, i.e., every
edge from G
J
, needs to be evaluated. However, to evaluate all
the edges in G
J
, there are exponential number of plans since
any arbitrary number of co n n ec t in g edges can be evaluated
in one MRJ. We propose a join-path graph to cover all the
possibilities. For the purpose of clear illustration , we define
a no-edge-repeating path between two vertices of G
J
in the
first place.
Definition 2 A no-edge-repeating path p between two ver-
tices v
i
and v
j
in G
J
is a traversing sequence of connecting
edges he
i
, ..., e
j
i between v
i
and v
j
in G
J
, in which no edge
appears more than once.
Definition 3 A join-path graph G
JP
=hV, E
, L
, W, Si is a
complete weighted graph with edge labels, where ea ch edge is
associated with a weight and scheduling information. Speci f -
ically, V ={v|v {R
1
, ..., R
m
}}, E
={e
|e
= (v
i
, v
j
) repre-
sents a unique no-edge-repeating path p between v
i
and v
j
in G
J
}, L
= {l
|l
(e
) = l
(v
i
, v
j
) =
S
l(e), e p between v
i
and v
j
}, W = {w|w(e
) is the minimal cost to evaluate e
},
S = {s|s(e
) is the schedu li ng to evaluate e
at the cost of
w( e
)}.
In the definition, the scheduling information on the edge
refers to some user specified parameter to run a MRJ, such
that this job is expected to be accomplished as fast as pos-
sible. In this work, we consider the number of Reduce tasks
assigned to a M RJ as the scheduling parameter, denoted
as RN(MRJ), as it is the only parameter that users need
to specify in their programs. The reason we take this pa-
rameter into consi d era t io n is based on two observations from
extensive experiments: 1) It is not guaranteed t h at the more
computing units involved in Reduce tasks, the sooner a MRJ
job is accomplished; 2) Given limited computing units, there
is resource competition among multiple MRJs.
1186

Intuitively, we enumerate all the possib le join combina-
tions in G
JP
. Note that in the context of join processing,
R
i
R
k
R
j
is the same with R
j
R
k
R
i
, therefore,
G
JP
is an undirected graph. We elaborate Definition 3.3 with
the following example. Given a join graph G
J
, shown on the
left in Fig.1, a corresponding join-path graph G
JP
is gener-
ated, which is presented in an adjacent matrix format on the
right. The numbers enclos ed in bracelets are the involved θ
functions on a path. For instance, in the cell corresponding
to R
1
and R
2
, {3, 4, 6, 5, 2} indi ca t es a no-edge-repeating
path {θ
3
, θ
4
, θ
6
, θ
5
, θ
2
} between R
1
and R
2
. For this par-
ticular example, notice that for every node there exists a
closed traversing path (or circuit) which covers all the edges
exactly once, namely the “Eulerian Circuit”. We use E(G
JP
)
to denote a “Eulerian Circuit of G
JP
in the fig u re. Since
we only ca re what edges are involved in a path, any E(G
JP
)
would be sufficient. Notice that in the figure, edge weights
and scheduling information are not presented. As a matter
of fact, these information are incrementally computed dur-
ing the generatio n of G
JP
, which will be illu s tra t ed in the
later Section.
R1
R3
R2 R4
R5
1
2
3
4
5
6
R1 R2 R3 R4 R5
R1
{1,2,3} {1} {3,2}
{3,4,6,5,2}
{3,4} {3,5,6}
{3} {1,2}
{1,2,4,6,5} {3,4,6,5}
{1,2,5} {1,2,4,6}
{3,5} {3,4,6}
R2 !
{1,3,2}
{2,4} {2,5,6}
{1,3,4}
{1,3,5,6}
{2} {1,3}
{2,4,6,5}
{1,3,4,6,5}
{2,5} {2,4,6}
{1,3,5} {1,3,4,6}
R3 ! !
{4,5,6}
{4} {6,5}
{4,3,1,2}
{6,5,3,1,2}
{6} {4,5}
{4,3,1,2,5}
R4 ! ! !
{4,6,5}{3,1,2}
{5} {4,6}
{3,1,2,5}
{3,1,2,4,6}
R5 ! ! ! !
{4,5,6}
)(
JP
G
)(
JP
G
)(
JP
G
)(
JP
G
)(
JP
G
Figure 1: Example join graph G
J
and its c o rre s pond-
ing join-path graph G
JP
, pre s e nted in an adjacent
matrix
According to the definition of G
JP
, any edge e
in G
JP
is a
collection of connecting edges in G
J
. Thus, e
in fact implies
a subgraph of G
J
. As we use one MRJ to evaluate e
, denoted
as MRJ(e
), G
JP
’s edge set represents all the possible MRJs
that can be empl oyed to evaluate the original query Q. Let
T denot e a set of MRJs that are selec t ed from G
JP
’s edge set.
Intuitively, if the MRJs in T cover all the join conditions of
the original query, we can answer the query by executing all
these MRJs. Formally, we define that T is “sufficient” as
follows:
Definition 4 T , a collection of MRJ s , is sufficient to eval-
uate Q iff
S
e
i
= G
J
.E, where MRJ(e
i
) T ,
Since it is trivial to check whether T is sufficient, for the
rest of this work, we only consider the case that T is suf-
ficient. Thus, given T , we define its execution plan P as a
specific execution sequence of MRJs, which minimizes the
time span of using T to evaluate the original query Q. For-
mally, we can define our problem as follows:
Problem Definition: Given a N-join query Q and k
P
pro-
cessing unit s, a join-path graph G
JP
according t o Q’s join
graph G
J
is built. We want to select a collection of edges
from G
JP
that correspondingly form a set of MRJs, denoted
as T
opt
, such that the re exists an execution plan P of T
opt
which minim iz es the query evaluatio n time.
Obviously, there are many different choices of T to evalu-
ate Q. Moreover, given T a n d limited processing units, dif-
ferent execution plans yield different evaluation time spans.
In fact, the determination of P is non-trivial, we give the
detailed analysis of the h a rd n e ss of our problem i n the next
subsection. As we shall elaborate later, given T and k
P
avail-
able processing units, we adopt an approximation me th od to
determine P in linear time.
3.2 Problem Hardness
According to the problem definition, we need two steps to
find T
opt
: 1) generate G
JP
from G
J
; 2) select MRJs for T
opt
.
Neither one of these two steps is easy to solve.
For the first step, to constru ct G
JP
, we need to enumerate
all the no-edge-repeating pa t h s between any pair of vertices
in G
J
. Assume G
J
has the “Eulerian trail”[16], which is a
way to traverse the graph with every edge be visited exactly
once, then for any pair of vertices v
i
and v
j
, any different
no-edge-repeating path between them is a “sub-path” of an
Eulerian trail. If we know all th e no-edge-repeating paths
between any pair of vertices, we can enumerate all the Eule-
rian trails in polynomial tim e. Therefore, the complexity of
constructing G
JP
is at least as hard as enumerating all the
Eulerian trails of a given graph, which is known to be #P-
complete [6]. Moreover, we find that even G
J
does not have
an Eul eri a n trail, the problem co mp l exi ty is not reduced at
all, as we elaborate in the proof of the following theorem.
Theorem 1 Genera ti ng G
JP
from a given G
J
is a #P com-
plete problem.
Proof. If G
J
has the Eulerian trail, constructing G
JP
is
#P-complete (see the discussion above).
On the contrary, if G
J
does not have the Eulerian trail, it
implies that there are r vertices having odd degrees, where
r > 2. Now consider that we add one vi rtua l vertex and con-
necting it with r-1 vertices of odd degrees. Now the graph
must have an Eulerian trail. If we can easily construct the
join-path graph of the new graph , the ori g in a l graph’s G
JP
can be computed i n polynomial ti me. We elaborate with the
following example, as shown in Fig. 2 . Assume v
s
is added
to the original G
J
, then by computing the jo in - p a th graph
of the new graph, we know all the no-edge-repeati n g paths
between v
i
and v
j
. Then, a no-edge-repeating path between
v
i
and v
j
cannot exist if it has v
s
involved. By simply re-
moving all the enumerated paths that go through v
s
, we can
obtain the G
JP
of the orig in a l G
J
. Thus, the dominating cost
of constructing G
JP
is still the enumeration of all Eulerian
trails. Therefore, this problem is #P-complete.
s
v
i
v
p
v
q
v
j
v
Figure 2: Adding virtual vertex v
s
to G
J
Although it is difficult to compute the exact G
JP
, we find
that a sub g ra p h of G
JP
, which contains all the vertices an d
denoted as G
JP
, could be sufficient to guarantee the optimal
query evaluation efficiency. We take the following principle
into the consideration. Given the same number of processing
units, if it takes longer t i me to evaluate R
i
R
j
R
k
with one MRJ compared to the total time cost of evaluating
R
i
R
j
and R
j
R
k
separately and merging the results,
we do not take R
i
R
j
R
k
R
s
into consideration.
By following this principle, we can avoid enumerating all
the possible no-edge-repeating paths between any pair of
1187

vertices. As a matter of fact, we can obtain such a sufficient
G
JP
in polynomial time.
The second step of our solution is to select t h e T
opt
. As-
sume the G
JP
computed from the first step provides a col-
lection of edges, accordingly, we h ave a collect io n of MRJ
candidates to evaluate the query. Although each edge in
G
JP
is associa t ed with a weight d en o ti n g the minimum time
cost to evaluate all the join conditions contained in this edge,
it is just an estimated t im e span on the condition that there
are enough processing units. However, when a T is chosen,
and the number of processing units is limited, the time cost
of using T to answer Q need to be re-estimated. Assume
we can find the time cost estimation of T , denoted as C(T ),
then the problem is to find such an optimal T
opt
from all
possible T s, which has the minimum time cost. Apparently,
this is a variance of the classic set cover problem, which is
known to be NP hard [10]. Therefore, there are many heuris-
tics and approximation algorithms can be adopted to solve
the selection problem.
As clearly indicated in the problem definition, the solution
lies in the construction of G
JP
and smartly select T based on
the cost estimation of a group of M R Js. Therefore, fo r the
rest of the paper, we shall first elaborate our cost models for
a single MRJ and a group of MRJs. Then we present our
detailed solution for the N- jo in query eval u at i o n .
4. COST MODEL
To highli g ht our observatio n s on how much t h e overlap-
ping of computation and network cost would affect the ex-
ecution of a MRJ, in this section we present a generalized
analytical study on the execution t ime of b o t h a single MRJ
and a group of MRJs. In the context of G
JP
construction
and T selection, we study the estim a ti o n of w ( e
), where
e
G
JP
.E, and C(T ), which is the time cost to evaluate T .
4.1 Estimating w( e
): Model for Single MRJ
Since our target is to find an optimal join plan, we only
consider the processing cost of join operations with MRJs.
Generally, most of the CPU time for join p rocessing is spent
on simple comparison and counting, thus, system I / O cost
dominates the total execution time. For MapReduce jobs,
heavy cost on large scale sequential disk scan and frequent
I/O of intermediate results dominate the execution time.
Therefore, we shall build a model for a MRJ’s execution
time based on the analysis of I/O a n d network co st .
General MapReduce computing framework involves t h ree
phases of data pro c es si n g: Map, Reduce and the da t a copy-
ing from M a p tasks to Reduce tasks, as shown in Fig.3.
In the figure, each “M” stands for a Map task; each “CP”
stands for one phase of Map output copying over network,
and each “R” stands for a Reduce task. Since each Map
task is based on a data block, we assume that the unit pro-
cessing cost for each Map task is t
M
. Moreover, since the
entire input data may no t be loaded into the system mem-
ory within one round [12] [3], we assume these Map tasks are
performed rou n d by round (we have the same observation in
practice). However, the size of Reduce task is subjected to
the (key, val ue ) distribution. As s h own in Fig.3, the make
span of a MRJ is dominated by the most time co nsu m in g
Reduce task. Therefore, we only consider the Reduce task
with the largest volume of inputs in the following analysis.
Assume the total input size of a MRJ is S
I
, the total inter-
mediate data copied from Map to Reduce is of size S
CP
, the
number of M a p tasks and Reduce tasks are m and n, respec-
tively. In addition, a s a general assumption, S
I
is considered
to be evenly partitioned among m Map tasks [24]. Let J
M
,
J
R
and J
CP
denote the total time cost of three phases re-
spectively, T be the total execution time of a MRJ. Then
T J
M
+ J
CP
+ J
R
holds due to the overlapping between
J
M
and J
CP
.
tCP
M R
R
CP
M
M
M
M
M
M
M
M
CP CP
M
M
M
CP
CP CP CP CP
R
tM
JM
JCP
JCP
JR
tCP
Case 1:
Case 2:
Figure 3: MapReduce workflow
For each Map task, it pe rfo rms disk I/O and data pro-
cessing. Si nc e disk I/O is the dominant cost, therefore, we
can estimate the time cost for single Map task b a sed on disk
I/O. Disk I/O contains two part s, one is sequential reading,
the other is data spilling. Then the time cost for single Map
Task t
M
is
t
M
= (C
1
+ p × α) ×
S
I
m
(1)
where C
1
is a constant factor regarding disk I/O capability,
p is a random variable denoting the cost o f spilling inter-
mediate data. For a given system configuration, p subjects
to the intermediate data size; it increases while spilled data
size grows. α denotes the output rat io of a Map task, which
is query specific and can be computed with the select iv ity
estimation. Assu me m
is the current number of Map tasks
running in parallel in the system, then J
M
can be computed
as follows
J
M
= t
M
×
m
m
(2)
For J
CP
, let t
CP
be the time cost for copying the output
of single Map task to n Reduce tasks, it includes the co st
of data copying over network as well as overhead of serv -
ing network protocols. t
CP
is calculated with the following
formula,
t
CP
= C
2
×
α × S
I
n × m
+ q × n (3)
where C
2
is a con st a nt number denoting the efficiency of
data copying over network, q is a random variable which
represents the cost of a Map task serving n connections from
n Reduce tasks. Intuitively, there is a rapid growth of q while
n gets larger. Thus, J
CP
can be comput ed as follows:
J
CP
=
m
m
× t
CP
(4)
For J
R
, intuitively it is dominated by the Reduce task
which has the biggest size of input. We assume that the key
distribution in the input file is random; thus let S
i
r
denote
the input size of Reduce task i, then according to the Central
Limit Theorem[20], we can assume for i = 1 , ..., n, S
i
r
follows
a normal distribution N (µ, σ), where µ is determined by
α × S
I
and σ subjects to data set properties, which can be
learned from history query logs. Thus, by employing the
rule of three sigmas”[20], we make S
r
= α × S
I
× n
1
+ 3σ
the biggest input size to a Reduce task, then
J
R
= (p + β × C
1
) × S
r
(5)
1188

Citations
More filters
Journal ArticleDOI

A survey of large-scale analytical query processing in MapReduce

TL;DR: A taxonomy is presented for categorizing existing research on MapReduce improvements according to the specific problem they target, and interesting directions for future parallel data processing systems are outlined.
Journal ArticleDOI

Distributed data management using MapReduce

TL;DR: This article aims to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.
Proceedings ArticleDOI

From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System

TL;DR: This paper describes a system that can compute efficiently complex join queries, including queries with cyclic joins, on a massively parallel architecture, and builds on two independent lines of work for multi-join query evaluation: a communication-optimal algorithm for distributed evaluation, and a worst-case optimal algorithm for sequential evaluation.
Proceedings ArticleDOI

Scalable subgraph enumeration in MapReduce

TL;DR: This paper proposes a new algorithm TwinTwigJoin based on a left-deep-join framework in MapReduce, in which the basic join unit is a Twintwig (an edge or two incident edges of a node) and shows that in the Erdos-Renyi random-graph model, TwinTwgJoin is instance optimal in the left- Deepjoin framework under reasonable assumptions, and devise an algorithm to compute the optimal join plan.
Journal ArticleDOI

MapReduce algorithms for big data analysis

TL;DR: This tutorial will introduce the MapReduce framework based on Hadoop, discuss how to design efficient MapReduced algorithms and present the state-of-the-art in MapRed reduce algorithms for data mining, machine learning and similarity joins.
References
More filters
Book

Introduction to Algorithms

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
BookDOI

Probability theory : the logic of science

TL;DR: In this article, a survey of elementary applications of probability theory can be found, including the following: 1. Plausible reasoning 2. The quantitative rules 3. Elementary sampling theory 4. Elementary hypothesis testing 5. Queer uses for probability theory 6. Elementary parameter estimation 7. The central, Gaussian or normal distribution 8. Sufficiency, ancillarity, and all that 9. Repetitive experiments, probability and frequency 10. Advanced applications: 11. Discrete prior probabilities, the entropy principle 12. Simple applications of decision theory 15.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What have the authors contributed in "Efficient multi-way theta-join processing using mapreduce" ?

In this work, the authors study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a costeffective perspective.