scispace - formally typeset
Open AccessProceedings ArticleDOI

Scalable Multi-query Optimization for SPARQL

Reads0
Chats0
TLDR
This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL and proposes heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together.
Abstract
This paper revisits the classical problem of multi-query optimization in the context of RDF/SPARQL. We show that the techniques developed for relational and semi-structured data/query languages are hard, if not impossible, to be extended to account for RDF data model and graph query patterns expressed in SPARQL. In light of the NP-hardness of the multi-query optimization for SPARQL, we propose heuristic algorithms that partition the input batch of queries into groups such that each group of queries can be optimized together. An essential component of the optimization incorporates an efficient algorithm to discover the common sub-structures of multiple SPARQL queries and an effective cost model to compare candidate execution plans. Since our optimization techniques do not make any assumption about the underlying SPARQL query engine, they have the advantage of being portable across different RDF stores. The extensive experimental studies, performed on three popular RDF stores, show that the proposed techniques are effective, efficient and scalable.

read more

Content maybe subject to copyright    Report

Scalable Multi-Query Optimization for SPARQL
Wangchao Le
1
Anastasios Kementsietsidis
2
Songyun Duan
2
Feifei Li
1
1
School of Computing, University of Utah, Salt Lake City, UT, USA
2
IBM T.J. Watson Research Center, Hawthorne, NY, USA
1
{lew,lifeifei}@cs.utah.edu,
2
{akement, sduan}@us.ibm.com
Abstract—This paper revisits the classical problem of multi-
query optimization in the context of RDF/SPARQL. We show
that the techniques developed for relational and semi-structured
data/query languages are hard, if not impossible, to be extended
to account for RDF data model and graph query patterns
expressed in SPARQL. In light of the NP-hardness of the
multi-query optimization for SPARQL, we propose heuristic
algorithms that partition the input batch of queries into groups
such that each group of queries can be optimized together.
An essential component of the optimization incorporates an
efficient algorithm to discover the common sub-structures of
multiple SPARQL queries and an effective cost model to compare
candidate execution plans. Since our optimization techniques do
not make any assumption about the underlying SPARQL query
engine, they have the advantage of being portable across different
RDF stores. The extensive experimental studies, performed on
three popular RDF stores, show that the proposed techniques
are effective, efficient and scalable.
I. INTRODUCTION
With the proliferation of RDF data, over the years, a lot
of effort has been devoted in building RDF stores that aim to
efficiently answer graph pattern queries expressed in SPARQL.
There are generally two routes to building RDF stores: (i)
migrating the schema-relax RDF data to relational data, e.g.,
Virtuoso, Jena SDB, Sesame, 3store; and (ii) building generic
RDF s tores from scratch, e.g., Jena TDB, RDF-3X, 4store,
Sesame Native. As RDF data are schema-relax [26] and
graph pattern queries in SPARQL typically involve many
joins [1], [19], a full spectrum of techniques have been
proposed to address the new challenges. For inst ance, vertical
partitioning was proposed for relational backend [1]; side-
way information passing technique was applied for scalable
join processing [19]; and various compressing and indexing
techniques were designed for small memory footprint [3], [18].
With the infrastructure being built, the community is turning
to develop more advanced applications, e.g., integrating and
harvesting knowledge on the Web [24], rewriting queries for
fine-grained access control [17] and inference [13]. In such
applications, a SPARQL query over views is often rewritten
into an equivalent batch of SPARQL queries for evaluation
over the base data. As the semantics of the rewritten queries
in the same batch are commonly overlapped [13], [17], there
is much room for sharing computation when executing these
rewritten queries. This observation motivates us to revisit the
classical problem of multi-query optimization (MQO) in the
context of RDF and SPARQL.
Not surprisingly, MQO for SPARQL queries is NP-hard, con-
sidering that MQO for relational queries is NP-hard [30] and
the established equivalence between SPARQL and relational
algebra [2], [23]. It is tempting to apply the MQO techniques
developed in relational systems to address the MQO problem
in SPARQL. For instance, the work by P. Roy et al. [27]
represented query plans in AND-OR DAGs and used heuristics
to partially materialize intermediate results that could result in
a promising query throughput. Similar themes can be seen in
a variety of contexts, including relational queries [30], [31],
XQueries [6], aggregation queries [36], or more recently as
full-reducer tree queries [15]. These off-the-shelf solutions,
however, are hard to be engineered into RDF query engines in
practice. The first source of complexity for using the relational
techniques and the like stems from the physical design of
RDF data itself. While indexing and storing relational data
commonly conform to a carefully calibrated relational schema,
many variances existed for RDF data; e.g. , the giant triple table
adopted in 3store and RDF-3X, the property table in Jena, and
more recently the use of vertical partitioning to store RDF data.
These, together with the disparate indexing techniques, make
the cost estimation for an individual query operator (the corner
stone for any MQO technique) highly error-prone and store
dependent. Moreover, as observed in previous works [1], [19],
SPARQL queries feature more joins than typical SQL queries
a fact that is also evident by comparing TPC benchmarks [34]
with the benchmarks for RDF stores [5], [9], [11], [28]. While
existing techniques commonly root on looking for the best plan
in a greedy fashion, comparing the cost for alternative plans
becomes impractical in the context of SPARQL, as the error
for selectivity estimation inevitably increases when the number
of joins increases [18], [33]. Finally, in W3C’s envision [26],
RDF is a very general data model, therefore, knowledge and
facts can be seamlessly harvested and integrated from various
SPARQL endpoints on the Web [38] (powered by different
RDF stores). While a specialized MQO solution may serve
inside the optimizer of certain RDF stores, it is more appealing
to have a generic MQO framework that could smoothly fit
into any SPARQL endpoint, which would be coherent with
the design principle of RDF data model.
With the above challenges in mind, in this paper, we s tudy
MQO of SPARQL queries over RDF data, with the objective to
minimize total query evaluation time. Specifically, we employ
query rewriting techniques to achieve desirable and consistent
performance for MQO across different RDF stores, with the
guarantee of soundness and completeness. While the previous
works consider alignments for the common substructures
in acyclic query plans [15], [27], we set forth to identify

common subqueries (cyclic query graphs included) and rewrite
them with SPARQL in a meaningful way. Unlike [27], which
requires explicitly materializing and indexing the common
intermediate results, our approach works on top of any RDF
engine and ensures that the underlying RDF stores can au-
tomatically cache and reuse such results. In addition, a full
range of optimization techniques in different RDF stores and
SPARQL query optimizers can seamlessly support our MQO
technique. Our contributions can be summarized as follows.
We present a generic technique for MQO in SPARQL.
Unlike the previous works that focus on synthesizing
query plans, our technique summarizes similarity in the
structure of SPARQL queries and takes into account the
unique properties (e.g., cyclic query patterns) of SPARQL.
Our MQO approach relies on query rewriting, which is
built on the algorithms for finding common substruc-
tures. In addition, we tailored efficient and effective
optimizations for finding common subqueries in a batch
of SPARQL queries.
We proposed a practical cost model. Our choice of the
cost model is determined both by t he idiosyncrasies of
the SPARQL language and by our empirical digest of
how SPARQL queries are executed in existing RDF data
management systems.
Extensive experiments with large RDF data (close to 10
million triples) performed on three different RDF stores
consistently demonstrate the efficiency and effectiveness
of our approach over the baseline methods.
II. PRELIMINARIES
A. SPARQL
SPARQL, a W3C recommendation, is a pattern-matching
query language. There are two types of SPARQL queries in
which we are going to focus our interest:
Type 1: Q := SELECT RD WHERE GP
Type 2: Q
OPT
:= SELECT RD WHERE GP (OPTIONAL GP
OPT
)
+
where, GP is a set of triple patterns, i.e., triples involving both
variables and constants, and RD is the result description. Given
an RDF data graph D, the triple pattern GP searches on D for a
set of subgraphs of D, each of which matches the graph pattern
in GP ( by binding pattern variables to values in the subgraph).
The result description RD for both query types contains a
subset of variables in the graph patterns, similar to a projection
in SQL. The difference between the two types is clearly in
the OPTIONAL clause. Unlike query Q, in the Q
OPT
query a
subgraph of D might match not only the pattern in GP but
also the pattern (combination) of GP and GP
OPT
. While more
than one OPTIONAL clauses are allowed, subgraph matching
with D independently considers the combination of pattern
GP with each of the OPTIONAL clauses. Therefore, with n
OPTIONAL clauses in query Q
OPT
, the query returns as results
the subgraphs that match any of the n (GP + GP
OPT
) pattern
combinations, plus the results that match just the GP pattern.
Consider the data and SPARQL query in Figure 1(a) and (b).
The query looks for triples whose subjects (each corresponding
subj pred obj
p1 name ”Alice”
p1 zip 10001
p1 mbox alice@home
p1 mbox alice@work
p1 www http://home/alice
p2 name ”Bob”
p2 zip ”10001”
p3 name ”Ella”
p3 zip ”10001”
p3 www http://work/ella
p4 name ”Tim”
p4 zip ”11234”
(a) Input data D
SELECT ?name, ?mail, ?hpage
WHERE { ?x name ?name, ?x zip 10001,
OPTIONAL {?x mbox ?mail }
OPTIONAL {?x www ?hpage }}
(b) Example query Q
OPT
name mail hpage
”Alice” alice@home
”Alice” alice@work
”Alice” http://home/alice
”Bob”
”Ella” http://work/ella
(c) Output Q
OPT
(D)
Fig. 1. An example
to a person) have the predicates name and zip, with the latter
having the value 10001 as object. For these triples, it returns
the object of the name predicate. Due to the first OPTIONAL
clause, the query also returns the object of predicate mbox, if
the predicate exists. Due to the second OPTIONAL clause, the
query also independently returns the object of predicate www,
if the predicate exists. Evaluating the query over the input
data D (can be viewed as a graph) results in output Q
OPT
(D),
as shown in Figure 1(c).
name
zip
mbox
www
?x
?n
10001
?m
?p
v
1
v
2
v
3
v
4
v
5
e
1
e
2
e
3
e
4
Fig. 2. A query graph
We represent queries graphically, and
associate with each query Q (Q
OPT
) a
query graph pattern corresponding to
its pattern GP (resp., GP (OPTIONAL
GP
OPT
)
+
). Formally, a query graph pat-
tern is a 4-tuple (V, E, ν, µ) where V
and E stand for vertices and edges, ν
and µ are two functions which assign
labels (i.e., constants and variables) to vertices and edges of
GP respectively. Vertices represent the subjects and objects of
a triple; gray vertices represent constants, and white vertices
represent variables. Edges represent predicates; dashed edges
represent predicates in the optional patterns GP
OPT
, and solid
edges represent predicates in the required patterns GP. Fig-
ure 2 shows a pictorial example for the query in Figure 1(b).
Its query graph patterns GP and GP
OPT
s are defined sepa-
rately. GP is defined as (V, E, ν, µ), where V = {v
1
, v
2
, v
3
},
E = {e
1
, e
2
} and the two naming functions ν = {ν
1
: v
1
?x, ν
2
: v
2
?n, ν
3
: v
3
10001}, µ = {µ
1
: e
1
name, µ
2
:
e
2
zip}. For the two OPTIONALs, they are defined as
GP
OPT
1
= (V
, E
, ν
, µ
), where V
= {v
1
, v
4
}, E
= {e
3
},
ν
= {ν
1
: v
1
?x, ν
2
: v
4
?m}, µ
= {µ
1
: e
3
mbox};
Likewise, GP
OPT
2
= (V
′′
, E
′′
, ν
′′
, µ
′′
), where V
′′
= {v
1
, v
5
},
E
′′
= {e
4
}, ν
′′
= {ν
′′
1
: v
1
?x, ν
′′
2
: v
5
?p},
µ
′′
= {µ
′′
1
: e
4
www}.
B. Problem statement
Formally, the problem of MQO in SPARQL, from query
rewriting perspective, is defined as follows: Given a data graph
G, and a set Q of Type 1 queries, compute a new set Q
OPT
of
Type 1 and Type 2 queries, evaluate Q
OPT
over G and distribute
the results to the queries in Q. There are two requirements
for the rewriting approach to MQO: (i) The query results of
Q
OPT
can be used to produce the same results as executing the
original queries in Q, which ensures the soundness and com-
pleteness of the rewriting; and (ii) the evaluation time of Q
OPT
,

?z
4
?x
4
?y
4
P
1
P
2
v
1
?z
3
?x
3
?y
3
P
1
P
2
P
3
P
5
v
1
?z
2
?x
2
?y
2
P
1
P
2
P
3
P
5
?w
1
v
1
?z
1
?x
1
?y
1
P
1
P
2
P
4
P
3
(a) Query Q
1
(b) Query Q
2
(c) Query Q
3
(d) Query Q
4
P
4
P
4
?w
2
?t
2
?w
3
?u
4
P
4
?w
4
P
3
P
6
v
1
SELECT *
WHERE { ?x P
1
?z, ?y P
2
?z,
OPTIONAL {?y P
3
?w, ?w P
4
v
1
}
OPTIONAL {?t P
3
?x, ?t P
5
v
1
, ?w P
4
v
1
}
OPTIONAL {?x P
3
?y, v
1
P
5
?y, ?w P
4
v
1
}
OPTIONAL {?y P
3
?u, ?w P
6
?u, ?w P
4
v
1
}
}
?z
?x
?y
P
1
P
2
P
3
v
1
P
5
P
3
P
4
?t
P
3
P
5
P
3
P
6
?w
?u
(e) Example query Q
OPT
SELECT *
WHERE { ?w P
4
v
1
,
OPTIONAL {?x
1
P
1
?z
1
, ?y
1
P
2
?z
1
, ?y
1
P
3
?w }
OPTIONAL {?x
2
P
1
?z
2
, ?y
2
P
2
?z
2
, ?t
2
P
3
?x
2
, ?t
2
P
5
v
1
}
OPTIONAL {?x
3
P
1
?z
3
, ?y
3
P
2
?z
3
, ?x
3
P
3
?y
3
, v
1
P
5
?y
3
}
OPTIONAL {?x
4
P
1
?z
4
, ?y
4
P
2
?z
4
, ?y
4
P
3
?u
4
, ?w P
6
?u
4
}
}
pattern p α(p)
?x P
1
?z 15%
?y P
2
?z 9%
?y P
3
?w 18%
?w P
4
v
1
4%
?t P
5
v
1
2%
v
1
P
5
?t 7%
?w P
6
?u 13%
(f) Structure and cost-based optimization
Fig. 3. Multi-query optimization example
including query rewriting, execution, and result distribution,
should be less than the baseline of executing the queries in
Q sequentially. To ease presentation, we assume that the input
queries in Q are of Type 1, while the output (optimized) queries
are either of Type 1 or Type 2. Our optimization techniques can
easily handle more general scenarios where both query types
are given as input (section IV).
We use a simple example to illustrate the MQO envisioned
and some challenges for the rewriting approach. Figure 3(a)-
(d) show the graph representation of four queries of Type 1.
Figure 3(e) shows a Type 2 query Q
OPT
that rewrites all four
input queries into one. To generate query Q
OPT
, we identify the
(largest) common subquery in all four queries: the subquery
involving triples ?x P
1
?z, ?y P
2
?z (the second largest com-
mon subquery involves only one predicate, P
3
or P
4
). This
common subquery constitutes the graph pattern GP of Q
OPT
.
The remaining subquery of each individual query generates an
OPTIONAL clause in Q
OPT
. Note that by generating a query like
Q
OPT
, the triple patterns in GP of Q
OPT
are evaluated only once,
instead of being evaluated for multiple times when the input
queries are executed independently. Intuitively, this is where
the savings MQO could bring from. As mentioned earlier, MQO
must consider generic directed graphs, possibly with cyclic
patterns, which makes it hard to adapt existing techniques for
this optimization. Also, the proposed optimization has a unique
characteristic that it leverages SPARQL-specific features such
as the OPTIONAL clause for query rewriting.
Note that the above rewriting only considers query struc-
tures, without considering query selectivity. Suppose we know
the selectivity α(p) of each pattern p in the queries, as shown
in Figure 3(f). Let us assume a simple cost model that the cost
of each query Q or Q
OPT
is equal to the minimum selectivity of
the patterns in GP; we ignore for now the cost of OPTIONAL
patterns, which is motivated by how real SPARQL engines
evaluate queries (The actual cost model used in this paper is
discussed in Section III-D.). So, the cost for all four queries
Q
1
to Q
4
is respectively 4, 2, 4 and 4 (with queries executed
on a dataset of size 100). Therefore, executing all queries
//J:Jaccard
Input: Set Q = {Q
1
, . . ., Q
n
}
Output: Set Q
OPT
of optimized queries
// Step 1: Bootstrapping the query optimizer
Run k-m eans on Q to generate a set M = {M
1
, . . ., M
k
} of k query1
groups based on query similarity in terms of their predicate sets;
// Step 2: Refining query clusters
for each query group M M do2
Initialize a set C = {C
1
, . . ., C
|M|
} of |M| clusters;
3
for each query Q
i
M, 1 i |M| do C
i
= Q
i
;4
while untested pair (C
i
, C
i
) with J
max
(C
i
, C
i
) do5
Let Q
ii
= {Q
ii
1
, . . . , Q
ii
m
} be the queries of C
i
C
i
;6
Let S be the top-s most selective triple patterns in Q
ii
;7
// Step 2.1: Building compact linegraphs
Let µ
µ
1
µ
2
. . . µ
m
and τ = {∅};8
for each query Q
ii
j
Q
ii
do
9
Build linegraph L(Q
ii
j
) with only the edges in µ
;
10
Keep indegree matrix m
j
, outdegree matrix m
+
j
for L(Q
ii
j
);
11
for each vertex e defined in µ
and µ
(e) 6= do12
Let I =m
1
[e] . . . m
m
[e] and O=m
+
1
[e] . . . m
+
m
[e];13
if I = O = then µ
(e)
def
= and τ =τ {triple pattern on e};14
for L(GP
j
), 1 j m do
15
Prune the L(GP
j
) vertices not in µ
and their incident edges;
16
// Step 2.2: Building product graphs
Build L(GP
p
) = L(GP
1
) L(GP
2
) . . . L(GP
m
);
17
// Step 2.3: Finding cliques in product graphs
{K
1
, . . . , K
r
} = AllM aximalClique(L(GP
p
));
18
if r = 0 then goto 22;19
for each K
i
, i = 1, 2, . . . , r do20
find all K
i
K
i
having the maximal strong covering tree in K
i
;21
sort SubQ = {K
1
, . . . , K
t
} τ in descending order by size;22
Initialize K = ;23
for each q
i
SubQ, i = 1, 2, . . . , t + |τ| do24
if S q
i
6= then Set K = q
i
and break25
if K 6= then26
Let C
tmp
= C
i
C
i
and cost(C
tmp
)=cost(sub-query for K);27
if cost(C
tmp
) cost(C
i
) + cost(C
i
) then28
Put K with C
tmp
;29
remove C
i
, C
i
from C and add C
tmp
;30
// Step 3: Generating optimized queries
for each cluster C
i
in C do31
if a clique K is associated with C
i
then32
Rewrite queries in C
i
using triple patterns in K;33
Output the query into set Q
OPT
;34
return Q
OPT
.35
Fig. 4. Multi-query optimization algorithm
individually (without optimization) costs 4 + 2 + 4 + 4 = 14.
In comparison, the cost of the structure-based only optimized
query in Figure 3(e) is 9, resulting in a saving of approximately
30%. Now, consider another rewriting in Figure 3(f) that
results in from optimization along the second largest common
subquery that just contains P
4
. The cost for this query is only
4, which leads to even more savings, although the rewriting
utilizes a smaller common subquery. As this simple example
illustrates, it is critical for MQO to construct a cost model that
integrates query structure overlap with selectivity estimation.
III. THE ALGORITHM
Our MQO algorithm, shown in Figure 4 , accepts as i nput a
set Q = {Q
1
, . . ., Q
n
} of n queries over a graph G. Without
loss of generality, assume the sets of variables used in different
queries are distinct. The algorithm identifies whether there is
a cost-effective way to share the evaluation of structurally-
overlapping graph patterns among the queries in Q. At a high
level, the algorithm works as follows: (1) It partitions the input
queries into groups, where queries in the same group are more
likely to share common sub-queries that can be optimized
through query rewriting; (2) it rewrites a number of Type 1

queries in each group to their correspondent cost-efficient
Type 2 queries; and (3) it executes the rewritten queries and
distributes the query results to the original input queries (along
with a refinement). Several challenges arise during the above
process: (i) There exists an exponential number of ways to
partition the input queries. We thus need a heuristic to prune
out the space of less optimal partitioning of queries. (ii) We
need an efficient algorithm to identify potential common sub-
queries for a given query group. And (iii) s ince different
common sub-queries result in different query rewritings, we
need a robust cost model to compare candidate rewriting
strategies. We describe how we tackle these challenges next.
A. Bootstrapping
Finding structural overlaps for a set of queries amounts to
finding the isomorphic subgraphs among the corresponding
query graphs. This process is computationally expensive (the
problem is NP-hard [4] in general), so ideally we would
like to find these overlaps only for groups of queries that
will eventually be optimized (rewritten). That is, we want
to minimize (or ideally eliminate) the computation spent on
identifying common subgraphs for query groups that lead to
less optimal MQO solutions. One heuristic we adopt is to
quickly prune out subsets of queries that clearly share little
in query graphs, without going to the next expensive step of
computing their common subqueries; therefore, the group of
queries that have few predicates in common will be pruned
from further consideration for optimization. We thus define
the similarity metric for two queries as the Jaccard similarity
of their predicate sets. The rational is that if the Jaccard
similarity of two queries is small, their structural overlap in
query graphs must also be small; so it is safe to not consider
grouping such queries for MQO. We implement this heuristic
as a bootstrap step in line 1 using k-means clustering (with
Jaccard as the similarity metric) for an initial partitioning of
the input queries into a set M of k query groups. Notice
that the similarity metric identifies queries with s ubstantial
overlaps in their predicate sets, ignoring for now the common
sub-structure and the selectivity of these predicates.
B. Refining query clusters
Starting with the k-means generated groups M, we refine
the partitioning of queries further based on their structure
similarity and the estimated cost. To this end, we consider each
query group generated from the k-means clustering M M
in isolation (since queries across groups are guaranteed to be
sufficiently different) and perform the following steps: In lines
530, we (incrementally) merge structurally similar queries
within M through hierarchical clustering [14], and generate
query clusters such that each query cluster is optimized
together (i.e., results in one Type 2 query). Initially, we create
one singleton cluster C
i
for each query Q
i
of M (line 4).
Given two clusters C
i
and C
i
, we have to determine whether
it is more cost-efficient to merge the two query clusters into a
single cluster (i.e., a single Type 2 query) than to keep the two
clusters separate (i.e., executing the corresponding two queries
independently). From the previous iteration, we already know
the cost of the optimized queries for each of the C
i
and C
i
clusters. To determine the cost of the merged cluster, we have
to compute the query that merges all the queries in C
i
and C
i
through rewriting; which requires us to compute the common
substructure of all these queries, and to estimate the cost of the
rewritten query generated from the merged clusters. For the
cost computation, we do some preliminary work here (line
7) by identifying the most selective triple patterns from the
two clusters (selectivity is estimated by [33]). Note that our
refinement of M might lead to more than one queries; one for
each cluster of M, in the form of either Type 1 or Type 2.
Finding common substructures: While finding the maxi-
mum common subgraph for two graphs is known to be NP-
hard [4], the challenge here is asymptotically harder as it
requires finding the largest common substructures for multiple
graphs. Existing solutions on finding common subgraphs also
assume untyped edges and nodes in undirected graphs. How-
ever, in our case the graphs represent queries, and different
triple patterns might correspond to different semantics (i.e.,
typed and directed). Thus, the predicates and the constants as-
sociated with nodes must be taken into consideration. This mix
of typed, constant and variable nodes/edges is not typical in
classical graph algorithms, and therefore existing solutions can
not be directly applied for query optimization. We therefore
propose an efficient algorithm to address these challenges.
In a nutshell, our algorithm follows the principle of finding
the maximal common edge subgraphs (MCES) [25], [37].
Concisely, three major sub-steps are involved (steps 2.1 to
2.3 in Figure 4): (a) transforming the input query graphs
into the equivalent linegraph representations; (b) generating a
product graph from the linegraphs; and (c) executing a tailored
clique detection algorithm to find the maximal cliques in the
product graph (a maximal clique corresponds to an MCES).
We describe these sub-steps in details next.
Step 2.1: Building compact linegraphs: The linegraph L(G)
of a graph G is a directed graph built as follows. Each node
in L(G) corresponds to an edge in G, and there is an edge be-
tween two nodes in L(G) if the corresponding edges in G share
a common node. Although it is straightforward to transform
a graph into its linegraph representation, the context of MQO
raises new requirements for the linegraph construction. We
represent the linegraph of a query graph pattern in a 4-tuple,
defined as L(G) = (V, E, π, ω). During linegraph construction,
besides the inversion of nodes and edges for the query graph,
our transformation also assigns to each edge in the linegraph
one of 4 labels (
0
3
). Specifically, for two triple patterns,
there are 4 possible joins between their subjects and objects (
0
= subject-subject,
1
= subject-object,
2
= object-subject,
3
= object-object). The assignment of labels on linegraph edges
captures these four join types (useful for pruning and will
become clear shortly). Figure 5 (a)-(d) shows the linegraphs
for the queries in Figure 3(a)-(d).
The classical solution for finding common substructures
of input graphs requires building Cartesian products on their

P
2
P
1
P
3
P
4
3
3
0
0
1
2
P
5
P
1
P
3
3
3
3
0
P
2
3
1
2
1
2
(a) L(Q
1
) (b) L(Q
2
)
(c) L(Q
3
)
(d) L(Q
4
)
0
P
2
P
1
P
3
P
5
P
4
3
3
2
1
0
0
3
3
P
4
2
1
P
1
P
2
P
3
P
6
P
4
3
3
0
0
3
3
0
0
(e) Subqueries
3
3
P
2
P
1
L(GP
p
):
τ
:
P
3
P
4
Fig. 5. (a)–(d) linegraphs, (e) their common substructures
linegraphs. This raises challenges in scalability when finding
the maximum common substructure for multiple queries in
one shot. To avoid the foreseeable explosion, we propose
fine-grained optimizations (lines 816) to keep linegraphs as
small as possible so that only the most promising substructures
would be transformed into linegraphs, with the rest being
temporarily masked from further processing.
To achieve the above, queries in Q
ii
pass through a
two-stage optimization. In the first stage (lines 811), we
identify (line 8) the common predicates in Q
ii
by building
the intersection µ
for all the labels defined in the µs (recall
that function µ assigns predicate names to graph edges).
Predicates that are not common to all queries can be safely
pruned, since by definition they are not part of any common
substructure, e.g.,P
5
and P
6
in Figure 3. While computing
the intersection of predicates, the algorithm also checks for
compatibility between the corresponding subjects and objects,
so that same-label predicates with different subjects/objects
are not added into µ
. In addition, we maintain two adjacency
matrices for a linegraph L(GP), namely, the indegree matrix
m
storing all incoming, and the outdegree matrix m
+
storing
all outgoing edges from L(GP) vertices. For a vertex v, we use
m
[v] and m
+
[v], respectively, to denote the portion of the
adjacency matrices storing the incoming and outgoing edges
of v. For example, the adjacency matrices for vertex P
3
in
linegraph L(Q
1
) of Figure 5 are m
+
1
[P
3
] = [,
0
, ,
2
, , ],
m
1
[P
3
] = [,
0
, ,
1
, , ], while for linegraph L(Q
2
) they
are m
+
2
[P
3
] = [
2
, , , ,
0
, ], m
2
[P
3
] = [
1
, , , ,
0
, ].
In the second stage (lines 1216), to further reduce the size
of linegraphs, for each linegraph vertex e, we compute the
Boolean intersection for the m
[e]s and m
+
[e]s from all
linegraphs respectively (line 13). We also prune e from µ
if both intersections equal and set aside the triple pattern
associated with e in a set τ (line 14). Intuitively, this optimiza-
tion acts as a look-ahead step in our algorithm, as it quickly
detects the cases where the common sub-queries involve only
one triple pattern (those in τ ). Moreover, it also improves
the efficiency of the clique detection (step 2.2 and 2.3) due
to the smaller sizes of input li negraphs. Going back to our
example, just by looking at the m
1
, m
+
1
, m
2
, m
+
2
, it is easy
to see that the intersection m
+
i
[P
3
] = m
i
[P
3
] = for all
the linegraphs of Figure 5(a)-(d). Therefore, our optimization
temporarily masks P
3
(so as P
4
) from the expensive clique
detection in the following two steps.
Step 2.2: Building product graphs: The product graph
L(GP
p
) := (V
p
, E
p
, π
p
, ω
p
) of two linegraphs, L(GP
1
) :=
(V
1
, E
1
, π
1
, ω
1
) and L(GP
2
) := (V
2
, E
2
, π
2
, ω
2
), is denoted as
L(GP
p
) := L(GP
1
) L(GP
2
). The vertices V
p
in L(GP
p
)
are defined on the Cartesian product of V
1
and V
2
. In order
to use product graphs in MQO, we optimize the standard
definition with the additional requirement that vertices paired
together must have t he same label (i.e., predicate). That is,
V
p
:= {(v
1
, v
2
) | v
1
V
1
v
2
V
2
π
1
(v
1
) = π
2
(v
2
)},
with the labeling function defined as π
p
:= {π
p
(v) | π
p
(v) =
π
1
(v
1
), with v = (v
1
, v
2
) V
p
}. For the product edges, we
use the standard definition which creates edges in the product
graph between two vertices (v
1i
, v
2i
) and (v
1j
, v
2j
) in V
p
if
either (i) the same edges (v
1i
, v
1j
) in E
1
, and (v
2i
, v
2j
) in E
2
exist; or (ii) no edges connect v
1i
with v
1j
in E
1
, and v
2i
with v
2j
in E
2
. The edges due to (i) are termed as strong
connections, while those for (ii) as weak connections [37].
Since the product graph for two linegraphs conforms to the
definition of linegraph, we can recursively build the product
for multiple linegraphs (line 17). Theoretically, there is an
exponential blowup in size when we construct the product for
multiple linegraphs. In practice, thanks to our optimizations in
Steps 2.1 and 2.2, our algorithm is able to accommodate tens to
hundred of queries, and generates the product graph efficiently
(which will be verified through Section V). Figure 5(e) shows
the product linegraph L(GP
p
) for the running example.
Step 2.3: Finding Cliques in product graphs: A (maximal)
clique with a strong covering tree (a tree only involving strong
connections) in the product graph equals to an MCES a
(maximal) common sub-query in essence. In addition, we
are interested in finding cost-effective common sub-queries.
To verify if the found common sub-query is selective, it is
checked with the set S (from line 7) of selective query patterns.
In the algorithm, we proceed by finding all maximal cliques
in the product graph (line 18), a process for which many
efficient algorithms exist [16], [21], [35]. For each discovered
clique, we identify its sub-cliques with the maximal strong
covering trees (line 21). For the L(GP
p
) in Figure 5( e), it
results in one clique (itself): i.e., K
1
= {P
1
, P
2
}. As the cost
of sub-queries is another dimension for query optimization, we
look for the substructures that are both large in size (i.e., the
number of query graph patterns in overlap) and correspond to
selective common sub-queries. Therefore, we first sort SubQ
(contributed by K
s and τ, line 22) by their sizes in descending
order, and then loop through the sorted list from the beginning
and stop at the first substructure that intersects S ( lines 22
25), i.e., P
4
in our example. We then merge (if it is cost-
effective, line 28) the queries whose common sub-query is
reflected in K and also merge their corresponding clusters
into a new cluster (while remembering the found common
sub-query) (lines 2630). The algorithm repeats lines 530
until every possible pair of clusters have been tested and no
new cluster can be generated.
C. Generating optimized queries and distributing results
After the clusters are finalized, the algorithm rewrites each
cluster of queries into one query and thus generates a set of
rewritings Q
OPT
(lines 3134). The result from evaluating Q
OPT
over the data is a superset of evaluating the input queries Q

Citations
More filters
ReportDOI

Discrete Applied Mathematics

TL;DR: Significant progress has been made with solution of location problems and in preprocessing and decomposition for discrete optimization and on the application of techniques from combinational optimization to nonlinear problems.
Journal ArticleDOI

Association rules with graph patterns

TL;DR: A parallel scalable algorithm is provided that guarantees a polynomial speedup over sequential algorithms with the increase of processors and develops a parallel algorithm with accuracy bound for the problem of discovering top-k diversified GPARs.
Proceedings ArticleDOI

Functional Dependencies for Graphs

TL;DR: GFDs provide an effective approach to detecting inconsistencies in knowledge and social graphs and the satisfiability and implication problems for GFDs are coNP-complete and NP-complete, respectively.
Journal ArticleDOI

Keys for graphs

TL;DR: This work proposes a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism, and provides two parallel scalable algorithms for entity matching, in MapReduce and a vertex-centric asynchronous model.
Proceedings ArticleDOI

Graph-Aware, Workload-Adaptive SPARQL Query Caching

TL;DR: This work presents a novel system that addresses graph-based, workload-adaptive indexing of large RDF graphs by caching SPARQL query results by integrating the canonical labelling algorithm with a dynamic programming planner to generate the optimal join execution plan.
References
More filters
Book

Graph theory

Frank Harary
Journal ArticleDOI

Data clustering: a review

TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Journal ArticleDOI

LUBM: A benchmark for OWL knowledge base systems

TL;DR: This work describes the method for benchmarking Semantic Web knowledge base systems with respect to use in large OWL applications and presents the Lehigh University Benchmark (LUBM) as an example of how to design such benchmarks.
Journal ArticleDOI

Resource description framework: metadata and its applications

TL;DR: This survey aims at providing a glimpse at the past, present, and future of this upcoming technology and highlights why it is expected that knowledge discovery and data mining can benefit from RDF and the Semantic Web.
Journal ArticleDOI

Multiple-query optimization

TL;DR: The results show that using multiple- query processing algorithms may reduce execution cost considerably, and the presentation and analysis of algorithms that can be used for multiple-query optimization are presented.