Scalable Multi-Query Optimization for SPARQL

Wangchao Le

1

Anastasios Kementsietsidis

2

Songyun Duan

2

Feifei Li

1

1

School of Computing, University of Utah, Salt Lake City, UT, USA

2

IBM T.J. Watson Research Center, Hawthorne, NY, USA

1

{lew,lifeifei}@cs.utah.edu,

2

{akement, sduan}@us.ibm.com

Abstract—This paper revisits the classical problem of multi-

query optimization in the context of RDF/SPARQL. We show

that the techniques developed for relational and semi-structured

data/query languages are hard, if not impossible, to be extended

to account for RDF data model and graph query patterns

expressed in SPARQL. In light of the NP-hardness of the

multi-query optimization for SPARQL, we propose heuristic

algorithms that partition the input batch of queries into groups

such that each group of queries can be optimized together.

An essential component of the optimization incorporates an

efﬁcient algorithm to discover the common sub-structures of

multiple SPARQL queries and an effective cost model to compare

candidate execution plans. Since our optimization techniques do

not make any assumption about the underlying SPARQL query

engine, they have the advantage of being portable across different

RDF stores. The extensive experimental studies, performed on

three popular RDF stores, show that the proposed techniques

are effective, efﬁcient and scalable.

I. INTRODUCTION

With the proliferation of RDF data, over the years, a lot

of effort has been devoted in building RDF stores that aim to

efﬁciently answer graph pattern queries expressed in SPARQL.

There are generally two routes to building RDF stores: (i)

migrating the schema-relax RDF data to relational data, e.g.,

Virtuoso, Jena SDB, Sesame, 3store; and (ii) building generic

RDF s tores from scratch, e.g., Jena TDB, RDF-3X, 4store,

Sesame Native. As RDF data are schema-relax [26] and

graph pattern queries in SPARQL typically involve many

joins [1], [19], a full spectrum of techniques have been

proposed to address the new challenges. For inst ance, vertical

partitioning was proposed for relational backend [1]; side-

way information passing technique was applied for scalable

join processing [19]; and various compressing and indexing

techniques were designed for small memory footprint [3], [18].

With the infrastructure being built, the community is turning

to develop more advanced applications, e.g., integrating and

harvesting knowledge on the Web [24], rewriting queries for

ﬁne-grained access control [17] and inference [13]. In such

applications, a SPARQL query over views is often rewritten

into an equivalent batch of SPARQL queries for evaluation

over the base data. As the semantics of the rewritten queries

in the same batch are commonly overlapped [13], [17], there

is much room for sharing computation when executing these

rewritten queries. This observation motivates us to revisit the

classical problem of multi-query optimization (MQO) in the

context of RDF and SPARQL.

Not surprisingly, MQO for SPARQL queries is NP-hard, con-

sidering that MQO for relational queries is NP-hard [30] and

the established equivalence between SPARQL and relational

algebra [2], [23]. It is tempting to apply the MQO techniques

developed in relational systems to address the MQO problem

in SPARQL. For instance, the work by P. Roy et al. [27]

represented query plans in AND-OR DAGs and used heuristics

to partially materialize intermediate results that could result in

a promising query throughput. Similar themes can be seen in

a variety of contexts, including relational queries [30], [31],

XQueries [6], aggregation queries [36], or more recently as

full-reducer tree queries [15]. These off-the-shelf solutions,

however, are hard to be engineered into RDF query engines in

practice. The ﬁrst source of complexity for using the relational

techniques and the like stems from the physical design of

RDF data itself. While indexing and storing relational data

commonly conform to a carefully calibrated relational schema,

many variances existed for RDF data; e.g. , the giant triple table

adopted in 3store and RDF-3X, the property table in Jena, and

more recently the use of vertical partitioning to store RDF data.

These, together with the disparate indexing techniques, make

the cost estimation for an individual query operator (the corner

stone for any MQO technique) highly error-prone and store

dependent. Moreover, as observed in previous works [1], [19],

SPARQL queries feature more joins than typical SQL queries –

a fact that is also evident by comparing TPC benchmarks [34]

with the benchmarks for RDF stores [5], [9], [11], [28]. While

existing techniques commonly root on looking for the best plan

in a greedy fashion, comparing the cost for alternative plans

becomes impractical in the context of SPARQL, as the error

for selectivity estimation inevitably increases when the number

of joins increases [18], [33]. Finally, in W3C’s envision [26],

RDF is a very general data model, therefore, knowledge and

facts can be seamlessly harvested and integrated from various

SPARQL endpoints on the Web [38] (powered by different

RDF stores). While a specialized MQO solution may serve

inside the optimizer of certain RDF stores, it is more appealing

to have a generic MQO framework that could smoothly ﬁt

into any SPARQL endpoint, which would be coherent with

the design principle of RDF data model.

With the above challenges in mind, in this paper, we s tudy

MQO of SPARQL queries over RDF data, with the objective to

minimize total query evaluation time. Speciﬁcally, we employ

query rewriting techniques to achieve desirable and consistent

performance for MQO across different RDF stores, with the

guarantee of soundness and completeness. While the previous

works consider alignments for the common substructures

in acyclic query plans [15], [27], we set forth to identify

common subqueries (cyclic query graphs included) and rewrite

them with SPARQL in a meaningful way. Unlike [27], which

requires explicitly materializing and indexing the common

intermediate results, our approach works on top of any RDF

engine and ensures that the underlying RDF stores can au-

tomatically cache and reuse such results. In addition, a full

range of optimization techniques in different RDF stores and

SPARQL query optimizers can seamlessly support our MQO

technique. Our contributions can be summarized as follows.

• We present a generic technique for MQO in SPARQL.

Unlike the previous works that focus on synthesizing

query plans, our technique summarizes similarity in the

structure of SPARQL queries and takes into account the

unique properties (e.g., cyclic query patterns) of SPARQL.

• Our MQO approach relies on query rewriting, which is

built on the algorithms for ﬁnding common substruc-

tures. In addition, we tailored efﬁcient and effective

optimizations for ﬁnding common subqueries in a batch

of SPARQL queries.

• We proposed a practical cost model. Our choice of the

cost model is determined both by t he idiosyncrasies of

the SPARQL language and by our empirical digest of

how SPARQL queries are executed in existing RDF data

management systems.

• Extensive experiments with large RDF data (close to 10

million triples) performed on three different RDF stores

consistently demonstrate the efﬁciency and effectiveness

of our approach over the baseline methods.

II. PRELIMINARIES

A. SPARQL

SPARQL, a W3C recommendation, is a pattern-matching

query language. There are two types of SPARQL queries in

which we are going to focus our interest:

Type 1: Q := SELECT RD WHERE GP

Type 2: Q

OPT

:= SELECT RD WHERE GP (OPTIONAL GP

OPT

)

+

where, GP is a set of triple patterns, i.e., triples involving both

variables and constants, and RD is the result description. Given

an RDF data graph D, the triple pattern GP searches on D for a

set of subgraphs of D, each of which matches the graph pattern

in GP ( by binding pattern variables to values in the subgraph).

The result description RD for both query types contains a

subset of variables in the graph patterns, similar to a projection

in SQL. The difference between the two types is clearly in

the OPTIONAL clause. Unlike query Q, in the Q

OPT

query a

subgraph of D might match not only the pattern in GP but

also the pattern (combination) of GP and GP

OPT

. While more

than one OPTIONAL clauses are allowed, subgraph matching

with D independently considers the combination of pattern

GP with each of the OPTIONAL clauses. Therefore, with n

OPTIONAL clauses in query Q

OPT

, the query returns as results

the subgraphs that match any of the n (GP + GP

OPT

) pattern

combinations, plus the results that match just the GP pattern.

Consider the data and SPARQL query in Figure 1(a) and (b).

The query looks for triples whose subjects (each corresponding

subj pred obj

p1 name ”Alice”

p1 zip 10001

p1 mbox alice@home

p1 mbox alice@work

p1 www http://home/alice

p2 name ”Bob”

p2 zip ”10001”

p3 name ”Ella”

p3 zip ”10001”

p3 www http://work/ella

p4 name ”Tim”

p4 zip ”11234”

(a) Input data D

SELECT ?name, ?mail, ?hpage

WHERE { ?x name ?name, ?x zip 10001,

OPTIONAL {?x mbox ?mail }

OPTIONAL {?x www ?hpage }}

(b) Example query Q

OPT

name mail hpage

”Alice” alice@home

”Alice” alice@work

”Alice” http://home/alice

”Bob”

”Ella” http://work/ella

(c) Output Q

OPT

(D)

Fig. 1. An example

to a person) have the predicates name and zip, with the latter

having the value 10001 as object. For these triples, it returns

the object of the name predicate. Due to the ﬁrst OPTIONAL

clause, the query also returns the object of predicate mbox, if

the predicate exists. Due to the second OPTIONAL clause, the

query also independently returns the object of predicate www,

if the predicate exists. Evaluating the query over the input

data D (can be viewed as a graph) results in output Q

OPT

(D),

as shown in Figure 1(c).

name

zip

mbox

www

?x

?n

10001

?m

?p

v

1

v

2

v

3

v

4

v

5

e

1

e

2

e

3

e

4

Fig. 2. A query graph

We represent queries graphically, and

associate with each query Q (Q

OPT

) a

query graph pattern corresponding to

its pattern GP (resp., GP (OPTIONAL

GP

OPT

)

+

). Formally, a query graph pat-

tern is a 4-tuple (V, E, ν, µ) where V

and E stand for vertices and edges, ν

and µ are two functions which assign

labels (i.e., constants and variables) to vertices and edges of

GP respectively. Vertices represent the subjects and objects of

a triple; gray vertices represent constants, and white vertices

represent variables. Edges represent predicates; dashed edges

represent predicates in the optional patterns GP

OPT

, and solid

edges represent predicates in the required patterns GP. Fig-

ure 2 shows a pictorial example for the query in Figure 1(b).

Its query graph patterns GP and GP

OPT

s are deﬁned sepa-

rately. GP is deﬁned as (V, E, ν, µ), where V = {v

1

, v

2

, v

3

},

E = {e

1

, e

2

} and the two naming functions ν = {ν

1

: v

1

→

?x, ν

2

: v

2

→?n, ν

3

: v

3

→10001}, µ = {µ

1

: e

1

→name, µ

2

:

e

2

→ zip}. For the two OPTIONALs, they are deﬁned as

GP

OPT

1

= (V

′

, E

′

, ν

′

, µ

′

), where V

′

= {v

1

, v

4

}, E

′

= {e

3

},

ν

′

= {ν

′

1

: v

1

→?x, ν

′

2

: v

4

→?m}, µ

′

= {µ

′

1

: e

3

→mbox};

Likewise, GP

OPT

2

= (V

′′

, E

′′

, ν

′′

, µ

′′

), where V

′′

= {v

1

, v

5

},

E

′′

= {e

4

}, ν

′′

= {ν

′′

1

: v

1

→ ?x, ν

′′

2

: v

5

→ ?p},

µ

′′

= {µ

′′

1

: e

4

→www}.

B. Problem statement

Formally, the problem of MQO in SPARQL, from query

rewriting perspective, is deﬁned as follows: Given a data graph

G, and a set Q of Type 1 queries, compute a new set Q

OPT

of

Type 1 and Type 2 queries, evaluate Q

OPT

over G and distribute

the results to the queries in Q. There are two requirements

for the rewriting approach to MQO: (i) The query results of

Q

OPT

can be used to produce the same results as executing the

original queries in Q, which ensures the soundness and com-

pleteness of the rewriting; and (ii) the evaluation time of Q

OPT

,

?z

4

?x

4

?y

4

P

1

P

2

v

1

?z

3

?x

3

?y

3

P

1

P

2

P

3

P

5

v

1

?z

2

?x

2

?y

2

P

1

P

2

P

3

P

5

?w

1

v

1

?z

1

?x

1

?y

1

P

1

P

2

P

4

P

3

(a) Query Q

1

(b) Query Q

2

(c) Query Q

3

(d) Query Q

4

P

4

P

4

?w

2

?t

2

?w

3

?u

4

P

4

?w

4

P

3

P

6

v

1

SELECT *

WHERE { ?x P

1

?z, ?y P

2

?z,

OPTIONAL {?y P

3

?w, ?w P

4

v

1

}

OPTIONAL {?t P

3

?x, ?t P

5

v

1

, ?w P

4

v

1

}

OPTIONAL {?x P

3

?y, v

1

P

5

?y, ?w P

4

v

1

}

OPTIONAL {?y P

3

?u, ?w P

6

?u, ?w P

4

v

1

}

}

?z

?x

?y

P

1

P

2

P

3

v

1

P

5

P

3

P

4

?t

P

3

P

5

P

3

P

6

?w

?u

(e) Example query Q

OPT

SELECT *

WHERE { ?w P

4

v

1

,

OPTIONAL {?x

1

P

1

?z

1

, ?y

1

P

2

?z

1

, ?y

1

P

3

?w }

OPTIONAL {?x

2

P

1

?z

2

, ?y

2

P

2

?z

2

, ?t

2

P

3

?x

2

, ?t

2

P

5

v

1

}

OPTIONAL {?x

3

P

1

?z

3

, ?y

3

P

2

?z

3

, ?x

3

P

3

?y

3

, v

1

P

5

?y

3

}

OPTIONAL {?x

4

P

1

?z

4

, ?y

4

P

2

?z

4

, ?y

4

P

3

?u

4

, ?w P

6

?u

4

}

}

pattern p α(p)

?x P

1

?z 15%

?y P

2

?z 9%

?y P

3

?w 18%

?w P

4

v

1

4%

?t P

5

v

1

2%

v

1

P

5

?t 7%

?w P

6

?u 13%

(f) Structure and cost-based optimization

Fig. 3. Multi-query optimization example

including query rewriting, execution, and result distribution,

should be less than the baseline of executing the queries in

Q sequentially. To ease presentation, we assume that the input

queries in Q are of Type 1, while the output (optimized) queries

are either of Type 1 or Type 2. Our optimization techniques can

easily handle more general scenarios where both query types

are given as input (section IV).

We use a simple example to illustrate the MQO envisioned

and some challenges for the rewriting approach. Figure 3(a)-

(d) show the graph representation of four queries of Type 1.

Figure 3(e) shows a Type 2 query Q

OPT

that rewrites all four

input queries into one. To generate query Q

OPT

, we identify the

(largest) common subquery in all four queries: the subquery

involving triples ?x P

1

?z, ?y P

2

?z (the second largest com-

mon subquery involves only one predicate, P

3

or P

4

). This

common subquery constitutes the graph pattern GP of Q

OPT

.

The remaining subquery of each individual query generates an

OPTIONAL clause in Q

OPT

. Note that by generating a query like

Q

OPT

, the triple patterns in GP of Q

OPT

are evaluated only once,

instead of being evaluated for multiple times when the input

queries are executed independently. Intuitively, this is where

the savings MQO could bring from. As mentioned earlier, MQO

must consider generic directed graphs, possibly with cyclic

patterns, which makes it hard to adapt existing techniques for

this optimization. Also, the proposed optimization has a unique

characteristic that it leverages SPARQL-speciﬁc features such

as the OPTIONAL clause for query rewriting.

Note that the above rewriting only considers query struc-

tures, without considering query selectivity. Suppose we know

the selectivity α(p) of each pattern p in the queries, as shown

in Figure 3(f). Let us assume a simple cost model that the cost

of each query Q or Q

OPT

is equal to the minimum selectivity of

the patterns in GP; we ignore for now the cost of OPTIONAL

patterns, which is motivated by how real SPARQL engines

evaluate queries (The actual cost model used in this paper is

discussed in Section III-D.). So, the cost for all four queries

Q

1

to Q

4

is respectively 4, 2, 4 and 4 (with queries executed

on a dataset of size 100). Therefore, executing all queries

//J:Jaccard

Input: Set Q = {Q

1

, . . ., Q

n

}

Output: Set Q

OPT

of optimized queries

// Step 1: Bootstrapping the query optimizer

Run k-m eans on Q to generate a set M = {M

1

, . . ., M

k

} of k query1

groups based on query similarity in terms of their predicate sets;

// Step 2: Refining query clusters

for each query group M ∈ M do2

Initialize a set C = {C

1

, . . ., C

|M|

} of |M| clusters;

3

for each query Q

i

∈ M, 1 ≤ i ≤ |M| do C

i

= Q

i

;4

while ∃ untested pair (C

i

, C

i

′

) with J

max

(C

i

, C

i

′

) do5

Let Q

ii

′

= {Q

ii

′

1

, . . . , Q

ii

′

m

} be the queries of C

i

∪ C

i

′

;6

Let S be the top-s most selective triple patterns in Q

ii

′

;7

// Step 2.1: Building compact linegraphs

Let µ

∩

← µ

1

∩ µ

2

. . . ∩ µ

m

and τ = {∅};8

for each query Q

ii

′

j

∈ Q

ii

′

do

9

Build linegraph L(Q

ii

′

j

) with only the edges in µ

∩

;

10

Keep indegree matrix m

−

j

, outdegree matrix m

+

j

for L(Q

ii

′

j

);

11

for each vertex e deﬁned in µ

∩

and µ

∩

(e) 6= ∅ do12

Let I =m

−

1

[e] ∩. . .∩ m

−

m

[e] and O=m

+

1

[e] ∩. . .∩ m

+

m

[e];13

if I = O = ∅ then µ

∩

(e)

def

= ∅ and τ =τ ∪ {triple pattern on e};14

for L(GP

j

), 1 ≤ j ≤ m do

15

Prune the L(GP

j

) vertices not in µ

∩

and their incident edges;

16

// Step 2.2: Building product graphs

Build L(GP

p

) = L(GP

1

) ⊗ L(GP

2

) ⊗ . . . ⊗ L(GP

m

);

17

// Step 2.3: Finding cliques in product graphs

{K

1

, . . . , K

r

} = AllM aximalClique(L(GP

p

));

18

if r = 0 then goto 22;19

for each K

i

, i = 1, 2, . . . , r do20

ﬁnd all K

′

i

⊆ K

i

having the maximal strong covering tree in K

i

;21

sort SubQ = {K

′

1

, . . . , K

′

t

} ∪ τ in descending order by size;22

Initialize K = ∅;23

for each q

i

∈ SubQ, i = 1, 2, . . . , t + |τ| do24

if S ∩ q

i

6= ∅ then Set K = q

i

and break25

if K 6= ∅ then26

Let C

tmp

= C

i

∪ C

i

′

and cost(C

tmp

)=cost(sub-query for K);27

if cost(C

tmp

) ≤ cost(C

i

) + cost(C

i

′

) then28

Put K with C

tmp

;29

remove C

i

, C

i

′

from C and add C

tmp

;30

// Step 3: Generating optimized queries

for each cluster C

i

in C do31

if a clique K is associated with C

i

then32

Rewrite queries in C

i

using triple patterns in K;33

Output the query into set Q

OPT

;34

return Q

OPT

.35

Fig. 4. Multi-query optimization algorithm

individually (without optimization) costs 4 + 2 + 4 + 4 = 14.

In comparison, the cost of the structure-based only optimized

query in Figure 3(e) is 9, resulting in a saving of approximately

30%. Now, consider another rewriting in Figure 3(f) that

results in from optimization along the second largest common

subquery that just contains P

4

. The cost for this query is only

4, which leads to even more savings, although the rewriting

utilizes a smaller common subquery. As this simple example

illustrates, it is critical for MQO to construct a cost model that

integrates query structure overlap with selectivity estimation.

III. THE ALGORITHM

Our MQO algorithm, shown in Figure 4 , accepts as i nput a

set Q = {Q

1

, . . ., Q

n

} of n queries over a graph G. Without

loss of generality, assume the sets of variables used in different

queries are distinct. The algorithm identiﬁes whether there is

a cost-effective way to share the evaluation of structurally-

overlapping graph patterns among the queries in Q. At a high

level, the algorithm works as follows: (1) It partitions the input

queries into groups, where queries in the same group are more

likely to share common sub-queries that can be optimized

through query rewriting; (2) it rewrites a number of Type 1

queries in each group to their correspondent cost-efﬁcient

Type 2 queries; and (3) it executes the rewritten queries and

distributes the query results to the original input queries (along

with a reﬁnement). Several challenges arise during the above

process: (i) There exists an exponential number of ways to

partition the input queries. We thus need a heuristic to prune

out the space of less optimal partitioning of queries. (ii) We

need an efﬁcient algorithm to identify potential common sub-

queries for a given query group. And (iii) s ince different

common sub-queries result in different query rewritings, we

need a robust cost model to compare candidate rewriting

strategies. We describe how we tackle these challenges next.

A. Bootstrapping

Finding structural overlaps for a set of queries amounts to

ﬁnding the isomorphic subgraphs among the corresponding

query graphs. This process is computationally expensive (the

problem is NP-hard [4] in general), so ideally we would

like to ﬁnd these overlaps only for groups of queries that

will eventually be optimized (rewritten). That is, we want

to minimize (or ideally eliminate) the computation spent on

identifying common subgraphs for query groups that lead to

less optimal MQO solutions. One heuristic we adopt is to

quickly prune out subsets of queries that clearly share little

in query graphs, without going to the next expensive step of

computing their common subqueries; therefore, the group of

queries that have few predicates in common will be pruned

from further consideration for optimization. We thus deﬁne

the similarity metric for two queries as the Jaccard similarity

of their predicate sets. The rational is that if the Jaccard

similarity of two queries is small, their structural overlap in

query graphs must also be small; so it is safe to not consider

grouping such queries for MQO. We implement this heuristic

as a bootstrap step in line 1 using k-means clustering (with

Jaccard as the similarity metric) for an initial partitioning of

the input queries into a set M of k query groups. Notice

that the similarity metric identiﬁes queries with s ubstantial

overlaps in their predicate sets, ignoring for now the common

sub-structure and the selectivity of these predicates.

B. Reﬁning query clusters

Starting with the k-means generated groups M, we reﬁne

the partitioning of queries further based on their structure

similarity and the estimated cost. To this end, we consider each

query group generated from the k-means clustering M ∈ M

in isolation (since queries across groups are guaranteed to be

sufﬁciently different) and perform the following steps: In lines

5–30, we (incrementally) merge structurally similar queries

within M through hierarchical clustering [14], and generate

query clusters such that each query cluster is optimized

together (i.e., results in one Type 2 query). Initially, we create

one singleton cluster C

i

for each query Q

i

of M (line 4).

Given two clusters C

i

and C

i

′

, we have to determine whether

it is more cost-efﬁcient to merge the two query clusters into a

single cluster (i.e., a single Type 2 query) than to keep the two

clusters separate (i.e., executing the corresponding two queries

independently). From the previous iteration, we already know

the cost of the optimized queries for each of the C

i

and C

i

′

clusters. To determine the cost of the merged cluster, we have

to compute the query that merges all the queries in C

i

and C

i

′

through rewriting; which requires us to compute the common

substructure of all these queries, and to estimate the cost of the

rewritten query generated from the merged clusters. For the

cost computation, we do some preliminary work here (line

7) by identifying the most selective triple patterns from the

two clusters (selectivity is estimated by [33]). Note that our

reﬁnement of M might lead to more than one queries; one for

each cluster of M, in the form of either Type 1 or Type 2.

Finding common substructures: While ﬁnding the maxi-

mum common subgraph for two graphs is known to be NP-

hard [4], the challenge here is asymptotically harder as it

requires ﬁnding the largest common substructures for multiple

graphs. Existing solutions on ﬁnding common subgraphs also

assume untyped edges and nodes in undirected graphs. How-

ever, in our case the graphs represent queries, and different

triple patterns might correspond to different semantics (i.e.,

typed and directed). Thus, the predicates and the constants as-

sociated with nodes must be taken into consideration. This mix

of typed, constant and variable nodes/edges is not typical in

classical graph algorithms, and therefore existing solutions can

not be directly applied for query optimization. We therefore

propose an efﬁcient algorithm to address these challenges.

In a nutshell, our algorithm follows the principle of ﬁnding

the maximal common edge subgraphs (MCES) [25], [37].

Concisely, three major sub-steps are involved (steps 2.1 to

2.3 in Figure 4): (a) transforming the input query graphs

into the equivalent linegraph representations; (b) generating a

product graph from the linegraphs; and (c) executing a tailored

clique detection algorithm to ﬁnd the maximal cliques in the

product graph (a maximal clique corresponds to an MCES).

We describe these sub-steps in details next.

Step 2.1: Building compact linegraphs: The linegraph L(G)

of a graph G is a directed graph built as follows. Each node

in L(G) corresponds to an edge in G, and there is an edge be-

tween two nodes in L(G) if the corresponding edges in G share

a common node. Although it is straightforward to transform

a graph into its linegraph representation, the context of MQO

raises new requirements for the linegraph construction. We

represent the linegraph of a query graph pattern in a 4-tuple,

deﬁned as L(G) = (V, E, π, ω). During linegraph construction,

besides the inversion of nodes and edges for the query graph,

our transformation also assigns to each edge in the linegraph

one of 4 labels (ℓ

0

∼ ℓ

3

). Speciﬁcally, for two triple patterns,

there are 4 possible joins between their subjects and objects (ℓ

0

= subject-subject, ℓ

1

= subject-object, ℓ

2

= object-subject, ℓ

3

= object-object). The assignment of labels on linegraph edges

captures these four join types (useful for pruning and will

become clear shortly). Figure 5 (a)-(d) shows the linegraphs

for the queries in Figure 3(a)-(d).

The classical solution for ﬁnding common substructures

of input graphs requires building Cartesian products on their

P

2

P

1

P

3

P

4

ℓ

3

ℓ

3

ℓ

0

ℓ

0

ℓ

1

ℓ

2

P

5

P

1

P

3

ℓ

3

ℓ

3

ℓ

3

ℓ

0

P

2

ℓ

3

ℓ

1

ℓ

2

ℓ

1

ℓ

2

(a) L(Q

1

) (b) L(Q

2

)

(c) L(Q

3

)

(d) L(Q

4

)

ℓ

0

P

2

P

1

P

3

P

5

P

4

ℓ

3

ℓ

3

ℓ

2

ℓ

1

ℓ

0

ℓ

0

ℓ

3

ℓ

3

P

4

ℓ

2

ℓ

1

P

1

P

2

P

3

P

6

P

4

ℓ

3

ℓ

3

ℓ

0

ℓ

0

ℓ

3

ℓ

3

ℓ

0

ℓ

0

(e) Subqueries

ℓ

3

ℓ

3

P

2

P

1

L(GP

p

):

τ

:

P

3

P

4

Fig. 5. (a)–(d) linegraphs, (e) their common substructures

linegraphs. This raises challenges in scalability when ﬁnding

the maximum common substructure for multiple queries in

one shot. To avoid the foreseeable explosion, we propose

ﬁne-grained optimizations (lines 8–16) to keep linegraphs as

small as possible so that only the most promising substructures

would be transformed into linegraphs, with the rest being

temporarily masked from further processing.

To achieve the above, queries in Q

ii

′

pass through a

two-stage optimization. In the ﬁrst stage (lines 8–11), we

identify (line 8) the common predicates in Q

ii

′

by building

the intersection µ

∩

for all the labels deﬁned in the µ’s (recall

that function µ assigns predicate names to graph edges).

Predicates that are not common to all queries can be safely

pruned, since by deﬁnition they are not part of any common

substructure, e.g.,P

5

and P

6

in Figure 3. While computing

the intersection of predicates, the algorithm also checks for

compatibility between the corresponding subjects and objects,

so that same-label predicates with different subjects/objects

are not added into µ

∩

. In addition, we maintain two adjacency

matrices for a linegraph L(GP), namely, the indegree matrix

m

−

storing all incoming, and the outdegree matrix m

+

storing

all outgoing edges from L(GP) vertices. For a vertex v, we use

m

−

[v] and m

+

[v], respectively, to denote the portion of the

adjacency matrices storing the incoming and outgoing edges

of v. For example, the adjacency matrices for vertex P

3

in

linegraph L(Q

1

) of Figure 5 are m

+

1

[P

3

] = [∅, ℓ

0

, ∅, ℓ

2

, ∅, ∅],

m

−

1

[P

3

] = [∅, ℓ

0

, ∅, ℓ

1

, ∅, ∅], while for linegraph L(Q

2

) they

are m

+

2

[P

3

] = [ℓ

2

, ∅, ∅, ∅, ℓ

0

, ∅], m

−

2

[P

3

] = [ℓ

1

, ∅, ∅, ∅, ℓ

0

, ∅].

In the second stage (lines 12–16), to further reduce the size

of linegraphs, for each linegraph vertex e, we compute the

Boolean intersection for the m

−

[e]’s and m

+

[e]’s from all

linegraphs respectively (line 13). We also prune e from µ

∩

if both intersections equal ∅ and set aside the triple pattern

associated with e in a set τ (line 14). Intuitively, this optimiza-

tion acts as a look-ahead step in our algorithm, as it quickly

detects the cases where the common sub-queries involve only

one triple pattern (those in τ ). Moreover, it also improves

the efﬁciency of the clique detection (step 2.2 and 2.3) due

to the smaller sizes of input li negraphs. Going back to our

example, just by looking at the m

−

1

, m

+

1

, m

−

2

, m

+

2

, it is easy

to see that the intersection ∩m

+

i

[P

3

] = ∩m

−

i

[P

3

] = ∅ for all

the linegraphs of Figure 5(a)-(d). Therefore, our optimization

temporarily masks P

3

(so as P

4

) from the expensive clique

detection in the following two steps.

Step 2.2: Building product graphs: The product graph

L(GP

p

) := (V

p

, E

p

, π

p

, ω

p

) of two linegraphs, L(GP

1

) :=

(V

1

, E

1

, π

1

, ω

1

) and L(GP

2

) := (V

2

, E

2

, π

2

, ω

2

), is denoted as

L(GP

p

) := L(GP

1

) ⊗ L(GP

2

). The vertices V

p

in L(GP

p

)

are deﬁned on the Cartesian product of V

1

and V

2

. In order

to use product graphs in MQO, we optimize the standard

deﬁnition with the additional requirement that vertices paired

together must have t he same label (i.e., predicate). That is,

V

p

:= {(v

1

, v

2

) | v

1

∈ V

1

∧ v

2

∈ V

2

∧ π

1

(v

1

) = π

2

(v

2

)},

with the labeling function deﬁned as π

p

:= {π

p

(v) | π

p

(v) =

π

1

(v

1

), with v = (v

1

, v

2

) ∈ V

p

}. For the product edges, we

use the standard deﬁnition which creates edges in the product

graph between two vertices (v

1i

, v

2i

) and (v

1j

, v

2j

) in V

p

if

either (i) the same edges (v

1i

, v

1j

) in E

1

, and (v

2i

, v

2j

) in E

2

exist; or (ii) no edges connect v

1i

with v

1j

in E

1

, and v

2i

with v

2j

in E

2

. The edges due to (i) are termed as strong

connections, while those for (ii) as weak connections [37].

Since the product graph for two linegraphs conforms to the

deﬁnition of linegraph, we can recursively build the product

for multiple linegraphs (line 17). Theoretically, there is an

exponential blowup in size when we construct the product for

multiple linegraphs. In practice, thanks to our optimizations in

Steps 2.1 and 2.2, our algorithm is able to accommodate tens to

hundred of queries, and generates the product graph efﬁciently

(which will be veriﬁed through Section V). Figure 5(e) shows

the product linegraph L(GP

p

) for the running example.

Step 2.3: Finding Cliques in product graphs: A (maximal)

clique with a strong covering tree (a tree only involving strong

connections) in the product graph equals to an MCES – a

(maximal) common sub-query in essence. In addition, we

are interested in ﬁnding cost-effective common sub-queries.

To verify if the found common sub-query is selective, it is

checked with the set S (from line 7) of selective query patterns.

In the algorithm, we proceed by ﬁnding all maximal cliques

in the product graph (line 18), a process for which many

efﬁcient algorithms exist [16], [21], [35]. For each discovered

clique, we identify its sub-cliques with the maximal strong

covering trees (line 21). For the L(GP

p

) in Figure 5( e), it

results in one clique (itself): i.e., K

′

1

= {P

1

, P

2

}. As the cost

of sub-queries is another dimension for query optimization, we

look for the substructures that are both large in size (i.e., the

number of query graph patterns in overlap) and correspond to

selective common sub-queries. Therefore, we ﬁrst sort SubQ

(contributed by K

′

s and τ, line 22) by their sizes in descending

order, and then loop through the sorted list from the beginning

and stop at the ﬁrst substructure that intersects S ( lines 22–

25), i.e., P

4

in our example. We then merge (if it is cost-

effective, line 28) the queries whose common sub-query is

reﬂected in K and also merge their corresponding clusters

into a new cluster (while remembering the found common

sub-query) (lines 26–30). The algorithm repeats lines 5–30

until every possible pair of clusters have been tested and no

new cluster can be generated.

C. Generating optimized queries and distributing results

After the clusters are ﬁnalized, the algorithm rewrites each

cluster of queries into one query and thus generates a set of

rewritings Q

OPT

(lines 31–34). The result from evaluating Q

OPT

over the data is a superset of evaluating the input queries Q