scispace - formally typeset
Open AccessJournal ArticleDOI

Top-K nearest keyword search on large graphs

Reads0
Chats0
TLDR
A shortest path tree for a distance oracle technique is built and a global storage technique is proposed to further reduce the index size and the query time in obtaining a k-NK result on a graph from that on trees.
Abstract
It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword λ, a k-NK query seeks k nodes which contain λ and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems.The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.

read more

Content maybe subject to copyright    Report

Top-K Nearest Keyword Search on Large Graphs
Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, Wentao Tian
The Chinese University of Hong Kong, Hong Kong, China
{mqiao,lqin,hcheng,yu,wttian}@se.cuhk.edu.hk
ABSTRACT
It is quite common for networks emerging nowadays to have labels
or textual contents on the nodes. On such networks, we study the
problem of top-k nearest keyword (k-NK) search. In a network G
modeled as an undirected graph, each node is attached with zero or
more keywords, and each edge is assigned with a weight measuring
its length. Given a query node q in G and a keyword λ, a k-NK
query seeks k nodes which contain λ and are nearest to q. k-NK is
not only useful as a stand-alone query but also as a building block
for tackling complex graph pattern matching problems.
The key to an accurate k-NK result is a precise shortest distance
estimation in a graph. Based on the latest distance oracle technique,
we build a shortest path tree for a distance oracle and use the tree
distance as a more accurate estimation. With such representation,
the original k-NK query on a graph can be reduced to answering
the query on a set of trees and then assembling the results obtained
from the trees. We propose two efficient algorithms to report the
exact k-NK result on a tree. One is query time optimized for a
scenario when a small number of result nodes are of interest to
users. The other handles k-NK queries for an arbitrarily large k
efficiently. In obtaining a k-NK result on a graph from that on trees,
a global storage technique is proposed to further reduce the index
size and the query time. Extensive experimental results conform
with our theoretical findings, and demonstrate the effectiveness and
efficiency of our k-NK algorithms on large real graphs.
1. INTRODUCTION
Many real-world networks emerging nowadays have labels or
textual contents on the nodes. For example in a road network, a
location may have labels such as “McDonald’s”, “hospital”, and
“kindergarten”. In a social network, a person may have informa-
tion including name, interests and skills, etc.. In a bibliographic
network, a paper may have keywords and abstract, and an author
may have name, affiliation and email address. In this study, we
consider the problem of top-k nearest keyword (k-NK) search on
large networks. In a network G modeled as an undirected graph,
each node is attached with zero or more keywords, and each edge
is assigned with a weight measuring its length. Given a query node
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or d istributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 10
Copyright 2013 VLDB Endowment 2150-8097/13/10...
$
10.00.
q in G and a keyword λ, a k-NK query in the form of Q = (q, λ, k)
looks for k nodes which contain λ and are nearest to q. Different
from a large body of research on k - nearest neighbor (k-NN) search
on spatial networks [15, 5, 6, 18, 19, 7], we define G as a general
graph without coordinates. Thus our solution can apply to a wide
range of networks.
Motivation. k-NK is an important and useful query in graph search.
As a stand-alone query, it has a wide range of applications. Further-
more, it can serve as a building block for tackling complex graph
pattern matching problems which impose both structural and tex-
tual constraints. Here we list a few applications of k-NK queries.
Consider the social network Facebook as an example, in which
personalized search based on graph structure and textual contents
has become increasingly popular
1
. A person looks for 20 friends or
potential friends who like hiking to participate in a hiking activity.
Intuitively, if two persons share some common friends, i.e., they are
two hops away, they are more likely to become friends. In contrast,
if they are far away from each other in the network, they are less
likely to establish a link. Thus the problem is to find 20 persons
who like hiking and are nearest to the person who serves as the
organizer. It can be answered by a k-NK query. More generally,
we also consider a query containing multiple keywords connected
by AND or OR operators to express more complex semantics, e.g.,
a person looks for k friends or potential friends who like hiking
AND (OR) photography and are nearest to him.
Take a road network with locations associated with keywords as
another example. For parents looking for k kindergartens nearest
to their home for their children, their requirements can be expressed
by a k-NK query where the query node is the home location, and
the keyword is “kindergarten”.
In the third example, we s how how k-NK queries serve as a
building block for solving the graph pattern matching problem.
Consider a couple who wants to buy a house. They have some con-
straints like having a kindergarten and a hospital within 3 km, and
a supermarket within 1 km of their home. These constraints can be
expressed as a star pattern, and the pattern matching problem can
be decomposed into three k-NK queries with keywords “kinder-
garten”, “hospital” and “supermarket” respectively and k = 1 for
each potential house location to be considered.
Recently, Bahmani and Goel [1] have designed a Partitioned
Multi-Indexing (PMI) scheme to answer k-NK queries approxi-
mately. PMI is an inverted index built based on distance oracle
[20] which is a distance estimation technique. Given a k-NK query
Q = (q, λ, k), it returns k nodes containing keyword λ in ascend-
ing order of their approximate distance from the query node q. PMI
inherits the 2 log
2
|V | 1 approximation factor for distance esti-
mation from distance oracle [20], where V is the set of nodes in the
1
https://www.facebook.com/about/graphsearch
901

graph. The major drawback of PMI is that its distance estimation
error could be quite large in practice. This can greatly distort the
ranking of the candidate nodes carrying the query keywords, and
thus lead to a low result quality.
In this work, we study how to answer k-NK queries accurately
and efficiently using compact index. The key to an accurate k-NK
result is a precise shortest distance estimation in a graph. As we
use a general graph model, existing k-NN solutions on spatial net-
works [15, 5, 6, 18, 19, 7] cannot be applied, as they usually rely
on specialized structures that leverage properties of spatial data to
optimize their solutions. Instead we use distance oracle [20] as the
fundamental distance estimation framework. For each component
of a distance oracle, we will build a shortest path tree, based on
which we can estimate the shortest distance between two nodes by
their tree distance. The tree distance is more accurate than the dis-
tance estimated by distance oracle, which we call witness distance
to distinguish. As we transform a distance oracle on a graph into a
set of shortest path trees, the original k-NK query on the graph can
be reduced to answering the k-NK query on a set of trees. Thus we
first focus on processing k-NK queries to find exact top-k answers
on a tree. Then we study how to assemble the results obtained from
the trees to form the approximate top-k answers on the graph.
Contributions. Our main contributions in this work are summa-
rized as follows.
(1) Given a tree, we first consider a common scenario when users
are interested in a small number of answer nodes bounded by a
small constant
k, i.e., k k. We propose the first algorithm
tree-boundk with query time O(k + log |V
λ
|), where |V
λ
| is the
number of nodes carrying the query keyword λ, and index size
O (
k · |doc(V )|), where |doc(V )| is the total number of keywords
on all the nodes in the graph.
(2) Next we remove the
k restriction and handle k-NK queries
for an arbitrary k on a tree. We propose the second algorithm
tree-pivot with query time O(k·log |V |) and index size O(|doc(V )
log |V |) which is independent of k, thus is more scalable.
(3) Based on our proposed tree algorithms, we present our algo-
rithm for approximate k-NK query on a graph. We propose a global
storage technique to further reduce the index size and the query
time. We also show how to extend our methods to handle a query
with multiple keywords.
(4) Our experimental evaluation demonstrates the effectiveness and
efficiency of our k-NK algorithms on large real-world networks.
We show the superiority of our methods in ranking top-k answer
nodes accurately, when compared with the state-of-the-art top-k
keyword search method PMI [1].
Roadmap. The rest of the paper is organized as follows. Sec-
tion 2 formally defines the problem. Section 3 discusses two ex-
isting related studies and their drawbacks. Section 4 presents our
framework. Sections 5 and 6 introduce two proposed algorithms to
answer k-NK queries on a tree for a small k and an arbitrary k re-
spectively using compact index structures. Section 7 elaborates on
the way to answer k-NK queries on a graph by approximating the
graph with a bounded number of trees. Section 8 presents exten-
sive experimental evaluation. Section 9 reviews the previous works
related to ours. Finally, Section 10 concludes the paper.
2. PROBLEM DEFINITION
We model a weighted undirected graph as G(V, E), where V (G)
represents the set of nodes and E(G) represents the set of edges in
G. We use V and E to denote V (G) and E(G) if the context is
obvious. Each edge (u, v) E has a positive weight, denoted
r
b
u
j
o
s
p
v
h
e
t
c
a
n
k
i
d
g
f
m
β
λ,α
λ
α,β
λ
λ,α
β
λ
Figure 1: A Graph G with Keywords
as weight(u, v). A path p = (v
1
, v
2
, · · · , v
l
) is a sequence of l
nodes in V s uch that for each v
i
(1 i < l), (v
i
, v
i+1
) E.
The weight of a path is the total weight of all edges on the path.
For any two nodes u V and v V , the distance of u and v
on G, dist(u, v), is the minimum weight of all paths from u to v
in G. Each node v V contains a set of zero or more keywords
which is denoted as doc(v). The union of keywords for all nodes
in G is denoted as doc(V ). Note that doc(V ) is a multiset and
|doc(V )| =
P
vV
|doc(v)|. We use V
λ
V to denote the set of
nodes carrying keyword λ in V .
DEFINITION 1. Given a graph G(V, E), a top-k nearest key-
word (k-NK) query is a triple Q = (q, λ, k), where q V is a
query node in G, λ is a keyword, and k is a positive integer. Given
a query Q, a node v V is a keyword node w.r.t. Q if v contains
keyword λ, i.e., v V
λ
. The result is a set of k keyword nodes,
denoted as R = {v
1
, v
2
, · · · , v
k
} V
λ
, and there does not exist
a node u V
λ
\ R such that dist(q, u) < max
vR
dist(q, v). To
further report the distance in the top-k result, we can use the form
R = {v
1
: dist(q, v
1
), v
2
: dist(q, v
2
), · · · , v
k
: dist(q, v
k
)}.
In this paper, we aim at answering a k-NK query Q = (q, λ, k)
on a graph G. For simplicity, we assume that there is only one
keyword λ in the query. We will discuss how to answer a query
containing multiple keywords with AND and OR semantics.
Example 1: Fig. 1 shows a graph G. Assume that the weight of
each edge is 1. For a k-NK query Q = (f, λ, 3), the keyword node
set is V
λ
= {b, c, k, n, t}. The result of Q is R = {b : 2, n : 4, k :
5} since dist(f, b) = 2, dist(f, n) = 4, and dist(f, k) = 5. 2
3. EXISTING SOLUTIONS
A straightforward approach to answering a k-NK query Q =
(q, λ, k) on G is to use Dijkstra’s algorithm to search from the
query node q and output k nearest keyword nodes in nondecreasing
order of their distances to q. The time complexity is O(|E| + |V | ·
log |V |). Obviously, Dijkstra’s algorithm is inefficient when the
size of the graph is large or the keyword nodes are far away from q.
In the literature, [1] and [22] design different indexing schemes
to process (top-k) nearest keyword queries on a graph or a tree. We
introduce the two methods in the following two subsections.
3.1 Approximate k-NK on a Graph
Bahmani and Goel [1] find an approximate answer to a k-NK
query in a graph based on a distance oracle [20].
Distance Oracle: Distance oracle is a technique for estimating the
distance of two nodes in a graph [20]. Given a graph G, a distance
oracle is a Voronoi partition of V (G) determined by a set of ran-
domly selected center nodes. More specifically, given a number
n
c
, we randomly select n
c
nodes from V (G) as the center nodes
to construct a distance oracle O. Then the partition is constructed
by assigning each node v V (G) to its nearest center node, de-
noted as wit
O
(v), which is called the witness node of v w.r.t. O. If
v is a center node, wit
O
(v) = v. For each node v V (G), the
shortest distance from v to its witness node, i.e., dist(v, wit
O
(v)),
is precomputed. After constructing O, given two nodes u and v
in G, if u and v are in the same partition in O, i.e., wit
O
(u) =
902

u
o
h
i
m
t
c
j
s
p
v
e
a
n
k
O
2
b u
s
p
v
h
e
t
c
i
dg
f
1
2
2
o
r
n
k
j
a
1
2
2
1
1
3
3
3
4
m
5
5
5
5
4
6
4
O
1
b
dg
f
1
r
1
1
1
2
2
2
2
1
2
1
3
3
1
2
1
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
Figure 2: Two Distance Oracles O
1
and O
2
wit
O
(v), we compute the estimated distance, called witness dis-
tance, as
dist
O
(u, v) = dist(u, wit
O
(u)) + dist(v, wit
O
(v)). If u
and v are not in the same partition in O,
dist
O
(u, v) = +.
One distance oracle is usually not enough for distance estimation
in a graph G. It cannot estimate the distance of two nodes in dif-
ferent partitions. Even for two nodes in the same partition, the esti-
mation may have a large error. Therefore, a s et of r = p × log |V |
distance oracles {O
1
, O
2
, · · · , O
r
} are constructed, where p can
be considered as a constant
2
. The algorithm is processed in log |V |
phases. In phase i (0 i < log |V |), p distance oracles are con-
structed where each distance oracle contains 2
i
randomly selected
center nodes. Given r distance oracles, the distance of two nodes
u and v in G can be estimated as an upper bound dist(u, v) =
min
1ir
dist
O
i
(u, v).
The time complexity to compute the estimated distance
dist(u, v)
for any two nodes u and v in a graph G is O(log |V |). The distance
oracles consume O(|V | · log |V |) space. Das Sarma et al. [20]
prove that when p = Θ(|V |
1/ log |V |
), the estimated distance can
be bounded by dist(u, v)
dist(u, v) (2 log
2
|V |−1)·dist(u, v)
with a high probability.
Example 2: Fig. 2 shows two distance oracles O
1
and O
2
for the
graph shown in Fig. 1. There is one center node r in O
1
, and four
center nodes r, n, o and t in O
2
. The distance of nodes j and
s is estimated as
dist(j, s) = min{dist
O
1
(j, s),
dist
O
2
(j, s)} =
min{dist(j, r) + dist(s, r), dist(j, n) + dist(s, n)} = 5. 2
Answering k-NK with Distance Oracle: [1] designs a Partitioned
Multi-Indexing (PMI) scheme which uses a set of distance oracles
to answer a k-NK query in a graph. For each partition in a distance
oracle O
i
, an inverted list is constructed for each keyword in the
partition. Specifically, for a partition with a center node c and a
keyword λ, the inverted list contains all nodes in the partition that
contain keyword λ ranked in nondecreasing order of their distances
to c. Given a k-NK query Q = (q, λ, k) and a distance oracle O
i
,
the algorithm first finds the partition that q belongs to in O
i
. The
result w.r.t. O
i
is the first k elements in the inverted list for λ in the
partition, denoted as R
O
i
= {u
1
: dist(c, u
1
) + dist(c, q), u
2
:
dist(c, u
2
) + dist(c, q), · · · , u
k
: dist(c, u
k
) + dist(c, q)}. The
final result R is computed by merging the nodes in each R
O
i
and
maintaining k nodes with the shortest distances to q. The query
time complexity is O(k ·log |V |). We illustrate the algorithm using
the following example.
Example 3: Consider the graph in Fig. 1 and two distance oracles
in Fig. 2. For keyword λ, the inverted list for the partition centered
at node r in O
1
has 5 elements {b : 1, n : 3, k : 4, c : 5, t : 6}.
The inverted list for the partition centered at node o in O
2
has 1
element {k : 2}. Given a k-NK query Q = (m, λ, 2), from O
1
, we
can get a result R
O
1
= {b : 1 + dist(r, m), n : 3 + dist(r, m)} =
{b : 5, n : 7}, and from O
2
, we can get a result R
O
2
= {k :
2 + dist(o, m )} = {k : 3}. By merging R
O
1
and R
O
2
, the final
answer is R = {k : 3, b : 5}. The exact answer is R = {c : 1, k :
1} according to Fig. 1. 2
Limitation: Although in theory, the witness distance used by [1]
can be bounded by a factor of 2 log
2
|V | 1 of the exact distance
with a high probability, in practice, however, we find the distance
2
In [20], the set {O
1
, O
2
, · · · , O
r
} is defined as a distance oracle.
b,19,[19,20]
h,10,[10,18]
e,11,[11,18]
m,15, [15,18]
c,16,[16,17]
p,12, [12,14]
v,13, [13, 13]
g,2,[2,6]
f,7,[7,8]
r,1,[1,20]
d,9,[9,18]
a,3,[3,6]
k,5,[5,5]
i,8,[8,8]
j,4,[4,5]
t,17,[17,17]
o,18,[18,18]
s,14,[14,14]
n,6,[6,6]
u,20,[20,20]
λ
λ
λ
λ
λ
Figure 3: A Tree T with Preorder and Interval on Each Node
b
t
c
r
n
k
a
b
e
t
c
r
n
k
j
a
CT ECT
1
3
4
6
5
11
16
17
19
Interval [1,2] 3 [4,5] 6 [7,10]
Result b n k n b
Interval [11,16] 17 18
Result c t c
[19,20]
b
TVP
Figure 4: CT(λ), ECT(λ) and TVP(λ) for Keyword λ
estimation error can be quite large. For example, for the graph G in
Fig. 1 and two distance oracles O
1
and O
2
in Fig. 2, for two nodes
s and v, the witness distance in O
1
is
dist
O
1
(s, v) = dist(s, r) +
dist(v, r) = 10, and that in O
2
is
dist
O
2
(s, v) = dist(s, n) +
dist(v, n) = 6. However, the exact distance is dist(s, v) = 2 in
G, which is much smaller than both dist
O
1
(s, v) and
dist
O
2
(s, v).
The inaccurate distance estimation can greatly distort the ranking
of the nodes carrying the query keyword, and thus lead to a low
result quality, as illustrated in Example 3.
3.2 Exact 1-NK on a Tree
Tao et al. [22] compute the exact answer to a 1-NK query on a
tree T (V, E). Given a query Q = (q, λ, 1), the result is the nearest
node in T that contains keyword λ, denoted as NN(q, λ). The ba-
sic idea is as follows. We label a node v with the sequence number
of v in the preorder traversal of T . For a certain keyword λ, all
nodes with the preorder label in the interval [1, |V |] can be parti-
tioned into several disjointed intervals, such that any node v in the
same interval shares an identical NN(v, λ). The partition is called
tree Voronoi partition of λ, denoted as TVP(λ). By precomputing
TVP(λ) for all keywords λ on the tree, a query Q = (q, λ, 1) can
be answered in O (log |V
λ
|) time using a binary search in TVP(λ).
In order to compute TVP(λ) for all keywords λ in T efficiently,
two new data structures, namely, Compact Tree CT(λ) and Ex-
tended Compact Tree ECT(λ), are proposed in [22].
DEFINITION 2. (Compact Tree and Extended Compact Tree)
For a tree T and a keyword λ, a compact tree CT(λ) is a tree that
keeps only two types of nodes in T : a keyword node that contains
keyword λ, and a node that has at least two direct subtrees contain-
ing nodes carrying keyword λ. In the preorder traversal of T , for
two successive nodes u and v, if NN(u, λ) 6= NN(v, λ), v is called
a change node. An extended compact tree ECT(λ) is a tree con-
structed by adding all change nodes into the compact tree CT(λ).
Using ECT(λ), TVP(λ) can be constructed easily. In [22],
the authors prove that the total size of all compact trees and all
extended compact trees for all keywords in the tree T (V, E) is
bounded by O(|doc(V )|). The time to compute all compact trees
and all extended compact trees for all keywords in the tree T (V, E)
is bounded by O(|doc(V )| · log |V |).
Example 4: Fig. 3 shows a tree with the preorder label from 1 to 20
on its nodes. For keyword λ, there are 5 keyword nodes b, c, k, n, t.
For node s, NN(s, λ) = c. The compact tree of λ, CT(λ), is shown
on the left part of Fig. 4. Node r is in CT(λ) because r has three
direct subtrees with nodes carrying keyword λ. e is not in CT(λ)
because e is not a keyword node and e has only one direct subtree
rooted at m with nodes carrying keyword λ. The extended compact
tree of λ, ECT(λ), is shown in the middle part of Fig. 4 with the
903

preorder label marked beside each node. Node e is in ECT(λ),
because for its parent node h, NN(h, λ) = b 6= NN(e, λ) = c.
The tree Voronoi partition of λ, TVP(λ), is shown on the right part
of Fig. 4. For node s with preorder label 14, it is in the interval
[11, 16], thus NN(s, λ) = c as listed in TVP(λ). 2
4. SOLUTION OVERVIEW
Answering k-NK on a Graph using Tree Distance: To address
the drawback of witness distance, in this paper, we propose to use
tree distance in processing a k-NK query. We observe that for a
partition of a distance oracle, we can construct a shortest path tree
rooted at the center node of the partition. Since a tree contains more
structural information than a star, using tree distance will be more
accurate than using witness distance for estimating the distance of
two nodes. For a distance oracle O
i
, let the set of trees constructed
in O
i
be T
i
. T
i
can be considered as a tree by adding a virtual
root and several virtual edges with weight + that connect the
new virtual root to every root node in T
i
respectively. Let the k-NK
result on tree T be R
T
. Suppose we have an algorithm to compute
R
T
on a tree T , we can solve the k-NK problem in a graph by
merging R
T
i
for each tree T
i
, 1 i r. Obviously, such a result
will be more accurate than the res ult by [1]. The following example
illustrates the k-NK query processing based on tree distance.
Example 5: For the distance oracles O
1
and O
2
shown in Fig. 2,
the corresponding shortest path trees T
1
and T
2
are shown in Fig. 5.
For T
1
, there is only 1 tree rooted at r because there is only 1
partition in O
1
. For T
2
, there are 4 trees rooted at nodes n, o, r, t
respectively, because there are 4 partitions in O
2
. In each tree, the
path from any node to the root node is a shortest path in the original
graph. For two nodes s and v, their tree distance is 2 in both T
1
and
T
2
, the same as the exact distance dist(s, v) in G. For a k-NK query
Q = (m, λ, 2), we have R
T
1
= {c : 1, t : 2}, and R
T
2
= {k : 1}.
By merging R
T
1
and R
T
2
, we get R = {c : 1, k : 1}. Such a result
is much better than the result in Example 3 computed using witness
distance for the same query. 2
With the tree distance formulation, the key operation in answer-
ing a k-NK query on a graph is to answer the k-NK query on a tree.
Therefore, we start with processing a k-NK query on a tree.
Answering k-NK on a Tree: We show that it is nontrivial to answer
a k-NK query on a tree efficiently even if k is bounded. Our first
attempt is to extend the existing 1-NK solution on a tree T (V, E)
in [22]. Recall that in [22], for a certain keyword λ, the range
[1, |V |] is partitioned into several disjoint intervals, and nodes with
the preorder label in an identical interval share the same 1-NK re-
sult. When k 2, each interval needs to be further partitioned to
ensure that all nodes with the preorder label in the same interval
share an identical k-NK result. The number of intervals increases
exponentially w.r.t. the number of keyword nodes on the tree until
it reaches |V | for a keyword λ. Clearly, using such an approach,
the index size is too large in practice even for a small k. Our second
attempt is that, for each node v on the tree T (V, E) and each key-
word λ, we precompute its
k nearest nodes that contain λ. When
processing a query Q = (q, λ, k) with k
k, we can simply re-
trieve the precomputed result on node q and output the first k nodes
directly. Such an approach is impractical because for each keyword
λ, we need O(
k · |V |) space to store the precomputed results.
In the following, we first introduce two algorithms for answering
exact k-NK on a tree T (V, E). Our first algorithm tree-boundk can
only handle bounded k values with query processing time O(k +
log |V
λ
|) and index size O(k · |doc(V )|) for all keywords where k
is an upper bound value of k. Our second algorithm tree-pivot can
handle an arbitrary k with query processing time O(k · log |V |)
T
2
b
u
s
p
v
h
e
t
c
i
d
g
f
T
1
o
r
m
n
k
j
a
λ
λ
λ
λ
λ
r
b
u
h
i
d
g
f
λ
o
m
k
λ
j
s
p
v
e
a
n
λ
t
c
λ
λ
Figure 5: Shortest Path Trees T
1
and T
2
Algorithm 1: tree-boundk (Q,T )
Input: A k-NK query Q = (q, λ, k), and a tree T .
Output: Answer for Q on T .
R ;1
(u, u
) the entry edge of q on CT(λ);2
R R
k
(cand
λ
(u) dist(q, u));3
R R
k
(cand
λ
(u
) dist(q, u
));4
return R;5
and index size O(|doc(V )| · log |V |) for all keywords which is
independent of k. We then show our algorithm for approximate
k-NK on a graph by merging results on a bounded number of trees.
We propose a global storage technique to further reduce the index
size and the query time on a graph. Finally we show how to extend
our method to handle a query with multiple keywords.
5. K-NK ON A TREE FOR A SMALL K
In this section, we study how to answer a k-NK query Q =
(q, λ, k) on a tree T (V, E). We first consider a common sce-
nario when users are interested in a small number of answer nodes
bounded by a small constant
k, i.e., k k. Recall that for a key-
word λ, its compact tree CT(λ) keeps all the structural information
of λ on the tree T . Our idea is to precompute the top-k results for
every keyword λ and every node on CT(λ). Since the total size
of all compact trees is bounded by O(|doc(V )|), the total space to
store the top-
k results of nodes on all compact trees is bounded by
O (
k · |doc(V )|). Given a query Q = (q, λ, k), if q is on CT(λ),
we can simply report the precomputed answer on CT(λ). If q is
not on CT(λ), we need to find a way to construct the answer using
the precomputed results as well as the structure of CT(λ) and T . In
the following, we first introduce how to answer a k-NK query using
CT(λ), followed by discussions on the construction of the index.
5.1 Query Processing
For a keyword λ, and each node v in the compact tree CT(λ),
we use a candidate list cand
λ
(v) to denote the precomputed k-NK
results for k =
k on node v ranked in nondecreasing order of their
distances to v, in the form of cand
λ
(v) = {v
1
: dist(v, v
1
), v
2
:
dist(v, v
2
), · · · , v
k
: dist(v, v
k
)} where dist(v, v
1
) dist(v, v
2
)
· · · dist(v, v
k
). Given a query Q = (q, λ, k) on a tree T (V, E)
where k
k, if q is in CT(λ), we can simply report the first k ele-
ments in cand
λ
(q) as the answer. The difficult case is when q is not
in CT(λ). In order to answer such a query, we define an entry edge
to be the edge in CT(λ) that is nearest to q. Intuitively, the entry
edge plays a role of connecting the query node q to the compact
tree CT(λ). The for mal definition of entry edge is as follows.
DEFINITION 3. (Entry Node and Entry Edge) Given a com-
pact tree CT(λ), for each edge (u, u
) on CT(λ) with u
being a
child node of u, (u, u
) represents a unique path from u to u
on
the original tree T . For any node v on T , we say v sticks to CT(λ),
denoted as v
s
CT(λ), if and only if there exists an edge (u, u
)
on CT(λ) such that v is on the path from u to u
on T , otherwise
v does not stick to CT(λ), denoted as v /
s
CT(λ). For a node q
on T , let v be the first node on the path from q to the root node of
T such that v
s
CT(λ). v is called the Entry Node of q w.r.t. λ,
904

Algorithm 2: operator R δ
Input: Candidate list R = {u
1
: d
u
1
, u
2
: d
u
2
, · · · }, distance δ.
Output: A candidate list by adding δ to all distances in R.
R
;1
for i = 1 to |R| do2
R
R
S
{u
i
: d
u
i
+ δ};
3
return R
;4
denoted as EN
λ
(q). The corresponding edge (u, u
) on CT(λ) is
called the Entry Edge of q w.r.t. λ, denoted as EE
λ
(q).
Note that for a node q and a keyword λ, EE
λ
(q) is an edge on
the compact tree CT(λ), and EN
λ
(q) is a node on the original tree
T . We use an example to illustrate the entry node and entry edge.
Example 6: For the tree T shown in Fig. 3 and keyword λ, the
compact tree CT(λ) is shown on the left part of Fig. 4. For ease of
illustration, we also mark the nodes in CT(λ) dark on the tree T in
Fig. 3. For edge (r, c) in CT(λ), h
s
CT(λ) because h is on the
path from r to c in T . p /
s
CT(λ) since p is not on the tree path
of any CT(λ) edge. For node v, its entry node is EN
λ
(v) = e, as e
is the first node on the path (v, p, e, h, d, r) such that e
s
CT(λ).
The entry edge for v is EE
λ
(v) = (r, c) since the entry node e for
v is on the path from r to c in T . The entry nodes and entry edges
for some other nodes in T are listed in the following table. 2
Node g j d e p u
EN
λ
g j d e e b
EE
λ
(r, a) (a, k) (r, c) (r, c) (r, c) (r, b)
The Algorithm: Given a tree T (V, E), for keyword λ, all keyword
nodes are contained in CT(λ). For any node q V , the path from
q to any keyword node will go through the entry node EN
λ
(q).
Based on such property, the result of a query Q = (q, λ, k) is iden-
tical with the result of the query Q
= (EN
λ
(q), λ, k). However,
EN
λ
(q) may not be on CT(λ), thus the result of Q
is not neces-
sarily precomputed. Let (u, u
) = EE
λ
(q), since EN
λ
(q) is on the
path from u to u
on the tree T , the path from EN
λ
(q) to any key-
word node in T will go through either u or u
. Thus, the answer for
Q
can be constructed by merging the precomputed candidate lists
cand
λ
(u) and cand
λ
(u
) on CT(λ).
Our algorithm for processing a query Q = (q, λ, k) on a tree T is
shown in Algorithm 1. We assume that the compact tree CT(λ) for
each keyword λ and the list cand
λ
(u) for every node u on CT(λ)
have been computed. After initializing the res ult R in line 1, we
find the entry edge (u, u
) for q on CT(λ) (line 2). We add a dis-
tance dist(q , u) to every node in cand
λ
(u) using the operator, to
reflect the distance from q to a keyword node via u. We then merge
the new result into R using the
k
operator (line 3). Similarly we
apply the two operators to cand
λ
(u
) with the distance dist(q, u
)
(line 4). We will describe the operators and
k
later. We use the
following example to illustrate the algorithm.
Example 7: Given the tree T shown in Fig. 3 and CT(λ) on the left
part of Fig. 4, for a query Q = (o, λ, 2), the entry edge EE
λ
(o) =
(r, c). Suppose the lists cand
λ
(r) = {b : 1, n : 3} and cand
λ
(c) =
{c : 0, t : 1} are precomputed. By adding dist(o, r) = 5 to
cand
λ
(r), and adding dist(o, c) = 2 to cand
λ
(c), we get the new
lists {b : 6, n : 8} for r and {c : 2, t : 3} for c. We merge the two
lists and get the final result R = {c : 2, t : 3}. 2
The efficiency of Algorithm 1 depends on three operations. The
first operation is to find the entry edge for any node on T (line 2).
The second operation is to calculate the distance of any two nodes
on T , e.g., dist(q, u) and dist(q, u
) (line 3-4). The third operation
is to merge two sorted lists into a new one using operators and
k
(line 3-4). Next, we discuss the three operations separately.
Algorithm 3: operator R
1
k
R
2
Input: Two sorted candidate lists R
1
= {u
1
: d
u
1
, u
2
: d
u
2
, · · · }
R
2
= {v
1
: d
v
1
, v
2
: d
v
2
, · · · }, and result size k.
Output: The merged candidate list.
R ; i 1; j 1;1
while (i < |R
1
| or j < |R
2
|) and |R| k do2
if i < |R
1
| and (d
u
i
d
v
j
or j |R
2
|) then
3
if u
i
/ R then R R
S
{u
i
: d
u
i
};
4
i i + 1;5
else if j < |R
2
| and (d
v
j
d
u
i
or i |R
1
|) then
6
if v
j
/ R then R R
S
{v
j
: d
v
j
};
7
j j + 1;8
return R;9
Finding the Entry Edge: Given a keyword λ, for any node v on a
tree T (V, E), our idea of finding the entry edge EE
λ
(v) of v is sim-
ilar to the idea of finding the 1-NK answer using the tree Voronoi
partition TVP(λ) in [22]. For the range [1, |V |], we partition it
into several disjoint intervals, such that nodes with the preorder la-
bel in the same interval share an identical entry edge. We call such
partition an entry edge partition for λ, denoted as EEP(λ). Given
EEP(λ), EE
λ
(v) can be computed easily using a binary search in
EEP(λ) in O(log |V
λ
|) time. In the next subsection, we show how
to build EEP(λ) for all keywords efficiently and prove that the total
size of EEP(λ) for all keywords in T is bounded by O(doc|V |).
Computing Tree Distance: Given a tree T (V, E) with root r, sup-
pose the distance from r to every node in T has been precomputed.
For any two nodes u and v on T , we denote LCA(u, v) as their low-
est common ancestor. The distance of u and v can be computed as
dist(u, v) = dist(r, u) + dist(r, v) 2dist(r, LCA(u, v)). Using
the techniques in [2], LCA(u, v) can be found in O(1) time using
O (|V |) index space. Thus dist(u, v) for any two nodes u and v on
T can be computed in O(1) time using O(|V |) index space.
Merging Results: The results are merged using two operators
and
k
. Algorithm 2 shows the operator , which takes a candi-
date list R and a distance δ as input, and outputs a candidate list by
adding δ to all distances in R. The time complexity for the op-
erator is O(|R|). Algorithm 3 shows the operator
k
, which takes
two candidate lists R
1
and R
2
sorted in nondecreasing order of the
distances, and a value k as input, and outputs the merged candidate
list R. R contains at most k elements sorted in nondecreasing order
of the distances. R can be constructed by visiting each element in
R
1
and R
2
at most once. The time complexity for the
k
operator
is O(min{|R
1
| + |R
2
|, k}). The
k
and operators satisfy the
commutative, associative and distributive laws as follows.
(Commutative Law) R
1
k
R
2
= R
2
k
R
1
.
(Associative Law) (R
1
k
R
2
)
k
R
3
= R
1
k
(R
2
k
R
3
).
(Distributive Law) (R
1
k
R
2
) d = (R
1
d)
k
(R
2
d).
THEOREM 1. Algorithm 1 computes the exact k-NK answer for
a query Q = (q, λ, k) on a tree T (V, E) in O(k + log |V
λ
|) time.
Algorithm 1 uses the novel idea of entry edge, and elegantly ex-
tends the 1-NK method [22] to handle k-NK (k > 1) with the same
query time complexity, except for an extra linear cost O(k) indis-
pensable for reporting the results.
Given the tree T , for every keyword λ, besides the compact tree
CT(λ), two more indexes are needed. The first index, the entry
edge partition EEP(λ), is to find the entry edge for any node on T .
The second index is the candidate list cand
λ
(v) for every node on
CT(λ). Below we show how to construct the two indexes.
5.2 Construction of Entry Edge Partition
Given a tree T (V, E), for each keyword λ, sharing the similar
idea with the tree Voronoi partition TVP(λ), we construct an entry
905

Citations
More filters
Journal ArticleDOI

G-Tree: An Efficient and Scalable Index for Spatial Search on Road Networks

TL;DR: Inspired by R-tree, a height-balanced and scalable index, namely G-tree is proposed, to efficiently support three types of location-based queries on road networks, single-pair shortest path query, k nearest neighbor (kNN) query, and keyword-based kNN query.
Proceedings ArticleDOI

Keyword-aware continuous kNN query on road networks

TL;DR: This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently and builds a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search.
Proceedings ArticleDOI

Exact Top-k Nearest Keyword Search in Large Networks

TL;DR: This paper proposes algorithms for top-k nearest keyword search that provide exact solutions and which handle networks of very large sizes and verified the performance of the solutions compared with the best-known approximation algorithms with experiments on real datasets.
Proceedings ArticleDOI

Real time personalized search on social networks

TL;DR: A novel 3D cube inverted index is designed, a cube based threshold algorithm is devised to retrieve the top-k results, and several pruning techniques are proposed to optimize the social distance computation, whose cost dominates the query processing.
Journal ArticleDOI

Keyword Search over Distributed Graphs with Compressed Signature

TL;DR: A signature-based search algorithm is proposed that encodes the shortest-path distance from a vertex to any given keyword in the graph, and can find query answers by exploring fewer paths, so that the time and communication costs are low.
References
More filters
Journal ArticleDOI

Query Processing Using Distance Oracles for Spatial Networks

TL;DR: In this article, the authors proposed a distance oracle for finding shortest paths and nearest neighbors in a spatial network. But the distance oracles are not scalable and can only be used on sufficiently large road networks.
Proceedings ArticleDOI

Multi-approximate-keyword routing in GIS data

TL;DR: This work complements the standard shortest path search with multiple keywords and an approximate string similarity function, where the goal is to find the shortest path that passes through at least one matching object per keyword; it is proved that one approximate method has a κ-approximation in the worst case.
Proceedings ArticleDOI

Nearest keyword search in XML documents

TL;DR: This paper presents an indexing scheme that answers NK queries efficiently, in terms of both practical and worst-case performance, and occupies space linear to the dataset size, and can be constructed by a fast algorithm.
Journal ArticleDOI

Instance optimal query processing in spatial networks

TL;DR: This paper proposes algorithms for network k-NN queries, range queries, closest-pair queries and multi-source skyline queries based on a novel processing framework, namely, incremental lower bound constraint, which is proven to be instance optimal among classes of algorithms.
Proceedings ArticleDOI

Partitioned multi-indexing: bringing order to social search

TL;DR: This work is the first demonstration of the feasibility of social search with real-time text updates at large scales and builds on Das Sarma et al.'s approximate distance oracle, the worst-case approximation ratio of the scheme is ~O(1) for undirected networks.
Frequently Asked Questions (16)
Q1. What have the authors contributed in "Top-k nearest keyword search on large graphs" ?

On such networks, the authors study the problem of top-k nearest keyword ( k-NK ) search. The authors propose two efficient algorithms to report the exact k-NK result on a tree. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. 

2With the tree distance formulation, the key operation in answering a k-NK query on a graph is to answer the k-NK query on a tree. 

By keeping a global candidate list and removing duplicate index items, global storage reduces the index size of pivot by 61% on DBLP and 55% on FLARN. 

Suppose the authors have an algorithm to compute RT on a tree T , the authors can solve the k-NK problem in a graph by merging RTi for each tree Ti, 1 ≤ i ≤ r. 

As the authors transform a distance oracle on a graph into a set of shortest path trees, the original k-NK query on the graph can be reduced to answering the k-NK query on a set of trees. 

Since CT(λ) keeps the structural information of all keyword nodes in T , it is sufficient to search only on CT(λ) to calculate candλ(v). 

2THEOREM 5. Given a tree T (V, E), Algorithm 7 constructs a distance preserving balanced tree DT(T ) for T using O(|V | · log |V |) time and O(|V |) space. 

This is because the complexity of pivot grows linearly with the tree depth,and the larger diameter of FLARN leads to a larger tree depth. 

In order to reduce the average depth of nodes to optimize both index space and query processing time, the authors introduce a new structure called distance preserving balanced tree for T (V, E), denoted as DT(T ). 

For each node v traversed, the authors merge candλ(v) into that of its parent node u by adding a distance dist(u, v) to the list candλ(v) (line 3-5). 

Their second attempt is that, for each node v on the tree T (V, E) and each keyword λ, the authors precompute its k nearest nodes that contain λ. 

Let (u, u′) = EEλ(q), since ENλ(q) is on the path from u to u′ on the tree T , the path from ENλ(q) to any keyword node in T will go through either u or u′. 

Given a compact tree CT(λ) for a tree T and a keyword λ, the authors need to compute the candidate list candλ(v) for every node v on CT(λ). 

For each pivot p of v as well as v itself, the authors calculate distT (p, v) on the original tree T , and add the element v : distT (p, v) to the candidate list candλ(p) (line 4-5). 

The first traversal on CT(λ) is a bottom-up one, such that the candidate list on each node is propagated to all its ancestors on CT(λ). 

Since a tree contains more structural information than a star, using tree distance will be more accurate than using witness distance for estimating the distance of two nodes.