scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Top-K nearest keyword search on large graphs

01 Aug 2013-Vol. 6, Iss: 10, pp 901-912
TL;DR: A shortest path tree for a distance oracle technique is built and a global storage technique is proposed to further reduce the index size and the query time in obtaining a k-NK result on a graph from that on trees.
Abstract: It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword λ, a k-NK query seeks k nodes which contain λ and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems.The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.

Summary (4 min read)

1. INTRODUCTION

  • Many real-world networks emerging nowadays have labels or textual contents on the nodes.
  • K-NK is an important and useful query in graph search.
  • Intuitively, if two persons share some common friends, i.e., they are two hops away, they are more likely to become friends.
  • Instead the authors use distance oracle [20] as the fundamental distance estimation framework.
  • The rest of the paper is organized as follows.

2. PROBLEM DEFINITION

  • The authors model a weighted undirected graph as G(V, E), where V (G) represents the set of nodes and E(G) represents the set of edges in G.
  • The weight of a path is the total weight of all edges on the path.
  • For simplicity, the authors assume that there is only one keyword λ in the query.
  • The authors will discuss how to answer a query containing multiple keywords with AND and OR semantics.

3. EXISTING SOLUTIONS

  • Obviously, Dijkstra’s algorithm is inefficient when the size of the graph is large or the keyword nodes are far away from q.
  • In the literature, [1] and [22] design different indexing schemes to process (top-k) nearest keyword queries on a graph or a tree.
  • The authors introduce the two methods in the following two subsections.

3.1 Approximate k-NK on a Graph

  • Bahmani and Goel [1] find an approximate answer to a k-NK query in a graph based on a distance oracle [20].
  • One distance oracle is usually not enough for distance estimation in a graph G. Even for two nodes in the same partition, the estimation may have a large error.
  • The authors illustrate the algorithm using the following example.

3.2 Exact 1-NK on a Tree

  • For a certain keyword λ, all nodes with the preorder label in the interval [1, |V |] can be partitioned into several disjointed intervals, such that any node v in the same interval shares an identical NN(v, λ).
  • The partition is called tree Voronoi partition of λ, denoted as TVP(λ).
  • An extended compact tree ECT(λ) is a tree constructed by adding all change nodes into the compact tree CT(λ).
  • Using ECT(λ), TVP(λ) can be constructed easily.
  • The time to compute all compact trees and all extended compact trees for all keywords in the tree T (V, E) is bounded by O(|doc(V )| · log |V |).

4. SOLUTION OVERVIEW

  • Answering k-NK on a Graph using Tree Distance:.
  • To address the drawback of witness distance, in this paper, the authors propose to use tree distance in processing a k-NK query.
  • For the distance oracles O1 and O2 shown in Fig. 2, the corresponding shortest path trees T1 and T2 are shown in Fig, also known as Example 5.
  • Recall that in [22], for a certain keyword λ, the range [1, |V |] is partitioned into several disjoint intervals, and nodes with the preorder label in an identical interval share the same 1-NK result.
  • The authors first algorithm tree-boundk can only handle bounded k values with query processing time O(k + log |Vλ|) and index size O(k · |doc(V )|) for all keywords where k is an upper bound value of k.

5. K-NK ON A TREE FOR A SMALL K

  • Recall that for a keyword λ, its compact tree CT(λ) keeps all the structural information of λ on the tree T .
  • The authors idea is to precompute the top-k results for every keyword λ and every node on CT(λ).
  • Since the total size of all compact trees is bounded by O(|doc(V )|), the total space to store the top-k results of nodes on all compact trees is bounded by O(k · |doc(V )|).
  • In the following, the authors first introduce how to answer a k-NK query using CT(λ), followed by discussions on the construction of the index.

5.1 Query Processing

  • Intuitively, the entry edge plays a role of connecting the query node q to the compact tree CT(λ).
  • The authors assume that the compact tree CT(λ) for each keyword λ and the list candλ(u) for every node u on CT(λ) have been computed.
  • The efficiency of Algorithm 1 depends on three operations.
  • The results are merged using two operators ⊕ and ⊗k.
  • The first index, the entry edge partition EEP(λ), is to find the entry edge for any node on T .

5.2 Construction of Entry Edge Partition

  • Given a tree T (V, E), for each keyword λ, sharing the similar idea with the tree Voronoi partition TVP(λ), the authors construct an entry Algorithm 4: EEP-construct (T ,CT(λ)).
  • Based on such an observation, by excluding the intervals of all edges under the subtree rooted at u′ in CT(λ) from the interval of (u, u′), nodes with preorder in the remaining intervals will use (u, u′) as the entry edge.
  • Algorithm 4 shows the construction of the entry edge partition EEP(λ) on CT(λ) for a keyword λ.
  • After initializing EEP(λ) (line 2), the main operation is a recursive procedure partition (line 3), to partition the interval [1, |V |] to several disjoint intervals.

5.3 Construction of Candidate List

  • Given a compact tree CT(λ) for a tree T and a keyword λ, the authors need to compute the candidate list candλ(v) for every node v on CT(λ).
  • Since CT(λ) keeps the structural information of all keyword nodes in T , it is sufficient to search only on CT(λ) to calculate candλ(v).
  • Based on this observation, the authors can follow the path to propagate the candidate list on u to v.
  • Using this idea, the authors just need to traverse the tree CT(λ) twice to build the candidate lists for all nodes on CT(λ).
  • The second traversal on CT(λ) is a top-down one, such that the candidate list on each node is further propagated to all its descendants.

6.1 A Basic Pivot Approach

  • The authors basic idea is to compute the first segment online and precompute the results regarding the second segment offline.
  • In the query processing phase, the authors do not search the whole tree to get the answer for a query, but instead, they just need to merge the precomputed candidates along the path from the query node to the root node of the tree T .
  • The authors use the following example to illustrate the pivot based approach.
  • For every node v, the authors create a candidate list candλ(v) that contains all keyword nodes in its subtree, sorted in nondecreasing distances to v.

6.2 Pivot Approach with Tree Balancing

  • The problem is not perfectly solved using the basic pivot approach above.
  • Thus the key to optimizing both index space and query time is to reduce the average depth of nodes on the tree.
  • Furthermore, the authors need to traverse n nodes to answer a query when the query node q is at one end of the chain, leading to O(n) query time.
  • Generally speaking, DT(T ) preserves all distance information for any node pair on T and the height of DT(T ) is at most log2 |V |. DEFINITION 5.
  • The authors will also describe how to construct DT(T ) for a tree T and how to compute all candidate lists candλ(v) for all keywords λ and all nodes v on the tree DT(T ).

6.3 Index Construction

  • The first index is the distance preserving balanced tree DT(T ) for T and the second index is the candidate list candλ(v) for each keyword λ and each node v on DT(T ).
  • Such a property also holds for any subtree of T ′ because it is processed using steps (1) and (2) recursively.
  • The following lemma shows that the median node always exists on any tree T , and also gives a method to find the median node of T .
  • All other nodes in DT(T ) are constructed similarly.
  • After all candidate lists are created, the authors sort the elements in every candidate list in nondecreasing order of the distances.

7. APPROXIMATE K-NK ON A GRAPH

  • The authors introduce two algorithms graph-boundk and graph-pivot for a bounded k and an arbitrary k respectively.
  • This expression can be generalized to the case of merging the candidate lists of node v on more than two trees.
  • In the following, the authors will show that the global candidate list can be used to answer k-NK queries without sacrificing the result quality.
  • Therefore, the authors show that using global storage will not sacrifice the result quality.
  • When k is small, the index time and index space for boundk are smaller than pivot on both trees and graphs.

8. EXPERIMENTS

  • The authors report the performance of their methods boundk, pivot, and their global storage implementations boundk-gs and pivot-gs, with two baseline solutions BFS and PMI.
  • The authors obtained the keywords of nodes from the OpenStreetMap project5 with a bounding box.
  • Global storage helps reduce the query time of boundk by 20% and that of pivot by 15%.
  • The query time shows a sharper increasing trend on DBLP than FLARN, as the frequency difference between DBLP keywords is larger.
  • The index size of pivot is 2.5 times that of PMI on DBLP and 7.9 times on FLARN, due to the larger diameter of FLARN.

10. CONCLUSIONS

  • The authors study top-k nearest keyword (k-NK) search on large graphs.
  • The authors propose two exact k-NK algorithms on trees to handle a bounded k and an arbitrary k respectively.
  • The authors extend tree based algorithms to graphs and propose a global storage technique to further reduce the index size and query time.
  • The authors conducted extensive performance studies on real large graphs to demonstrate the effectiveness and efficiency of their algorithms.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Top-K Nearest Keyword Search on Large Graphs
Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, Wentao Tian
The Chinese University of Hong Kong, Hong Kong, China
{mqiao,lqin,hcheng,yu,wttian}@se.cuhk.edu.hk
ABSTRACT
It is quite common for networks emerging nowadays to have labels
or textual contents on the nodes. On such networks, we study the
problem of top-k nearest keyword (k-NK) search. In a network G
modeled as an undirected graph, each node is attached with zero or
more keywords, and each edge is assigned with a weight measuring
its length. Given a query node q in G and a keyword λ, a k-NK
query seeks k nodes which contain λ and are nearest to q. k-NK is
not only useful as a stand-alone query but also as a building block
for tackling complex graph pattern matching problems.
The key to an accurate k-NK result is a precise shortest distance
estimation in a graph. Based on the latest distance oracle technique,
we build a shortest path tree for a distance oracle and use the tree
distance as a more accurate estimation. With such representation,
the original k-NK query on a graph can be reduced to answering
the query on a set of trees and then assembling the results obtained
from the trees. We propose two efficient algorithms to report the
exact k-NK result on a tree. One is query time optimized for a
scenario when a small number of result nodes are of interest to
users. The other handles k-NK queries for an arbitrarily large k
efficiently. In obtaining a k-NK result on a graph from that on trees,
a global storage technique is proposed to further reduce the index
size and the query time. Extensive experimental results conform
with our theoretical findings, and demonstrate the effectiveness and
efficiency of our k-NK algorithms on large real graphs.
1. INTRODUCTION
Many real-world networks emerging nowadays have labels or
textual contents on the nodes. For example in a road network, a
location may have labels such as “McDonald’s”, “hospital”, and
“kindergarten”. In a social network, a person may have informa-
tion including name, interests and skills, etc.. In a bibliographic
network, a paper may have keywords and abstract, and an author
may have name, affiliation and email address. In this study, we
consider the problem of top-k nearest keyword (k-NK) search on
large networks. In a network G modeled as an undirected graph,
each node is attached with zero or more keywords, and each edge
is assigned with a weight measuring its length. Given a query node
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or d istributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 10
Copyright 2013 VLDB Endowment 2150-8097/13/10...
$
10.00.
q in G and a keyword λ, a k-NK query in the form of Q = (q, λ, k)
looks for k nodes which contain λ and are nearest to q. Different
from a large body of research on k - nearest neighbor (k-NN) search
on spatial networks [15, 5, 6, 18, 19, 7], we define G as a general
graph without coordinates. Thus our solution can apply to a wide
range of networks.
Motivation. k-NK is an important and useful query in graph search.
As a stand-alone query, it has a wide range of applications. Further-
more, it can serve as a building block for tackling complex graph
pattern matching problems which impose both structural and tex-
tual constraints. Here we list a few applications of k-NK queries.
Consider the social network Facebook as an example, in which
personalized search based on graph structure and textual contents
has become increasingly popular
1
. A person looks for 20 friends or
potential friends who like hiking to participate in a hiking activity.
Intuitively, if two persons share some common friends, i.e., they are
two hops away, they are more likely to become friends. In contrast,
if they are far away from each other in the network, they are less
likely to establish a link. Thus the problem is to find 20 persons
who like hiking and are nearest to the person who serves as the
organizer. It can be answered by a k-NK query. More generally,
we also consider a query containing multiple keywords connected
by AND or OR operators to express more complex semantics, e.g.,
a person looks for k friends or potential friends who like hiking
AND (OR) photography and are nearest to him.
Take a road network with locations associated with keywords as
another example. For parents looking for k kindergartens nearest
to their home for their children, their requirements can be expressed
by a k-NK query where the query node is the home location, and
the keyword is “kindergarten”.
In the third example, we s how how k-NK queries serve as a
building block for solving the graph pattern matching problem.
Consider a couple who wants to buy a house. They have some con-
straints like having a kindergarten and a hospital within 3 km, and
a supermarket within 1 km of their home. These constraints can be
expressed as a star pattern, and the pattern matching problem can
be decomposed into three k-NK queries with keywords “kinder-
garten”, “hospital” and “supermarket” respectively and k = 1 for
each potential house location to be considered.
Recently, Bahmani and Goel [1] have designed a Partitioned
Multi-Indexing (PMI) scheme to answer k-NK queries approxi-
mately. PMI is an inverted index built based on distance oracle
[20] which is a distance estimation technique. Given a k-NK query
Q = (q, λ, k), it returns k nodes containing keyword λ in ascend-
ing order of their approximate distance from the query node q. PMI
inherits the 2 log
2
|V | 1 approximation factor for distance esti-
mation from distance oracle [20], where V is the set of nodes in the
1
https://www.facebook.com/about/graphsearch
901

graph. The major drawback of PMI is that its distance estimation
error could be quite large in practice. This can greatly distort the
ranking of the candidate nodes carrying the query keywords, and
thus lead to a low result quality.
In this work, we study how to answer k-NK queries accurately
and efficiently using compact index. The key to an accurate k-NK
result is a precise shortest distance estimation in a graph. As we
use a general graph model, existing k-NN solutions on spatial net-
works [15, 5, 6, 18, 19, 7] cannot be applied, as they usually rely
on specialized structures that leverage properties of spatial data to
optimize their solutions. Instead we use distance oracle [20] as the
fundamental distance estimation framework. For each component
of a distance oracle, we will build a shortest path tree, based on
which we can estimate the shortest distance between two nodes by
their tree distance. The tree distance is more accurate than the dis-
tance estimated by distance oracle, which we call witness distance
to distinguish. As we transform a distance oracle on a graph into a
set of shortest path trees, the original k-NK query on the graph can
be reduced to answering the k-NK query on a set of trees. Thus we
first focus on processing k-NK queries to find exact top-k answers
on a tree. Then we study how to assemble the results obtained from
the trees to form the approximate top-k answers on the graph.
Contributions. Our main contributions in this work are summa-
rized as follows.
(1) Given a tree, we first consider a common scenario when users
are interested in a small number of answer nodes bounded by a
small constant
k, i.e., k k. We propose the first algorithm
tree-boundk with query time O(k + log |V
λ
|), where |V
λ
| is the
number of nodes carrying the query keyword λ, and index size
O (
k · |doc(V )|), where |doc(V )| is the total number of keywords
on all the nodes in the graph.
(2) Next we remove the
k restriction and handle k-NK queries
for an arbitrary k on a tree. We propose the second algorithm
tree-pivot with query time O(k·log |V |) and index size O(|doc(V )
log |V |) which is independent of k, thus is more scalable.
(3) Based on our proposed tree algorithms, we present our algo-
rithm for approximate k-NK query on a graph. We propose a global
storage technique to further reduce the index size and the query
time. We also show how to extend our methods to handle a query
with multiple keywords.
(4) Our experimental evaluation demonstrates the effectiveness and
efficiency of our k-NK algorithms on large real-world networks.
We show the superiority of our methods in ranking top-k answer
nodes accurately, when compared with the state-of-the-art top-k
keyword search method PMI [1].
Roadmap. The rest of the paper is organized as follows. Sec-
tion 2 formally defines the problem. Section 3 discusses two ex-
isting related studies and their drawbacks. Section 4 presents our
framework. Sections 5 and 6 introduce two proposed algorithms to
answer k-NK queries on a tree for a small k and an arbitrary k re-
spectively using compact index structures. Section 7 elaborates on
the way to answer k-NK queries on a graph by approximating the
graph with a bounded number of trees. Section 8 presents exten-
sive experimental evaluation. Section 9 reviews the previous works
related to ours. Finally, Section 10 concludes the paper.
2. PROBLEM DEFINITION
We model a weighted undirected graph as G(V, E), where V (G)
represents the set of nodes and E(G) represents the set of edges in
G. We use V and E to denote V (G) and E(G) if the context is
obvious. Each edge (u, v) E has a positive weight, denoted
r
b
u
j
o
s
p
v
h
e
t
c
a
n
k
i
d
g
f
m
β
λ,α
λ
α,β
λ
λ,α
β
λ
Figure 1: A Graph G with Keywords
as weight(u, v). A path p = (v
1
, v
2
, · · · , v
l
) is a sequence of l
nodes in V s uch that for each v
i
(1 i < l), (v
i
, v
i+1
) E.
The weight of a path is the total weight of all edges on the path.
For any two nodes u V and v V , the distance of u and v
on G, dist(u, v), is the minimum weight of all paths from u to v
in G. Each node v V contains a set of zero or more keywords
which is denoted as doc(v). The union of keywords for all nodes
in G is denoted as doc(V ). Note that doc(V ) is a multiset and
|doc(V )| =
P
vV
|doc(v)|. We use V
λ
V to denote the set of
nodes carrying keyword λ in V .
DEFINITION 1. Given a graph G(V, E), a top-k nearest key-
word (k-NK) query is a triple Q = (q, λ, k), where q V is a
query node in G, λ is a keyword, and k is a positive integer. Given
a query Q, a node v V is a keyword node w.r.t. Q if v contains
keyword λ, i.e., v V
λ
. The result is a set of k keyword nodes,
denoted as R = {v
1
, v
2
, · · · , v
k
} V
λ
, and there does not exist
a node u V
λ
\ R such that dist(q, u) < max
vR
dist(q, v). To
further report the distance in the top-k result, we can use the form
R = {v
1
: dist(q, v
1
), v
2
: dist(q, v
2
), · · · , v
k
: dist(q, v
k
)}.
In this paper, we aim at answering a k-NK query Q = (q, λ, k)
on a graph G. For simplicity, we assume that there is only one
keyword λ in the query. We will discuss how to answer a query
containing multiple keywords with AND and OR semantics.
Example 1: Fig. 1 shows a graph G. Assume that the weight of
each edge is 1. For a k-NK query Q = (f, λ, 3), the keyword node
set is V
λ
= {b, c, k, n, t}. The result of Q is R = {b : 2, n : 4, k :
5} since dist(f, b) = 2, dist(f, n) = 4, and dist(f, k) = 5. 2
3. EXISTING SOLUTIONS
A straightforward approach to answering a k-NK query Q =
(q, λ, k) on G is to use Dijkstra’s algorithm to search from the
query node q and output k nearest keyword nodes in nondecreasing
order of their distances to q. The time complexity is O(|E| + |V | ·
log |V |). Obviously, Dijkstra’s algorithm is inefficient when the
size of the graph is large or the keyword nodes are far away from q.
In the literature, [1] and [22] design different indexing schemes
to process (top-k) nearest keyword queries on a graph or a tree. We
introduce the two methods in the following two subsections.
3.1 Approximate k-NK on a Graph
Bahmani and Goel [1] find an approximate answer to a k-NK
query in a graph based on a distance oracle [20].
Distance Oracle: Distance oracle is a technique for estimating the
distance of two nodes in a graph [20]. Given a graph G, a distance
oracle is a Voronoi partition of V (G) determined by a set of ran-
domly selected center nodes. More specifically, given a number
n
c
, we randomly select n
c
nodes from V (G) as the center nodes
to construct a distance oracle O. Then the partition is constructed
by assigning each node v V (G) to its nearest center node, de-
noted as wit
O
(v), which is called the witness node of v w.r.t. O. If
v is a center node, wit
O
(v) = v. For each node v V (G), the
shortest distance from v to its witness node, i.e., dist(v, wit
O
(v)),
is precomputed. After constructing O, given two nodes u and v
in G, if u and v are in the same partition in O, i.e., wit
O
(u) =
902

u
o
h
i
m
t
c
j
s
p
v
e
a
n
k
O
2
b u
s
p
v
h
e
t
c
i
dg
f
1
2
2
o
r
n
k
j
a
1
2
2
1
1
3
3
3
4
m
5
5
5
5
4
6
4
O
1
b
dg
f
1
r
1
1
1
2
2
2
2
1
2
1
3
3
1
2
1
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
Figure 2: Two Distance Oracles O
1
and O
2
wit
O
(v), we compute the estimated distance, called witness dis-
tance, as
dist
O
(u, v) = dist(u, wit
O
(u)) + dist(v, wit
O
(v)). If u
and v are not in the same partition in O,
dist
O
(u, v) = +.
One distance oracle is usually not enough for distance estimation
in a graph G. It cannot estimate the distance of two nodes in dif-
ferent partitions. Even for two nodes in the same partition, the esti-
mation may have a large error. Therefore, a s et of r = p × log |V |
distance oracles {O
1
, O
2
, · · · , O
r
} are constructed, where p can
be considered as a constant
2
. The algorithm is processed in log |V |
phases. In phase i (0 i < log |V |), p distance oracles are con-
structed where each distance oracle contains 2
i
randomly selected
center nodes. Given r distance oracles, the distance of two nodes
u and v in G can be estimated as an upper bound dist(u, v) =
min
1ir
dist
O
i
(u, v).
The time complexity to compute the estimated distance
dist(u, v)
for any two nodes u and v in a graph G is O(log |V |). The distance
oracles consume O(|V | · log |V |) space. Das Sarma et al. [20]
prove that when p = Θ(|V |
1/ log |V |
), the estimated distance can
be bounded by dist(u, v)
dist(u, v) (2 log
2
|V |−1)·dist(u, v)
with a high probability.
Example 2: Fig. 2 shows two distance oracles O
1
and O
2
for the
graph shown in Fig. 1. There is one center node r in O
1
, and four
center nodes r, n, o and t in O
2
. The distance of nodes j and
s is estimated as
dist(j, s) = min{dist
O
1
(j, s),
dist
O
2
(j, s)} =
min{dist(j, r) + dist(s, r), dist(j, n) + dist(s, n)} = 5. 2
Answering k-NK with Distance Oracle: [1] designs a Partitioned
Multi-Indexing (PMI) scheme which uses a set of distance oracles
to answer a k-NK query in a graph. For each partition in a distance
oracle O
i
, an inverted list is constructed for each keyword in the
partition. Specifically, for a partition with a center node c and a
keyword λ, the inverted list contains all nodes in the partition that
contain keyword λ ranked in nondecreasing order of their distances
to c. Given a k-NK query Q = (q, λ, k) and a distance oracle O
i
,
the algorithm first finds the partition that q belongs to in O
i
. The
result w.r.t. O
i
is the first k elements in the inverted list for λ in the
partition, denoted as R
O
i
= {u
1
: dist(c, u
1
) + dist(c, q), u
2
:
dist(c, u
2
) + dist(c, q), · · · , u
k
: dist(c, u
k
) + dist(c, q)}. The
final result R is computed by merging the nodes in each R
O
i
and
maintaining k nodes with the shortest distances to q. The query
time complexity is O(k ·log |V |). We illustrate the algorithm using
the following example.
Example 3: Consider the graph in Fig. 1 and two distance oracles
in Fig. 2. For keyword λ, the inverted list for the partition centered
at node r in O
1
has 5 elements {b : 1, n : 3, k : 4, c : 5, t : 6}.
The inverted list for the partition centered at node o in O
2
has 1
element {k : 2}. Given a k-NK query Q = (m, λ, 2), from O
1
, we
can get a result R
O
1
= {b : 1 + dist(r, m), n : 3 + dist(r, m)} =
{b : 5, n : 7}, and from O
2
, we can get a result R
O
2
= {k :
2 + dist(o, m )} = {k : 3}. By merging R
O
1
and R
O
2
, the final
answer is R = {k : 3, b : 5}. The exact answer is R = {c : 1, k :
1} according to Fig. 1. 2
Limitation: Although in theory, the witness distance used by [1]
can be bounded by a factor of 2 log
2
|V | 1 of the exact distance
with a high probability, in practice, however, we find the distance
2
In [20], the set {O
1
, O
2
, · · · , O
r
} is defined as a distance oracle.
b,19,[19,20]
h,10,[10,18]
e,11,[11,18]
m,15, [15,18]
c,16,[16,17]
p,12, [12,14]
v,13, [13, 13]
g,2,[2,6]
f,7,[7,8]
r,1,[1,20]
d,9,[9,18]
a,3,[3,6]
k,5,[5,5]
i,8,[8,8]
j,4,[4,5]
t,17,[17,17]
o,18,[18,18]
s,14,[14,14]
n,6,[6,6]
u,20,[20,20]
λ
λ
λ
λ
λ
Figure 3: A Tree T with Preorder and Interval on Each Node
b
t
c
r
n
k
a
b
e
t
c
r
n
k
j
a
CT ECT
1
3
4
6
5
11
16
17
19
Interval [1,2] 3 [4,5] 6 [7,10]
Result b n k n b
Interval [11,16] 17 18
Result c t c
[19,20]
b
TVP
Figure 4: CT(λ), ECT(λ) and TVP(λ) for Keyword λ
estimation error can be quite large. For example, for the graph G in
Fig. 1 and two distance oracles O
1
and O
2
in Fig. 2, for two nodes
s and v, the witness distance in O
1
is
dist
O
1
(s, v) = dist(s, r) +
dist(v, r) = 10, and that in O
2
is
dist
O
2
(s, v) = dist(s, n) +
dist(v, n) = 6. However, the exact distance is dist(s, v) = 2 in
G, which is much smaller than both dist
O
1
(s, v) and
dist
O
2
(s, v).
The inaccurate distance estimation can greatly distort the ranking
of the nodes carrying the query keyword, and thus lead to a low
result quality, as illustrated in Example 3.
3.2 Exact 1-NK on a Tree
Tao et al. [22] compute the exact answer to a 1-NK query on a
tree T (V, E). Given a query Q = (q, λ, 1), the result is the nearest
node in T that contains keyword λ, denoted as NN(q, λ). The ba-
sic idea is as follows. We label a node v with the sequence number
of v in the preorder traversal of T . For a certain keyword λ, all
nodes with the preorder label in the interval [1, |V |] can be parti-
tioned into several disjointed intervals, such that any node v in the
same interval shares an identical NN(v, λ). The partition is called
tree Voronoi partition of λ, denoted as TVP(λ). By precomputing
TVP(λ) for all keywords λ on the tree, a query Q = (q, λ, 1) can
be answered in O (log |V
λ
|) time using a binary search in TVP(λ).
In order to compute TVP(λ) for all keywords λ in T efficiently,
two new data structures, namely, Compact Tree CT(λ) and Ex-
tended Compact Tree ECT(λ), are proposed in [22].
DEFINITION 2. (Compact Tree and Extended Compact Tree)
For a tree T and a keyword λ, a compact tree CT(λ) is a tree that
keeps only two types of nodes in T : a keyword node that contains
keyword λ, and a node that has at least two direct subtrees contain-
ing nodes carrying keyword λ. In the preorder traversal of T , for
two successive nodes u and v, if NN(u, λ) 6= NN(v, λ), v is called
a change node. An extended compact tree ECT(λ) is a tree con-
structed by adding all change nodes into the compact tree CT(λ).
Using ECT(λ), TVP(λ) can be constructed easily. In [22],
the authors prove that the total size of all compact trees and all
extended compact trees for all keywords in the tree T (V, E) is
bounded by O(|doc(V )|). The time to compute all compact trees
and all extended compact trees for all keywords in the tree T (V, E)
is bounded by O(|doc(V )| · log |V |).
Example 4: Fig. 3 shows a tree with the preorder label from 1 to 20
on its nodes. For keyword λ, there are 5 keyword nodes b, c, k, n, t.
For node s, NN(s, λ) = c. The compact tree of λ, CT(λ), is shown
on the left part of Fig. 4. Node r is in CT(λ) because r has three
direct subtrees with nodes carrying keyword λ. e is not in CT(λ)
because e is not a keyword node and e has only one direct subtree
rooted at m with nodes carrying keyword λ. The extended compact
tree of λ, ECT(λ), is shown in the middle part of Fig. 4 with the
903

preorder label marked beside each node. Node e is in ECT(λ),
because for its parent node h, NN(h, λ) = b 6= NN(e, λ) = c.
The tree Voronoi partition of λ, TVP(λ), is shown on the right part
of Fig. 4. For node s with preorder label 14, it is in the interval
[11, 16], thus NN(s, λ) = c as listed in TVP(λ). 2
4. SOLUTION OVERVIEW
Answering k-NK on a Graph using Tree Distance: To address
the drawback of witness distance, in this paper, we propose to use
tree distance in processing a k-NK query. We observe that for a
partition of a distance oracle, we can construct a shortest path tree
rooted at the center node of the partition. Since a tree contains more
structural information than a star, using tree distance will be more
accurate than using witness distance for estimating the distance of
two nodes. For a distance oracle O
i
, let the set of trees constructed
in O
i
be T
i
. T
i
can be considered as a tree by adding a virtual
root and several virtual edges with weight + that connect the
new virtual root to every root node in T
i
respectively. Let the k-NK
result on tree T be R
T
. Suppose we have an algorithm to compute
R
T
on a tree T , we can solve the k-NK problem in a graph by
merging R
T
i
for each tree T
i
, 1 i r. Obviously, such a result
will be more accurate than the res ult by [1]. The following example
illustrates the k-NK query processing based on tree distance.
Example 5: For the distance oracles O
1
and O
2
shown in Fig. 2,
the corresponding shortest path trees T
1
and T
2
are shown in Fig. 5.
For T
1
, there is only 1 tree rooted at r because there is only 1
partition in O
1
. For T
2
, there are 4 trees rooted at nodes n, o, r, t
respectively, because there are 4 partitions in O
2
. In each tree, the
path from any node to the root node is a shortest path in the original
graph. For two nodes s and v, their tree distance is 2 in both T
1
and
T
2
, the same as the exact distance dist(s, v) in G. For a k-NK query
Q = (m, λ, 2), we have R
T
1
= {c : 1, t : 2}, and R
T
2
= {k : 1}.
By merging R
T
1
and R
T
2
, we get R = {c : 1, k : 1}. Such a result
is much better than the result in Example 3 computed using witness
distance for the same query. 2
With the tree distance formulation, the key operation in answer-
ing a k-NK query on a graph is to answer the k-NK query on a tree.
Therefore, we start with processing a k-NK query on a tree.
Answering k-NK on a Tree: We show that it is nontrivial to answer
a k-NK query on a tree efficiently even if k is bounded. Our first
attempt is to extend the existing 1-NK solution on a tree T (V, E)
in [22]. Recall that in [22], for a certain keyword λ, the range
[1, |V |] is partitioned into several disjoint intervals, and nodes with
the preorder label in an identical interval share the same 1-NK re-
sult. When k 2, each interval needs to be further partitioned to
ensure that all nodes with the preorder label in the same interval
share an identical k-NK result. The number of intervals increases
exponentially w.r.t. the number of keyword nodes on the tree until
it reaches |V | for a keyword λ. Clearly, using such an approach,
the index size is too large in practice even for a small k. Our second
attempt is that, for each node v on the tree T (V, E) and each key-
word λ, we precompute its
k nearest nodes that contain λ. When
processing a query Q = (q, λ, k) with k
k, we can simply re-
trieve the precomputed result on node q and output the first k nodes
directly. Such an approach is impractical because for each keyword
λ, we need O(
k · |V |) space to store the precomputed results.
In the following, we first introduce two algorithms for answering
exact k-NK on a tree T (V, E). Our first algorithm tree-boundk can
only handle bounded k values with query processing time O(k +
log |V
λ
|) and index size O(k · |doc(V )|) for all keywords where k
is an upper bound value of k. Our second algorithm tree-pivot can
handle an arbitrary k with query processing time O(k · log |V |)
T
2
b
u
s
p
v
h
e
t
c
i
d
g
f
T
1
o
r
m
n
k
j
a
λ
λ
λ
λ
λ
r
b
u
h
i
d
g
f
λ
o
m
k
λ
j
s
p
v
e
a
n
λ
t
c
λ
λ
Figure 5: Shortest Path Trees T
1
and T
2
Algorithm 1: tree-boundk (Q,T )
Input: A k-NK query Q = (q, λ, k), and a tree T .
Output: Answer for Q on T .
R ;1
(u, u
) the entry edge of q on CT(λ);2
R R
k
(cand
λ
(u) dist(q, u));3
R R
k
(cand
λ
(u
) dist(q, u
));4
return R;5
and index size O(|doc(V )| · log |V |) for all keywords which is
independent of k. We then show our algorithm for approximate
k-NK on a graph by merging results on a bounded number of trees.
We propose a global storage technique to further reduce the index
size and the query time on a graph. Finally we show how to extend
our method to handle a query with multiple keywords.
5. K-NK ON A TREE FOR A SMALL K
In this section, we study how to answer a k-NK query Q =
(q, λ, k) on a tree T (V, E). We first consider a common sce-
nario when users are interested in a small number of answer nodes
bounded by a small constant
k, i.e., k k. Recall that for a key-
word λ, its compact tree CT(λ) keeps all the structural information
of λ on the tree T . Our idea is to precompute the top-k results for
every keyword λ and every node on CT(λ). Since the total size
of all compact trees is bounded by O(|doc(V )|), the total space to
store the top-
k results of nodes on all compact trees is bounded by
O (
k · |doc(V )|). Given a query Q = (q, λ, k), if q is on CT(λ),
we can simply report the precomputed answer on CT(λ). If q is
not on CT(λ), we need to find a way to construct the answer using
the precomputed results as well as the structure of CT(λ) and T . In
the following, we first introduce how to answer a k-NK query using
CT(λ), followed by discussions on the construction of the index.
5.1 Query Processing
For a keyword λ, and each node v in the compact tree CT(λ),
we use a candidate list cand
λ
(v) to denote the precomputed k-NK
results for k =
k on node v ranked in nondecreasing order of their
distances to v, in the form of cand
λ
(v) = {v
1
: dist(v, v
1
), v
2
:
dist(v, v
2
), · · · , v
k
: dist(v, v
k
)} where dist(v, v
1
) dist(v, v
2
)
· · · dist(v, v
k
). Given a query Q = (q, λ, k) on a tree T (V, E)
where k
k, if q is in CT(λ), we can simply report the first k ele-
ments in cand
λ
(q) as the answer. The difficult case is when q is not
in CT(λ). In order to answer such a query, we define an entry edge
to be the edge in CT(λ) that is nearest to q. Intuitively, the entry
edge plays a role of connecting the query node q to the compact
tree CT(λ). The for mal definition of entry edge is as follows.
DEFINITION 3. (Entry Node and Entry Edge) Given a com-
pact tree CT(λ), for each edge (u, u
) on CT(λ) with u
being a
child node of u, (u, u
) represents a unique path from u to u
on
the original tree T . For any node v on T , we say v sticks to CT(λ),
denoted as v
s
CT(λ), if and only if there exists an edge (u, u
)
on CT(λ) such that v is on the path from u to u
on T , otherwise
v does not stick to CT(λ), denoted as v /
s
CT(λ). For a node q
on T , let v be the first node on the path from q to the root node of
T such that v
s
CT(λ). v is called the Entry Node of q w.r.t. λ,
904

Algorithm 2: operator R δ
Input: Candidate list R = {u
1
: d
u
1
, u
2
: d
u
2
, · · · }, distance δ.
Output: A candidate list by adding δ to all distances in R.
R
;1
for i = 1 to |R| do2
R
R
S
{u
i
: d
u
i
+ δ};
3
return R
;4
denoted as EN
λ
(q). The corresponding edge (u, u
) on CT(λ) is
called the Entry Edge of q w.r.t. λ, denoted as EE
λ
(q).
Note that for a node q and a keyword λ, EE
λ
(q) is an edge on
the compact tree CT(λ), and EN
λ
(q) is a node on the original tree
T . We use an example to illustrate the entry node and entry edge.
Example 6: For the tree T shown in Fig. 3 and keyword λ, the
compact tree CT(λ) is shown on the left part of Fig. 4. For ease of
illustration, we also mark the nodes in CT(λ) dark on the tree T in
Fig. 3. For edge (r, c) in CT(λ), h
s
CT(λ) because h is on the
path from r to c in T . p /
s
CT(λ) since p is not on the tree path
of any CT(λ) edge. For node v, its entry node is EN
λ
(v) = e, as e
is the first node on the path (v, p, e, h, d, r) such that e
s
CT(λ).
The entry edge for v is EE
λ
(v) = (r, c) since the entry node e for
v is on the path from r to c in T . The entry nodes and entry edges
for some other nodes in T are listed in the following table. 2
Node g j d e p u
EN
λ
g j d e e b
EE
λ
(r, a) (a, k) (r, c) (r, c) (r, c) (r, b)
The Algorithm: Given a tree T (V, E), for keyword λ, all keyword
nodes are contained in CT(λ). For any node q V , the path from
q to any keyword node will go through the entry node EN
λ
(q).
Based on such property, the result of a query Q = (q, λ, k) is iden-
tical with the result of the query Q
= (EN
λ
(q), λ, k). However,
EN
λ
(q) may not be on CT(λ), thus the result of Q
is not neces-
sarily precomputed. Let (u, u
) = EE
λ
(q), since EN
λ
(q) is on the
path from u to u
on the tree T , the path from EN
λ
(q) to any key-
word node in T will go through either u or u
. Thus, the answer for
Q
can be constructed by merging the precomputed candidate lists
cand
λ
(u) and cand
λ
(u
) on CT(λ).
Our algorithm for processing a query Q = (q, λ, k) on a tree T is
shown in Algorithm 1. We assume that the compact tree CT(λ) for
each keyword λ and the list cand
λ
(u) for every node u on CT(λ)
have been computed. After initializing the res ult R in line 1, we
find the entry edge (u, u
) for q on CT(λ) (line 2). We add a dis-
tance dist(q , u) to every node in cand
λ
(u) using the operator, to
reflect the distance from q to a keyword node via u. We then merge
the new result into R using the
k
operator (line 3). Similarly we
apply the two operators to cand
λ
(u
) with the distance dist(q, u
)
(line 4). We will describe the operators and
k
later. We use the
following example to illustrate the algorithm.
Example 7: Given the tree T shown in Fig. 3 and CT(λ) on the left
part of Fig. 4, for a query Q = (o, λ, 2), the entry edge EE
λ
(o) =
(r, c). Suppose the lists cand
λ
(r) = {b : 1, n : 3} and cand
λ
(c) =
{c : 0, t : 1} are precomputed. By adding dist(o, r) = 5 to
cand
λ
(r), and adding dist(o, c) = 2 to cand
λ
(c), we get the new
lists {b : 6, n : 8} for r and {c : 2, t : 3} for c. We merge the two
lists and get the final result R = {c : 2, t : 3}. 2
The efficiency of Algorithm 1 depends on three operations. The
first operation is to find the entry edge for any node on T (line 2).
The second operation is to calculate the distance of any two nodes
on T , e.g., dist(q, u) and dist(q, u
) (line 3-4). The third operation
is to merge two sorted lists into a new one using operators and
k
(line 3-4). Next, we discuss the three operations separately.
Algorithm 3: operator R
1
k
R
2
Input: Two sorted candidate lists R
1
= {u
1
: d
u
1
, u
2
: d
u
2
, · · · }
R
2
= {v
1
: d
v
1
, v
2
: d
v
2
, · · · }, and result size k.
Output: The merged candidate list.
R ; i 1; j 1;1
while (i < |R
1
| or j < |R
2
|) and |R| k do2
if i < |R
1
| and (d
u
i
d
v
j
or j |R
2
|) then
3
if u
i
/ R then R R
S
{u
i
: d
u
i
};
4
i i + 1;5
else if j < |R
2
| and (d
v
j
d
u
i
or i |R
1
|) then
6
if v
j
/ R then R R
S
{v
j
: d
v
j
};
7
j j + 1;8
return R;9
Finding the Entry Edge: Given a keyword λ, for any node v on a
tree T (V, E), our idea of finding the entry edge EE
λ
(v) of v is sim-
ilar to the idea of finding the 1-NK answer using the tree Voronoi
partition TVP(λ) in [22]. For the range [1, |V |], we partition it
into several disjoint intervals, such that nodes with the preorder la-
bel in the same interval share an identical entry edge. We call such
partition an entry edge partition for λ, denoted as EEP(λ). Given
EEP(λ), EE
λ
(v) can be computed easily using a binary search in
EEP(λ) in O(log |V
λ
|) time. In the next subsection, we show how
to build EEP(λ) for all keywords efficiently and prove that the total
size of EEP(λ) for all keywords in T is bounded by O(doc|V |).
Computing Tree Distance: Given a tree T (V, E) with root r, sup-
pose the distance from r to every node in T has been precomputed.
For any two nodes u and v on T , we denote LCA(u, v) as their low-
est common ancestor. The distance of u and v can be computed as
dist(u, v) = dist(r, u) + dist(r, v) 2dist(r, LCA(u, v)). Using
the techniques in [2], LCA(u, v) can be found in O(1) time using
O (|V |) index space. Thus dist(u, v) for any two nodes u and v on
T can be computed in O(1) time using O(|V |) index space.
Merging Results: The results are merged using two operators
and
k
. Algorithm 2 shows the operator , which takes a candi-
date list R and a distance δ as input, and outputs a candidate list by
adding δ to all distances in R. The time complexity for the op-
erator is O(|R|). Algorithm 3 shows the operator
k
, which takes
two candidate lists R
1
and R
2
sorted in nondecreasing order of the
distances, and a value k as input, and outputs the merged candidate
list R. R contains at most k elements sorted in nondecreasing order
of the distances. R can be constructed by visiting each element in
R
1
and R
2
at most once. The time complexity for the
k
operator
is O(min{|R
1
| + |R
2
|, k}). The
k
and operators satisfy the
commutative, associative and distributive laws as follows.
(Commutative Law) R
1
k
R
2
= R
2
k
R
1
.
(Associative Law) (R
1
k
R
2
)
k
R
3
= R
1
k
(R
2
k
R
3
).
(Distributive Law) (R
1
k
R
2
) d = (R
1
d)
k
(R
2
d).
THEOREM 1. Algorithm 1 computes the exact k-NK answer for
a query Q = (q, λ, k) on a tree T (V, E) in O(k + log |V
λ
|) time.
Algorithm 1 uses the novel idea of entry edge, and elegantly ex-
tends the 1-NK method [22] to handle k-NK (k > 1) with the same
query time complexity, except for an extra linear cost O(k) indis-
pensable for reporting the results.
Given the tree T , for every keyword λ, besides the compact tree
CT(λ), two more indexes are needed. The first index, the entry
edge partition EEP(λ), is to find the entry edge for any node on T .
The second index is the candidate list cand
λ
(v) for every node on
CT(λ). Below we show how to construct the two indexes.
5.2 Construction of Entry Edge Partition
Given a tree T (V, E), for each keyword λ, sharing the similar
idea with the tree Voronoi partition TVP(λ), we construct an entry
905

Citations
More filters
Journal ArticleDOI
TL;DR: Inspired by R-tree, a height-balanced and scalable index, namely G-tree is proposed, to efficiently support three types of location-based queries on road networks, single-pair shortest path query, k nearest neighbor (kNN) query, and keyword-based kNN query.
Abstract: In the recent decades, we have witnessed the rapidly growing popularity of location-based systems. Three types of location-based queries on road networks, single-pair shortest path query, $k$ nearest neighbor ( $k$ NN) query, and keyword-based $k$ NN query, are widely used in location-based systems. Inspired by $\tt R$ - $\tt tree$ , we propose a height-balanced and scalable index, namely $\tt G$ - $\tt tree$ , to efficiently support these queries. The space complexity of $\tt G$ - $\tt tree$ is $\mathcal {O}(|\mathcal {V}|\log {|\mathcal {V}|})$ where ${|\mathcal {V}|}$ is the number of vertices in the road network. Unlike previous works that support these queries separately, $\tt G$ - $\tt tree$ supports all these queries within one framework. The basis for this framework is an assembly-based method to calculate the shortest-path distances between two vertices. Based on the assembly-based method, efficient search algorithms to answer $k$ NN queries and keyword-based $k$ NN queries are developed. Experiment results show $\tt G$ - $\tt tree$ ’s theoretical and practical superiority over existing methods.

122 citations


Additional excerpts

  • ...Index Terms—Single-pair shortest path, KNN search, keyword search, road network, index, spatial databases Ç...

    [...]

  • ...3.3 G-tree Construction In this section, we present how to construct the G-tree....

    [...]

Proceedings ArticleDOI
16 May 2016
TL;DR: This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently and builds a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search.
Abstract: It is nowadays quite common for road networks to have textual contents on the vertices, which describe auxiliary information (e.g., business, traffic, etc.) associated with the vertex. In such road networks, which are modelled as weighted undirected graphs, each vertex is associated with one or more keywords, and each edge is assigned with a weight, which can be its physical length or travelling time. In this paper, we study the problem of keyword-aware continuous k nearest neighbour (KCkNN) search on road networks, which computes the k nearest vertices that contain the query keywords issued by a moving object and maintains the results continuously as the object is moving on the road network. Reducing the query processing costs in terms of computation and communication has attracted considerable attention in the database community with interesting techniques proposed. This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently. First we build a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search by avoiding massive network traversals and sequential probe of keywords. To reduce the frequency of unnecessary result updates, we develop the concepts of dominance interval and region on road network, which share the similar intuition with safe region for processing continuous queries in Euclidean space but are more complicated and thus require more dedicated design. For high frequency keywords, we resolve the dominance interval when the query results changed. In addition, a path-based dominance updating approach is proposed to compute the dominance region efficiently when the query keywords are of low frequency. We conduct extensive experiments by comparing our algorithms with the state-of-the-art methods on real data sets. The empirical observations have verified the superiority of our proposed solution in all aspects of index size, communication cost and computation time.

58 citations


Cites background or methods from "Top-K nearest keyword search on lar..."

  • ...Such queries, known as spatial keyword queries, which find the top-k objects of interest in terms of both spatial proximity and textual relevance to the query, have been extensively studied in recent years [6][13][15][20][21][25][26][27]....

    [...]

  • ...Technique Boolean keyword Continuous query Unknown path Static data objects Road network Safe region ROAD[13], G-tree[27], SP-tree[20], FBS[11] – – OA-kNN[21] – – YPK-CNN[24], CPM[16], GMA[17] CkNN[22] UNICONS[5] V∗-Diagram[18], MkSK[23], INS[14] LARC...

    [...]

  • ...SP-tree [20] deals with the problem of keyword search on large graphs by introducing a shortest path tree, thus the network distances between results and query are approximated by tree distances....

    [...]

Proceedings ArticleDOI
27 May 2015
TL;DR: This paper proposes algorithms for top-k nearest keyword search that provide exact solutions and which handle networks of very large sizes and verified the performance of the solutions compared with the best-known approximation algorithms with experiments on real datasets.
Abstract: Top-k nearest keyword search has been of interest because of applications ranging from road network location search by keyword to search of information on an RDF repository. We consider the evaluation of a query with a given vertex and a keyword, and the problem is to find a set of $k$ nearest vertices that contain the keyword. The known algorithms for handling this problem only give approximate answers. In this paper, we propose algorithms for top-k nearest keyword search that provide exact solutions and which handle networks of very large sizes. We have also verified the performance of our solutions compared with the best-known approximation algorithms with experiments on real datasets.

53 citations


Cites background or methods from "Top-K nearest keyword search on lar..."

  • ...(2) Both methods [4, 26] assume that the index can reside in main memory....

    [...]

  • ...Given a graph G = (V,E) with vertex set V , and edge set E, the algorithm in [4] incurs a (2 log2 |V | − 1) approximation factor, which can be quite large given large values of |V |, and as shown in [26], the resulting error is significant in their empirical study in real graphs and good solutions can be missed....

    [...]

  • ...The authors of [26] point out that the error introduced by the star summary in [4] can be large....

    [...]

  • ...Both PMI and pivot-gs were implemented by the authors of [26]....

    [...]

  • ...As pointed out in [4] and [26], some keyword queries in a network are generated from a vertex inside the network with an interest of looking for vertices in a near-vicinity of the network....

    [...]

Proceedings ArticleDOI
13 Apr 2015
TL;DR: A novel 3D cube inverted index is designed, a cube based threshold algorithm is devised to retrieve the top-k results, and several pruning techniques are proposed to optimize the social distance computation, whose cost dominates the query processing.
Abstract: Internet users are shifting from searching on traditional media to social network platforms (SNPs) to retrieve up-to-date and valuable information. SNPs have two unique characteristics: frequent content update and small world phenomenon. However, existing works are not able to support these two features simultaneously. To address this problem, we develop a general framework to enable real time personalized top-k query. Our framework is based on a general ranking function that incorporates time freshness, social relevance and textual similarity. To ensure efficient update and query processing, there are two key challenges. The first is to design an index structure that is update-friendly while supporting instant query processing. The second is to efficiently compute the social relevance in a complex graph. To address these challenges, we first design a novel 3D cube inverted index to support efficient pruning on the three dimensions simultaneously. Then we devise a cube based threshold algorithm to retrieve the top-k results, and propose several pruning techniques to optimize the social distance computation, whose cost dominates the query processing. Furthermore, we optimize the 3D index via a hierarchical partition method to enhance our pruning on the social dimension. Extensive experimental results on two real world large datasets demonstrate the efficiency and the robustness of our proposed solution.

40 citations


Cites background from "Top-K nearest keyword search on lar..."

  • ...The social distance is usually modeled as the shortest distance on the social graph [9], [10], [6], [5], [7]....

    [...]

  • ...(1) Social Relevance: The social distance for two vertices v ↔ v′ is adopted as the shortest distance [9], [10], [6], [5]....

    [...]

Journal ArticleDOI
TL;DR: A signature-based search algorithm is proposed that encodes the shortest-path distance from a vertex to any given keyword in the graph, and can find query answers by exploring fewer paths, so that the time and communication costs are low.
Abstract: Graph keyword search has drawn many research interests, since graph models can generally represent both structured and unstructured databases and keyword searches can extract valuable information for users without the knowledge of the underlying schema and query language. In practice, data graphs can be extremely large, e.g., a Web-scale graph containing billions of vertices. The state-of-the-art approaches employ centralized algorithms to process graph keyword searches, and thus they are infeasible for such large graphs, due to the limited computational power and storage space of a centralized server. To address this problem, we investigate keyword search for Web-scale graphs deployed in a distributed environment. We first give a naive search algorithm to answer the query efficiently. However, the naive search algorithm uses a flooding search strategy that incurs large time and network overhead. To remedy this shortcoming, we then propose a signature-based search algorithm. Specifically, we design a vertex signature that encodes the shortest-path distance from a vertex to any given keyword in the graph. As a result, we can find query answers by exploring fewer paths, so that the time and communication costs are low. Moreover, we reorganize the graph data in the cluster after its initial random partitioning so that the signature-based techniques are more effective. Finally, our experimental results demonstrate the feasibility of our proposed approach in performing keyword searches over Web-scale graph data.

22 citations


Cites background from "Top-K nearest keyword search on lar..."

  • ...[29] studied the top-k nearest keyword (k-NK) query over a graph....

    [...]

  • ...Therefore, the studied problems in [28], [29] are different from that in this paper, and their proposed techniques cannot be directed used for solving our problem in this paper....

    [...]

  • ...There are some works to study the variants of graph keyword search [28], [29]....

    [...]

References
More filters
Proceedings ArticleDOI
06 Jul 2001
TL;DR: The most impressive feature of the data structure is its constant query time, hence the name ``oracle', which provides faster constructions of sparse spanners of weighted graphs, and improved tree covers and distance labelings of weighted or unweighted graphs.
Abstract: Let G=(V,E) be an undirected weighted graph with |V|=n and |E|=m. Let k\ge 1 be an integer. We show that G=(V,E) can be preprocessed in O(kmn^{1/k}) expected time, constructing a data structure of size O(kn^{1+1/k}), such that any subsequent distance query can be answered, approximately, in O(k) time. The approximate distance returned is of stretch at most 2k-1, i.e., the quotient obtained by dividing the estimated distance by the actual distance lies between 1 and 2k-1. We show that a 1963 girth conjecture of Erd{\H{o}}s, implies that ω(n^{1+1/k}) space is needed in the worst case for any real stretch strictly smaller than 2k+1. The space requirement of our algorithm is, therefore, essentially optimal. The most impressive feature of our data structure is its constant query time, hence the name oracle. Previously, data structures that used only O(n^{1+1/k}) space had a query time of ω(n^{1/k}) and a slightly larger, non-optimal, stretch. Our algorithms are extremely simple and easy to implement efficiently. They also provide faster constructions of sparse spanners of weighted graphs, and improved tree covers and distance labelings of weighted or unweighted graphs.}

563 citations


"Top-K nearest keyword search on lar..." refers methods in this paper

  • ...[23] is a seminal work on distance oracle that estimates distance with 2k − 1 stretch using an O(|V | 1 k ) sized index....

    [...]

  • ...[11] adapt the distance oracle [23] to answer 1-NK queries with 4k−5 stretch in O(k) time using an O(k|V | 1 k ) sized index....

    [...]

Proceedings Article
30 Aug 2005
TL;DR: This paper proposes a new search algorithm, Bidirectional Search, which improves on Backward Expanding search by allowing forward search from potential roots towards leaves, and devise a novel search frontier prioritization technique based on spreading activation.
Abstract: Relational, XML and HTML data can be represented as graphs with entities as nodes and relationships as edges. Text is associated with nodes and possibly edges. Keyword search on such graphs has received much attention lately. A central problem in this scenario is to efficiently extract from the data graph a small number of the "best" answer trees. A Backward Expanding search, starting at nodes matching keywords and working up toward confluent roots, is commonly used for predominantly text-driven queries. But it can perform poorly if some keywords match many nodes, or some node has very large degree.In this paper we propose a new search algorithm, Bidirectional Search, which improves on Backward Expanding search by allowing forward search from potential roots towards leaves. To exploit this flexibility, we devise a novel search frontier prioritization technique based on spreading activation. We present a performance study on real data, establishing that Bidirectional Search significantly outperforms Backward Expanding search.

545 citations


"Top-K nearest keyword search on lar..." refers background in this paper

  • ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

    [...]

  • ...v,13, [13, 13] g,2,[2,6] f,7,[7,8] r,1,[1,20]...

    [...]

Book ChapterDOI
31 Aug 2004
TL;DR: This paper proposes a novel approach to efficiently and accurately evaluate KNN queries in spatial network databases using first order Voronoi diagram, which outperforms approaches that are based on on-line distance computation by up to one order of magnitude, and provides a factor of four improvement in the selectivity of the filter step as compared to the index-based approaches.
Abstract: A frequent type of query in spatial networks (e.g., road networks) is to find the K nearest neighbors (KNN) of a given query object. With these networks, the distances between objects depend on their network connectivity and it is computationally expensive to compute the distances (e.g., shortest paths) between objects. In this paper, we propose a novel approach to efficiently and accurately evaluate KNN queries in spatial network databases using first order Voronoi diagram. This approach is based on partitioning a large network to small Voronoi regions, and then pre-computing distances both within and across the regions. By localizing the precomputation within the regions, we save on both storage and computation and by performing across-the-network computation for only the border points of the neighboring regions, we avoid global pre-computation between every node-pair. Our empirical experiments with several real-world data sets show that our proposed solution outperforms approaches that are based on on-line distance computation by up to one order of magnitude, and provides a factor of four improvement in the selectivity of the filter step as compared to the index-based approaches.

520 citations


"Top-K nearest keyword search on lar..." refers background or methods in this paper

  • ...As we use a general graph model, existing k-NN solutions on spatial networks [15, 5, 6, 18, 19, 7] cannot be applied, as they usually rely on specialized structures that leverage properties of spatial data to optimize their solutions....

    [...]

  • ...Different from a large body of research on k-nearest neighbor (k-NN) search on spatial networks [15, 5, 6, 18, 19, 7], we define G as a general graph without coordinates....

    [...]

  • ...[15] uses network Voronoi polygons to divide a graph into disjointed subsets for kNN search....

    [...]

  • ...K nearest neighbor (k-NN) search has been extensively studied in spatial networks [15, 5, 6, 18, 19, 7]....

    [...]

Proceedings ArticleDOI
09 Jun 2008
TL;DR: An extended inverted index is proposed to facilitate keyword-based search, and a novel ranking mechanism for enhancing search effectiveness is presented, which achieves both high search efficiency and high accuracy.
Abstract: Conventional keyword search engines are restricted to a given data model and cannot easily adapt to unstructured, semi-structured or structured data. In this paper, we propose an efficient and adaptive keyword search method, called EASE, for indexing and querying large collections of heterogenous data. To achieve high efficiency in processing keyword queries, we first model unstructured, semi-structured and structured data as graphs, and then summarize the graphs and construct graph indices instead of using traditional inverted indices. We propose an extended inverted index to facilitate keyword-based search, and present a novel ranking mechanism for enhancing search effectiveness. We have conducted an extensive experimental study using real datasets, and the results show that EASE achieves both high search efficiency and high accuracy, and outperforms the existing approaches significantly.

422 citations

Proceedings ArticleDOI
15 Apr 2007
TL;DR: This paper proposes a novel parameterized solution, with l as a parameter, to find the optimal GST-1, in time complexity O(3ln + 2l ((l + logn)n + m), where n and m are the numbers of nodes and edges in graph G, which can handle graphs with a large number of nodes.
Abstract: It is widely realized that the integration of database and information retrieval techniques will provide users with a wide range of high quality services. In this paper, we study processing an l-keyword query, p1, p1, ..., pl, against a relational database which can be modeled as a weighted graph, G(V, E). Here V is a set of nodes (tuples) and E is a set of edges representing foreign key references between tuples. Let Vi ⊆ V be a set of nodes that contain the keyword pi. We study finding top-k minimum cost connected trees that contain at least one node in every subset Vi, and denote our problem as GST-k When k = 1, it is known as a minimum cost group Steiner tree problem which is NP-complete. We observe that the number of keywords, l, is small, and propose a novel parameterized solution, with l as a parameter, to find the optimal GST-1, in time complexity O(3ln + 2l ((l + logn)n + m)), where n and m are the numbers of nodes and edges in graph G. Our solution can handle graphs with a large number of nodes. Our GST-1 solution can be easily extended to support GST-k, which outperforms the existing GST-k solutions over both weighted undirected/directed graphs. We conducted extensive experimental studies, and report our finding.

357 citations


"Top-K nearest keyword search on lar..." refers background in this paper

  • ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

    [...]

  • ...Thus nodes with preorder in either of the two intervals [1, 1] and [7, 8] share the same entry edge (φ, r)....

    [...]

  • ...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...

    [...]

  • ...By excluding the three intervals from [1, 20], two intervals [1, 1] and [7, 8] are left....

    [...]

  • ...v,13, [13, 13] g,2,[2,6] f,7,[7,8] r,1,[1,20]...

    [...]

Frequently Asked Questions (16)
Q1. What have the authors contributed in "Top-k nearest keyword search on large graphs" ?

On such networks, the authors study the problem of top-k nearest keyword ( k-NK ) search. The authors propose two efficient algorithms to report the exact k-NK result on a tree. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. 

2With the tree distance formulation, the key operation in answering a k-NK query on a graph is to answer the k-NK query on a tree. 

By keeping a global candidate list and removing duplicate index items, global storage reduces the index size of pivot by 61% on DBLP and 55% on FLARN. 

Suppose the authors have an algorithm to compute RT on a tree T , the authors can solve the k-NK problem in a graph by merging RTi for each tree Ti, 1 ≤ i ≤ r. 

As the authors transform a distance oracle on a graph into a set of shortest path trees, the original k-NK query on the graph can be reduced to answering the k-NK query on a set of trees. 

Since CT(λ) keeps the structural information of all keyword nodes in T , it is sufficient to search only on CT(λ) to calculate candλ(v). 

2THEOREM 5. Given a tree T (V, E), Algorithm 7 constructs a distance preserving balanced tree DT(T ) for T using O(|V | · log |V |) time and O(|V |) space. 

This is because the complexity of pivot grows linearly with the tree depth,and the larger diameter of FLARN leads to a larger tree depth. 

In order to reduce the average depth of nodes to optimize both index space and query processing time, the authors introduce a new structure called distance preserving balanced tree for T (V, E), denoted as DT(T ). 

For each node v traversed, the authors merge candλ(v) into that of its parent node u by adding a distance dist(u, v) to the list candλ(v) (line 3-5). 

Their second attempt is that, for each node v on the tree T (V, E) and each keyword λ, the authors precompute its k nearest nodes that contain λ. 

Let (u, u′) = EEλ(q), since ENλ(q) is on the path from u to u′ on the tree T , the path from ENλ(q) to any keyword node in T will go through either u or u′. 

Given a compact tree CT(λ) for a tree T and a keyword λ, the authors need to compute the candidate list candλ(v) for every node v on CT(λ). 

For each pivot p of v as well as v itself, the authors calculate distT (p, v) on the original tree T , and add the element v : distT (p, v) to the candidate list candλ(p) (line 4-5). 

The first traversal on CT(λ) is a bottom-up one, such that the candidate list on each node is propagated to all its ancestors on CT(λ). 

Since a tree contains more structural information than a star, using tree distance will be more accurate than using witness distance for estimating the distance of two nodes.