scispace - formally typeset

Journal ArticleDOI

Top-K nearest keyword search on large graphs

01 Aug 2013-Vol. 6, Iss: 10, pp 901-912

TL;DR: A shortest path tree for a distance oracle technique is built and a global storage technique is proposed to further reduce the index size and the query time in obtaining a k-NK result on a graph from that on trees.
Abstract: It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword λ, a k-NK query seeks k nodes which contain λ and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems.The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.
Topics: Query optimization (65%), Tree (graph theory) (64%), Block graph (62%), Shortest-path tree (62%), Trémaux tree (61%)

Content maybe subject to copyright    Report

Top-K Nearest Keyword Search on Large Graphs
Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, Wentao Tian
The Chinese University of Hong Kong, Hong Kong, China
{mqiao,lqin,hcheng,yu,wttian}@se.cuhk.edu.hk
ABSTRACT
It is quite common for networks emerging nowadays to have labels
or textual contents on the nodes. On such networks, we study the
problem of top-k nearest keyword (k-NK) search. In a network G
modeled as an undirected graph, each node is attached with zero or
more keywords, and each edge is assigned with a weight measuring
its length. Given a query node q in G and a keyword λ, a k-NK
query seeks k nodes which contain λ and are nearest to q. k-NK is
not only useful as a stand-alone query but also as a building block
for tackling complex graph pattern matching problems.
The key to an accurate k-NK result is a precise shortest distance
estimation in a graph. Based on the latest distance oracle technique,
we build a shortest path tree for a distance oracle and use the tree
distance as a more accurate estimation. With such representation,
the original k-NK query on a graph can be reduced to answering
the query on a set of trees and then assembling the results obtained
from the trees. We propose two efficient algorithms to report the
exact k-NK result on a tree. One is query time optimized for a
scenario when a small number of result nodes are of interest to
users. The other handles k-NK queries for an arbitrarily large k
efficiently. In obtaining a k-NK result on a graph from that on trees,
a global storage technique is proposed to further reduce the index
size and the query time. Extensive experimental results conform
with our theoretical findings, and demonstrate the effectiveness and
efficiency of our k-NK algorithms on large real graphs.
1. INTRODUCTION
Many real-world networks emerging nowadays have labels or
textual contents on the nodes. For example in a road network, a
location may have labels such as “McDonald’s”, “hospital”, and
“kindergarten”. In a social network, a person may have informa-
tion including name, interests and skills, etc.. In a bibliographic
network, a paper may have keywords and abstract, and an author
may have name, affiliation and email address. In this study, we
consider the problem of top-k nearest keyword (k-NK) search on
large networks. In a network G modeled as an undirected graph,
each node is attached with zero or more keywords, and each edge
is assigned with a weight measuring its length. Given a query node
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or d istributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 10
Copyright 2013 VLDB Endowment 2150-8097/13/10...
$
10.00.
q in G and a keyword λ, a k-NK query in the form of Q = (q, λ, k)
looks for k nodes which contain λ and are nearest to q. Different
from a large body of research on k - nearest neighbor (k-NN) search
on spatial networks [15, 5, 6, 18, 19, 7], we define G as a general
graph without coordinates. Thus our solution can apply to a wide
range of networks.
Motivation. k-NK is an important and useful query in graph search.
As a stand-alone query, it has a wide range of applications. Further-
more, it can serve as a building block for tackling complex graph
pattern matching problems which impose both structural and tex-
tual constraints. Here we list a few applications of k-NK queries.
Consider the social network Facebook as an example, in which
personalized search based on graph structure and textual contents
has become increasingly popular
1
. A person looks for 20 friends or
potential friends who like hiking to participate in a hiking activity.
Intuitively, if two persons share some common friends, i.e., they are
two hops away, they are more likely to become friends. In contrast,
if they are far away from each other in the network, they are less
likely to establish a link. Thus the problem is to find 20 persons
who like hiking and are nearest to the person who serves as the
organizer. It can be answered by a k-NK query. More generally,
we also consider a query containing multiple keywords connected
by AND or OR operators to express more complex semantics, e.g.,
a person looks for k friends or potential friends who like hiking
AND (OR) photography and are nearest to him.
Take a road network with locations associated with keywords as
another example. For parents looking for k kindergartens nearest
to their home for their children, their requirements can be expressed
by a k-NK query where the query node is the home location, and
the keyword is “kindergarten”.
In the third example, we s how how k-NK queries serve as a
building block for solving the graph pattern matching problem.
Consider a couple who wants to buy a house. They have some con-
straints like having a kindergarten and a hospital within 3 km, and
a supermarket within 1 km of their home. These constraints can be
expressed as a star pattern, and the pattern matching problem can
be decomposed into three k-NK queries with keywords “kinder-
garten”, “hospital” and “supermarket” respectively and k = 1 for
each potential house location to be considered.
Recently, Bahmani and Goel [1] have designed a Partitioned
Multi-Indexing (PMI) scheme to answer k-NK queries approxi-
mately. PMI is an inverted index built based on distance oracle
[20] which is a distance estimation technique. Given a k-NK query
Q = (q, λ, k), it returns k nodes containing keyword λ in ascend-
ing order of their approximate distance from the query node q. PMI
inherits the 2 log
2
|V | 1 approximation factor for distance esti-
mation from distance oracle [20], where V is the set of nodes in the
1
https://www.facebook.com/about/graphsearch
901

graph. The major drawback of PMI is that its distance estimation
error could be quite large in practice. This can greatly distort the
ranking of the candidate nodes carrying the query keywords, and
thus lead to a low result quality.
In this work, we study how to answer k-NK queries accurately
and efficiently using compact index. The key to an accurate k-NK
result is a precise shortest distance estimation in a graph. As we
use a general graph model, existing k-NN solutions on spatial net-
works [15, 5, 6, 18, 19, 7] cannot be applied, as they usually rely
on specialized structures that leverage properties of spatial data to
optimize their solutions. Instead we use distance oracle [20] as the
fundamental distance estimation framework. For each component
of a distance oracle, we will build a shortest path tree, based on
which we can estimate the shortest distance between two nodes by
their tree distance. The tree distance is more accurate than the dis-
tance estimated by distance oracle, which we call witness distance
to distinguish. As we transform a distance oracle on a graph into a
set of shortest path trees, the original k-NK query on the graph can
be reduced to answering the k-NK query on a set of trees. Thus we
first focus on processing k-NK queries to find exact top-k answers
on a tree. Then we study how to assemble the results obtained from
the trees to form the approximate top-k answers on the graph.
Contributions. Our main contributions in this work are summa-
rized as follows.
(1) Given a tree, we first consider a common scenario when users
are interested in a small number of answer nodes bounded by a
small constant
k, i.e., k k. We propose the first algorithm
tree-boundk with query time O(k + log |V
λ
|), where |V
λ
| is the
number of nodes carrying the query keyword λ, and index size
O (
k · |doc(V )|), where |doc(V )| is the total number of keywords
on all the nodes in the graph.
(2) Next we remove the
k restriction and handle k-NK queries
for an arbitrary k on a tree. We propose the second algorithm
tree-pivot with query time O(k·log |V |) and index size O(|doc(V )
log |V |) which is independent of k, thus is more scalable.
(3) Based on our proposed tree algorithms, we present our algo-
rithm for approximate k-NK query on a graph. We propose a global
storage technique to further reduce the index size and the query
time. We also show how to extend our methods to handle a query
with multiple keywords.
(4) Our experimental evaluation demonstrates the effectiveness and
efficiency of our k-NK algorithms on large real-world networks.
We show the superiority of our methods in ranking top-k answer
nodes accurately, when compared with the state-of-the-art top-k
keyword search method PMI [1].
Roadmap. The rest of the paper is organized as follows. Sec-
tion 2 formally defines the problem. Section 3 discusses two ex-
isting related studies and their drawbacks. Section 4 presents our
framework. Sections 5 and 6 introduce two proposed algorithms to
answer k-NK queries on a tree for a small k and an arbitrary k re-
spectively using compact index structures. Section 7 elaborates on
the way to answer k-NK queries on a graph by approximating the
graph with a bounded number of trees. Section 8 presents exten-
sive experimental evaluation. Section 9 reviews the previous works
related to ours. Finally, Section 10 concludes the paper.
2. PROBLEM DEFINITION
We model a weighted undirected graph as G(V, E), where V (G)
represents the set of nodes and E(G) represents the set of edges in
G. We use V and E to denote V (G) and E(G) if the context is
obvious. Each edge (u, v) E has a positive weight, denoted
r
b
u
j
o
s
p
v
h
e
t
c
a
n
k
i
d
g
f
m
β
λ,α
λ
α,β
λ
λ,α
β
λ
Figure 1: A Graph G with Keywords
as weight(u, v). A path p = (v
1
, v
2
, · · · , v
l
) is a sequence of l
nodes in V s uch that for each v
i
(1 i < l), (v
i
, v
i+1
) E.
The weight of a path is the total weight of all edges on the path.
For any two nodes u V and v V , the distance of u and v
on G, dist(u, v), is the minimum weight of all paths from u to v
in G. Each node v V contains a set of zero or more keywords
which is denoted as doc(v). The union of keywords for all nodes
in G is denoted as doc(V ). Note that doc(V ) is a multiset and
|doc(V )| =
P
vV
|doc(v)|. We use V
λ
V to denote the set of
nodes carrying keyword λ in V .
DEFINITION 1. Given a graph G(V, E), a top-k nearest key-
word (k-NK) query is a triple Q = (q, λ, k), where q V is a
query node in G, λ is a keyword, and k is a positive integer. Given
a query Q, a node v V is a keyword node w.r.t. Q if v contains
keyword λ, i.e., v V
λ
. The result is a set of k keyword nodes,
denoted as R = {v
1
, v
2
, · · · , v
k
} V
λ
, and there does not exist
a node u V
λ
\ R such that dist(q, u) < max
vR
dist(q, v). To
further report the distance in the top-k result, we can use the form
R = {v
1
: dist(q, v
1
), v
2
: dist(q, v
2
), · · · , v
k
: dist(q, v
k
)}.
In this paper, we aim at answering a k-NK query Q = (q, λ, k)
on a graph G. For simplicity, we assume that there is only one
keyword λ in the query. We will discuss how to answer a query
containing multiple keywords with AND and OR semantics.
Example 1: Fig. 1 shows a graph G. Assume that the weight of
each edge is 1. For a k-NK query Q = (f, λ, 3), the keyword node
set is V
λ
= {b, c, k, n, t}. The result of Q is R = {b : 2, n : 4, k :
5} since dist(f, b) = 2, dist(f, n) = 4, and dist(f, k) = 5. 2
3. EXISTING SOLUTIONS
A straightforward approach to answering a k-NK query Q =
(q, λ, k) on G is to use Dijkstra’s algorithm to search from the
query node q and output k nearest keyword nodes in nondecreasing
order of their distances to q. The time complexity is O(|E| + |V | ·
log |V |). Obviously, Dijkstra’s algorithm is inefficient when the
size of the graph is large or the keyword nodes are far away from q.
In the literature, [1] and [22] design different indexing schemes
to process (top-k) nearest keyword queries on a graph or a tree. We
introduce the two methods in the following two subsections.
3.1 Approximate k-NK on a Graph
Bahmani and Goel [1] find an approximate answer to a k-NK
query in a graph based on a distance oracle [20].
Distance Oracle: Distance oracle is a technique for estimating the
distance of two nodes in a graph [20]. Given a graph G, a distance
oracle is a Voronoi partition of V (G) determined by a set of ran-
domly selected center nodes. More specifically, given a number
n
c
, we randomly select n
c
nodes from V (G) as the center nodes
to construct a distance oracle O. Then the partition is constructed
by assigning each node v V (G) to its nearest center node, de-
noted as wit
O
(v), which is called the witness node of v w.r.t. O. If
v is a center node, wit
O
(v) = v. For each node v V (G), the
shortest distance from v to its witness node, i.e., dist(v, wit
O
(v)),
is precomputed. After constructing O, given two nodes u and v
in G, if u and v are in the same partition in O, i.e., wit
O
(u) =
902

u
o
h
i
m
t
c
j
s
p
v
e
a
n
k
O
2
b u
s
p
v
h
e
t
c
i
dg
f
1
2
2
o
r
n
k
j
a
1
2
2
1
1
3
3
3
4
m
5
5
5
5
4
6
4
O
1
b
dg
f
1
r
1
1
1
2
2
2
2
1
2
1
3
3
1
2
1
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
Figure 2: Two Distance Oracles O
1
and O
2
wit
O
(v), we compute the estimated distance, called witness dis-
tance, as
dist
O
(u, v) = dist(u, wit
O
(u)) + dist(v, wit
O
(v)). If u
and v are not in the same partition in O,
dist
O
(u, v) = +.
One distance oracle is usually not enough for distance estimation
in a graph G. It cannot estimate the distance of two nodes in dif-
ferent partitions. Even for two nodes in the same partition, the esti-
mation may have a large error. Therefore, a s et of r = p × log |V |
distance oracles {O
1
, O
2
, · · · , O
r
} are constructed, where p can
be considered as a constant
2
. The algorithm is processed in log |V |
phases. In phase i (0 i < log |V |), p distance oracles are con-
structed where each distance oracle contains 2
i
randomly selected
center nodes. Given r distance oracles, the distance of two nodes
u and v in G can be estimated as an upper bound dist(u, v) =
min
1ir
dist
O
i
(u, v).
The time complexity to compute the estimated distance
dist(u, v)
for any two nodes u and v in a graph G is O(log |V |). The distance
oracles consume O(|V | · log |V |) space. Das Sarma et al. [20]
prove that when p = Θ(|V |
1/ log |V |
), the estimated distance can
be bounded by dist(u, v)
dist(u, v) (2 log
2
|V |−1)·dist(u, v)
with a high probability.
Example 2: Fig. 2 shows two distance oracles O
1
and O
2
for the
graph shown in Fig. 1. There is one center node r in O
1
, and four
center nodes r, n, o and t in O
2
. The distance of nodes j and
s is estimated as
dist(j, s) = min{dist
O
1
(j, s),
dist
O
2
(j, s)} =
min{dist(j, r) + dist(s, r), dist(j, n) + dist(s, n)} = 5. 2
Answering k-NK with Distance Oracle: [1] designs a Partitioned
Multi-Indexing (PMI) scheme which uses a set of distance oracles
to answer a k-NK query in a graph. For each partition in a distance
oracle O
i
, an inverted list is constructed for each keyword in the
partition. Specifically, for a partition with a center node c and a
keyword λ, the inverted list contains all nodes in the partition that
contain keyword λ ranked in nondecreasing order of their distances
to c. Given a k-NK query Q = (q, λ, k) and a distance oracle O
i
,
the algorithm first finds the partition that q belongs to in O
i
. The
result w.r.t. O
i
is the first k elements in the inverted list for λ in the
partition, denoted as R
O
i
= {u
1
: dist(c, u
1
) + dist(c, q), u
2
:
dist(c, u
2
) + dist(c, q), · · · , u
k
: dist(c, u
k
) + dist(c, q)}. The
final result R is computed by merging the nodes in each R
O
i
and
maintaining k nodes with the shortest distances to q. The query
time complexity is O(k ·log |V |). We illustrate the algorithm using
the following example.
Example 3: Consider the graph in Fig. 1 and two distance oracles
in Fig. 2. For keyword λ, the inverted list for the partition centered
at node r in O
1
has 5 elements {b : 1, n : 3, k : 4, c : 5, t : 6}.
The inverted list for the partition centered at node o in O
2
has 1
element {k : 2}. Given a k-NK query Q = (m, λ, 2), from O
1
, we
can get a result R
O
1
= {b : 1 + dist(r, m), n : 3 + dist(r, m)} =
{b : 5, n : 7}, and from O
2
, we can get a result R
O
2
= {k :
2 + dist(o, m )} = {k : 3}. By merging R
O
1
and R
O
2
, the final
answer is R = {k : 3, b : 5}. The exact answer is R = {c : 1, k :
1} according to Fig. 1. 2
Limitation: Although in theory, the witness distance used by [1]
can be bounded by a factor of 2 log
2
|V | 1 of the exact distance
with a high probability, in practice, however, we find the distance
2
In [20], the set {O
1
, O
2
, · · · , O
r
} is defined as a distance oracle.
b,19,[19,20]
h,10,[10,18]
e,11,[11,18]
m,15, [15,18]
c,16,[16,17]
p,12, [12,14]
v,13, [13, 13]
g,2,[2,6]
f,7,[7,8]
r,1,[1,20]
d,9,[9,18]
a,3,[3,6]
k,5,[5,5]
i,8,[8,8]
j,4,[4,5]
t,17,[17,17]
o,18,[18,18]
s,14,[14,14]
n,6,[6,6]
u,20,[20,20]
λ
λ
λ
λ
λ
Figure 3: A Tree T with Preorder and Interval on Each Node
b
t
c
r
n
k
a
b
e
t
c
r
n
k
j
a
CT ECT
1
3
4
6
5
11
16
17
19
Interval [1,2] 3 [4,5] 6 [7,10]
Result b n k n b
Interval [11,16] 17 18
Result c t c
[19,20]
b
TVP
Figure 4: CT(λ), ECT(λ) and TVP(λ) for Keyword λ
estimation error can be quite large. For example, for the graph G in
Fig. 1 and two distance oracles O
1
and O
2
in Fig. 2, for two nodes
s and v, the witness distance in O
1
is
dist
O
1
(s, v) = dist(s, r) +
dist(v, r) = 10, and that in O
2
is
dist
O
2
(s, v) = dist(s, n) +
dist(v, n) = 6. However, the exact distance is dist(s, v) = 2 in
G, which is much smaller than both dist
O
1
(s, v) and
dist
O
2
(s, v).
The inaccurate distance estimation can greatly distort the ranking
of the nodes carrying the query keyword, and thus lead to a low
result quality, as illustrated in Example 3.
3.2 Exact 1-NK on a Tree
Tao et al. [22] compute the exact answer to a 1-NK query on a
tree T (V, E). Given a query Q = (q, λ, 1), the result is the nearest
node in T that contains keyword λ, denoted as NN(q, λ). The ba-
sic idea is as follows. We label a node v with the sequence number
of v in the preorder traversal of T . For a certain keyword λ, all
nodes with the preorder label in the interval [1, |V |] can be parti-
tioned into several disjointed intervals, such that any node v in the
same interval shares an identical NN(v, λ). The partition is called
tree Voronoi partition of λ, denoted as TVP(λ). By precomputing
TVP(λ) for all keywords λ on the tree, a query Q = (q, λ, 1) can
be answered in O (log |V
λ
|) time using a binary search in TVP(λ).
In order to compute TVP(λ) for all keywords λ in T efficiently,
two new data structures, namely, Compact Tree CT(λ) and Ex-
tended Compact Tree ECT(λ), are proposed in [22].
DEFINITION 2. (Compact Tree and Extended Compact Tree)
For a tree T and a keyword λ, a compact tree CT(λ) is a tree that
keeps only two types of nodes in T : a keyword node that contains
keyword λ, and a node that has at least two direct subtrees contain-
ing nodes carrying keyword λ. In the preorder traversal of T , for
two successive nodes u and v, if NN(u, λ) 6= NN(v, λ), v is called
a change node. An extended compact tree ECT(λ) is a tree con-
structed by adding all change nodes into the compact tree CT(λ).
Using ECT(λ), TVP(λ) can be constructed easily. In [22],
the authors prove that the total size of all compact trees and all
extended compact trees for all keywords in the tree T (V, E) is
bounded by O(|doc(V )|). The time to compute all compact trees
and all extended compact trees for all keywords in the tree T (V, E)
is bounded by O(|doc(V )| · log |V |).
Example 4: Fig. 3 shows a tree with the preorder label from 1 to 20
on its nodes. For keyword λ, there are 5 keyword nodes b, c, k, n, t.
For node s, NN(s, λ) = c. The compact tree of λ, CT(λ), is shown
on the left part of Fig. 4. Node r is in CT(λ) because r has three
direct subtrees with nodes carrying keyword λ. e is not in CT(λ)
because e is not a keyword node and e has only one direct subtree
rooted at m with nodes carrying keyword λ. The extended compact
tree of λ, ECT(λ), is shown in the middle part of Fig. 4 with the
903

preorder label marked beside each node. Node e is in ECT(λ),
because for its parent node h, NN(h, λ) = b 6= NN(e, λ) = c.
The tree Voronoi partition of λ, TVP(λ), is shown on the right part
of Fig. 4. For node s with preorder label 14, it is in the interval
[11, 16], thus NN(s, λ) = c as listed in TVP(λ). 2
4. SOLUTION OVERVIEW
Answering k-NK on a Graph using Tree Distance: To address
the drawback of witness distance, in this paper, we propose to use
tree distance in processing a k-NK query. We observe that for a
partition of a distance oracle, we can construct a shortest path tree
rooted at the center node of the partition. Since a tree contains more
structural information than a star, using tree distance will be more
accurate than using witness distance for estimating the distance of
two nodes. For a distance oracle O
i
, let the set of trees constructed
in O
i
be T
i
. T
i
can be considered as a tree by adding a virtual
root and several virtual edges with weight + that connect the
new virtual root to every root node in T
i
respectively. Let the k-NK
result on tree T be R
T
. Suppose we have an algorithm to compute
R
T
on a tree T , we can solve the k-NK problem in a graph by
merging R
T
i
for each tree T
i
, 1 i r. Obviously, such a result
will be more accurate than the res ult by [1]. The following example
illustrates the k-NK query processing based on tree distance.
Example 5: For the distance oracles O
1
and O
2
shown in Fig. 2,
the corresponding shortest path trees T
1
and T
2
are shown in Fig. 5.
For T
1
, there is only 1 tree rooted at r because there is only 1
partition in O
1
. For T
2
, there are 4 trees rooted at nodes n, o, r, t
respectively, because there are 4 partitions in O
2
. In each tree, the
path from any node to the root node is a shortest path in the original
graph. For two nodes s and v, their tree distance is 2 in both T
1
and
T
2
, the same as the exact distance dist(s, v) in G. For a k-NK query
Q = (m, λ, 2), we have R
T
1
= {c : 1, t : 2}, and R
T
2
= {k : 1}.
By merging R
T
1
and R
T
2
, we get R = {c : 1, k : 1}. Such a result
is much better than the result in Example 3 computed using witness
distance for the same query. 2
With the tree distance formulation, the key operation in answer-
ing a k-NK query on a graph is to answer the k-NK query on a tree.
Therefore, we start with processing a k-NK query on a tree.
Answering k-NK on a Tree: We show that it is nontrivial to answer
a k-NK query on a tree efficiently even if k is bounded. Our first
attempt is to extend the existing 1-NK solution on a tree T (V, E)
in [22]. Recall that in [22], for a certain keyword λ, the range
[1, |V |] is partitioned into several disjoint intervals, and nodes with
the preorder label in an identical interval share the same 1-NK re-
sult. When k 2, each interval needs to be further partitioned to
ensure that all nodes with the preorder label in the same interval
share an identical k-NK result. The number of intervals increases
exponentially w.r.t. the number of keyword nodes on the tree until
it reaches |V | for a keyword λ. Clearly, using such an approach,
the index size is too large in practice even for a small k. Our second
attempt is that, for each node v on the tree T (V, E) and each key-
word λ, we precompute its
k nearest nodes that contain λ. When
processing a query Q = (q, λ, k) with k
k, we can simply re-
trieve the precomputed result on node q and output the first k nodes
directly. Such an approach is impractical because for each keyword
λ, we need O(
k · |V |) space to store the precomputed results.
In the following, we first introduce two algorithms for answering
exact k-NK on a tree T (V, E). Our first algorithm tree-boundk can
only handle bounded k values with query processing time O(k +
log |V
λ
|) and index size O(k · |doc(V )|) for all keywords where k
is an upper bound value of k. Our second algorithm tree-pivot can
handle an arbitrary k with query processing time O(k · log |V |)
T
2
b
u
s
p
v
h
e
t
c
i
d
g
f
T
1
o
r
m
n
k
j
a
λ
λ
λ
λ
λ
r
b
u
h
i
d
g
f
λ
o
m
k
λ
j
s
p
v
e
a
n
λ
t
c
λ
λ
Figure 5: Shortest Path Trees T
1
and T
2
Algorithm 1: tree-boundk (Q,T )
Input: A k-NK query Q = (q, λ, k), and a tree T .
Output: Answer for Q on T .
R ;1
(u, u
) the entry edge of q on CT(λ);2
R R
k
(cand
λ
(u) dist(q, u));3
R R
k
(cand
λ
(u
) dist(q, u
));4
return R;5
and index size O(|doc(V )| · log |V |) for all keywords which is
independent of k. We then show our algorithm for approximate
k-NK on a graph by merging results on a bounded number of trees.
We propose a global storage technique to further reduce the index
size and the query time on a graph. Finally we show how to extend
our method to handle a query with multiple keywords.
5. K-NK ON A TREE FOR A SMALL K
In this section, we study how to answer a k-NK query Q =
(q, λ, k) on a tree T (V, E). We first consider a common sce-
nario when users are interested in a small number of answer nodes
bounded by a small constant
k, i.e., k k. Recall that for a key-
word λ, its compact tree CT(λ) keeps all the structural information
of λ on the tree T . Our idea is to precompute the top-k results for
every keyword λ and every node on CT(λ). Since the total size
of all compact trees is bounded by O(|doc(V )|), the total space to
store the top-
k results of nodes on all compact trees is bounded by
O (
k · |doc(V )|). Given a query Q = (q, λ, k), if q is on CT(λ),
we can simply report the precomputed answer on CT(λ). If q is
not on CT(λ), we need to find a way to construct the answer using
the precomputed results as well as the structure of CT(λ) and T . In
the following, we first introduce how to answer a k-NK query using
CT(λ), followed by discussions on the construction of the index.
5.1 Query Processing
For a keyword λ, and each node v in the compact tree CT(λ),
we use a candidate list cand
λ
(v) to denote the precomputed k-NK
results for k =
k on node v ranked in nondecreasing order of their
distances to v, in the form of cand
λ
(v) = {v
1
: dist(v, v
1
), v
2
:
dist(v, v
2
), · · · , v
k
: dist(v, v
k
)} where dist(v, v
1
) dist(v, v
2
)
· · · dist(v, v
k
). Given a query Q = (q, λ, k) on a tree T (V, E)
where k
k, if q is in CT(λ), we can simply report the first k ele-
ments in cand
λ
(q) as the answer. The difficult case is when q is not
in CT(λ). In order to answer such a query, we define an entry edge
to be the edge in CT(λ) that is nearest to q. Intuitively, the entry
edge plays a role of connecting the query node q to the compact
tree CT(λ). The for mal definition of entry edge is as follows.
DEFINITION 3. (Entry Node and Entry Edge) Given a com-
pact tree CT(λ), for each edge (u, u
) on CT(λ) with u
being a
child node of u, (u, u
) represents a unique path from u to u
on
the original tree T . For any node v on T , we say v sticks to CT(λ),
denoted as v
s
CT(λ), if and only if there exists an edge (u, u
)
on CT(λ) such that v is on the path from u to u
on T , otherwise
v does not stick to CT(λ), denoted as v /
s
CT(λ). For a node q
on T , let v be the first node on the path from q to the root node of
T such that v
s
CT(λ). v is called the Entry Node of q w.r.t. λ,
904

Algorithm 2: operator R δ
Input: Candidate list R = {u
1
: d
u
1
, u
2
: d
u
2
, · · · }, distance δ.
Output: A candidate list by adding δ to all distances in R.
R
;1
for i = 1 to |R| do2
R
R
S
{u
i
: d
u
i
+ δ};
3
return R
;4
denoted as EN
λ
(q). The corresponding edge (u, u
) on CT(λ) is
called the Entry Edge of q w.r.t. λ, denoted as EE
λ
(q).
Note that for a node q and a keyword λ, EE
λ
(q) is an edge on
the compact tree CT(λ), and EN
λ
(q) is a node on the original tree
T . We use an example to illustrate the entry node and entry edge.
Example 6: For the tree T shown in Fig. 3 and keyword λ, the
compact tree CT(λ) is shown on the left part of Fig. 4. For ease of
illustration, we also mark the nodes in CT(λ) dark on the tree T in
Fig. 3. For edge (r, c) in CT(λ), h
s
CT(λ) because h is on the
path from r to c in T . p /
s
CT(λ) since p is not on the tree path
of any CT(λ) edge. For node v, its entry node is EN
λ
(v) = e, as e
is the first node on the path (v, p, e, h, d, r) such that e
s
CT(λ).
The entry edge for v is EE
λ
(v) = (r, c) since the entry node e for
v is on the path from r to c in T . The entry nodes and entry edges
for some other nodes in T are listed in the following table. 2
Node g j d e p u
EN
λ
g j d e e b
EE
λ
(r, a) (a, k) (r, c) (r, c) (r, c) (r, b)
The Algorithm: Given a tree T (V, E), for keyword λ, all keyword
nodes are contained in CT(λ). For any node q V , the path from
q to any keyword node will go through the entry node EN
λ
(q).
Based on such property, the result of a query Q = (q, λ, k) is iden-
tical with the result of the query Q
= (EN
λ
(q), λ, k). However,
EN
λ
(q) may not be on CT(λ), thus the result of Q
is not neces-
sarily precomputed. Let (u, u
) = EE
λ
(q), since EN
λ
(q) is on the
path from u to u
on the tree T , the path from EN
λ
(q) to any key-
word node in T will go through either u or u
. Thus, the answer for
Q
can be constructed by merging the precomputed candidate lists
cand
λ
(u) and cand
λ
(u
) on CT(λ).
Our algorithm for processing a query Q = (q, λ, k) on a tree T is
shown in Algorithm 1. We assume that the compact tree CT(λ) for
each keyword λ and the list cand
λ
(u) for every node u on CT(λ)
have been computed. After initializing the res ult R in line 1, we
find the entry edge (u, u
) for q on CT(λ) (line 2). We add a dis-
tance dist(q , u) to every node in cand
λ
(u) using the operator, to
reflect the distance from q to a keyword node via u. We then merge
the new result into R using the
k
operator (line 3). Similarly we
apply the two operators to cand
λ
(u
) with the distance dist(q, u
)
(line 4). We will describe the operators and
k
later. We use the
following example to illustrate the algorithm.
Example 7: Given the tree T shown in Fig. 3 and CT(λ) on the left
part of Fig. 4, for a query Q = (o, λ, 2), the entry edge EE
λ
(o) =
(r, c). Suppose the lists cand
λ
(r) = {b : 1, n : 3} and cand
λ
(c) =
{c : 0, t : 1} are precomputed. By adding dist(o, r) = 5 to
cand
λ
(r), and adding dist(o, c) = 2 to cand
λ
(c), we get the new
lists {b : 6, n : 8} for r and {c : 2, t : 3} for c. We merge the two
lists and get the final result R = {c : 2, t : 3}. 2
The efficiency of Algorithm 1 depends on three operations. The
first operation is to find the entry edge for any node on T (line 2).
The second operation is to calculate the distance of any two nodes
on T , e.g., dist(q, u) and dist(q, u
) (line 3-4). The third operation
is to merge two sorted lists into a new one using operators and
k
(line 3-4). Next, we discuss the three operations separately.
Algorithm 3: operator R
1
k
R
2
Input: Two sorted candidate lists R
1
= {u
1
: d
u
1
, u
2
: d
u
2
, · · · }
R
2
= {v
1
: d
v
1
, v
2
: d
v
2
, · · · }, and result size k.
Output: The merged candidate list.
R ; i 1; j 1;1
while (i < |R
1
| or j < |R
2
|) and |R| k do2
if i < |R
1
| and (d
u
i
d
v
j
or j |R
2
|) then
3
if u
i
/ R then R R
S
{u
i
: d
u
i
};
4
i i + 1;5
else if j < |R
2
| and (d
v
j
d
u
i
or i |R
1
|) then
6
if v
j
/ R then R R
S
{v
j
: d
v
j
};
7
j j + 1;8
return R;9
Finding the Entry Edge: Given a keyword λ, for any node v on a
tree T (V, E), our idea of finding the entry edge EE
λ
(v) of v is sim-
ilar to the idea of finding the 1-NK answer using the tree Voronoi
partition TVP(λ) in [22]. For the range [1, |V |], we partition it
into several disjoint intervals, such that nodes with the preorder la-
bel in the same interval share an identical entry edge. We call such
partition an entry edge partition for λ, denoted as EEP(λ). Given
EEP(λ), EE
λ
(v) can be computed easily using a binary search in
EEP(λ) in O(log |V
λ
|) time. In the next subsection, we show how
to build EEP(λ) for all keywords efficiently and prove that the total
size of EEP(λ) for all keywords in T is bounded by O(doc|V |).
Computing Tree Distance: Given a tree T (V, E) with root r, sup-
pose the distance from r to every node in T has been precomputed.
For any two nodes u and v on T , we denote LCA(u, v) as their low-
est common ancestor. The distance of u and v can be computed as
dist(u, v) = dist(r, u) + dist(r, v) 2dist(r, LCA(u, v)). Using
the techniques in [2], LCA(u, v) can be found in O(1) time using
O (|V |) index space. Thus dist(u, v) for any two nodes u and v on
T can be computed in O(1) time using O(|V |) index space.
Merging Results: The results are merged using two operators
and
k
. Algorithm 2 shows the operator , which takes a candi-
date list R and a distance δ as input, and outputs a candidate list by
adding δ to all distances in R. The time complexity for the op-
erator is O(|R|). Algorithm 3 shows the operator
k
, which takes
two candidate lists R
1
and R
2
sorted in nondecreasing order of the
distances, and a value k as input, and outputs the merged candidate
list R. R contains at most k elements sorted in nondecreasing order
of the distances. R can be constructed by visiting each element in
R
1
and R
2
at most once. The time complexity for the
k
operator
is O(min{|R
1
| + |R
2
|, k}). The
k
and operators satisfy the
commutative, associative and distributive laws as follows.
(Commutative Law) R
1
k
R
2
= R
2
k
R
1
.
(Associative Law) (R
1
k
R
2
)
k
R
3
= R
1
k
(R
2
k
R
3
).
(Distributive Law) (R
1
k
R
2
) d = (R
1
d)
k
(R
2
d).
THEOREM 1. Algorithm 1 computes the exact k-NK answer for
a query Q = (q, λ, k) on a tree T (V, E) in O(k + log |V
λ
|) time.
Algorithm 1 uses the novel idea of entry edge, and elegantly ex-
tends the 1-NK method [22] to handle k-NK (k > 1) with the same
query time complexity, except for an extra linear cost O(k) indis-
pensable for reporting the results.
Given the tree T , for every keyword λ, besides the compact tree
CT(λ), two more indexes are needed. The first index, the entry
edge partition EEP(λ), is to find the entry edge for any node on T .
The second index is the candidate list cand
λ
(v) for every node on
CT(λ). Below we show how to construct the two indexes.
5.2 Construction of Entry Edge Partition
Given a tree T (V, E), for each keyword λ, sharing the similar
idea with the tree Voronoi partition TVP(λ), we construct an entry
905

Citations
More filters

Journal ArticleDOI
Ruicheng Zhong1, Guoliang Li1, Kian-Lee Tan2, Lizhu Zhou1  +1 moreInstitutions (3)
TL;DR: Inspired by R-tree, a height-balanced and scalable index, namely G-tree is proposed, to efficiently support three types of location-based queries on road networks, single-pair shortest path query, k nearest neighbor (kNN) query, and keyword-based kNN query.
Abstract: In the recent decades, we have witnessed the rapidly growing popularity of location-based systems. Three types of location-based queries on road networks, single-pair shortest path query, $k$ nearest neighbor ( $k$ NN) query, and keyword-based $k$ NN query, are widely used in location-based systems. Inspired by $\tt R$ - $\tt tree$ , we propose a height-balanced and scalable index, namely $\tt G$ - $\tt tree$ , to efficiently support these queries. The space complexity of $\tt G$ - $\tt tree$ is $\mathcal {O}(|\mathcal {V}|\log {|\mathcal {V}|})$ where ${|\mathcal {V}|}$ is the number of vertices in the road network. Unlike previous works that support these queries separately, $\tt G$ - $\tt tree$ supports all these queries within one framework. The basis for this framework is an assembly-based method to calculate the shortest-path distances between two vertices. Based on the assembly-based method, efficient search algorithms to answer $k$ NN queries and keyword-based $k$ NN queries are developed. Experiment results show $\tt G$ - $\tt tree$ ’s theoretical and practical superiority over existing methods.

95 citations


Additional excerpts

  • ...Index Terms—Single-pair shortest path, KNN search, keyword search, road network, index, spatial databases Ç...

    [...]

  • ...3.3 G-tree Construction In this section, we present how to construct the G-tree....

    [...]


Proceedings ArticleDOI
Bolong Zheng1, Kai Zheng1, Xiaokui Xiao2, Han Su3  +3 moreInstitutions (4)
16 May 2016-
TL;DR: This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently and builds a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search.
Abstract: It is nowadays quite common for road networks to have textual contents on the vertices, which describe auxiliary information (e.g., business, traffic, etc.) associated with the vertex. In such road networks, which are modelled as weighted undirected graphs, each vertex is associated with one or more keywords, and each edge is assigned with a weight, which can be its physical length or travelling time. In this paper, we study the problem of keyword-aware continuous k nearest neighbour (KCkNN) search on road networks, which computes the k nearest vertices that contain the query keywords issued by a moving object and maintains the results continuously as the object is moving on the road network. Reducing the query processing costs in terms of computation and communication has attracted considerable attention in the database community with interesting techniques proposed. This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently. First we build a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search by avoiding massive network traversals and sequential probe of keywords. To reduce the frequency of unnecessary result updates, we develop the concepts of dominance interval and region on road network, which share the similar intuition with safe region for processing continuous queries in Euclidean space but are more complicated and thus require more dedicated design. For high frequency keywords, we resolve the dominance interval when the query results changed. In addition, a path-based dominance updating approach is proposed to compute the dominance region efficiently when the query keywords are of low frequency. We conduct extensive experiments by comparing our algorithms with the state-of-the-art methods on real data sets. The empirical observations have verified the superiority of our proposed solution in all aspects of index size, communication cost and computation time.

48 citations


Cites background or methods from "Top-K nearest keyword search on lar..."

  • ...Such queries, known as spatial keyword queries, which find the top-k objects of interest in terms of both spatial proximity and textual relevance to the query, have been extensively studied in recent years [6][13][15][20][21][25][26][27]....

    [...]

  • ...Technique Boolean keyword Continuous query Unknown path Static data objects Road network Safe region ROAD[13], G-tree[27], SP-tree[20], FBS[11] – – OA-kNN[21] – – YPK-CNN[24], CPM[16], GMA[17] CkNN[22] UNICONS[5] V∗-Diagram[18], MkSK[23], INS[14] LARC...

    [...]

  • ...SP-tree [20] deals with the problem of keyword search on large graphs by introducing a shortest path tree, thus the network distances between results and query are approximated by tree distances....

    [...]


Proceedings ArticleDOI
27 May 2015-
TL;DR: This paper proposes algorithms for top-k nearest keyword search that provide exact solutions and which handle networks of very large sizes and verified the performance of the solutions compared with the best-known approximation algorithms with experiments on real datasets.
Abstract: Top-k nearest keyword search has been of interest because of applications ranging from road network location search by keyword to search of information on an RDF repository. We consider the evaluation of a query with a given vertex and a keyword, and the problem is to find a set of $k$ nearest vertices that contain the keyword. The known algorithms for handling this problem only give approximate answers. In this paper, we propose algorithms for top-k nearest keyword search that provide exact solutions and which handle networks of very large sizes. We have also verified the performance of our solutions compared with the best-known approximation algorithms with experiments on real datasets.

40 citations


Cites background or methods from "Top-K nearest keyword search on lar..."

  • ...(2) Both methods [4, 26] assume that the index can reside in main memory....

    [...]

  • ...Given a graph G = (V,E) with vertex set V , and edge set E, the algorithm in [4] incurs a (2 log2 |V | − 1) approximation factor, which can be quite large given large values of |V |, and as shown in [26], the resulting error is significant in their empirical study in real graphs and good solutions can be missed....

    [...]

  • ...The authors of [26] point out that the error introduced by the star summary in [4] can be large....

    [...]

  • ...Both PMI and pivot-gs were implemented by the authors of [26]....

    [...]

  • ...As pointed out in [4] and [26], some keyword queries in a network are generated from a vertex inside the network with an interest of looking for vertices in a near-vicinity of the network....

    [...]


Proceedings ArticleDOI
Yuchen Li1, Zhifeng Bao2, Guoliang Li3, Kian-Lee Tan1Institutions (3)
13 Apr 2015-
TL;DR: A novel 3D cube inverted index is designed, a cube based threshold algorithm is devised to retrieve the top-k results, and several pruning techniques are proposed to optimize the social distance computation, whose cost dominates the query processing.
Abstract: Internet users are shifting from searching on traditional media to social network platforms (SNPs) to retrieve up-to-date and valuable information. SNPs have two unique characteristics: frequent content update and small world phenomenon. However, existing works are not able to support these two features simultaneously. To address this problem, we develop a general framework to enable real time personalized top-k query. Our framework is based on a general ranking function that incorporates time freshness, social relevance and textual similarity. To ensure efficient update and query processing, there are two key challenges. The first is to design an index structure that is update-friendly while supporting instant query processing. The second is to efficiently compute the social relevance in a complex graph. To address these challenges, we first design a novel 3D cube inverted index to support efficient pruning on the three dimensions simultaneously. Then we devise a cube based threshold algorithm to retrieve the top-k results, and propose several pruning techniques to optimize the social distance computation, whose cost dominates the query processing. Furthermore, we optimize the 3D index via a hierarchical partition method to enhance our pruning on the social dimension. Extensive experimental results on two real world large datasets demonstrate the efficiency and the robustness of our proposed solution.

39 citations


Cites background from "Top-K nearest keyword search on lar..."

  • ...The social distance is usually modeled as the shortest distance on the social graph [9], [10], [6], [5], [7]....

    [...]

  • ...(1) Social Relevance: The social distance for two vertices v ↔ v′ is adopted as the shortest distance [9], [10], [6], [5]....

    [...]


Journal ArticleDOI
Ye Yuan1, Xiang Lian2, Lei Chen3, Jeffery Xu Yu4  +2 moreInstitutions (4)
TL;DR: A signature-based search algorithm is proposed that encodes the shortest-path distance from a vertex to any given keyword in the graph, and can find query answers by exploring fewer paths, so that the time and communication costs are low.
Abstract: Graph keyword search has drawn many research interests, since graph models can generally represent both structured and unstructured databases and keyword searches can extract valuable information for users without the knowledge of the underlying schema and query language. In practice, data graphs can be extremely large, e.g., a Web-scale graph containing billions of vertices. The state-of-the-art approaches employ centralized algorithms to process graph keyword searches, and thus they are infeasible for such large graphs, due to the limited computational power and storage space of a centralized server. To address this problem, we investigate keyword search for Web-scale graphs deployed in a distributed environment. We first give a naive search algorithm to answer the query efficiently. However, the naive search algorithm uses a flooding search strategy that incurs large time and network overhead. To remedy this shortcoming, we then propose a signature-based search algorithm. Specifically, we design a vertex signature that encodes the shortest-path distance from a vertex to any given keyword in the graph. As a result, we can find query answers by exploring fewer paths, so that the time and communication costs are low. Moreover, we reorganize the graph data in the cluster after its initial random partitioning so that the signature-based techniques are more effective. Finally, our experimental results demonstrate the feasibility of our proposed approach in performing keyword searches over Web-scale graph data.

20 citations


Cites background from "Top-K nearest keyword search on lar..."

  • ...[29] studied the top-k nearest keyword (k-NK) query over a graph....

    [...]

  • ...Therefore, the studied problems in [28], [29] are different from that in this paper, and their proposed techniques cannot be directed used for solving our problem in this paper....

    [...]

  • ...There are some works to study the variants of graph keyword search [28], [29]....

    [...]


References
More filters

Book
05 Sep 2011-
TL;DR: The present article is a commencement at attempting to remedy this deficiency of scientific correlation, and the meaning and working of the various formulæ have been explained sufficiently, it is hoped, to render them readily usable even by those whose knowledge of mathematics is elementary.
Abstract: All knowledge—beyond that of bare isolated occurrence—deals with uniformities. Of the latter, some few have a claim to be considered absolute, such as mathematical implications and mechanical laws. But the vast majority are only partial; medicine does not teach that smallpox is inevitably escaped by vaccination, but that it is so generally; biology has not shown that all animals require organic food, but that nearly all do so; in daily life, a dark sky is no proof that it will rain, but merely a warning; even in morality, the sole categorical imperative alleged by Kant was the sinfulness of telling a lie, and few thinkers since have admitted so much as this to be valid universally. In psychology, more perhaps than in any other science, it is hard to find absolutely inflexible coincidences; occasionally, indeed, there appear uniformities sufficiently regular to be practically treated as laws, but infinitely the greater part of the observations hitherto recorded concern only more or less pronounced tendencies of one event or attribute to accompany another. Under these circumstances, one might well have expected that the evidential evaluation and precise mensuration of tendencies had long been the subject of exhaustive investigation and now formed one of the earliest sections in a beginner’s psychological course. Instead, we find only a general naı̈ve ignorance that there is anything about it requiring to be learnt. One after another, laborious series of experiments are executed and published with the purpose of demonstrating some connection between two events, wherein the otherwise learned psychologist reveals that his art of proving and measuring correspondence has not advanced beyond that of lay persons. The consequence has been that the significance of the experiments is not at all rightly understood, nor have any definite facts been elicited that may be either confirmed or refuted. The present article is a commencement at attempting to remedy this deficiency of scientific correlation. With this view, it will be strictly confined to the needs of practical workers, and all theoretical mathematical demonstrations will be omitted; it may, however, be said that the relations stated have already received a large amount of empirical verification. Great thanks are due from me to Professor Haussdorff and to Dr. G. Lipps, each of whom have supplied a useful theorem in polynomial probability; the former has also very kindly given valuable advice concerning the proof of the important formulæ for elimination of ‘‘systematic deviations.’’ At the same time, and for the same reason, the meaning and working of the various formulæ have been explained sufficiently, it is hoped, to render them readily usable even by those whose knowledge of mathematics is elementary. The fundamental procedure is accompanied by simple imaginary examples, while the more advanced parts are illustrated by cases that have actually occurred in my personal experience. For more abundant and positive exemplification, the reader is requested to refer to the under cited research, which is entirely built upon the principles and mathematical relations here laid down. In conclusion, the general value of the methodics recommended is emphasized by a brief criticism of the best correlational work hitherto made public, and also the important question is discussed as to the number of ‘‘cases’’ required for an experimental series.

3,267 citations


"Top-K nearest keyword search on lar..." refers methods in this paper

  • ...We use six metrics for evaluation: hit rate, Spearman’s rho [21], error, query time, index time, and index size....

    [...]


Proceedings ArticleDOI
G. Bhalotia1, Arvind Hulgeri, Charuta Nakhe1, Soumen Chakrabarti1  +1 moreInstitutions (1)
26 Feb 2002-
TL;DR: BANKS is described, a system which enables keyword-based search on relational databases, together with data and schema browsing, and presents an efficient heuristic algorithm for finding and ranking query results.
Abstract: With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

944 citations


"Top-K nearest keyword search on lar..." refers background in this paper

  • ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

    [...]

  • ...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...

    [...]


Book ChapterDOI
20 Aug 2002-
TL;DR: It is proved that DISCOVER finds without redundancy all relevant candidate networks, whose size can be data bound, by exploiting the structure of the schema and the selection of the optimal execution plan (way to reuse common subexpressions) is NP-complete.
Abstract: DISCOVER operates on relational databases and facilitates information discovery on them by allowing its user to issue keyword queries without any knowledge of the database schema or of SQL. DISCOVER returns qualified joining networks of tuples, that is, sets of tuples that are associated because they join on their primary and foreign keys and collectively contain all the keywords of the query. DISCOVER proceeds in two steps. First the Candidate Network Generator generates all candidate networks of relations, that is, join expressions that generate the joining networks of tuples. Then the Plan Generator builds plans for the efficient evaluation of the set of candidate networks, exploiting the opportunities to reuse common subexpressions of the candidate networks. We prove that DISCOVER finds without redundancy all relevant candidate networks, whose size can be data bound, by exploiting the structure of the schema. We prove that the selection of the optimal execution plan (way to reuse common subexpressions) is NP-complete. We provide a greedy algorithm and we show that it provides near-optimal plan execution time cost. Our experimentation also provides hints on tuning the greedy algorithm.

875 citations


"Top-K nearest keyword search on lar..." refers background in this paper

  • ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

    [...]


Book ChapterDOI
10 Apr 2000-
TL;DR: A very simple algorithm for the Least Common Ancestors problem is presented, dispelling the frequently held notion that optimal LCA computation is unwieldy and unimplementable.
Abstract: We present a very simple algorithm for the Least Common Ancestors problem. We thus dispel the frequently held notion that optimal LCA computation is unwieldy and unimplementable. Interestingly, this algorithm is a sequentialization of a previously known PRAM algorithm.

852 citations


"Top-K nearest keyword search on lar..." refers background or methods in this paper

  • ...[2, 6] is processed recursively by invoking partition(EEP(λ), [2, 6], (r, a),CT(λ)), and [7, 20] is processed by the other two child nodes c and b similarly....

    [...]

  • ...16 17 19 Interval [1,2] 3 [4,5] 6 [7,10]...

    [...]

  • ...We first process edge (r, a) with interval [2, 6], which divides the interval [1, 20] into three parts: [1, 1], [2, 6], and [7, 20]....

    [...]

  • ...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...

    [...]

  • ...Using the techniques in [2], LCA(u, v) can be found in O(1) time using O(|V |) index space....

    [...]


Proceedings ArticleDOI
Hao He1, Haixun Wang2, Jun Yang1, Philip S. Yu2Institutions (2)
11 Jun 2007-
TL;DR: BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search, and offers orders-of-magnitude performance improvement over existing approaches.
Abstract: Query processing over graph-structured data is enjoying a growing number of applications. A top-k keyword search query on a graph finds the top k answers according to some ranking criteria, where each answer is a substructure of the graph containing all query keywords. Current techniques for supporting such queries on general graphs suffer from several drawbacks, e.g., poor worst-case performance, not taking full advantage of indexes, and high memory requirements. To address these problems, we propose BLINKS, a bi-level indexing and query processing scheme for top-k keyword search on graphs. BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search. To reduce the index space, BLINKS partitions a data graph into blocks: The bi-level index stores summary information at the block level to initiate and guide search among blocks, and more detailed information for each block to accelerate search within blocks. Our experiments show that BLINKS offers orders-of-magnitude performance improvement over existing approaches.

585 citations


"Top-K nearest keyword search on lar..." refers background in this paper

  • ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

    [...]

  • ...For the node h, its interval is [10, 18] because the preorder of h on T is 10 and the maximum preorder for all nodes on the subtree rooted at h is 18....

    [...]


Network Information
Related Papers (5)
11 Jun 2007

Hao He, Haixun Wang +2 more

01 Aug 2009

Gao Cong, Christian S. Jensen +1 more

26 Feb 2002

G. Bhalotia, Arvind Hulgeri +3 more

07 Apr 2008

I. De Felipe, Vagelis Hristidis +1 more

Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20216
20208
20197
20183
20179
20163