What have the authors contributed in "Top-k nearest keyword search on large graphs" ?

On such networks, the authors study the problem of top-k nearest keyword ( k-NK ) search. The authors propose two efficient algorithms to report the exact k-NK result on a tree. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time.

How does global storage reduce the index size of pivot?

By keeping a global candidate list and removing duplicate index items, global storage reduces the index size of pivot by 61% on DBLP and 55% on FLARN.

How can the authors solve the k-NK problem in a graph?

Suppose the authors have an algorithm to compute RT on a tree T , the authors can solve the k-NK problem in a graph by merging RTi for each tree Ti, 1 ≤ i ≤ r.

What is the way to calculate cand(v)?

Since CT(λ) keeps the structural information of all keyword nodes in T , it is sufficient to search only on CT(λ) to calculate candλ(v).

How does the Algorithm 7 construct a distance preserving balanced tree?

2THEOREM 5. Given a tree T (V, E), Algorithm 7 constructs a distance preserving balanced tree DT(T ) for T using O(|V | · log |V |) time and O(|V |) space.

Why is the index size of pivot longer on FLARN than on DBLP?

This is because the complexity of pivot grows linearly with the tree depth,and the larger diameter of FLARN leads to a larger tree depth.

What is the definition of a distance preserving balanced tree?

In order to reduce the average depth of nodes to optimize both index space and query processing time, the authors introduce a new structure called distance preserving balanced tree for T (V, E), denoted as DT(T ).

How do the authors calculate cand(v) for a tree?

For each node v traversed, the authors merge candλ(v) into that of its parent node u by adding a distance dist(u, v) to the list candλ(v) (line 3-5).

What is the second attempt to precompute k nearest nodes?

Their second attempt is that, for each node v on the tree T (V, E) and each keyword λ, the authors precompute its k nearest nodes that contain λ.

What is the EE(q) operator for the path from u to u?

Let (u, u′) = EEλ(q), since ENλ(q) is on the path from u to u′ on the tree T , the path from ENλ(q) to any keyword node in T will go through either u or u′.

How do the authors compute the candidate list for a tree?

Given a compact tree CT(λ) for a tree T and a keyword λ, the authors need to compute the candidate list candλ(v) for every node v on CT(λ).

How do the authors calculate candidate list cand(p)?

For each pivot p of v as well as v itself, the authors calculate distT (p, v) on the original tree T , and add the element v : distT (p, v) to the candidate list candλ(p) (line 4-5).

What is the first traversal on CT()?

The first traversal on CT(λ) is a bottom-up one, such that the candidate list on each node is propagated to all its ancestors on CT(λ).

(Open Access) Top-K nearest keyword search on large graphs (2013) | Miao Qiao

Q: What is the key operation in answering a k-NK query on a graph?

2With the tree distance formulation, the key operation in answering a k-NK query on a graph is to answer the k-NK query on a tree.

Q: How can the authors reduce a distance oracle to a set of trees?

As the authors transform a distance oracle on a graph into a set of shortest path trees, the original k-NK query on the graph can be reduced to answering the k-NK query on a set of trees.

Top-K Nearest Keyword Search on Large Graphs

Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, Wentao Tian

The Chinese University of Hong Kong, Hong Kong, China

{mqiao,lqin,hcheng,yu,wttian}@se.cuhk.edu.hk

ABSTRACT

It is quite common for networks emerging nowadays to have labels

or textual contents on the nodes. On such networks, we study the

problem of top-k nearest keyword (k-NK) search. In a network G

modeled as an undirected graph, each node is attached with zero or

more keywords, and each edge is assigned with a weight measuring

its length. Given a query node q in G and a keyword λ, a k-NK

query seeks k nodes which contain λ and are nearest to q. k-NK is

not only useful as a stand-alone query but also as a building block

for tackling complex graph pattern matching problems.

The key to an accurate k-NK result is a precise shortest distance

estimation in a graph. Based on the latest distance oracle technique,

we build a shortest path tree for a distance oracle and use the tree

distance as a more accurate estimation. With such representation,

the original k-NK query on a graph can be reduced to answering

the query on a set of trees and then assembling the results obtained

from the trees. We propose two efﬁcient algorithms to report the

exact k-NK result on a tree. One is query time optimized for a

scenario when a small number of result nodes are of interest to

users. The other handles k-NK queries for an arbitrarily large k

efﬁciently. In obtaining a k-NK result on a graph from that on trees,

a global storage technique is proposed to further reduce the index

size and the query time. Extensive experimental results conform

with our theoretical ﬁndings, and demonstrate the effectiveness and

efﬁciency of our k-NK algorithms on large real graphs.

1. INTRODUCTION

Many real-world networks emerging nowadays have labels or

textual contents on the nodes. For example in a road network, a

location may have labels such as “McDonald’s”, “hospital”, and

“kindergarten”. In a social network, a person may have informa-

tion including name, interests and skills, etc.. In a bibliographic

network, a paper may have keywords and abstract, and an author

may have name, afﬁliation and email address. In this study, we

consider the problem of top-k nearest keyword (k-NK) search on

large networks. In a network G modeled as an undirected graph,

each node is attached with zero or more keywords, and each edge

is assigned with a weight measuring its length. Given a query node

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or d istributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 10

10.00.

q in G and a keyword λ, a k-NK query in the form of Q = (q, λ, k)

looks for k nodes which contain λ and are nearest to q. Different

from a large body of research on k - nearest neighbor (k-NN) search

on spatial networks [15, 5, 6, 18, 19, 7], we deﬁne G as a general

graph without coordinates. Thus our solution can apply to a wide

range of networks.

Motivation. k-NK is an important and useful query in graph search.

As a stand-alone query, it has a wide range of applications. Further-

more, it can serve as a building block for tackling complex graph

pattern matching problems which impose both structural and tex-

tual constraints. Here we list a few applications of k-NK queries.

Consider the social network Facebook as an example, in which

personalized search based on graph structure and textual contents

has become increasingly popular

. A person looks for 20 friends or

potential friends who like hiking to participate in a hiking activity.

Intuitively, if two persons share some common friends, i.e., they are

two hops away, they are more likely to become friends. In contrast,

if they are far away from each other in the network, they are less

likely to establish a link. Thus the problem is to ﬁnd 20 persons

who like hiking and are nearest to the person who serves as the

organizer. It can be answered by a k-NK query. More generally,

we also consider a query containing multiple keywords connected

by AND or OR operators to express more complex semantics, e.g.,

a person looks for k friends or potential friends who like hiking

AND (OR) photography and are nearest to him.

Take a road network with locations associated with keywords as

another example. For parents looking for k kindergartens nearest

to their home for their children, their requirements can be expressed

by a k-NK query where the query node is the home location, and

the keyword is “kindergarten”.

In the third example, we s how how k-NK queries serve as a

building block for solving the graph pattern matching problem.

Consider a couple who wants to buy a house. They have some con-

straints like having a kindergarten and a hospital within 3 km, and

a supermarket within 1 km of their home. These constraints can be

expressed as a star pattern, and the pattern matching problem can

be decomposed into three k-NK queries with keywords “kinder-

garten”, “hospital” and “supermarket” respectively and k = 1 for

each potential house location to be considered.

Recently, Bahmani and Goel [1] have designed a Partitioned

Multi-Indexing (PMI) scheme to answer k-NK queries approxi-

mately. PMI is an inverted index built based on distance oracle

[20] which is a distance estimation technique. Given a k-NK query

Q = (q, λ, k), it returns k nodes containing keyword λ in ascend-

ing order of their approximate distance from the query node q. PMI

inherits the 2 log

|V | − 1 approximation factor for distance esti-

mation from distance oracle [20], where V is the set of nodes in the

https://www.facebook.com/about/graphsearch

901

graph. The major drawback of PMI is that its distance estimation

error could be quite large in practice. This can greatly distort the

ranking of the candidate nodes carrying the query keywords, and

thus lead to a low result quality.

In this work, we study how to answer k-NK queries accurately

and efﬁciently using compact index. The key to an accurate k-NK

result is a precise shortest distance estimation in a graph. As we

use a general graph model, existing k-NN solutions on spatial net-

works [15, 5, 6, 18, 19, 7] cannot be applied, as they usually rely

on specialized structures that leverage properties of spatial data to

optimize their solutions. Instead we use distance oracle [20] as the

fundamental distance estimation framework. For each component

of a distance oracle, we will build a shortest path tree, based on

which we can estimate the shortest distance between two nodes by

their tree distance. The tree distance is more accurate than the dis-

tance estimated by distance oracle, which we call witness distance

to distinguish. As we transform a distance oracle on a graph into a

set of shortest path trees, the original k-NK query on the graph can

be reduced to answering the k-NK query on a set of trees. Thus we

ﬁrst focus on processing k-NK queries to ﬁnd exact top-k answers

on a tree. Then we study how to assemble the results obtained from

the trees to form the approximate top-k answers on the graph.

Contributions. Our main contributions in this work are summa-

rized as follows.

(1) Given a tree, we ﬁrst consider a common scenario when users

are interested in a small number of answer nodes bounded by a

small constant

k, i.e., k ≤ k. We propose the ﬁrst algorithm

tree-boundk with query time O(k + log |V

|), where |V

| is the

number of nodes carrying the query keyword λ, and index size

O (

k · |doc(V )|), where |doc(V )| is the total number of keywords

on all the nodes in the graph.

(2) Next we remove the

k restriction and handle k-NK queries

for an arbitrary k on a tree. We propose the second algorithm

tree-pivot with query time O(k·log |V |) and index size O(|doc(V )|·

log |V |) which is independent of k, thus is more scalable.

(3) Based on our proposed tree algorithms, we present our algo-

rithm for approximate k-NK query on a graph. We propose a global

storage technique to further reduce the index size and the query

time. We also show how to extend our methods to handle a query

with multiple keywords.

(4) Our experimental evaluation demonstrates the effectiveness and

efﬁciency of our k-NK algorithms on large real-world networks.

We show the superiority of our methods in ranking top-k answer

nodes accurately, when compared with the state-of-the-art top-k

keyword search method PMI [1].

Roadmap. The rest of the paper is organized as follows. Sec-

tion 2 formally deﬁnes the problem. Section 3 discusses two ex-

isting related studies and their drawbacks. Section 4 presents our

framework. Sections 5 and 6 introduce two proposed algorithms to

answer k-NK queries on a tree for a small k and an arbitrary k re-

spectively using compact index structures. Section 7 elaborates on

the way to answer k-NK queries on a graph by approximating the

graph with a bounded number of trees. Section 8 presents exten-

sive experimental evaluation. Section 9 reviews the previous works

related to ours. Finally, Section 10 concludes the paper.

2. PROBLEM DEFINITION

We model a weighted undirected graph as G(V, E), where V (G)

represents the set of nodes and E(G) represents the set of edges in

G. We use V and E to denote V (G) and E(G) if the context is

obvious. Each edge (u, v) ∈ E has a positive weight, denoted

λ,α

α,β

λ,α

Figure 1: A Graph G with Keywords

as weight(u, v). A path p = (v

, v

, · · · , v

) is a sequence of l

nodes in V s uch that for each v

(1 ≤ i < l), (v

, v

i+1

) ∈ E.

The weight of a path is the total weight of all edges on the path.

For any two nodes u ∈ V and v ∈ V , the distance of u and v

on G, dist(u, v), is the minimum weight of all paths from u to v

in G. Each node v ∈ V contains a set of zero or more keywords

which is denoted as doc(v). The union of keywords for all nodes

in G is denoted as doc(V ). Note that doc(V ) is a multiset and

|doc(V )| =

v∈V

|doc(v)|. We use V

⊆ V to denote the set of

nodes carrying keyword λ in V .

DEFINITION 1. Given a graph G(V, E), a top-k nearest key-

word (k-NK) query is a triple Q = (q, λ, k), where q ∈ V is a

query node in G, λ is a keyword, and k is a positive integer. Given

a query Q, a node v ∈ V is a keyword node w.r.t. Q if v contains

keyword λ, i.e., v ∈ V

. The result is a set of k keyword nodes,

denoted as R = {v

, v

, · · · , v

} ⊆ V

, and there does not exist

a node u ∈ V

\ R such that dist(q, u) < max

v∈R

dist(q, v). To

further report the distance in the top-k result, we can use the form

R = {v

: dist(q, v

), v

: dist(q, v

), · · · , v

: dist(q, v

)}.

In this paper, we aim at answering a k-NK query Q = (q, λ, k)

on a graph G. For simplicity, we assume that there is only one

keyword λ in the query. We will discuss how to answer a query

containing multiple keywords with AND and OR semantics.

Example 1: Fig. 1 shows a graph G. Assume that the weight of

each edge is 1. For a k-NK query Q = (f, λ, 3), the keyword node

set is V

= {b, c, k, n, t}. The result of Q is R = {b : 2, n : 4, k :

5} since dist(f, b) = 2, dist(f, n) = 4, and dist(f, k) = 5. 2

3. EXISTING SOLUTIONS

A straightforward approach to answering a k-NK query Q =

(q, λ, k) on G is to use Dijkstra’s algorithm to search from the

query node q and output k nearest keyword nodes in nondecreasing

order of their distances to q. The time complexity is O(|E| + |V | ·

log |V |). Obviously, Dijkstra’s algorithm is inefﬁcient when the

size of the graph is large or the keyword nodes are far away from q.

In the literature, [1] and [22] design different indexing schemes

to process (top-k) nearest keyword queries on a graph or a tree. We

introduce the two methods in the following two subsections.

3.1 Approximate k-NK on a Graph

Bahmani and Goel [1] ﬁnd an approximate answer to a k-NK

query in a graph based on a distance oracle [20].

Distance Oracle: Distance oracle is a technique for estimating the

distance of two nodes in a graph [20]. Given a graph G, a distance

oracle is a Voronoi partition of V (G) determined by a set of ran-

domly selected center nodes. More speciﬁcally, given a number

, we randomly select n

nodes from V (G) as the center nodes

to construct a distance oracle O. Then the partition is constructed

by assigning each node v ∈ V (G) to its nearest center node, de-

noted as wit

(v), which is called the witness node of v w.r.t. O. If

v is a center node, wit

(v) = v. For each node v ∈ V (G), the

shortest distance from v to its witness node, i.e., dist(v, wit

(v)),

is precomputed. After constructing O, given two nodes u and v

in G, if u and v are in the same partition in O, i.e., wit

(u) =

902

b u

Figure 2: Two Distance Oracles O

and O

wit

(v), we compute the estimated distance, called witness dis-

tance, as

dist

(u, v) = dist(u, wit

(u)) + dist(v, wit

(v)). If u

and v are not in the same partition in O,

dist

(u, v) = +∞.

One distance oracle is usually not enough for distance estimation

in a graph G. It cannot estimate the distance of two nodes in dif-

ferent partitions. Even for two nodes in the same partition, the esti-

mation may have a large error. Therefore, a s et of r = p × log |V |

distance oracles {O

, O

, · · · , O

} are constructed, where p can

be considered as a constant

. The algorithm is processed in log |V |

phases. In phase i (0 ≤ i < log |V |), p distance oracles are con-

structed where each distance oracle contains 2

randomly selected

center nodes. Given r distance oracles, the distance of two nodes

u and v in G can be estimated as an upper bound dist(u, v) =

min

1≤i≤r

dist

(u, v).

The time complexity to compute the estimated distance

dist(u, v)

for any two nodes u and v in a graph G is O(log |V |). The distance

oracles consume O(|V | · log |V |) space. Das Sarma et al. [20]

prove that when p = Θ(|V |

1/ log |V |

), the estimated distance can

be bounded by dist(u, v) ≤

dist(u, v) ≤ (2 log

|V |−1)·dist(u, v)

with a high probability.

Example 2: Fig. 2 shows two distance oracles O

and O

for the

graph shown in Fig. 1. There is one center node r in O

, and four

center nodes r, n, o and t in O

. The distance of nodes j and

s is estimated as

dist(j, s) = min{dist

(j, s),

dist

(j, s)} =

min{dist(j, r) + dist(s, r), dist(j, n) + dist(s, n)} = 5. 2

Answering k-NK with Distance Oracle: [1] designs a Partitioned

Multi-Indexing (PMI) scheme which uses a set of distance oracles

to answer a k-NK query in a graph. For each partition in a distance

oracle O

, an inverted list is constructed for each keyword in the

partition. Speciﬁcally, for a partition with a center node c and a

keyword λ, the inverted list contains all nodes in the partition that

contain keyword λ ranked in nondecreasing order of their distances

to c. Given a k-NK query Q = (q, λ, k) and a distance oracle O

the algorithm ﬁrst ﬁnds the partition that q belongs to in O

. The

result w.r.t. O

is the ﬁrst k elements in the inverted list for λ in the

partition, denoted as R

= {u

: dist(c, u

) + dist(c, q), u

dist(c, u

) + dist(c, q), · · · , u

: dist(c, u

) + dist(c, q)}. The

ﬁnal result R is computed by merging the nodes in each R

and

maintaining k nodes with the shortest distances to q. The query

time complexity is O(k ·log |V |). We illustrate the algorithm using

the following example.

Example 3: Consider the graph in Fig. 1 and two distance oracles

in Fig. 2. For keyword λ, the inverted list for the partition centered

at node r in O

has 5 elements {b : 1, n : 3, k : 4, c : 5, t : 6}.

The inverted list for the partition centered at node o in O

has 1

element {k : 2}. Given a k-NK query Q = (m, λ, 2), from O

, we

can get a result R

= {b : 1 + dist(r, m), n : 3 + dist(r, m)} =

{b : 5, n : 7}, and from O

, we can get a result R

= {k :

2 + dist(o, m )} = {k : 3}. By merging R

and R

, the ﬁnal

answer is R = {k : 3, b : 5}. The exact answer is R = {c : 1, k :

1} according to Fig. 1. 2

Limitation: Although in theory, the witness distance used by [1]

can be bounded by a factor of 2 log

|V | − 1 of the exact distance

with a high probability, in practice, however, we ﬁnd the distance

In [20], the set {O

, O

, · · · , O

} is deﬁned as a distance oracle.

b,19,[19,20]

h,10,[10,18]

e,11,[11,18]

m,15, [15,18]

c,16,[16,17]

p,12, [12,14]

v,13, [13, 13]

g,2,[2,6]

f,7,[7,8]

r,1,[1,20]

d,9,[9,18]

a,3,[3,6]

k,5,[5,5]

i,8,[8,8]

j,4,[4,5]

t,17,[17,17]

o,18,[18,18]

s,14,[14,14]

n,6,[6,6]

u,20,[20,20]

Figure 3: A Tree T with Preorder and Interval on Each Node

CT ECT

Interval [1,2] 3 [4,5] 6 [7,10]

Result b n k n b

Interval [11,16] 17 18

Result c t c

[19,20]

TVP

Figure 4: CT(λ), ECT(λ) and TVP(λ) for Keyword λ

estimation error can be quite large. For example, for the graph G in

Fig. 1 and two distance oracles O

and O

in Fig. 2, for two nodes

s and v, the witness distance in O

dist

(s, v) = dist(s, r) +

dist(v, r) = 10, and that in O

dist

(s, v) = dist(s, n) +

dist(v, n) = 6. However, the exact distance is dist(s, v) = 2 in

G, which is much smaller than both dist

(s, v) and

dist

(s, v).

The inaccurate distance estimation can greatly distort the ranking

of the nodes carrying the query keyword, and thus lead to a low

result quality, as illustrated in Example 3.

3.2 Exact 1-NK on a Tree

Tao et al. [22] compute the exact answer to a 1-NK query on a

tree T (V, E). Given a query Q = (q, λ, 1), the result is the nearest

node in T that contains keyword λ, denoted as NN(q, λ). The ba-

sic idea is as follows. We label a node v with the sequence number

of v in the preorder traversal of T . For a certain keyword λ, all

nodes with the preorder label in the interval [1, |V |] can be parti-

tioned into several disjointed intervals, such that any node v in the

same interval shares an identical NN(v, λ). The partition is called

tree Voronoi partition of λ, denoted as TVP(λ). By precomputing

TVP(λ) for all keywords λ on the tree, a query Q = (q, λ, 1) can

be answered in O (log |V

|) time using a binary search in TVP(λ).

In order to compute TVP(λ) for all keywords λ in T efﬁciently,

two new data structures, namely, Compact Tree CT(λ) and Ex-

tended Compact Tree ECT(λ), are proposed in [22].

DEFINITION 2. (Compact Tree and Extended Compact Tree)

For a tree T and a keyword λ, a compact tree CT(λ) is a tree that

keeps only two types of nodes in T : a keyword node that contains

keyword λ, and a node that has at least two direct subtrees contain-

ing nodes carrying keyword λ. In the preorder traversal of T , for

two successive nodes u and v, if NN(u, λ) 6= NN(v, λ), v is called

a change node. An extended compact tree ECT(λ) is a tree con-

structed by adding all change nodes into the compact tree CT(λ).

Using ECT(λ), TVP(λ) can be constructed easily. In [22],

the authors prove that the total size of all compact trees and all

extended compact trees for all keywords in the tree T (V, E) is

bounded by O(|doc(V )|). The time to compute all compact trees

and all extended compact trees for all keywords in the tree T (V, E)

is bounded by O(|doc(V )| · log |V |).

Example 4: Fig. 3 shows a tree with the preorder label from 1 to 20

on its nodes. For keyword λ, there are 5 keyword nodes b, c, k, n, t.

For node s, NN(s, λ) = c. The compact tree of λ, CT(λ), is shown

on the left part of Fig. 4. Node r is in CT(λ) because r has three

direct subtrees with nodes carrying keyword λ. e is not in CT(λ)

because e is not a keyword node and e has only one direct subtree

rooted at m with nodes carrying keyword λ. The extended compact

tree of λ, ECT(λ), is shown in the middle part of Fig. 4 with the

903

preorder label marked beside each node. Node e is in ECT(λ),

because for its parent node h, NN(h, λ) = b 6= NN(e, λ) = c.

The tree Voronoi partition of λ, TVP(λ), is shown on the right part

of Fig. 4. For node s with preorder label 14, it is in the interval

[11, 16], thus NN(s, λ) = c as listed in TVP(λ). 2

4. SOLUTION OVERVIEW

Answering k-NK on a Graph using Tree Distance: To address

the drawback of witness distance, in this paper, we propose to use

tree distance in processing a k-NK query. We observe that for a

partition of a distance oracle, we can construct a shortest path tree

rooted at the center node of the partition. Since a tree contains more

structural information than a star, using tree distance will be more

accurate than using witness distance for estimating the distance of

two nodes. For a distance oracle O

, let the set of trees constructed

in O

be T

. T

can be considered as a tree by adding a virtual

root and several virtual edges with weight +∞ that connect the

new virtual root to every root node in T

respectively. Let the k-NK

result on tree T be R

. Suppose we have an algorithm to compute

on a tree T , we can solve the k-NK problem in a graph by

merging R

for each tree T

, 1 ≤ i ≤ r. Obviously, such a result

will be more accurate than the res ult by [1]. The following example

illustrates the k-NK query processing based on tree distance.

Example 5: For the distance oracles O

and O

shown in Fig. 2,

the corresponding shortest path trees T

and T

are shown in Fig. 5.

For T

, there is only 1 tree rooted at r because there is only 1

partition in O

. For T

, there are 4 trees rooted at nodes n, o, r, t

respectively, because there are 4 partitions in O

. In each tree, the

path from any node to the root node is a shortest path in the original

graph. For two nodes s and v, their tree distance is 2 in both T

and

, the same as the exact distance dist(s, v) in G. For a k-NK query

Q = (m, λ, 2), we have R

= {c : 1, t : 2}, and R

= {k : 1}.

By merging R

and R

, we get R = {c : 1, k : 1}. Such a result

is much better than the result in Example 3 computed using witness

distance for the same query. 2

With the tree distance formulation, the key operation in answer-

ing a k-NK query on a graph is to answer the k-NK query on a tree.

Therefore, we start with processing a k-NK query on a tree.

Answering k-NK on a Tree: We show that it is nontrivial to answer

a k-NK query on a tree efﬁciently even if k is bounded. Our ﬁrst

attempt is to extend the existing 1-NK solution on a tree T (V, E)

in [22]. Recall that in [22], for a certain keyword λ, the range

[1, |V |] is partitioned into several disjoint intervals, and nodes with

the preorder label in an identical interval share the same 1-NK re-

sult. When k ≥ 2, each interval needs to be further partitioned to

ensure that all nodes with the preorder label in the same interval

share an identical k-NK result. The number of intervals increases

exponentially w.r.t. the number of keyword nodes on the tree until

it reaches |V | for a keyword λ. Clearly, using such an approach,

the index size is too large in practice even for a small k. Our second

attempt is that, for each node v on the tree T (V, E) and each key-

word λ, we precompute its

k nearest nodes that contain λ. When

processing a query Q = (q, λ, k) with k ≤

k, we can simply re-

trieve the precomputed result on node q and output the ﬁrst k nodes

directly. Such an approach is impractical because for each keyword

λ, we need O(

k · |V |) space to store the precomputed results.

In the following, we ﬁrst introduce two algorithms for answering

exact k-NK on a tree T (V, E). Our ﬁrst algorithm tree-boundk can

only handle bounded k values with query processing time O(k +

log |V

|) and index size O(k · |doc(V )|) for all keywords where k

is an upper bound value of k. Our second algorithm tree-pivot can

handle an arbitrary k with query processing time O(k · log |V |)

Figure 5: Shortest Path Trees T

and T

Algorithm 1: tree-boundk (Q,T )

Input: A k-NK query Q = (q, λ, k), and a tree T .

Output: Answer for Q on T .

R ← ∅;1

(u, u

′

) ← the entry edge of q on CT(λ);2

R ← R ⊗

(cand

(u) ⊕ dist(q, u));3

R ← R ⊗

(cand

′

) ⊕ dist(q, u

′

));4

return R;5

and index size O(|doc(V )| · log |V |) for all keywords which is

independent of k. We then show our algorithm for approximate

k-NK on a graph by merging results on a bounded number of trees.

We propose a global storage technique to further reduce the index

size and the query time on a graph. Finally we show how to extend

our method to handle a query with multiple keywords.

5. K-NK ON A TREE FOR A SMALL K

In this section, we study how to answer a k-NK query Q =

(q, λ, k) on a tree T (V, E). We ﬁrst consider a common sce-

nario when users are interested in a small number of answer nodes

bounded by a small constant

k, i.e., k ≤ k. Recall that for a key-

word λ, its compact tree CT(λ) keeps all the structural information

of λ on the tree T . Our idea is to precompute the top-k results for

every keyword λ and every node on CT(λ). Since the total size

of all compact trees is bounded by O(|doc(V )|), the total space to

store the top-

k results of nodes on all compact trees is bounded by

O (

k · |doc(V )|). Given a query Q = (q, λ, k), if q is on CT(λ),

we can simply report the precomputed answer on CT(λ). If q is

not on CT(λ), we need to ﬁnd a way to construct the answer using

the precomputed results as well as the structure of CT(λ) and T . In

the following, we ﬁrst introduce how to answer a k-NK query using

CT(λ), followed by discussions on the construction of the index.

5.1 Query Processing

For a keyword λ, and each node v in the compact tree CT(λ),

we use a candidate list cand

(v) to denote the precomputed k-NK

results for k =

k on node v ranked in nondecreasing order of their

distances to v, in the form of cand

(v) = {v

: dist(v, v

), v

dist(v, v

), · · · , v

: dist(v, v

)} where dist(v, v

) ≤ dist(v, v

) ≤

· · · ≤ dist(v, v

). Given a query Q = (q, λ, k) on a tree T (V, E)

where k ≤

k, if q is in CT(λ), we can simply report the ﬁrst k ele-

ments in cand

(q) as the answer. The difﬁcult case is when q is not

in CT(λ). In order to answer such a query, we deﬁne an entry edge

to be the edge in CT(λ) that is nearest to q. Intuitively, the entry

edge plays a role of connecting the query node q to the compact

tree CT(λ). The for mal deﬁnition of entry edge is as follows.

DEFINITION 3. (Entry Node and Entry Edge) Given a com-

pact tree CT(λ), for each edge (u, u

′

) on CT(λ) with u

′

being a

child node of u, (u, u

′

) represents a unique path from u to u

′

the original tree T . For any node v on T , we say v sticks to CT(λ),

denoted as v ∈

CT(λ), if and only if there exists an edge (u, u

′

)

on CT(λ) such that v is on the path from u to u

′

on T , otherwise

v does not stick to CT(λ), denoted as v /∈

CT(λ). For a node q

on T , let v be the ﬁrst node on the path from q to the root node of

T such that v ∈

CT(λ). v is called the Entry Node of q w.r.t. λ,

904

Algorithm 2: operator R ⊕ δ

Input: Candidate list R = {u

: d

, u

: d

, · · · }, distance δ.

Output: A candidate list by adding δ to all distances in R.

′

← ∅;1

for i = 1 to |R| do2

′

← R

′

: d

+ δ};

return R

′

denoted as EN

(q). The corresponding edge (u, u

′

) on CT(λ) is

called the Entry Edge of q w.r.t. λ, denoted as EE

(q).

Note that for a node q and a keyword λ, EE

(q) is an edge on

the compact tree CT(λ), and EN

(q) is a node on the original tree

T . We use an example to illustrate the entry node and entry edge.

Example 6: For the tree T shown in Fig. 3 and keyword λ, the

compact tree CT(λ) is shown on the left part of Fig. 4. For ease of

illustration, we also mark the nodes in CT(λ) dark on the tree T in

Fig. 3. For edge (r, c) in CT(λ), h ∈

CT(λ) because h is on the

path from r to c in T . p /∈

CT(λ) since p is not on the tree path

of any CT(λ) edge. For node v, its entry node is EN

(v) = e, as e

is the ﬁrst node on the path (v, p, e, h, d, r) such that e ∈

CT(λ).

The entry edge for v is EE

(v) = (r, c) since the entry node e for

v is on the path from r to c in T . The entry nodes and entry edges

for some other nodes in T are listed in the following table. 2

Node g j d e p u

g j d e e b

(r, a) (a, k) (r, c) (r, c) (r, c) (r, b)

The Algorithm: Given a tree T (V, E), for keyword λ, all keyword

nodes are contained in CT(λ). For any node q ∈ V , the path from

q to any keyword node will go through the entry node EN

(q).

Based on such property, the result of a query Q = (q, λ, k) is iden-

tical with the result of the query Q

′

= (EN

(q), λ, k). However,

(q) may not be on CT(λ), thus the result of Q

′

is not neces-

sarily precomputed. Let (u, u

′

) = EE

(q), since EN

(q) is on the

path from u to u

′

on the tree T , the path from EN

(q) to any key-

word node in T will go through either u or u

′

. Thus, the answer for

′

can be constructed by merging the precomputed candidate lists

cand

(u) and cand

′

) on CT(λ).

Our algorithm for processing a query Q = (q, λ, k) on a tree T is

shown in Algorithm 1. We assume that the compact tree CT(λ) for

each keyword λ and the list cand

(u) for every node u on CT(λ)

have been computed. After initializing the res ult R in line 1, we

ﬁnd the entry edge (u, u

′

) for q on CT(λ) (line 2). We add a dis-

tance dist(q , u) to every node in cand

(u) using the ⊕ operator, to

reﬂect the distance from q to a keyword node via u. We then merge

the new result into R using the ⊗

operator (line 3). Similarly we

apply the two operators to cand

′

) with the distance dist(q, u

′

)

(line 4). We will describe the operators ⊕ and ⊗

later. We use the

following example to illustrate the algorithm.

Example 7: Given the tree T shown in Fig. 3 and CT(λ) on the left

part of Fig. 4, for a query Q = (o, λ, 2), the entry edge EE

(o) =

(r, c). Suppose the lists cand

(r) = {b : 1, n : 3} and cand

{c : 0, t : 1} are precomputed. By adding dist(o, r) = 5 to

cand

(r), and adding dist(o, c) = 2 to cand

(c), we get the new

lists {b : 6, n : 8} for r and {c : 2, t : 3} for c. We merge the two

lists and get the ﬁnal result R = {c : 2, t : 3}. 2

The efﬁciency of Algorithm 1 depends on three operations. The

ﬁrst operation is to ﬁnd the entry edge for any node on T (line 2).

The second operation is to calculate the distance of any two nodes

on T , e.g., dist(q, u) and dist(q, u

′

) (line 3-4). The third operation

is to merge two sorted lists into a new one using operators ⊕ and

⊗

(line 3-4). Next, we discuss the three operations separately.

Algorithm 3: operator R

⊗

Input: Two sorted candidate lists R

= {u

: d

, u

: d

, · · · }

= {v

: d

, v

: d

, · · · }, and result size k.

Output: The merged candidate list.

R ← ∅; i ← 1; j ← 1;1

while (i < |R

| or j < |R

|) and |R| ≤ k do2

if i < |R

| and (d

≤ d

or j ≥ |R

|) then

if u

/∈ R then R ← R

: d

};

i ← i + 1;5

else if j < |R

| and (d

≤ d

or i ≥ |R

|) then

if v

/∈ R then R ← R

: d

};

j ← j + 1;8

return R;9

Finding the Entry Edge: Given a keyword λ, for any node v on a

tree T (V, E), our idea of ﬁnding the entry edge EE

(v) of v is sim-

ilar to the idea of ﬁnding the 1-NK answer using the tree Voronoi

partition TVP(λ) in [22]. For the range [1, |V |], we partition it

into several disjoint intervals, such that nodes with the preorder la-

bel in the same interval share an identical entry edge. We call such

partition an entry edge partition for λ, denoted as EEP(λ). Given

EEP(λ), EE

(v) can be computed easily using a binary search in

EEP(λ) in O(log |V

|) time. In the next subsection, we show how

to build EEP(λ) for all keywords efﬁciently and prove that the total

size of EEP(λ) for all keywords in T is bounded by O(doc|V |).

Computing Tree Distance: Given a tree T (V, E) with root r, sup-

pose the distance from r to every node in T has been precomputed.

For any two nodes u and v on T , we denote LCA(u, v) as their low-

est common ancestor. The distance of u and v can be computed as

dist(u, v) = dist(r, u) + dist(r, v) − 2dist(r, LCA(u, v)). Using

the techniques in [2], LCA(u, v) can be found in O(1) time using

O (|V |) index space. Thus dist(u, v) for any two nodes u and v on

T can be computed in O(1) time using O(|V |) index space.

Merging Results: The results are merged using two operators ⊕

and ⊗

. Algorithm 2 shows the operator ⊕, which takes a candi-

date list R and a distance δ as input, and outputs a candidate list by

adding δ to all distances in R. The time complexity for the ⊕ op-

erator is O(|R|). Algorithm 3 shows the operator ⊗

, which takes

two candidate lists R

and R

sorted in nondecreasing order of the

distances, and a value k as input, and outputs the merged candidate

list R. R contains at most k elements sorted in nondecreasing order

of the distances. R can be constructed by visiting each element in

and R

at most once. The time complexity for the ⊗

operator

is O(min{|R

| + |R

|, k}). The ⊗

and ⊕ operators satisfy the

commutative, associative and distributive laws as follows.

(Commutative Law) R

⊗

= R

⊗

(Associative Law) (R

⊗

) ⊗

= R

⊗

(Distributive Law) (R

⊗

) ⊕ d = (R

⊕ d) ⊗

⊕ d).

THEOREM 1. Algorithm 1 computes the exact k-NK answer for

a query Q = (q, λ, k) on a tree T (V, E) in O(k + log |V

|) time.

Algorithm 1 uses the novel idea of entry edge, and elegantly ex-

tends the 1-NK method [22] to handle k-NK (k > 1) with the same

query time complexity, except for an extra linear cost O(k) indis-

pensable for reporting the results.

Given the tree T , for every keyword λ, besides the compact tree

CT(λ), two more indexes are needed. The ﬁrst index, the entry

edge partition EEP(λ), is to ﬁnd the entry edge for any node on T .

The second index is the candidate list cand

(v) for every node on

CT(λ). Below we show how to construct the two indexes.

5.2 Construction of Entry Edge Partition

Given a tree T (V, E), for each keyword λ, sharing the similar

idea with the tree Voronoi partition TVP(λ), we construct an entry

905

Top-K nearest keyword search on large graphs

Figures

Citations

G-Tree: An Efficient and Scalable Index for Spatial Search on Road Networks

Keyword-aware continuous kNN query on road networks

Exact Top-k Nearest Keyword Search in Large Networks

Real time personalized search on social networks

Keyword Search over Distributed Graphs with Compressed Signature

References

Query Processing Using Distance Oracles for Spatial Networks

Multi-approximate-keyword routing in GIS data

Nearest keyword search in XML documents

Instance optimal query processing in spatial networks

Partitioned multi-indexing: bringing order to social search

Related Papers (5)

BLINKS: ranked keyword searches on graphs

Fast exact shortest-path distance queries on large networks by pruned landmark labeling

Efficient retrieval of the top-k most relevant spatial web objects

Keyword searching and browsing in databases using BANKS

Keyword Search on Spatial Databases

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Top-k nearest keyword search on large graphs" ?

Q2. What is the key operation in answering a k-NK query on a graph?

Q3. How does global storage reduce the index size of pivot?

Q4. How can the authors solve the k-NK problem in a graph?

Q5. How can the authors reduce a distance oracle to a set of trees?

Q6. What is the way to calculate cand(v)?

Q7. How does the Algorithm 7 construct a distance preserving balanced tree?

Q8. Why is the index size of pivot longer on FLARN than on DBLP?

Q9. What is the definition of a distance preserving balanced tree?

Q10. How do the authors calculate cand(v) for a tree?

Q11. What is the second attempt to precompute k nearest nodes?

Q12. What is the EE(q) operator for the path from u to u?

Q13. How do the authors compute the candidate list for a tree?

Q14. How do the authors calculate candidate list cand(p)?

Q15. What is the first traversal on CT()?

Q16. What is the difference between tree distance and witness distance?