scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Top-K nearest keyword search on large graphs

01 Aug 2013-Vol. 6, Iss: 10, pp 901-912
TL;DR: A shortest path tree for a distance oracle technique is built and a global storage technique is proposed to further reduce the index size and the query time in obtaining a k-NK result on a graph from that on trees.
Abstract: It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword λ, a k-NK query seeks k nodes which contain λ and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems.The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.

Summary (4 min read)

1. INTRODUCTION

  • Many real-world networks emerging nowadays have labels or textual contents on the nodes.
  • K-NK is an important and useful query in graph search.
  • Intuitively, if two persons share some common friends, i.e., they are two hops away, they are more likely to become friends.
  • Instead the authors use distance oracle [20] as the fundamental distance estimation framework.
  • The rest of the paper is organized as follows.

2. PROBLEM DEFINITION

  • The authors model a weighted undirected graph as G(V, E), where V (G) represents the set of nodes and E(G) represents the set of edges in G.
  • The weight of a path is the total weight of all edges on the path.
  • For simplicity, the authors assume that there is only one keyword λ in the query.
  • The authors will discuss how to answer a query containing multiple keywords with AND and OR semantics.

3. EXISTING SOLUTIONS

  • Obviously, Dijkstra’s algorithm is inefficient when the size of the graph is large or the keyword nodes are far away from q.
  • In the literature, [1] and [22] design different indexing schemes to process (top-k) nearest keyword queries on a graph or a tree.
  • The authors introduce the two methods in the following two subsections.

3.1 Approximate k-NK on a Graph

  • Bahmani and Goel [1] find an approximate answer to a k-NK query in a graph based on a distance oracle [20].
  • One distance oracle is usually not enough for distance estimation in a graph G. Even for two nodes in the same partition, the estimation may have a large error.
  • The authors illustrate the algorithm using the following example.

3.2 Exact 1-NK on a Tree

  • For a certain keyword λ, all nodes with the preorder label in the interval [1, |V |] can be partitioned into several disjointed intervals, such that any node v in the same interval shares an identical NN(v, λ).
  • The partition is called tree Voronoi partition of λ, denoted as TVP(λ).
  • An extended compact tree ECT(λ) is a tree constructed by adding all change nodes into the compact tree CT(λ).
  • Using ECT(λ), TVP(λ) can be constructed easily.
  • The time to compute all compact trees and all extended compact trees for all keywords in the tree T (V, E) is bounded by O(|doc(V )| · log |V |).

4. SOLUTION OVERVIEW

  • Answering k-NK on a Graph using Tree Distance:.
  • To address the drawback of witness distance, in this paper, the authors propose to use tree distance in processing a k-NK query.
  • For the distance oracles O1 and O2 shown in Fig. 2, the corresponding shortest path trees T1 and T2 are shown in Fig, also known as Example 5.
  • Recall that in [22], for a certain keyword λ, the range [1, |V |] is partitioned into several disjoint intervals, and nodes with the preorder label in an identical interval share the same 1-NK result.
  • The authors first algorithm tree-boundk can only handle bounded k values with query processing time O(k + log |Vλ|) and index size O(k · |doc(V )|) for all keywords where k is an upper bound value of k.

5. K-NK ON A TREE FOR A SMALL K

  • Recall that for a keyword λ, its compact tree CT(λ) keeps all the structural information of λ on the tree T .
  • The authors idea is to precompute the top-k results for every keyword λ and every node on CT(λ).
  • Since the total size of all compact trees is bounded by O(|doc(V )|), the total space to store the top-k results of nodes on all compact trees is bounded by O(k · |doc(V )|).
  • In the following, the authors first introduce how to answer a k-NK query using CT(λ), followed by discussions on the construction of the index.

5.1 Query Processing

  • Intuitively, the entry edge plays a role of connecting the query node q to the compact tree CT(λ).
  • The authors assume that the compact tree CT(λ) for each keyword λ and the list candλ(u) for every node u on CT(λ) have been computed.
  • The efficiency of Algorithm 1 depends on three operations.
  • The results are merged using two operators ⊕ and ⊗k.
  • The first index, the entry edge partition EEP(λ), is to find the entry edge for any node on T .

5.2 Construction of Entry Edge Partition

  • Given a tree T (V, E), for each keyword λ, sharing the similar idea with the tree Voronoi partition TVP(λ), the authors construct an entry Algorithm 4: EEP-construct (T ,CT(λ)).
  • Based on such an observation, by excluding the intervals of all edges under the subtree rooted at u′ in CT(λ) from the interval of (u, u′), nodes with preorder in the remaining intervals will use (u, u′) as the entry edge.
  • Algorithm 4 shows the construction of the entry edge partition EEP(λ) on CT(λ) for a keyword λ.
  • After initializing EEP(λ) (line 2), the main operation is a recursive procedure partition (line 3), to partition the interval [1, |V |] to several disjoint intervals.

5.3 Construction of Candidate List

  • Given a compact tree CT(λ) for a tree T and a keyword λ, the authors need to compute the candidate list candλ(v) for every node v on CT(λ).
  • Since CT(λ) keeps the structural information of all keyword nodes in T , it is sufficient to search only on CT(λ) to calculate candλ(v).
  • Based on this observation, the authors can follow the path to propagate the candidate list on u to v.
  • Using this idea, the authors just need to traverse the tree CT(λ) twice to build the candidate lists for all nodes on CT(λ).
  • The second traversal on CT(λ) is a top-down one, such that the candidate list on each node is further propagated to all its descendants.

6.1 A Basic Pivot Approach

  • The authors basic idea is to compute the first segment online and precompute the results regarding the second segment offline.
  • In the query processing phase, the authors do not search the whole tree to get the answer for a query, but instead, they just need to merge the precomputed candidates along the path from the query node to the root node of the tree T .
  • The authors use the following example to illustrate the pivot based approach.
  • For every node v, the authors create a candidate list candλ(v) that contains all keyword nodes in its subtree, sorted in nondecreasing distances to v.

6.2 Pivot Approach with Tree Balancing

  • The problem is not perfectly solved using the basic pivot approach above.
  • Thus the key to optimizing both index space and query time is to reduce the average depth of nodes on the tree.
  • Furthermore, the authors need to traverse n nodes to answer a query when the query node q is at one end of the chain, leading to O(n) query time.
  • Generally speaking, DT(T ) preserves all distance information for any node pair on T and the height of DT(T ) is at most log2 |V |. DEFINITION 5.
  • The authors will also describe how to construct DT(T ) for a tree T and how to compute all candidate lists candλ(v) for all keywords λ and all nodes v on the tree DT(T ).

6.3 Index Construction

  • The first index is the distance preserving balanced tree DT(T ) for T and the second index is the candidate list candλ(v) for each keyword λ and each node v on DT(T ).
  • Such a property also holds for any subtree of T ′ because it is processed using steps (1) and (2) recursively.
  • The following lemma shows that the median node always exists on any tree T , and also gives a method to find the median node of T .
  • All other nodes in DT(T ) are constructed similarly.
  • After all candidate lists are created, the authors sort the elements in every candidate list in nondecreasing order of the distances.

7. APPROXIMATE K-NK ON A GRAPH

  • The authors introduce two algorithms graph-boundk and graph-pivot for a bounded k and an arbitrary k respectively.
  • This expression can be generalized to the case of merging the candidate lists of node v on more than two trees.
  • In the following, the authors will show that the global candidate list can be used to answer k-NK queries without sacrificing the result quality.
  • Therefore, the authors show that using global storage will not sacrifice the result quality.
  • When k is small, the index time and index space for boundk are smaller than pivot on both trees and graphs.

8. EXPERIMENTS

  • The authors report the performance of their methods boundk, pivot, and their global storage implementations boundk-gs and pivot-gs, with two baseline solutions BFS and PMI.
  • The authors obtained the keywords of nodes from the OpenStreetMap project5 with a bounding box.
  • Global storage helps reduce the query time of boundk by 20% and that of pivot by 15%.
  • The query time shows a sharper increasing trend on DBLP than FLARN, as the frequency difference between DBLP keywords is larger.
  • The index size of pivot is 2.5 times that of PMI on DBLP and 7.9 times on FLARN, due to the larger diameter of FLARN.

10. CONCLUSIONS

  • The authors study top-k nearest keyword (k-NK) search on large graphs.
  • The authors propose two exact k-NK algorithms on trees to handle a bounded k and an arbitrary k respectively.
  • The authors extend tree based algorithms to graphs and propose a global storage technique to further reduce the index size and query time.
  • The authors conducted extensive performance studies on real large graphs to demonstrate the effectiveness and efficiency of their algorithms.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Top-K Nearest Keyword Search on Large Graphs
Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, Wentao Tian
The Chinese University of Hong Kong, Hong Kong, China
{mqiao,lqin,hcheng,yu,wttian}@se.cuhk.edu.hk
ABSTRACT
It is quite common for networks emerging nowadays to have labels
or textual contents on the nodes. On such networks, we study the
problem of top-k nearest keyword (k-NK) search. In a network G
modeled as an undirected graph, each node is attached with zero or
more keywords, and each edge is assigned with a weight measuring
its length. Given a query node q in G and a keyword λ, a k-NK
query seeks k nodes which contain λ and are nearest to q. k-NK is
not only useful as a stand-alone query but also as a building block
for tackling complex graph pattern matching problems.
The key to an accurate k-NK result is a precise shortest distance
estimation in a graph. Based on the latest distance oracle technique,
we build a shortest path tree for a distance oracle and use the tree
distance as a more accurate estimation. With such representation,
the original k-NK query on a graph can be reduced to answering
the query on a set of trees and then assembling the results obtained
from the trees. We propose two efficient algorithms to report the
exact k-NK result on a tree. One is query time optimized for a
scenario when a small number of result nodes are of interest to
users. The other handles k-NK queries for an arbitrarily large k
efficiently. In obtaining a k-NK result on a graph from that on trees,
a global storage technique is proposed to further reduce the index
size and the query time. Extensive experimental results conform
with our theoretical findings, and demonstrate the effectiveness and
efficiency of our k-NK algorithms on large real graphs.
1. INTRODUCTION
Many real-world networks emerging nowadays have labels or
textual contents on the nodes. For example in a road network, a
location may have labels such as “McDonald’s”, “hospital”, and
“kindergarten”. In a social network, a person may have informa-
tion including name, interests and skills, etc.. In a bibliographic
network, a paper may have keywords and abstract, and an author
may have name, affiliation and email address. In this study, we
consider the problem of top-k nearest keyword (k-NK) search on
large networks. In a network G modeled as an undirected graph,
each node is attached with zero or more keywords, and each edge
is assigned with a weight measuring its length. Given a query node
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or d istributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 10
Copyright 2013 VLDB Endowment 2150-8097/13/10...
$
10.00.
q in G and a keyword λ, a k-NK query in the form of Q = (q, λ, k)
looks for k nodes which contain λ and are nearest to q. Different
from a large body of research on k - nearest neighbor (k-NN) search
on spatial networks [15, 5, 6, 18, 19, 7], we define G as a general
graph without coordinates. Thus our solution can apply to a wide
range of networks.
Motivation. k-NK is an important and useful query in graph search.
As a stand-alone query, it has a wide range of applications. Further-
more, it can serve as a building block for tackling complex graph
pattern matching problems which impose both structural and tex-
tual constraints. Here we list a few applications of k-NK queries.
Consider the social network Facebook as an example, in which
personalized search based on graph structure and textual contents
has become increasingly popular
1
. A person looks for 20 friends or
potential friends who like hiking to participate in a hiking activity.
Intuitively, if two persons share some common friends, i.e., they are
two hops away, they are more likely to become friends. In contrast,
if they are far away from each other in the network, they are less
likely to establish a link. Thus the problem is to find 20 persons
who like hiking and are nearest to the person who serves as the
organizer. It can be answered by a k-NK query. More generally,
we also consider a query containing multiple keywords connected
by AND or OR operators to express more complex semantics, e.g.,
a person looks for k friends or potential friends who like hiking
AND (OR) photography and are nearest to him.
Take a road network with locations associated with keywords as
another example. For parents looking for k kindergartens nearest
to their home for their children, their requirements can be expressed
by a k-NK query where the query node is the home location, and
the keyword is “kindergarten”.
In the third example, we s how how k-NK queries serve as a
building block for solving the graph pattern matching problem.
Consider a couple who wants to buy a house. They have some con-
straints like having a kindergarten and a hospital within 3 km, and
a supermarket within 1 km of their home. These constraints can be
expressed as a star pattern, and the pattern matching problem can
be decomposed into three k-NK queries with keywords “kinder-
garten”, “hospital” and “supermarket” respectively and k = 1 for
each potential house location to be considered.
Recently, Bahmani and Goel [1] have designed a Partitioned
Multi-Indexing (PMI) scheme to answer k-NK queries approxi-
mately. PMI is an inverted index built based on distance oracle
[20] which is a distance estimation technique. Given a k-NK query
Q = (q, λ, k), it returns k nodes containing keyword λ in ascend-
ing order of their approximate distance from the query node q. PMI
inherits the 2 log
2
|V | 1 approximation factor for distance esti-
mation from distance oracle [20], where V is the set of nodes in the
1
https://www.facebook.com/about/graphsearch
901

graph. The major drawback of PMI is that its distance estimation
error could be quite large in practice. This can greatly distort the
ranking of the candidate nodes carrying the query keywords, and
thus lead to a low result quality.
In this work, we study how to answer k-NK queries accurately
and efficiently using compact index. The key to an accurate k-NK
result is a precise shortest distance estimation in a graph. As we
use a general graph model, existing k-NN solutions on spatial net-
works [15, 5, 6, 18, 19, 7] cannot be applied, as they usually rely
on specialized structures that leverage properties of spatial data to
optimize their solutions. Instead we use distance oracle [20] as the
fundamental distance estimation framework. For each component
of a distance oracle, we will build a shortest path tree, based on
which we can estimate the shortest distance between two nodes by
their tree distance. The tree distance is more accurate than the dis-
tance estimated by distance oracle, which we call witness distance
to distinguish. As we transform a distance oracle on a graph into a
set of shortest path trees, the original k-NK query on the graph can
be reduced to answering the k-NK query on a set of trees. Thus we
first focus on processing k-NK queries to find exact top-k answers
on a tree. Then we study how to assemble the results obtained from
the trees to form the approximate top-k answers on the graph.
Contributions. Our main contributions in this work are summa-
rized as follows.
(1) Given a tree, we first consider a common scenario when users
are interested in a small number of answer nodes bounded by a
small constant
k, i.e., k k. We propose the first algorithm
tree-boundk with query time O(k + log |V
λ
|), where |V
λ
| is the
number of nodes carrying the query keyword λ, and index size
O (
k · |doc(V )|), where |doc(V )| is the total number of keywords
on all the nodes in the graph.
(2) Next we remove the
k restriction and handle k-NK queries
for an arbitrary k on a tree. We propose the second algorithm
tree-pivot with query time O(k·log |V |) and index size O(|doc(V )
log |V |) which is independent of k, thus is more scalable.
(3) Based on our proposed tree algorithms, we present our algo-
rithm for approximate k-NK query on a graph. We propose a global
storage technique to further reduce the index size and the query
time. We also show how to extend our methods to handle a query
with multiple keywords.
(4) Our experimental evaluation demonstrates the effectiveness and
efficiency of our k-NK algorithms on large real-world networks.
We show the superiority of our methods in ranking top-k answer
nodes accurately, when compared with the state-of-the-art top-k
keyword search method PMI [1].
Roadmap. The rest of the paper is organized as follows. Sec-
tion 2 formally defines the problem. Section 3 discusses two ex-
isting related studies and their drawbacks. Section 4 presents our
framework. Sections 5 and 6 introduce two proposed algorithms to
answer k-NK queries on a tree for a small k and an arbitrary k re-
spectively using compact index structures. Section 7 elaborates on
the way to answer k-NK queries on a graph by approximating the
graph with a bounded number of trees. Section 8 presents exten-
sive experimental evaluation. Section 9 reviews the previous works
related to ours. Finally, Section 10 concludes the paper.
2. PROBLEM DEFINITION
We model a weighted undirected graph as G(V, E), where V (G)
represents the set of nodes and E(G) represents the set of edges in
G. We use V and E to denote V (G) and E(G) if the context is
obvious. Each edge (u, v) E has a positive weight, denoted
r
b
u
j
o
s
p
v
h
e
t
c
a
n
k
i
d
g
f
m
β
λ,α
λ
α,β
λ
λ,α
β
λ
Figure 1: A Graph G with Keywords
as weight(u, v). A path p = (v
1
, v
2
, · · · , v
l
) is a sequence of l
nodes in V s uch that for each v
i
(1 i < l), (v
i
, v
i+1
) E.
The weight of a path is the total weight of all edges on the path.
For any two nodes u V and v V , the distance of u and v
on G, dist(u, v), is the minimum weight of all paths from u to v
in G. Each node v V contains a set of zero or more keywords
which is denoted as doc(v). The union of keywords for all nodes
in G is denoted as doc(V ). Note that doc(V ) is a multiset and
|doc(V )| =
P
vV
|doc(v)|. We use V
λ
V to denote the set of
nodes carrying keyword λ in V .
DEFINITION 1. Given a graph G(V, E), a top-k nearest key-
word (k-NK) query is a triple Q = (q, λ, k), where q V is a
query node in G, λ is a keyword, and k is a positive integer. Given
a query Q, a node v V is a keyword node w.r.t. Q if v contains
keyword λ, i.e., v V
λ
. The result is a set of k keyword nodes,
denoted as R = {v
1
, v
2
, · · · , v
k
} V
λ
, and there does not exist
a node u V
λ
\ R such that dist(q, u) < max
vR
dist(q, v). To
further report the distance in the top-k result, we can use the form
R = {v
1
: dist(q, v
1
), v
2
: dist(q, v
2
), · · · , v
k
: dist(q, v
k
)}.
In this paper, we aim at answering a k-NK query Q = (q, λ, k)
on a graph G. For simplicity, we assume that there is only one
keyword λ in the query. We will discuss how to answer a query
containing multiple keywords with AND and OR semantics.
Example 1: Fig. 1 shows a graph G. Assume that the weight of
each edge is 1. For a k-NK query Q = (f, λ, 3), the keyword node
set is V
λ
= {b, c, k, n, t}. The result of Q is R = {b : 2, n : 4, k :
5} since dist(f, b) = 2, dist(f, n) = 4, and dist(f, k) = 5. 2
3. EXISTING SOLUTIONS
A straightforward approach to answering a k-NK query Q =
(q, λ, k) on G is to use Dijkstra’s algorithm to search from the
query node q and output k nearest keyword nodes in nondecreasing
order of their distances to q. The time complexity is O(|E| + |V | ·
log |V |). Obviously, Dijkstra’s algorithm is inefficient when the
size of the graph is large or the keyword nodes are far away from q.
In the literature, [1] and [22] design different indexing schemes
to process (top-k) nearest keyword queries on a graph or a tree. We
introduce the two methods in the following two subsections.
3.1 Approximate k-NK on a Graph
Bahmani and Goel [1] find an approximate answer to a k-NK
query in a graph based on a distance oracle [20].
Distance Oracle: Distance oracle is a technique for estimating the
distance of two nodes in a graph [20]. Given a graph G, a distance
oracle is a Voronoi partition of V (G) determined by a set of ran-
domly selected center nodes. More specifically, given a number
n
c
, we randomly select n
c
nodes from V (G) as the center nodes
to construct a distance oracle O. Then the partition is constructed
by assigning each node v V (G) to its nearest center node, de-
noted as wit
O
(v), which is called the witness node of v w.r.t. O. If
v is a center node, wit
O
(v) = v. For each node v V (G), the
shortest distance from v to its witness node, i.e., dist(v, wit
O
(v)),
is precomputed. After constructing O, given two nodes u and v
in G, if u and v are in the same partition in O, i.e., wit
O
(u) =
902

u
o
h
i
m
t
c
j
s
p
v
e
a
n
k
O
2
b u
s
p
v
h
e
t
c
i
dg
f
1
2
2
o
r
n
k
j
a
1
2
2
1
1
3
3
3
4
m
5
5
5
5
4
6
4
O
1
b
dg
f
1
r
1
1
1
2
2
2
2
1
2
1
3
3
1
2
1
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
Figure 2: Two Distance Oracles O
1
and O
2
wit
O
(v), we compute the estimated distance, called witness dis-
tance, as
dist
O
(u, v) = dist(u, wit
O
(u)) + dist(v, wit
O
(v)). If u
and v are not in the same partition in O,
dist
O
(u, v) = +.
One distance oracle is usually not enough for distance estimation
in a graph G. It cannot estimate the distance of two nodes in dif-
ferent partitions. Even for two nodes in the same partition, the esti-
mation may have a large error. Therefore, a s et of r = p × log |V |
distance oracles {O
1
, O
2
, · · · , O
r
} are constructed, where p can
be considered as a constant
2
. The algorithm is processed in log |V |
phases. In phase i (0 i < log |V |), p distance oracles are con-
structed where each distance oracle contains 2
i
randomly selected
center nodes. Given r distance oracles, the distance of two nodes
u and v in G can be estimated as an upper bound dist(u, v) =
min
1ir
dist
O
i
(u, v).
The time complexity to compute the estimated distance
dist(u, v)
for any two nodes u and v in a graph G is O(log |V |). The distance
oracles consume O(|V | · log |V |) space. Das Sarma et al. [20]
prove that when p = Θ(|V |
1/ log |V |
), the estimated distance can
be bounded by dist(u, v)
dist(u, v) (2 log
2
|V |−1)·dist(u, v)
with a high probability.
Example 2: Fig. 2 shows two distance oracles O
1
and O
2
for the
graph shown in Fig. 1. There is one center node r in O
1
, and four
center nodes r, n, o and t in O
2
. The distance of nodes j and
s is estimated as
dist(j, s) = min{dist
O
1
(j, s),
dist
O
2
(j, s)} =
min{dist(j, r) + dist(s, r), dist(j, n) + dist(s, n)} = 5. 2
Answering k-NK with Distance Oracle: [1] designs a Partitioned
Multi-Indexing (PMI) scheme which uses a set of distance oracles
to answer a k-NK query in a graph. For each partition in a distance
oracle O
i
, an inverted list is constructed for each keyword in the
partition. Specifically, for a partition with a center node c and a
keyword λ, the inverted list contains all nodes in the partition that
contain keyword λ ranked in nondecreasing order of their distances
to c. Given a k-NK query Q = (q, λ, k) and a distance oracle O
i
,
the algorithm first finds the partition that q belongs to in O
i
. The
result w.r.t. O
i
is the first k elements in the inverted list for λ in the
partition, denoted as R
O
i
= {u
1
: dist(c, u
1
) + dist(c, q), u
2
:
dist(c, u
2
) + dist(c, q), · · · , u
k
: dist(c, u
k
) + dist(c, q)}. The
final result R is computed by merging the nodes in each R
O
i
and
maintaining k nodes with the shortest distances to q. The query
time complexity is O(k ·log |V |). We illustrate the algorithm using
the following example.
Example 3: Consider the graph in Fig. 1 and two distance oracles
in Fig. 2. For keyword λ, the inverted list for the partition centered
at node r in O
1
has 5 elements {b : 1, n : 3, k : 4, c : 5, t : 6}.
The inverted list for the partition centered at node o in O
2
has 1
element {k : 2}. Given a k-NK query Q = (m, λ, 2), from O
1
, we
can get a result R
O
1
= {b : 1 + dist(r, m), n : 3 + dist(r, m)} =
{b : 5, n : 7}, and from O
2
, we can get a result R
O
2
= {k :
2 + dist(o, m )} = {k : 3}. By merging R
O
1
and R
O
2
, the final
answer is R = {k : 3, b : 5}. The exact answer is R = {c : 1, k :
1} according to Fig. 1. 2
Limitation: Although in theory, the witness distance used by [1]
can be bounded by a factor of 2 log
2
|V | 1 of the exact distance
with a high probability, in practice, however, we find the distance
2
In [20], the set {O
1
, O
2
, · · · , O
r
} is defined as a distance oracle.
b,19,[19,20]
h,10,[10,18]
e,11,[11,18]
m,15, [15,18]
c,16,[16,17]
p,12, [12,14]
v,13, [13, 13]
g,2,[2,6]
f,7,[7,8]
r,1,[1,20]
d,9,[9,18]
a,3,[3,6]
k,5,[5,5]
i,8,[8,8]
j,4,[4,5]
t,17,[17,17]
o,18,[18,18]
s,14,[14,14]
n,6,[6,6]
u,20,[20,20]
λ
λ
λ
λ
λ
Figure 3: A Tree T with Preorder and Interval on Each Node
b
t
c
r
n
k
a
b
e
t
c
r
n
k
j
a
CT ECT
1
3
4
6
5
11
16
17
19
Interval [1,2] 3 [4,5] 6 [7,10]
Result b n k n b
Interval [11,16] 17 18
Result c t c
[19,20]
b
TVP
Figure 4: CT(λ), ECT(λ) and TVP(λ) for Keyword λ
estimation error can be quite large. For example, for the graph G in
Fig. 1 and two distance oracles O
1
and O
2
in Fig. 2, for two nodes
s and v, the witness distance in O
1
is
dist
O
1
(s, v) = dist(s, r) +
dist(v, r) = 10, and that in O
2
is
dist
O
2
(s, v) = dist(s, n) +
dist(v, n) = 6. However, the exact distance is dist(s, v) = 2 in
G, which is much smaller than both dist
O
1
(s, v) and
dist
O
2
(s, v).
The inaccurate distance estimation can greatly distort the ranking
of the nodes carrying the query keyword, and thus lead to a low
result quality, as illustrated in Example 3.
3.2 Exact 1-NK on a Tree
Tao et al. [22] compute the exact answer to a 1-NK query on a
tree T (V, E). Given a query Q = (q, λ, 1), the result is the nearest
node in T that contains keyword λ, denoted as NN(q, λ). The ba-
sic idea is as follows. We label a node v with the sequence number
of v in the preorder traversal of T . For a certain keyword λ, all
nodes with the preorder label in the interval [1, |V |] can be parti-
tioned into several disjointed intervals, such that any node v in the
same interval shares an identical NN(v, λ). The partition is called
tree Voronoi partition of λ, denoted as TVP(λ). By precomputing
TVP(λ) for all keywords λ on the tree, a query Q = (q, λ, 1) can
be answered in O (log |V
λ
|) time using a binary search in TVP(λ).
In order to compute TVP(λ) for all keywords λ in T efficiently,
two new data structures, namely, Compact Tree CT(λ) and Ex-
tended Compact Tree ECT(λ), are proposed in [22].
DEFINITION 2. (Compact Tree and Extended Compact Tree)
For a tree T and a keyword λ, a compact tree CT(λ) is a tree that
keeps only two types of nodes in T : a keyword node that contains
keyword λ, and a node that has at least two direct subtrees contain-
ing nodes carrying keyword λ. In the preorder traversal of T , for
two successive nodes u and v, if NN(u, λ) 6= NN(v, λ), v is called
a change node. An extended compact tree ECT(λ) is a tree con-
structed by adding all change nodes into the compact tree CT(λ).
Using ECT(λ), TVP(λ) can be constructed easily. In [22],
the authors prove that the total size of all compact trees and all
extended compact trees for all keywords in the tree T (V, E) is
bounded by O(|doc(V )|). The time to compute all compact trees
and all extended compact trees for all keywords in the tree T (V, E)
is bounded by O(|doc(V )| · log |V |).
Example 4: Fig. 3 shows a tree with the preorder label from 1 to 20
on its nodes. For keyword λ, there are 5 keyword nodes b, c, k, n, t.
For node s, NN(s, λ) = c. The compact tree of λ, CT(λ), is shown
on the left part of Fig. 4. Node r is in CT(λ) because r has three
direct subtrees with nodes carrying keyword λ. e is not in CT(λ)
because e is not a keyword node and e has only one direct subtree
rooted at m with nodes carrying keyword λ. The extended compact
tree of λ, ECT(λ), is shown in the middle part of Fig. 4 with the
903

preorder label marked beside each node. Node e is in ECT(λ),
because for its parent node h, NN(h, λ) = b 6= NN(e, λ) = c.
The tree Voronoi partition of λ, TVP(λ), is shown on the right part
of Fig. 4. For node s with preorder label 14, it is in the interval
[11, 16], thus NN(s, λ) = c as listed in TVP(λ). 2
4. SOLUTION OVERVIEW
Answering k-NK on a Graph using Tree Distance: To address
the drawback of witness distance, in this paper, we propose to use
tree distance in processing a k-NK query. We observe that for a
partition of a distance oracle, we can construct a shortest path tree
rooted at the center node of the partition. Since a tree contains more
structural information than a star, using tree distance will be more
accurate than using witness distance for estimating the distance of
two nodes. For a distance oracle O
i
, let the set of trees constructed
in O
i
be T
i
. T
i
can be considered as a tree by adding a virtual
root and several virtual edges with weight + that connect the
new virtual root to every root node in T
i
respectively. Let the k-NK
result on tree T be R
T
. Suppose we have an algorithm to compute
R
T
on a tree T , we can solve the k-NK problem in a graph by
merging R
T
i
for each tree T
i
, 1 i r. Obviously, such a result
will be more accurate than the res ult by [1]. The following example
illustrates the k-NK query processing based on tree distance.
Example 5: For the distance oracles O
1
and O
2
shown in Fig. 2,
the corresponding shortest path trees T
1
and T
2
are shown in Fig. 5.
For T
1
, there is only 1 tree rooted at r because there is only 1
partition in O
1
. For T
2
, there are 4 trees rooted at nodes n, o, r, t
respectively, because there are 4 partitions in O
2
. In each tree, the
path from any node to the root node is a shortest path in the original
graph. For two nodes s and v, their tree distance is 2 in both T
1
and
T
2
, the same as the exact distance dist(s, v) in G. For a k-NK query
Q = (m, λ, 2), we have R
T
1
= {c : 1, t : 2}, and R
T
2
= {k : 1}.
By merging R
T
1
and R
T
2
, we get R = {c : 1, k : 1}. Such a result
is much better than the result in Example 3 computed using witness
distance for the same query. 2
With the tree distance formulation, the key operation in answer-
ing a k-NK query on a graph is to answer the k-NK query on a tree.
Therefore, we start with processing a k-NK query on a tree.
Answering k-NK on a Tree: We show that it is nontrivial to answer
a k-NK query on a tree efficiently even if k is bounded. Our first
attempt is to extend the existing 1-NK solution on a tree T (V, E)
in [22]. Recall that in [22], for a certain keyword λ, the range
[1, |V |] is partitioned into several disjoint intervals, and nodes with
the preorder label in an identical interval share the same 1-NK re-
sult. When k 2, each interval needs to be further partitioned to
ensure that all nodes with the preorder label in the same interval
share an identical k-NK result. The number of intervals increases
exponentially w.r.t. the number of keyword nodes on the tree until
it reaches |V | for a keyword λ. Clearly, using such an approach,
the index size is too large in practice even for a small k. Our second
attempt is that, for each node v on the tree T (V, E) and each key-
word λ, we precompute its
k nearest nodes that contain λ. When
processing a query Q = (q, λ, k) with k
k, we can simply re-
trieve the precomputed result on node q and output the first k nodes
directly. Such an approach is impractical because for each keyword
λ, we need O(
k · |V |) space to store the precomputed results.
In the following, we first introduce two algorithms for answering
exact k-NK on a tree T (V, E). Our first algorithm tree-boundk can
only handle bounded k values with query processing time O(k +
log |V
λ
|) and index size O(k · |doc(V )|) for all keywords where k
is an upper bound value of k. Our second algorithm tree-pivot can
handle an arbitrary k with query processing time O(k · log |V |)
T
2
b
u
s
p
v
h
e
t
c
i
d
g
f
T
1
o
r
m
n
k
j
a
λ
λ
λ
λ
λ
r
b
u
h
i
d
g
f
λ
o
m
k
λ
j
s
p
v
e
a
n
λ
t
c
λ
λ
Figure 5: Shortest Path Trees T
1
and T
2
Algorithm 1: tree-boundk (Q,T )
Input: A k-NK query Q = (q, λ, k), and a tree T .
Output: Answer for Q on T .
R ;1
(u, u
) the entry edge of q on CT(λ);2
R R
k
(cand
λ
(u) dist(q, u));3
R R
k
(cand
λ
(u
) dist(q, u
));4
return R;5
and index size O(|doc(V )| · log |V |) for all keywords which is
independent of k. We then show our algorithm for approximate
k-NK on a graph by merging results on a bounded number of trees.
We propose a global storage technique to further reduce the index
size and the query time on a graph. Finally we show how to extend
our method to handle a query with multiple keywords.
5. K-NK ON A TREE FOR A SMALL K
In this section, we study how to answer a k-NK query Q =
(q, λ, k) on a tree T (V, E). We first consider a common sce-
nario when users are interested in a small number of answer nodes
bounded by a small constant
k, i.e., k k. Recall that for a key-
word λ, its compact tree CT(λ) keeps all the structural information
of λ on the tree T . Our idea is to precompute the top-k results for
every keyword λ and every node on CT(λ). Since the total size
of all compact trees is bounded by O(|doc(V )|), the total space to
store the top-
k results of nodes on all compact trees is bounded by
O (
k · |doc(V )|). Given a query Q = (q, λ, k), if q is on CT(λ),
we can simply report the precomputed answer on CT(λ). If q is
not on CT(λ), we need to find a way to construct the answer using
the precomputed results as well as the structure of CT(λ) and T . In
the following, we first introduce how to answer a k-NK query using
CT(λ), followed by discussions on the construction of the index.
5.1 Query Processing
For a keyword λ, and each node v in the compact tree CT(λ),
we use a candidate list cand
λ
(v) to denote the precomputed k-NK
results for k =
k on node v ranked in nondecreasing order of their
distances to v, in the form of cand
λ
(v) = {v
1
: dist(v, v
1
), v
2
:
dist(v, v
2
), · · · , v
k
: dist(v, v
k
)} where dist(v, v
1
) dist(v, v
2
)
· · · dist(v, v
k
). Given a query Q = (q, λ, k) on a tree T (V, E)
where k
k, if q is in CT(λ), we can simply report the first k ele-
ments in cand
λ
(q) as the answer. The difficult case is when q is not
in CT(λ). In order to answer such a query, we define an entry edge
to be the edge in CT(λ) that is nearest to q. Intuitively, the entry
edge plays a role of connecting the query node q to the compact
tree CT(λ). The for mal definition of entry edge is as follows.
DEFINITION 3. (Entry Node and Entry Edge) Given a com-
pact tree CT(λ), for each edge (u, u
) on CT(λ) with u
being a
child node of u, (u, u
) represents a unique path from u to u
on
the original tree T . For any node v on T , we say v sticks to CT(λ),
denoted as v
s
CT(λ), if and only if there exists an edge (u, u
)
on CT(λ) such that v is on the path from u to u
on T , otherwise
v does not stick to CT(λ), denoted as v /
s
CT(λ). For a node q
on T , let v be the first node on the path from q to the root node of
T such that v
s
CT(λ). v is called the Entry Node of q w.r.t. λ,
904

Algorithm 2: operator R δ
Input: Candidate list R = {u
1
: d
u
1
, u
2
: d
u
2
, · · · }, distance δ.
Output: A candidate list by adding δ to all distances in R.
R
;1
for i = 1 to |R| do2
R
R
S
{u
i
: d
u
i
+ δ};
3
return R
;4
denoted as EN
λ
(q). The corresponding edge (u, u
) on CT(λ) is
called the Entry Edge of q w.r.t. λ, denoted as EE
λ
(q).
Note that for a node q and a keyword λ, EE
λ
(q) is an edge on
the compact tree CT(λ), and EN
λ
(q) is a node on the original tree
T . We use an example to illustrate the entry node and entry edge.
Example 6: For the tree T shown in Fig. 3 and keyword λ, the
compact tree CT(λ) is shown on the left part of Fig. 4. For ease of
illustration, we also mark the nodes in CT(λ) dark on the tree T in
Fig. 3. For edge (r, c) in CT(λ), h
s
CT(λ) because h is on the
path from r to c in T . p /
s
CT(λ) since p is not on the tree path
of any CT(λ) edge. For node v, its entry node is EN
λ
(v) = e, as e
is the first node on the path (v, p, e, h, d, r) such that e
s
CT(λ).
The entry edge for v is EE
λ
(v) = (r, c) since the entry node e for
v is on the path from r to c in T . The entry nodes and entry edges
for some other nodes in T are listed in the following table. 2
Node g j d e p u
EN
λ
g j d e e b
EE
λ
(r, a) (a, k) (r, c) (r, c) (r, c) (r, b)
The Algorithm: Given a tree T (V, E), for keyword λ, all keyword
nodes are contained in CT(λ). For any node q V , the path from
q to any keyword node will go through the entry node EN
λ
(q).
Based on such property, the result of a query Q = (q, λ, k) is iden-
tical with the result of the query Q
= (EN
λ
(q), λ, k). However,
EN
λ
(q) may not be on CT(λ), thus the result of Q
is not neces-
sarily precomputed. Let (u, u
) = EE
λ
(q), since EN
λ
(q) is on the
path from u to u
on the tree T , the path from EN
λ
(q) to any key-
word node in T will go through either u or u
. Thus, the answer for
Q
can be constructed by merging the precomputed candidate lists
cand
λ
(u) and cand
λ
(u
) on CT(λ).
Our algorithm for processing a query Q = (q, λ, k) on a tree T is
shown in Algorithm 1. We assume that the compact tree CT(λ) for
each keyword λ and the list cand
λ
(u) for every node u on CT(λ)
have been computed. After initializing the res ult R in line 1, we
find the entry edge (u, u
) for q on CT(λ) (line 2). We add a dis-
tance dist(q , u) to every node in cand
λ
(u) using the operator, to
reflect the distance from q to a keyword node via u. We then merge
the new result into R using the
k
operator (line 3). Similarly we
apply the two operators to cand
λ
(u
) with the distance dist(q, u
)
(line 4). We will describe the operators and
k
later. We use the
following example to illustrate the algorithm.
Example 7: Given the tree T shown in Fig. 3 and CT(λ) on the left
part of Fig. 4, for a query Q = (o, λ, 2), the entry edge EE
λ
(o) =
(r, c). Suppose the lists cand
λ
(r) = {b : 1, n : 3} and cand
λ
(c) =
{c : 0, t : 1} are precomputed. By adding dist(o, r) = 5 to
cand
λ
(r), and adding dist(o, c) = 2 to cand
λ
(c), we get the new
lists {b : 6, n : 8} for r and {c : 2, t : 3} for c. We merge the two
lists and get the final result R = {c : 2, t : 3}. 2
The efficiency of Algorithm 1 depends on three operations. The
first operation is to find the entry edge for any node on T (line 2).
The second operation is to calculate the distance of any two nodes
on T , e.g., dist(q, u) and dist(q, u
) (line 3-4). The third operation
is to merge two sorted lists into a new one using operators and
k
(line 3-4). Next, we discuss the three operations separately.
Algorithm 3: operator R
1
k
R
2
Input: Two sorted candidate lists R
1
= {u
1
: d
u
1
, u
2
: d
u
2
, · · · }
R
2
= {v
1
: d
v
1
, v
2
: d
v
2
, · · · }, and result size k.
Output: The merged candidate list.
R ; i 1; j 1;1
while (i < |R
1
| or j < |R
2
|) and |R| k do2
if i < |R
1
| and (d
u
i
d
v
j
or j |R
2
|) then
3
if u
i
/ R then R R
S
{u
i
: d
u
i
};
4
i i + 1;5
else if j < |R
2
| and (d
v
j
d
u
i
or i |R
1
|) then
6
if v
j
/ R then R R
S
{v
j
: d
v
j
};
7
j j + 1;8
return R;9
Finding the Entry Edge: Given a keyword λ, for any node v on a
tree T (V, E), our idea of finding the entry edge EE
λ
(v) of v is sim-
ilar to the idea of finding the 1-NK answer using the tree Voronoi
partition TVP(λ) in [22]. For the range [1, |V |], we partition it
into several disjoint intervals, such that nodes with the preorder la-
bel in the same interval share an identical entry edge. We call such
partition an entry edge partition for λ, denoted as EEP(λ). Given
EEP(λ), EE
λ
(v) can be computed easily using a binary search in
EEP(λ) in O(log |V
λ
|) time. In the next subsection, we show how
to build EEP(λ) for all keywords efficiently and prove that the total
size of EEP(λ) for all keywords in T is bounded by O(doc|V |).
Computing Tree Distance: Given a tree T (V, E) with root r, sup-
pose the distance from r to every node in T has been precomputed.
For any two nodes u and v on T , we denote LCA(u, v) as their low-
est common ancestor. The distance of u and v can be computed as
dist(u, v) = dist(r, u) + dist(r, v) 2dist(r, LCA(u, v)). Using
the techniques in [2], LCA(u, v) can be found in O(1) time using
O (|V |) index space. Thus dist(u, v) for any two nodes u and v on
T can be computed in O(1) time using O(|V |) index space.
Merging Results: The results are merged using two operators
and
k
. Algorithm 2 shows the operator , which takes a candi-
date list R and a distance δ as input, and outputs a candidate list by
adding δ to all distances in R. The time complexity for the op-
erator is O(|R|). Algorithm 3 shows the operator
k
, which takes
two candidate lists R
1
and R
2
sorted in nondecreasing order of the
distances, and a value k as input, and outputs the merged candidate
list R. R contains at most k elements sorted in nondecreasing order
of the distances. R can be constructed by visiting each element in
R
1
and R
2
at most once. The time complexity for the
k
operator
is O(min{|R
1
| + |R
2
|, k}). The
k
and operators satisfy the
commutative, associative and distributive laws as follows.
(Commutative Law) R
1
k
R
2
= R
2
k
R
1
.
(Associative Law) (R
1
k
R
2
)
k
R
3
= R
1
k
(R
2
k
R
3
).
(Distributive Law) (R
1
k
R
2
) d = (R
1
d)
k
(R
2
d).
THEOREM 1. Algorithm 1 computes the exact k-NK answer for
a query Q = (q, λ, k) on a tree T (V, E) in O(k + log |V
λ
|) time.
Algorithm 1 uses the novel idea of entry edge, and elegantly ex-
tends the 1-NK method [22] to handle k-NK (k > 1) with the same
query time complexity, except for an extra linear cost O(k) indis-
pensable for reporting the results.
Given the tree T , for every keyword λ, besides the compact tree
CT(λ), two more indexes are needed. The first index, the entry
edge partition EEP(λ), is to find the entry edge for any node on T .
The second index is the candidate list cand
λ
(v) for every node on
CT(λ). Below we show how to construct the two indexes.
5.2 Construction of Entry Edge Partition
Given a tree T (V, E), for each keyword λ, sharing the similar
idea with the tree Voronoi partition TVP(λ), we construct an entry
905

Citations
More filters
Proceedings ArticleDOI
Lamia Falaki1
04 Jul 2022
TL;DR: In this article , the authors present an easy way to explore structured e-commerce data for business users that eliminate the dependency to predefined forms by using machine learning to rank more relevant answers ahead of less relevant ones.
Abstract: Web search engines such as Google and Bing provide an easy and convenient way to find web pages that contain input keywords. This provides a user-friendly interface for non-technical users to explore the Web and find relevant data among thousands of Web pages. While numerous advancement has been made to store e-commerce data in the cloud, we have not seen great advancement in terms of search over such data. E-commerce data is usually stored as structured data in relational and graph databases. Thus, an answer to a query keyword is composed of different pieces of data stitched together. As of now, the main method to find answers over this structured data is through predefined search forms. However, these search forms are limited, and developing a new search form is time consuming and expensive. In this work, we present an easy way to explore structured e-commerce data for business users that eliminate the dependency to predefined forms. The new search system is similar to Google, in which the interface is essentially a text box, and non-technical business users enter input keywords into the system. The output is a portion of the data, that covers the input keywords. We propose a new ranking strategy based on machine learning to rank more relevant answers ahead of less relevant ones. Our experiments show this ranking strategy is successful in returning relevant answers.
Book ChapterDOI
24 May 2018
TL;DR: This paper proposes a query that can process graph traversal and text search in combination in a graph database system and rank users measured as a combination of their social distance and the relevance of the text description to the query keyword.
Abstract: Graph database systems are increasingly being used to store and query large-scale property graphs with complex relationships. Graph data, particularly the ones generated from social networks generally has text associated to the graph. Although graph systems provide support for efficient graph-based queries, there have not been comprehensive studies on how other dimensions, such as text, stored within a graph can work well together with graph traversals. In this paper we focus on a query that can process graph traversal and text search in combination in a graph database system and rank users measured as a combination of their social distance and the relevance of the text description to the query keyword. Our proposed algorithm leverages graph partitioning techniques to speed-up query processing along both dimensions. We conduct experiments on real-world large graph datasets and show benefits of our algorithm compared to several other baseline schemes.
01 Dec 2019
TL;DR: The proposed algorithm in this paper so-called RSL-Cluster performs the clustering by hierarchically removing the edge between nodes which has a weight lower that the average similarity of nodes until reaching the user’s desired number of clusters.
Abstract: موجودیت‌ها در شبکه‌های اجتماعی علاوه بر داشتن ارتباط با یکدیگر، دارای محتوا نیز هستند. این مدل از شبکه‌ها می‌توانند بر روی گراف‌هایی که گره‌های آن شامل متن هستند، مدل شوند. خوشه‌بندی گراف ازجمله مهم‌ترین کارهای تحلیلی شبکه اجتماعی است. باوجوداین دو جنبه، اغلب روش‌های خوشه‌بندی تنها یکی از جنبه‌های ساختاری یا محتوایی گراف را در نظر می‌گیرند. الگوریتم‌های خوشه‌بندی ساختاری-محتوایی، گراف را از هر دو جنبه ساختار و محتوا به‌صورت هم‌زمان در نظر می‌گیرند. هدف این مقاله رسیدن به خوشه‌هایی با ساختار درونی منسجم (ساختاری) و مقادیر ویژگی (محتوایی) همگن در گراف است. الگوریتم ارائه شده در این مقاله RLS-Cluster نام داشته که به‌صورت سلسله مراتبی با حذف یال با کمترین میانگین شباهت میان گره‌های محله آن یال، عمل خوشه‌بندی را انجام می‌دهد. در این روش برای هر یال میانگین شباهت محله محاسبه شده و به‌عنوان وزن آن یال در نظر گرفته می‌شود. یال‌هایی که دارای کم‌ترین وزن هستند حذف می‌شوند. این مرحله تا زمانی که به تعداد خوشه موردنظر کاربر برسد، ادامه میابد. مقایسه الگوریتم مطرح‌شده با سه الگوریتم خوشه‌بندی ساختاری-محتوایی ارائه شده تاکنون، بر اساس معیارهای مختلف سنجش کیفیت خوشه، بیانگر عملکرد مناسب روش ارائه شده است. این معیارها شامل معیارهای ساختاری، محتوایی و ساختاری-محتوایی هستند.
Proceedings ArticleDOI
10 Dec 2020
TL;DR: In this article, a subgraph sketch signature based approach is proposed to prune the subgraphs that do not fit all the intended entities in a given knowledge graph, and a novel index is devised to finally perform efficient entity retrieval over the whole knowledge graph.
Abstract: Querying and extracting potentially a large number of entities that are the user’s intention is a challenging problem for knowledge graphs. The conventional query mechanism of subgraph pattern matching would not work well as the user in general does not know the graph pattern to search for. Moreover, there may not be a single subgraph pattern that fits all the intended entities. Using keywords also may not be a viable approach, as it is very difficult to come up with the right set of keywords, and the results are often very diverse and overwhelming. We propose a novel approach that does not require users to know much about the knowledge graph but only simple keywords about the desired entities. We retrieve a sample of matches and learn the entity context patterns as what we call the subgraph sketch signatures. We provide clustered patterns for the user to prune. Moreover, we devise a novel index to finally perform efficient entity retrieval over the whole knowledge graph.
Proceedings ArticleDOI
04 Jul 2022
TL;DR: This work presents an easy way to explore structured e-commerce data for business users that eliminate the dependency to predefined search forms and proposes a new ranking strategy based on machine learning to rank more relevant answers ahead of less relevant ones.
Abstract: Web search engines such as Google and Bing provide an easy and convenient way to find web pages that contain input keywords. This provides a user-friendly interface for non-technical users to explore the Web and find relevant data among thousands of Web pages. While numerous advancement has been made to store e-commerce data in the cloud, we have not seen great advancement in terms of search over such data. E-commerce data is usually stored as structured data in relational and graph databases. Thus, an answer to a query keyword is composed of different pieces of data stitched together. As of now, the main method to find answers over this structured data is through predefined search forms. However, these search forms are limited, and developing a new search form is time consuming and expensive. In this work, we present an easy way to explore structured e-commerce data for business users that eliminate the dependency to predefined forms. The new search system is similar to Google, in which the interface is essentially a text box, and non-technical business users enter input keywords into the system. The output is a portion of the data, that covers the input keywords. We propose a new ranking strategy based on machine learning to rank more relevant answers ahead of less relevant ones. Our experiments show this ranking strategy is successful in returning relevant answers.
References
More filters
Book
05 Sep 2011
TL;DR: The present article is a commencement at attempting to remedy this deficiency of scientific correlation, and the meaning and working of the various formulæ have been explained sufficiently, it is hoped, to render them readily usable even by those whose knowledge of mathematics is elementary.
Abstract: All knowledge—beyond that of bare isolated occurrence—deals with uniformities. Of the latter, some few have a claim to be considered absolute, such as mathematical implications and mechanical laws. But the vast majority are only partial; medicine does not teach that smallpox is inevitably escaped by vaccination, but that it is so generally; biology has not shown that all animals require organic food, but that nearly all do so; in daily life, a dark sky is no proof that it will rain, but merely a warning; even in morality, the sole categorical imperative alleged by Kant was the sinfulness of telling a lie, and few thinkers since have admitted so much as this to be valid universally. In psychology, more perhaps than in any other science, it is hard to find absolutely inflexible coincidences; occasionally, indeed, there appear uniformities sufficiently regular to be practically treated as laws, but infinitely the greater part of the observations hitherto recorded concern only more or less pronounced tendencies of one event or attribute to accompany another. Under these circumstances, one might well have expected that the evidential evaluation and precise mensuration of tendencies had long been the subject of exhaustive investigation and now formed one of the earliest sections in a beginner’s psychological course. Instead, we find only a general naı̈ve ignorance that there is anything about it requiring to be learnt. One after another, laborious series of experiments are executed and published with the purpose of demonstrating some connection between two events, wherein the otherwise learned psychologist reveals that his art of proving and measuring correspondence has not advanced beyond that of lay persons. The consequence has been that the significance of the experiments is not at all rightly understood, nor have any definite facts been elicited that may be either confirmed or refuted. The present article is a commencement at attempting to remedy this deficiency of scientific correlation. With this view, it will be strictly confined to the needs of practical workers, and all theoretical mathematical demonstrations will be omitted; it may, however, be said that the relations stated have already received a large amount of empirical verification. Great thanks are due from me to Professor Haussdorff and to Dr. G. Lipps, each of whom have supplied a useful theorem in polynomial probability; the former has also very kindly given valuable advice concerning the proof of the important formulæ for elimination of ‘‘systematic deviations.’’ At the same time, and for the same reason, the meaning and working of the various formulæ have been explained sufficiently, it is hoped, to render them readily usable even by those whose knowledge of mathematics is elementary. The fundamental procedure is accompanied by simple imaginary examples, while the more advanced parts are illustrated by cases that have actually occurred in my personal experience. For more abundant and positive exemplification, the reader is requested to refer to the under cited research, which is entirely built upon the principles and mathematical relations here laid down. In conclusion, the general value of the methodics recommended is emphasized by a brief criticism of the best correlational work hitherto made public, and also the important question is discussed as to the number of ‘‘cases’’ required for an experimental series.

3,687 citations


"Top-K nearest keyword search on lar..." refers methods in this paper

  • ...We use six metrics for evaluation: hit rate, Spearman’s rho [21], error, query time, index time, and index size....

    [...]

Proceedings ArticleDOI
26 Feb 2002
TL;DR: BANKS is described, a system which enables keyword-based search on relational databases, together with data and schema browsing, and presents an efficient heuristic algorithm for finding and ranking query results.
Abstract: With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

970 citations


"Top-K nearest keyword search on lar..." refers background in this paper

  • ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

    [...]

  • ...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...

    [...]

Book ChapterDOI
10 Apr 2000
TL;DR: A very simple algorithm for the Least Common Ancestors problem is presented, dispelling the frequently held notion that optimal LCA computation is unwieldy and unimplementable.
Abstract: We present a very simple algorithm for the Least Common Ancestors problem. We thus dispel the frequently held notion that optimal LCA computation is unwieldy and unimplementable. Interestingly, this algorithm is a sequentialization of a previously known PRAM algorithm.

898 citations


"Top-K nearest keyword search on lar..." refers background or methods in this paper

  • ...[2, 6] is processed recursively by invoking partition(EEP(λ), [2, 6], (r, a),CT(λ)), and [7, 20] is processed by the other two child nodes c and b similarly....

    [...]

  • ...16 17 19 Interval [1,2] 3 [4,5] 6 [7,10]...

    [...]

  • ...We first process edge (r, a) with interval [2, 6], which divides the interval [1, 20] into three parts: [1, 1], [2, 6], and [7, 20]....

    [...]

  • ...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...

    [...]

  • ...Using the techniques in [2], LCA(u, v) can be found in O(1) time using O(|V |) index space....

    [...]

Book ChapterDOI
20 Aug 2002
TL;DR: It is proved that DISCOVER finds without redundancy all relevant candidate networks, whose size can be data bound, by exploiting the structure of the schema and the selection of the optimal execution plan (way to reuse common subexpressions) is NP-complete.
Abstract: DISCOVER operates on relational databases and facilitates information discovery on them by allowing its user to issue keyword queries without any knowledge of the database schema or of SQL. DISCOVER returns qualified joining networks of tuples, that is, sets of tuples that are associated because they join on their primary and foreign keys and collectively contain all the keywords of the query. DISCOVER proceeds in two steps. First the Candidate Network Generator generates all candidate networks of relations, that is, join expressions that generate the joining networks of tuples. Then the Plan Generator builds plans for the efficient evaluation of the set of candidate networks, exploiting the opportunities to reuse common subexpressions of the candidate networks. We prove that DISCOVER finds without redundancy all relevant candidate networks, whose size can be data bound, by exploiting the structure of the schema. We prove that the selection of the optimal execution plan (way to reuse common subexpressions) is NP-complete. We provide a greedy algorithm and we show that it provides near-optimal plan execution time cost. Our experimentation also provides hints on tuning the greedy algorithm.

892 citations


"Top-K nearest keyword search on lar..." refers background in this paper

  • ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

    [...]

Proceedings ArticleDOI
11 Jun 2007
TL;DR: BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search, and offers orders-of-magnitude performance improvement over existing approaches.
Abstract: Query processing over graph-structured data is enjoying a growing number of applications. A top-k keyword search query on a graph finds the top k answers according to some ranking criteria, where each answer is a substructure of the graph containing all query keywords. Current techniques for supporting such queries on general graphs suffer from several drawbacks, e.g., poor worst-case performance, not taking full advantage of indexes, and high memory requirements. To address these problems, we propose BLINKS, a bi-level indexing and query processing scheme for top-k keyword search on graphs. BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search. To reduce the index space, BLINKS partitions a data graph into blocks: The bi-level index stores summary information at the block level to initiate and guide search among blocks, and more detailed information for each block to accelerate search within blocks. Our experiments show that BLINKS offers orders-of-magnitude performance improvement over existing approaches.

601 citations


"Top-K nearest keyword search on lar..." refers background in this paper

  • ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

    [...]

  • ...For the node h, its interval is [10, 18] because the preorder of h on T is 10 and the maximum preorder for all nodes on the subtree rooted at h is 18....

    [...]

Frequently Asked Questions (16)
Q1. What have the authors contributed in "Top-k nearest keyword search on large graphs" ?

On such networks, the authors study the problem of top-k nearest keyword ( k-NK ) search. The authors propose two efficient algorithms to report the exact k-NK result on a tree. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. 

2With the tree distance formulation, the key operation in answering a k-NK query on a graph is to answer the k-NK query on a tree. 

By keeping a global candidate list and removing duplicate index items, global storage reduces the index size of pivot by 61% on DBLP and 55% on FLARN. 

Suppose the authors have an algorithm to compute RT on a tree T , the authors can solve the k-NK problem in a graph by merging RTi for each tree Ti, 1 ≤ i ≤ r. 

As the authors transform a distance oracle on a graph into a set of shortest path trees, the original k-NK query on the graph can be reduced to answering the k-NK query on a set of trees. 

Since CT(λ) keeps the structural information of all keyword nodes in T , it is sufficient to search only on CT(λ) to calculate candλ(v). 

2THEOREM 5. Given a tree T (V, E), Algorithm 7 constructs a distance preserving balanced tree DT(T ) for T using O(|V | · log |V |) time and O(|V |) space. 

This is because the complexity of pivot grows linearly with the tree depth,and the larger diameter of FLARN leads to a larger tree depth. 

In order to reduce the average depth of nodes to optimize both index space and query processing time, the authors introduce a new structure called distance preserving balanced tree for T (V, E), denoted as DT(T ). 

For each node v traversed, the authors merge candλ(v) into that of its parent node u by adding a distance dist(u, v) to the list candλ(v) (line 3-5). 

Their second attempt is that, for each node v on the tree T (V, E) and each keyword λ, the authors precompute its k nearest nodes that contain λ. 

Let (u, u′) = EEλ(q), since ENλ(q) is on the path from u to u′ on the tree T , the path from ENλ(q) to any keyword node in T will go through either u or u′. 

Given a compact tree CT(λ) for a tree T and a keyword λ, the authors need to compute the candidate list candλ(v) for every node v on CT(λ). 

For each pivot p of v as well as v itself, the authors calculate distT (p, v) on the original tree T , and add the element v : distT (p, v) to the candidate list candλ(p) (line 4-5). 

The first traversal on CT(λ) is a bottom-up one, such that the candidate list on each node is propagated to all its ancestors on CT(λ). 

Since a tree contains more structural information than a star, using tree distance will be more accurate than using witness distance for estimating the distance of two nodes.