Journal ArticleDOI

# Top-K nearest keyword search on large graphs

01 Aug 2013-Vol. 6, Iss: 10, pp 901-912

TL;DR: A shortest path tree for a distance oracle technique is built and a global storage technique is proposed to further reduce the index size and the query time in obtaining a k-NK result on a graph from that on trees.
Abstract: It is quite common for networks emerging nowadays to have labels or textual contents on the nodes. On such networks, we study the problem of top-k nearest keyword (k-NK) search. In a network G modeled as an undirected graph, each node is attached with zero or more keywords, and each edge is assigned with a weight measuring its length. Given a query node q in G and a keyword λ, a k-NK query seeks k nodes which contain λ and are nearest to q. k-NK is not only useful as a stand-alone query but also as a building block for tackling complex graph pattern matching problems.The key to an accurate k-NK result is a precise shortest distance estimation in a graph. Based on the latest distance oracle technique, we build a shortest path tree for a distance oracle and use the tree distance as a more accurate estimation. With such representation, the original k-NK query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We propose two efficient algorithms to report the exact k-NK result on a tree. One is query time optimized for a scenario when a small number of result nodes are of interest to users. The other handles k-NK queries for an arbitrarily large k efficiently. In obtaining a k-NK result on a graph from that on trees, a global storage technique is proposed to further reduce the index size and the query time. Extensive experimental results conform with our theoretical findings, and demonstrate the effectiveness and efficiency of our k-NK algorithms on large real graphs.
Topics: Query optimization (65%), Tree (graph theory) (64%), Block graph (62%), Shortest-path tree (62%), Trémaux tree (61%)

Content maybe subject to copyright    Report

Top-K Nearest Keyword Search on Large Graphs
Miao Qiao, Lu Qin, Hong Cheng, Jeffrey Xu Yu, Wentao Tian
The Chinese University of Hong Kong, Hong Kong, China
{mqiao,lqin,hcheng,yu,wttian}@se.cuhk.edu.hk
ABSTRACT
It is quite common for networks emerging nowadays to have labels
or textual contents on the nodes. On such networks, we study the
problem of top-k nearest keyword (k-NK) search. In a network G
modeled as an undirected graph, each node is attached with zero or
more keywords, and each edge is assigned with a weight measuring
its length. Given a query node q in G and a keyword λ, a k-NK
query seeks k nodes which contain λ and are nearest to q. k-NK is
not only useful as a stand-alone query but also as a building block
for tackling complex graph pattern matching problems.
The key to an accurate k-NK result is a precise shortest distance
estimation in a graph. Based on the latest distance oracle technique,
we build a shortest path tree for a distance oracle and use the tree
distance as a more accurate estimation. With such representation,
the original k-NK query on a graph can be reduced to answering
the query on a set of trees and then assembling the results obtained
from the trees. We propose two efﬁcient algorithms to report the
exact k-NK result on a tree. One is query time optimized for a
scenario when a small number of result nodes are of interest to
users. The other handles k-NK queries for an arbitrarily large k
efﬁciently. In obtaining a k-NK result on a graph from that on trees,
a global storage technique is proposed to further reduce the index
size and the query time. Extensive experimental results conform
with our theoretical ﬁndings, and demonstrate the effectiveness and
efﬁciency of our k-NK algorithms on large real graphs.
1. INTRODUCTION
Many real-world networks emerging nowadays have labels or
textual contents on the nodes. For example in a road network, a
location may have labels such as “McDonald’s”, “hospital”, and
“kindergarten”. In a social network, a person may have informa-
tion including name, interests and skills, etc.. In a bibliographic
network, a paper may have keywords and abstract, and an author
may have name, afﬁliation and email address. In this study, we
consider the problem of top-k nearest keyword (k-NK) search on
large networks. In a network G modeled as an undirected graph,
each node is attached with zero or more keywords, and each edge
is assigned with a weight measuring its length. Given a query node
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or d istributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 10

40 citations

### Cites background or methods from "Top-K nearest keyword search on lar..."

• ...(2) Both methods [4, 26] assume that the index can reside in main memory....

[...]

• ...Given a graph G = (V,E) with vertex set V , and edge set E, the algorithm in [4] incurs a (2 log2 |V | − 1) approximation factor, which can be quite large given large values of |V |, and as shown in [26], the resulting error is significant in their empirical study in real graphs and good solutions can be missed....

[...]

• ...The authors of [26] point out that the error introduced by the star summary in [4] can be large....

[...]

• ...Both PMI and pivot-gs were implemented by the authors of [26]....

[...]

• ...As pointed out in [4] and [26], some keyword queries in a network are generated from a vertex inside the network with an interest of looking for vertices in a near-vicinity of the network....

[...]

Proceedings ArticleDOI
Yuchen Li1, Zhifeng Bao2, Guoliang Li3, Kian-Lee Tan1Institutions (3)
13 Apr 2015-
TL;DR: A novel 3D cube inverted index is designed, a cube based threshold algorithm is devised to retrieve the top-k results, and several pruning techniques are proposed to optimize the social distance computation, whose cost dominates the query processing.
Abstract: Internet users are shifting from searching on traditional media to social network platforms (SNPs) to retrieve up-to-date and valuable information. SNPs have two unique characteristics: frequent content update and small world phenomenon. However, existing works are not able to support these two features simultaneously. To address this problem, we develop a general framework to enable real time personalized top-k query. Our framework is based on a general ranking function that incorporates time freshness, social relevance and textual similarity. To ensure efficient update and query processing, there are two key challenges. The first is to design an index structure that is update-friendly while supporting instant query processing. The second is to efficiently compute the social relevance in a complex graph. To address these challenges, we first design a novel 3D cube inverted index to support efficient pruning on the three dimensions simultaneously. Then we devise a cube based threshold algorithm to retrieve the top-k results, and propose several pruning techniques to optimize the social distance computation, whose cost dominates the query processing. Furthermore, we optimize the 3D index via a hierarchical partition method to enhance our pruning on the social dimension. Extensive experimental results on two real world large datasets demonstrate the efficiency and the robustness of our proposed solution.

39 citations

### Cites background from "Top-K nearest keyword search on lar..."

• ...The social distance is usually modeled as the shortest distance on the social graph [9], [10], [6], [5], [7]....

[...]

• ...(1) Social Relevance: The social distance for two vertices v ↔ v′ is adopted as the shortest distance [9], [10], [6], [5]....

[...]

Journal ArticleDOI
Ye Yuan1, Xiang Lian2, Lei Chen3, Jeffery Xu Yu4  +2 moreInstitutions (4)
TL;DR: A signature-based search algorithm is proposed that encodes the shortest-path distance from a vertex to any given keyword in the graph, and can find query answers by exploring fewer paths, so that the time and communication costs are low.
Abstract: Graph keyword search has drawn many research interests, since graph models can generally represent both structured and unstructured databases and keyword searches can extract valuable information for users without the knowledge of the underlying schema and query language. In practice, data graphs can be extremely large, e.g., a Web-scale graph containing billions of vertices. The state-of-the-art approaches employ centralized algorithms to process graph keyword searches, and thus they are infeasible for such large graphs, due to the limited computational power and storage space of a centralized server. To address this problem, we investigate keyword search for Web-scale graphs deployed in a distributed environment. We first give a naive search algorithm to answer the query efficiently. However, the naive search algorithm uses a flooding search strategy that incurs large time and network overhead. To remedy this shortcoming, we then propose a signature-based search algorithm. Specifically, we design a vertex signature that encodes the shortest-path distance from a vertex to any given keyword in the graph. As a result, we can find query answers by exploring fewer paths, so that the time and communication costs are low. Moreover, we reorganize the graph data in the cluster after its initial random partitioning so that the signature-based techniques are more effective. Finally, our experimental results demonstrate the feasibility of our proposed approach in performing keyword searches over Web-scale graph data.

20 citations

### Cites background from "Top-K nearest keyword search on lar..."

• ...[29] studied the top-k nearest keyword (k-NK) query over a graph....

[...]

• ...Therefore, the studied problems in [28], [29] are different from that in this paper, and their proposed techniques cannot be directed used for solving our problem in this paper....

[...]

• ...There are some works to study the variants of graph keyword search [28], [29]....

[...]

##### References
More filters

Book
05 Sep 2011-
TL;DR: The present article is a commencement at attempting to remedy this deficiency of scientific correlation, and the meaning and working of the various formulæ have been explained sufficiently, it is hoped, to render them readily usable even by those whose knowledge of mathematics is elementary.

3,267 citations

### "Top-K nearest keyword search on lar..." refers methods in this paper

• ...We use six metrics for evaluation: hit rate, Spearman’s rho [21], error, query time, index time, and index size....

[...]

Proceedings ArticleDOI
G. Bhalotia1, Arvind Hulgeri, Charuta Nakhe1, Soumen Chakrabarti1  +1 moreInstitutions (1)
26 Feb 2002-
TL;DR: BANKS is described, a system which enables keyword-based search on relational databases, together with data and schema browsing, and presents an efficient heuristic algorithm for finding and ranking query results.
Abstract: With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.

944 citations

### "Top-K nearest keyword search on lar..." refers background in this paper

• ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

[...]

• ...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...

[...]

Book ChapterDOI
20 Aug 2002-
TL;DR: It is proved that DISCOVER finds without redundancy all relevant candidate networks, whose size can be data bound, by exploiting the structure of the schema and the selection of the optimal execution plan (way to reuse common subexpressions) is NP-complete.
Abstract: DISCOVER operates on relational databases and facilitates information discovery on them by allowing its user to issue keyword queries without any knowledge of the database schema or of SQL. DISCOVER returns qualified joining networks of tuples, that is, sets of tuples that are associated because they join on their primary and foreign keys and collectively contain all the keywords of the query. DISCOVER proceeds in two steps. First the Candidate Network Generator generates all candidate networks of relations, that is, join expressions that generate the joining networks of tuples. Then the Plan Generator builds plans for the efficient evaluation of the set of candidate networks, exploiting the opportunities to reuse common subexpressions of the candidate networks. We prove that DISCOVER finds without redundancy all relevant candidate networks, whose size can be data bound, by exploiting the structure of the schema. We prove that the selection of the optimal execution plan (way to reuse common subexpressions) is NP-complete. We provide a greedy algorithm and we show that it provides near-optimal plan execution time cost. Our experimentation also provides hints on tuning the greedy algorithm.

875 citations

### "Top-K nearest keyword search on lar..." refers background in this paper

• ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

[...]

Book ChapterDOI
10 Apr 2000-
TL;DR: A very simple algorithm for the Least Common Ancestors problem is presented, dispelling the frequently held notion that optimal LCA computation is unwieldy and unimplementable.
Abstract: We present a very simple algorithm for the Least Common Ancestors problem. We thus dispel the frequently held notion that optimal LCA computation is unwieldy and unimplementable. Interestingly, this algorithm is a sequentialization of a previously known PRAM algorithm.

852 citations

### "Top-K nearest keyword search on lar..." refers background or methods in this paper

• ...[2, 6] is processed recursively by invoking partition(EEP(λ), [2, 6], (r, a),CT(λ)), and [7, 20] is processed by the other two child nodes c and b similarly....

[...]

• ...16 17 19 Interval [1,2] 3 [4,5] 6 [7,10]...

[...]

• ...We first process edge (r, a) with interval [2, 6], which divides the interval [1, 20] into three parts: [1, 1], [2, 6], and [7, 20]....

[...]

• ...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...

[...]

• ...Using the techniques in [2], LCA(u, v) can be found in O(1) time using O(|V |) index space....

[...]

Proceedings ArticleDOI
Hao He1, Haixun Wang2, Jun Yang1, Philip S. Yu2Institutions (2)
11 Jun 2007-
TL;DR: BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search, and offers orders-of-magnitude performance improvement over existing approaches.
Abstract: Query processing over graph-structured data is enjoying a growing number of applications. A top-k keyword search query on a graph finds the top k answers according to some ranking criteria, where each answer is a substructure of the graph containing all query keywords. Current techniques for supporting such queries on general graphs suffer from several drawbacks, e.g., poor worst-case performance, not taking full advantage of indexes, and high memory requirements. To address these problems, we propose BLINKS, a bi-level indexing and query processing scheme for top-k keyword search on graphs. BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search. To reduce the index space, BLINKS partitions a data graph into blocks: The bi-level index stores summary information at the block level to initiate and guide search among blocks, and more detailed information for each block to accelerate search within blocks. Our experiments show that BLINKS offers orders-of-magnitude performance improvement over existing approaches.

585 citations

### "Top-K nearest keyword search on lar..." refers background in this paper

• ...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....

[...]

• ...For the node h, its interval is [10, 18] because the preorder of h on T is 10 and the maximum preorder for all nodes on the subtree rooted at h is 18....

[...]

##### Network Information
###### Related Papers (5)
11 Jun 2007

Hao He, Haixun Wang +2 more

22 Jun 2013

Takuya Akiba, Yoichi Iwata +1 more

01 Aug 2009

Gao Cong, Christian S. Jensen +1 more

26 Feb 2002

G. Bhalotia, Arvind Hulgeri +3 more

07 Apr 2008

I. De Felipe, Vagelis Hristidis +1 more

##### Performance
###### Metrics
No. of citations received by the Paper in previous years
YearCitations
20216
20208
20197
20183
20179
20163