Top-K nearest keyword search on large graphs
Summary (4 min read)
1. INTRODUCTION
- Many real-world networks emerging nowadays have labels or textual contents on the nodes.
- K-NK is an important and useful query in graph search.
- Intuitively, if two persons share some common friends, i.e., they are two hops away, they are more likely to become friends.
- Instead the authors use distance oracle [20] as the fundamental distance estimation framework.
- The rest of the paper is organized as follows.
2. PROBLEM DEFINITION
- The authors model a weighted undirected graph as G(V, E), where V (G) represents the set of nodes and E(G) represents the set of edges in G.
- The weight of a path is the total weight of all edges on the path.
- For simplicity, the authors assume that there is only one keyword λ in the query.
- The authors will discuss how to answer a query containing multiple keywords with AND and OR semantics.
3. EXISTING SOLUTIONS
- Obviously, Dijkstra’s algorithm is inefficient when the size of the graph is large or the keyword nodes are far away from q.
- In the literature, [1] and [22] design different indexing schemes to process (top-k) nearest keyword queries on a graph or a tree.
- The authors introduce the two methods in the following two subsections.
3.1 Approximate k-NK on a Graph
- Bahmani and Goel [1] find an approximate answer to a k-NK query in a graph based on a distance oracle [20].
- One distance oracle is usually not enough for distance estimation in a graph G. Even for two nodes in the same partition, the estimation may have a large error.
- The authors illustrate the algorithm using the following example.
3.2 Exact 1-NK on a Tree
- For a certain keyword λ, all nodes with the preorder label in the interval [1, |V |] can be partitioned into several disjointed intervals, such that any node v in the same interval shares an identical NN(v, λ).
- The partition is called tree Voronoi partition of λ, denoted as TVP(λ).
- An extended compact tree ECT(λ) is a tree constructed by adding all change nodes into the compact tree CT(λ).
- Using ECT(λ), TVP(λ) can be constructed easily.
- The time to compute all compact trees and all extended compact trees for all keywords in the tree T (V, E) is bounded by O(|doc(V )| · log |V |).
4. SOLUTION OVERVIEW
- Answering k-NK on a Graph using Tree Distance:.
- To address the drawback of witness distance, in this paper, the authors propose to use tree distance in processing a k-NK query.
- For the distance oracles O1 and O2 shown in Fig. 2, the corresponding shortest path trees T1 and T2 are shown in Fig, also known as Example 5.
- Recall that in [22], for a certain keyword λ, the range [1, |V |] is partitioned into several disjoint intervals, and nodes with the preorder label in an identical interval share the same 1-NK result.
- The authors first algorithm tree-boundk can only handle bounded k values with query processing time O(k + log |Vλ|) and index size O(k · |doc(V )|) for all keywords where k is an upper bound value of k.
5. K-NK ON A TREE FOR A SMALL K
- Recall that for a keyword λ, its compact tree CT(λ) keeps all the structural information of λ on the tree T .
- The authors idea is to precompute the top-k results for every keyword λ and every node on CT(λ).
- Since the total size of all compact trees is bounded by O(|doc(V )|), the total space to store the top-k results of nodes on all compact trees is bounded by O(k · |doc(V )|).
- In the following, the authors first introduce how to answer a k-NK query using CT(λ), followed by discussions on the construction of the index.
5.1 Query Processing
- Intuitively, the entry edge plays a role of connecting the query node q to the compact tree CT(λ).
- The authors assume that the compact tree CT(λ) for each keyword λ and the list candλ(u) for every node u on CT(λ) have been computed.
- The efficiency of Algorithm 1 depends on three operations.
- The results are merged using two operators ⊕ and ⊗k.
- The first index, the entry edge partition EEP(λ), is to find the entry edge for any node on T .
5.2 Construction of Entry Edge Partition
- Given a tree T (V, E), for each keyword λ, sharing the similar idea with the tree Voronoi partition TVP(λ), the authors construct an entry Algorithm 4: EEP-construct (T ,CT(λ)).
- Based on such an observation, by excluding the intervals of all edges under the subtree rooted at u′ in CT(λ) from the interval of (u, u′), nodes with preorder in the remaining intervals will use (u, u′) as the entry edge.
- Algorithm 4 shows the construction of the entry edge partition EEP(λ) on CT(λ) for a keyword λ.
- After initializing EEP(λ) (line 2), the main operation is a recursive procedure partition (line 3), to partition the interval [1, |V |] to several disjoint intervals.
5.3 Construction of Candidate List
- Given a compact tree CT(λ) for a tree T and a keyword λ, the authors need to compute the candidate list candλ(v) for every node v on CT(λ).
- Since CT(λ) keeps the structural information of all keyword nodes in T , it is sufficient to search only on CT(λ) to calculate candλ(v).
- Based on this observation, the authors can follow the path to propagate the candidate list on u to v.
- Using this idea, the authors just need to traverse the tree CT(λ) twice to build the candidate lists for all nodes on CT(λ).
- The second traversal on CT(λ) is a top-down one, such that the candidate list on each node is further propagated to all its descendants.
6.1 A Basic Pivot Approach
- The authors basic idea is to compute the first segment online and precompute the results regarding the second segment offline.
- In the query processing phase, the authors do not search the whole tree to get the answer for a query, but instead, they just need to merge the precomputed candidates along the path from the query node to the root node of the tree T .
- The authors use the following example to illustrate the pivot based approach.
- For every node v, the authors create a candidate list candλ(v) that contains all keyword nodes in its subtree, sorted in nondecreasing distances to v.
6.2 Pivot Approach with Tree Balancing
- The problem is not perfectly solved using the basic pivot approach above.
- Thus the key to optimizing both index space and query time is to reduce the average depth of nodes on the tree.
- Furthermore, the authors need to traverse n nodes to answer a query when the query node q is at one end of the chain, leading to O(n) query time.
- Generally speaking, DT(T ) preserves all distance information for any node pair on T and the height of DT(T ) is at most log2 |V |. DEFINITION 5.
- The authors will also describe how to construct DT(T ) for a tree T and how to compute all candidate lists candλ(v) for all keywords λ and all nodes v on the tree DT(T ).
6.3 Index Construction
- The first index is the distance preserving balanced tree DT(T ) for T and the second index is the candidate list candλ(v) for each keyword λ and each node v on DT(T ).
- Such a property also holds for any subtree of T ′ because it is processed using steps (1) and (2) recursively.
- The following lemma shows that the median node always exists on any tree T , and also gives a method to find the median node of T .
- All other nodes in DT(T ) are constructed similarly.
- After all candidate lists are created, the authors sort the elements in every candidate list in nondecreasing order of the distances.
7. APPROXIMATE K-NK ON A GRAPH
- The authors introduce two algorithms graph-boundk and graph-pivot for a bounded k and an arbitrary k respectively.
- This expression can be generalized to the case of merging the candidate lists of node v on more than two trees.
- In the following, the authors will show that the global candidate list can be used to answer k-NK queries without sacrificing the result quality.
- Therefore, the authors show that using global storage will not sacrifice the result quality.
- When k is small, the index time and index space for boundk are smaller than pivot on both trees and graphs.
8. EXPERIMENTS
- The authors report the performance of their methods boundk, pivot, and their global storage implementations boundk-gs and pivot-gs, with two baseline solutions BFS and PMI.
- The authors obtained the keywords of nodes from the OpenStreetMap project5 with a bounding box.
- Global storage helps reduce the query time of boundk by 20% and that of pivot by 15%.
- The query time shows a sharper increasing trend on DBLP than FLARN, as the frequency difference between DBLP keywords is larger.
- The index size of pivot is 2.5 times that of PMI on DBLP and 7.9 times on FLARN, due to the larger diameter of FLARN.
10. CONCLUSIONS
- The authors study top-k nearest keyword (k-NK) search on large graphs.
- The authors propose two exact k-NK algorithms on trees to handle a bounded k and an arbitrary k respectively.
- The authors extend tree based algorithms to graphs and propose a global storage technique to further reduce the index size and query time.
- The authors conducted extensive performance studies on real large graphs to demonstrate the effectiveness and efficiency of their algorithms.
Did you find this useful? Give us your feedback
Citations
References
3,687 citations
"Top-K nearest keyword search on lar..." refers methods in this paper
...We use six metrics for evaluation: hit rate, Spearman’s rho [21], error, query time, index time, and index size....
[...]
970 citations
"Top-K nearest keyword search on lar..." refers background in this paper
...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....
[...]
...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...
[...]
898 citations
"Top-K nearest keyword search on lar..." refers background or methods in this paper
...[2, 6] is processed recursively by invoking partition(EEP(λ), [2, 6], (r, a),CT(λ)), and [7, 20] is processed by the other two child nodes c and b similarly....
[...]
...16 17 19 Interval [1,2] 3 [4,5] 6 [7,10]...
[...]
...We first process edge (r, a) with interval [2, 6], which divides the interval [1, 20] into three parts: [1, 1], [2, 6], and [7, 20]....
[...]
...k Interval [1, 1] [2, 3] [4, 5] [6, 6] [7, 8]...
[...]
...Using the techniques in [2], LCA(u, v) can be found in O(1) time using O(|V |) index space....
[...]
892 citations
"Top-K nearest keyword search on lar..." refers background in this paper
...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....
[...]
601 citations
"Top-K nearest keyword search on lar..." refers background in this paper
...The answer substructure can be a tree [12, 3, 13, 8, 10, 9], a subgraph [16, 17] or a r-clique [14]....
[...]
...For the node h, its interval is [10, 18] because the preorder of h on T is 10 and the maximum preorder for all nodes on the subtree rooted at h is 18....
[...]
Related Papers (5)
Frequently Asked Questions (16)
Q2. What is the key operation in answering a k-NK query on a graph?
2With the tree distance formulation, the key operation in answering a k-NK query on a graph is to answer the k-NK query on a tree.
Q3. How does global storage reduce the index size of pivot?
By keeping a global candidate list and removing duplicate index items, global storage reduces the index size of pivot by 61% on DBLP and 55% on FLARN.
Q4. How can the authors solve the k-NK problem in a graph?
Suppose the authors have an algorithm to compute RT on a tree T , the authors can solve the k-NK problem in a graph by merging RTi for each tree Ti, 1 ≤ i ≤ r.
Q5. How can the authors reduce a distance oracle to a set of trees?
As the authors transform a distance oracle on a graph into a set of shortest path trees, the original k-NK query on the graph can be reduced to answering the k-NK query on a set of trees.
Q6. What is the way to calculate cand(v)?
Since CT(λ) keeps the structural information of all keyword nodes in T , it is sufficient to search only on CT(λ) to calculate candλ(v).
Q7. How does the Algorithm 7 construct a distance preserving balanced tree?
2THEOREM 5. Given a tree T (V, E), Algorithm 7 constructs a distance preserving balanced tree DT(T ) for T using O(|V | · log |V |) time and O(|V |) space.
Q8. Why is the index size of pivot longer on FLARN than on DBLP?
This is because the complexity of pivot grows linearly with the tree depth,and the larger diameter of FLARN leads to a larger tree depth.
Q9. What is the definition of a distance preserving balanced tree?
In order to reduce the average depth of nodes to optimize both index space and query processing time, the authors introduce a new structure called distance preserving balanced tree for T (V, E), denoted as DT(T ).
Q10. How do the authors calculate cand(v) for a tree?
For each node v traversed, the authors merge candλ(v) into that of its parent node u by adding a distance dist(u, v) to the list candλ(v) (line 3-5).
Q11. What is the second attempt to precompute k nearest nodes?
Their second attempt is that, for each node v on the tree T (V, E) and each keyword λ, the authors precompute its k nearest nodes that contain λ.
Q12. What is the EE(q) operator for the path from u to u?
Let (u, u′) = EEλ(q), since ENλ(q) is on the path from u to u′ on the tree T , the path from ENλ(q) to any keyword node in T will go through either u or u′.
Q13. How do the authors compute the candidate list for a tree?
Given a compact tree CT(λ) for a tree T and a keyword λ, the authors need to compute the candidate list candλ(v) for every node v on CT(λ).
Q14. How do the authors calculate candidate list cand(p)?
For each pivot p of v as well as v itself, the authors calculate distT (p, v) on the original tree T , and add the element v : distT (p, v) to the candidate list candλ(p) (line 4-5).
Q15. What is the first traversal on CT()?
The first traversal on CT(λ) is a bottom-up one, such that the candidate list on each node is propagated to all its ancestors on CT(λ).
Q16. What is the difference between tree distance and witness distance?
Since a tree contains more structural information than a star, using tree distance will be more accurate than using witness distance for estimating the distance of two nodes.