scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

FERRARI: Flexible and efficient reachability range assignment for graph indexing

TL;DR: A scalable and highly efficient index structure for the reachability problem over graphs that imposes an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized.
Abstract: In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that, in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets.
Citations
More filters
Proceedings ArticleDOI
18 Jun 2014
TL;DR: This work investigates a new approach to the design of distributed, shared-nothing RDF engines that combines join-ahead pruning via a novel form of RDF graph summarization with a locality-based, horizontal partitioning of R DF triples into a grid-like, distributed index structure.
Abstract: We investigate a new approach to the design of distributed, shared-nothing RDF engines. Our engine, coined "TriAD", combines join-ahead pruning via a novel form of RDF graph summarization with a locality-based, horizontal partitioning of RDF triples into a grid-like, distributed index structure. The multi-threaded and distributed execution of joins in TriAD is facilitated by an asynchronous Message Passing protocol which allows us to run multiple join operators along a query plan in a fully parallel, asynchronous fashion. We believe that our architecture provides a so far unique approach to join-ahead pruning in a distributed environment, as the more classical form of sideways information passing would not permit for executing distributed joins in an asynchronous way. Our experiments over the LUBM, BTC and WSDTS benchmarks demonstrate that TriAD consistently outperforms centralized RDF engines by up to two orders of magnitude, while gaining a factor of more than three compared to the currently fastest, distributed engines. To our knowledge, we are thus able to report the so far fastest query response times for the above benchmarks using a mid-range server and regular Ethernet setup.

208 citations

Journal ArticleDOI
01 Aug 2018
TL;DR: A new system GraphS is presented to efficiently detect constrained cycles in a dynamic graph, which is changing constantly, and return the satisfying cycles in real-time, to greatly speed-up query time and achieve high system throughput.
Abstract: As graph data is prevalent for an increasing number of Internet applications, continuously monitoring structural patterns in dynamic graphs in order to generate real-time alerts and trigger prompt actions becomes critical for many applications In this paper, we present a new system GraphS to efficiently detect constrained cycles in a dynamic graph, which is changing constantly, and return the satisfying cycles in real-time A hot point based index is built and efficiently maintained for each query so as to greatly speed-up query time and achieve high system throughput The GraphS system is developed at Alibaba to actively monitor various online fraudulent activities based on cycle detection For a dynamic graph with hundreds of millions of edges and vertices, the system is capable to cope with a peak rate of tens of thousands of edge updates per second and find all the cycles with predefined constraints with a 999% latency of 20 milliseconds

108 citations


Cites methods from "FERRARI: Flexible and efficient rea..."

  • ...Representative works include GRAIL [26], FERRARI [21], and IP+ [25]....

    [...]

Proceedings ArticleDOI
22 Jun 2013
TL;DR: TF-label is an efficient and scalable labeling scheme for processing reachability queries that is constructed based on a novel topological folding that recursively folds an input graph into half so as to reduce the label size, thus improving query efficiency.
Abstract: Reachability querying is a basic graph operation with numerous important applications in databases, network analysis, computational biology, software engineering, etc. Although many indexes have been proposed to answer reachability queries, most of them are only efficient for handling relatively small graphs. We propose TF-label, an efficient and scalable labeling scheme for processing reachability queries. TF-label is constructed based on a novel topological folding (TF) that recursively folds an input graph into half so as to reduce the label size, thus improving query efficiency. We show that TF-label is efficient to construct and propose efficient algorithms and optimization schemes. Our experiments verify that TF-label is significantly more scalable and efficient than the state-of-the-art methods in both index construction and query processing.

99 citations


Cites background from "FERRARI: Flexible and efficient rea..."

  • ...We are also aware of a recent work [22] that trades off query performance for reduced index size and indexing cost....

    [...]

Proceedings ArticleDOI
18 Jun 2014
TL;DR: This paper introduces a general indexing framework that summarizes a family of reachability indices with the best performance among the existing techniques for static graphs, and proposes general and efficient algorithms for handling vertex insertions and deletions under this framework.
Abstract: Reachability queries are a fundamental type of queries on graphs that find important applications in numerous domains. Although a plethora of techniques have been proposed for reachability queries, most of them require that the input graph is static, i.e., they are inapplicable to the {\em dynamic} graphs (e.g., social networks and the Semantic Web) commonly encountered in practice. There exist a few techniques that can handle dynamic graphs, but none of them can scale to sizable graphs without significant loss of efficiency. To address this deficiency, this paper presents a novel study on reachability indices for large dynamic graphs. We first introduce a general indexing framework that summarizes a family of reachability indices with the best performance among the existing techniques for static graphs. Then, we propose general and efficient algorithms for handling vertex insertions and deletions under the proposed framework. In addition, we show that our update algorithms can be used to improve the existing reachability techniques on static graphs, and we also propose a new approach for constructing a reachability index from scratch under our framework. We experimentally evaluate our solution on a large set of benchmark datasets, and we demonstrate that our solution not only supports efficient updates on dynamic graphs, but also provides even better query performance than the state-of-the-art techniques for static graphs.

85 citations


Cites background or methods from "FERRARI: Flexible and efficient rea..."

  • ...and patents are from [8, 17], while GovWild, Yago2, Twitter, and Web-UK are used in [25]....

    [...]

  • ...Methods based on transitive closure retrieval [3, 7, 8, 10, 14, 19, 25, 28, 29] pre-compute and compress the transitive closures of each vertex in G....

    [...]

Journal ArticleDOI
01 Aug 2014
TL;DR: This paper proposes a new randomized labeling approach to answer reachability queries, and the randomness is by independent permutation, and it is confirmed the efficiency of the approach is confirmed.
Abstract: Reachability query is a fundamental graph operation which answers whether a vertex can reach another vertex over a large directed graph G with n vertices and m edges, and has been extensively studied In the literature, all the approaches compute a label for every vertex in a graph G by index construction offline The query time for answering reachability queries online is affected by the quality of the labels computed in index construction The three main costs are the index construction time, the index size, and the query time Some of the up-to-date approaches can answer reachability queries efficiently, but spend non-linear time to construct an index Some of the up-to-date approaches construct an index in linear time and space, but may need to depth-first search G at run-time in O(n + m) In this paper, as the first, we propose a new randomized labeling approach to answer reachability queries, and the randomness is by independent permutation We conduct extensive experimental studies to compare with the up-to-date approaches using 19 large real datasets used in the existing work and synthetic datasets We confirm the efficiency of our approach

74 citations


Cites background or methods from "FERRARI: Flexible and efficient rea..."

  • ...propose Ferrari [24]....

    [...]

  • ...Such approaches include Tree+SSPI [7], GRIPP [26], GRAIL [29], and Ferrari [24]....

    [...]

  • ...The reachability query has been extensively studied over a decade [1, 16, 13, 23, 11, 28, 7, 26, 12, 8, 21, 20, 29, 27, 9, 10, 19, 24, 31], and the early work can be traced back to 1989 to compute transitive closure (TC) over a graph....

    [...]

  • ...We test Ferrari using the Ferrari-G index given in [24], because Ferrari-G is scalable to handle massive-scale graphs....

    [...]

  • ...GRAIL [29] O(k) or O(n+m)O(kn) O(k(n+m)) Ferrari [24] O(k) or O(n+m)O((k + s)n)O(k2m+ S) IP+ (ours) O(k) or O(knr2) O((k + h)n)O((k + h)(m+ n))...

    [...]

References
More filters
Book ChapterDOI
01 Jan 2014
TL;DR: This chapter provides an overview of the fundamentals of algorithms and their links to self-organization, exploration, and exploitation.
Abstract: Algorithms are important tools for solving problems computationally. All computation involves algorithms, and the efficiency of an algorithm largely determines its usefulness. This chapter provides an overview of the fundamentals of algorithms and their links to self-organization, exploration, and exploitation. A brief history of recent nature-inspired algorithms for optimization is outlined in this chapter.

8,285 citations

Proceedings Article
16 May 2010
TL;DR: An in-depth comparison of three measures of influence, using a large amount of data collected from Twitter, is presented, suggesting that topological measures such as indegree alone reveals very little about the influence of a user.
Abstract: Directed links in social media could represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence indicate a user's influence on others — a concept that is crucial in sociology and viral marketing. In this paper, using a large amount of data collected from Twitter, we present an in-depth comparison of three measures of influence: indegree, retweets, and mentions. Based on these measures, we investigate the dynamics of user influence across topics and time. We make several interesting observations. First, popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions. Second, most influential users can hold significant influence over a variety of topics. Third, influence is not gained spontaneously or accidentally, but through concerted effort such as limiting tweets to a single topic. We believe that these findings provide new insights for viral marketing and suggest that topological measures such as indegree alone reveals very little about the influence of a user.

3,041 citations

Journal ArticleDOI
TL;DR: YAGO2 as mentioned in this paper is an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space, and it contains 447 million facts about 9.8 million entities.

1,186 citations

BookDOI
19 Feb 2010
TL;DR: This is the first comprehensive survey book in the emerging topic of graph data processing and contains extensive surveys on important graph topics such as graph languages, indexing, clustering, data generation, pattern mining, classification, keyword search, pattern matching, and privacy.
Abstract: Managing and Mining Graph Data is a comprehensive survey book in graph data analytics. It contains extensive surveys on important graph topics such as graph languages, indexing, clustering, data generation, pattern mining, classification, keyword search, pattern matching, and privacy. It also studies a number of domain-specific scenarios such as stream mining, web graphs, social networks, chemical and biological data. The chapters are written by leading researchers, and provide a broad perspective of the area. This is the first comprehensive survey book in the emerging topic of graph data processing. Managing and Mining Graph Data is designed for a varied audience composed of professors, researchers and practitioners in industry. This volume is also suitable as a reference book for advanced-level database students in computer science. About the Editors:Charu C. Aggarwal obtained his B.Tech in Computer Science from IIT Kanpur in 1993 and Ph.D. from MIT in 1996. He has worked as a researcher at IBM since then, and has published over 130 papers in major data mining conferences and journals. He has applied for or been granted over 70 US and International patents, and has thrice been designated a Master Inventor at IBM. He has received an IBM Corporate award for his work on data stream analytics, and an IBM Outstanding Innovation Award for his work on privacy technology. He has served on the executive committees of most major data mining conferences. He has served as an associate editor of the IEEE TKDE, as an associate editor of the ACM SIGKDD Explorations, and as an action editor of the DMKD Journal. He is a fellow of the IEEE, and a life-member of the ACM. Haixun Wang is currently a researcher at Microsoft Research Asia. He received the B.S. and the M.S. degree, both in computer science, from Shanghai Jiao Tong University in 1994 and 1996. He received the Ph.D. degree in computer science from the University of California, Los Angeles in 2000. He subsequently worked as a researcher at IBMuntil 2009. His main research interest is database language and systems, data mining, and information retrieval. He has published more than 100 research papers in referred international journals and conference proceedings. He serves as an associate editor of the IEEE TKDE, and has served as a reviewer and program committee member of leading database conferences and journals.

531 citations

Proceedings ArticleDOI
06 Jan 2002
TL;DR: In this paper, the authors propose a new data structure for representing all distances in a graph, which is distributed in the sense that it may be viewed as assigning labels to the vertices, such that a query involving vertices u and v may be answered using only the labels of u and V.
Abstract: Reachability and distance queries in graphs are fundamental to numerous applications, ranging from geographic navigation systems to Internet routing Some of these applications involve huge graphs and yet require fast query answering We propose a new data structure for representing all distances in a graph The data structure is distributed in the sense that it may be viewed as assigning labels to the vertices, such that a query involving vertices u and v may be answered using only the labels of u and vOur labels are based on 2-hop covers of the shortest paths, or of all paths, in a graph For shortest paths, such a cover is a collection S of shortest paths such that for every two vertices u and v, there is a shortest path from u to v that is a concatenation of two paths from S We describe an efficient algorithm for finding an almost optimal 2-hop cover of a given collection of paths Our approach is general and can be applied to directed or undirected graphs, exact or approximate shortest paths, or to reachability queriesWe study the proposed data structure using a combination of theoretical and experimental means We implemented our algorithm and checked the size of the resulting data structure on several real-life networks from different application areas Our experiments show that the total size of the labels is typically not much larger than the network itself, and is usually considerably smaller than an explicit representation of the transitive closure of the network

512 citations