scispace - formally typeset
Search or ask a question

Showing papers on "Graph database published in 2016"


Posted Content
TL;DR: A scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs which outperforms related methods by a significant margin.
Abstract: We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.

15,696 citations


Book ChapterDOI
08 Oct 2016
TL;DR: Wang et al. as mentioned in this paper proposed the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTMs from sequential data or multi-dimensional data to general graph-structured data.
Abstract: By taking the semantic object parsing task as an exemplar application scenario, we propose the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data. Particularly, instead of evenly and fixedly dividing an image to pixels or patches in existing multi-dimensional LSTM structures (e.g., Row, Grid and Diagonal LSTMs), we take each arbitrary-shaped superpixel as a semantically consistent node, and adaptively construct an undirected graph for each image, where the spatial relations of the superpixels are naturally used as edges. Constructed on such an adaptive graph topology, the Graph LSTM is more naturally aligned with the visual patterns in the image (e.g., object boundaries or appearance similarities) and provides a more economical information propagation route. Furthermore, for each optimization step over Graph LSTM, we propose to use a confidence-driven scheme to update the hidden and memory states of nodes progressively till all nodes are updated. In addition, for each node, the forgets gates are adaptively learned to capture different degrees of semantic correlation with neighboring nodes. Comprehensive evaluations on four diverse semantic object parsing datasets well demonstrate the significant superiority of our Graph LSTM over other state-of-the-art solutions.

312 citations


Proceedings ArticleDOI
15 Oct 2016
TL;DR: Graphicionado augments the vertex programming paradigm, allowing different graph analytics applications to be mapped to the same accelerator framework, while maintaining flexibility through a small set of reconfigurable blocks, for high-performance, energy-efficient processing of graph analytics workloads.
Abstract: Graphs are one of the key data structures for many real-world computing applications and the importance of graph analytics is ever-growing. While existing software graph processing frameworks improve programmability of graph analytics, underlying general purpose processors still limit the performance and energy efficiency of graph analytics. We architect a domain-specific accelerator, Graphicionado, for high-performance, energy-efficient processing of graph analytics workloads. For efficient graph analytics processing, Graphicionado exploits not only data structure-centric datapath specialization, but also memory subsystem specialization, all the while taking advantage of the parallelism inherent in this domain. Graphicionado augments the vertex programming paradigm, allowing different graph analytics applications to be mapped to the same accelerator framework, while maintaining flexibility through a small set of reconfigurable blocks. This paper describes Graphicionado pipeline design choices in detail and gives insights on how Graphicionado combats application execution inefficiencies on general-purpose CPUs. Our results show that Graphicionado achieves a 1.76 − 6.54x speedup while consuming 50 − 100x less energy compared to a state-of-the-art software graph analytics processing framework executing 32 threads on a 16-core Haswell Xeon processor.

255 citations


Posted Content
TL;DR: The importance of formalisation for graph query languages is discussed, with a summary of what is known about SPARQL, Cypher, and Gremlin in terms of expressivity and complexity; and an outline of possible future directions for the area.
Abstract: We survey foundational features underlying modern graph query languages. We first discuss two popular graph data models: edge-labelled graphs, where nodes are connected by directed, labelled edges; and property graphs, where nodes and edges can further have attributes. Next we discuss the two most fundamental graph querying functionalities: graph patterns and navigational expressions. We start with graph patterns, in which a graph-structured query is matched against the data. Thereafter we discuss navigational expressions, in which patterns can be matched recursively against the graph to navigate paths of arbitrary length; we give an overview of what kinds of expressions have been proposed, and how they can be combined with graph patterns. We also discuss several semantics under which queries using the previous features can be evaluated, what effects the selection of features and semantics has on complexity, and offer examples of such features in three modern languages that are used to query graphs: SPARQL, Cypher and Gremlin. We conclude by discussing the importance of formalisation for graph query languages; a summary of what is known about SPARQL, Cypher and Gremlin in terms of expressivity and complexity; and an outline of possible future directions for the area.

213 citations


Proceedings Article
09 Jul 2016
TL;DR: The rich textual context information in a text corpus is incorporated to expand the semantic structure of the knowledge graph and each relation is enabled to own different representations for different head and tail entities to better handle 1-to-N, N- to-1 and N-To-N relations.
Abstract: Learning the representations of a knowledge graph has attracted significant research interest in the field of intelligent Web. By regarding each relation as one translation from head entity to tail entity, translation-based methods including TransE, TransH and TransR are simple, effective and achieving the state-of-the-art performance. However, they still suffer the following issues: (i) low performance when modeling 1-to-N, N-to-1 and N-to-N relations. (ii) limited performance due to the structure sparseness of the knowledge graph. In this paper, we propose a novel knowledge graph representation learning method by taking advantage of the rich context information in a text corpus. The rich textual context information is incorporated to expand the semantic structure of the knowledge graph and each relation is enabled to own different representations for different head and tail entities to better handle 1-to-N, N-to-1 and N-to-N relations. Experiments on multiple benchmark datasets show that our proposed method successfully addresses the above issues and significantly outperforms the state-of-the-art methods.

191 citations


Proceedings ArticleDOI
26 Jun 2016
TL;DR: A general approach to speed up CPU computing for graph computing in general by reducing the CPU cache miss ratio for different graph algorithms and proposes a new algorithm to reduce the time complexity and improve the efficiency with new optimization techniques based on a new data structure.
Abstract: The CPU cache performance is one of the key issues to efficiency in database systems. It is reported that cache miss latency takes a half of the execution time in database systems. To improve the CPU cache performance, there are studies to support searching including cache-oblivious, and cache-conscious trees. In this paper, we focus on CPU speedup for graph computing in general by reducing the CPU cache miss ratio for different graph algorithms. The approaches dealing with trees are not applicable to graphs which are complex in nature. In this paper, we explore a general approach to speed up CPU computing, in order to further enhance the efficiency of the graph algorithms without changing the graph algorithms (implementations) and the data structures used. That is, we aim at designing a general solution that is not for a specific graph algorithm, neither for a specific data structure. The approach studied in this work is graph ordering, which is to find the optimal permutation among all nodes in a given graph by keeping nodes that will be frequently accessed together locally, to minimize the CPU cache miss ratio. We prove the graph ordering problem is NP-hard, and give a basic algorithm with a bounded approximation. To improve the time complexity of the basic algorithm, we further propose a new algorithm to reduce the time complexity and improve the efficiency with new optimization techniques based on a new data structure. We conducted extensive experiments to evaluate our approach in comparison with other 9 possible graph orderings (such as the one obtained by METIS) using 8 large real graphs and 9 representative graph algorithms. We confirm that our approach can achieve high performance by reducing the CPU cache miss ratios.

131 citations


Journal ArticleDOI
TL;DR: A family of languages that enable combination of data and topology querying for graph databases are presented, and it is shown that it includes efficient and highly expressive formalisms for querying both the structure of the data and the data itself.
Abstract: Graph databases have received much attention as of late due to numerous applications in which data is naturally viewed as a graph; these include social networks, RDF and the Semantic Web, biological databases, and many others. There are many proposals for query languages for graph databases that mainly fall into two categories. One views graphs as a particular kind of relational data and uses traditional relational mechanisms for querying. The other concentrates on querying the topology of the graph. These approaches, however, lack the ability to combine data and topology, which would allow queries asking how data changes along paths and patterns enveloping it. In this article, we present a comprehensive study of languages that enable such combination of data and topology querying. These languages come in two flavors. The first follows the standard approach of path queries, which specify how labels of edges change along a path, but now we extend them with ways of specifying how both labels and data change. From the complexity point of view, the right type of formalisms are subclasses of register automata. These, however, are not well suited for querying. To overcome this, we develop several types of extended regular expressions to specify paths with data and study their querying power and complexity. The second approach adopts the popular XML language XPath and extends it from XML documents to graphs. Depending on the exact set of allowed features, we have a family of languages, and our study shows that it includes efficient and highly expressive formalisms for querying both the structure of the data and the data itself.

101 citations


Journal ArticleDOI
TL;DR: This study suggests that graph databases provide a flexible solution for the integration of multiple types of biological data and facilitate exploratory data mining to support hypothesis generation.
Abstract: Systems biology experiments generate large volumes of data of multiple modalities and this information presents a challenge for integration due to a mix of complexity together with rich semantics. Here, we describe how graph databases provide a powerful framework for storage, querying and envisioning of biological data. We show how graph databases are well suited for the representation of biological information, which is typically highly connected, semi-structured and unpredictable. We outline an application case that uses the Neo4j graph database for building and querying a prototype network to provide biological context to asthma related genes. Our study suggests that graph databases provide a flexible solution for the integration of multiple types of biological data and facilitate exploratory data mining to support hypothesis generation.

90 citations


Proceedings ArticleDOI
24 Jun 2016
TL;DR: GraphFrames is presented, an integrated system that lets users combine graph algorithms, pattern matching and relational queries, and optimizes work across them, while enabling optimizations across workflow steps that cannot occur in current systems.
Abstract: Graph data is prevalent in many domains, but it has usually required specialized engines to analyze. This design is onerous for users and precludes optimization across complete workflows. We present GraphFrames, an integrated system that lets users combine graph algorithms, pattern matching and relational queries, and optimizes work across them. GraphFrames generalize the ideas in previous graph-on-RDBMS systems, such as GraphX and Vertexica, by letting the system materialize multiple views of the graph (not just the specific triplet views in these systems) and executing both iterative algorithms and pattern matching using joins. To make applications easy to write, GraphFrames provide a concise, declarative API based on the "data frame" concept in R that can be used for both interactive queries and standalone programs. Under this API, GraphFrames use a graph-aware join optimization algorithm across the whole computation that can select from the available views.We implement GraphFrames over Spark SQL, enabling parallel execution on Spark and integration with custom code. We find that GraphFrames make it easy to express end-to-end workflows and match or exceed the performance of standalone tools, while enabling optimizations across workflow steps that cannot occur in current systems. In addition, we show that GraphFrames' view abstraction makes it easy to further speed up interactive queries by registering the appropriate view, and that the combination of graph and relational data allows for other optimizations, such as attribute-aware partitioning.

90 citations


Proceedings ArticleDOI
14 Jun 2016
TL;DR: TCM is presented, a novel generalized graph stream summary that can effectively and efficiently support analytics over graph streams, which demonstrates its potential to start a new line of research and applications in graph stream management.
Abstract: A graph stream, which refers to the graph with edges being updated sequentially in a form of a stream, has important applications in cyber security and social networks. Due to the sheer volume and highly dynamic nature of graph streams, the practical way of handling them is by summarization. Given a graph stream G, directed or undirected, the problem of graph stream summarization is to summarize G as SG with a much smaller (sublinear) space, linear construction time and constant maintenance cost for each edge update, such that SG allows many queries over G to be approximately conducted efficiently. The widely used practice of summarizing data streams is to treat each stream element independently by e.g., hash- or sample-based methods, without maintaining the connections (or relationships) between elements. Hence, existing methods can only solve ad-hoc problems, without supporting diversified and complicated analytics over graph streams. We present TCM, a novel generalized graph stream summary. Given an incoming edge, it summarizes both node and edge information in constant time. Consequently, the summary forms a graphical sketch where edges capture the connections inside elements, and nodes maintain relationships across elements. We discuss a wide range of supported queries and establish some error bounds. In addition, we experimentally show that TCM can effectively and efficiently support analytics over graph streams, which demonstrates its potential to start a new line of research and applications in graph stream management.

86 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: A novel two-stage co-segmentation framework is proposed, which introduces the weak background prior to establish a globally close-loop graph to represent the common object and union background separately and a novel graph optimized-flexible manifold ranking algorithm is proposed to flexibly optimize the graph connection and node labels to co-Segment the common objects.
Abstract: Aiming at automatically discovering the common objects contained in a set of relevant images and segmenting them as foreground simultaneously, object co-segmentation has become an active research topic in recent years. Although a number of approaches have been proposed to address this problem, many of them are designed with the misleading assumption, unscalable prior, or low flexibility and thus still suffer from certain limitations, which reduces their capability in the real-world scenarios. To alleviate these limitations, we propose a novel two-stage co-segmentation framework, which introduces the weak background prior to establish a globally close-loop graph to represent the common object and union background separately. Then a novel graph optimized-flexible manifold ranking algorithm is proposed to flexibly optimize the graph connection and node labels to co-segment the common objects. Experiments on three image datasets demonstrate that our method outperforms other state-of-the-art methods.

Proceedings ArticleDOI
24 Jun 2016
TL;DR: GraphTau is introduced, a time-evolving graph processing framework built on top of Apache Spark, a widely used distributed dataflow system that achieves high performance and fault tolerant graph stream processing via a number of optimizations.
Abstract: Time-evolving graph-structured big data arises naturally in many application domains such as social networks and communication networks. However, existing graph processing systems lack support for efficient computations on dynamic graphs.In this paper, we represent most computations on time evolving graphs into (1) a stream of consistent and resilient graph snapshots, and (2) a small set of operators that manipulate such streams of snapshots. We then introduce GraphTau, a time-evolving graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphTau quickly builds fault-tolerant graph snapshots as each small batch of new data arrives. GraphTau achieves high performance and fault tolerant graph stream processing via a number of optimizations. GraphTau also unifies data streaming and graph streaming processing. Our preliminary evaluations on two representative datasets show promising results. Besides performance benefit, GraphTau API relieves programmers from handling graph snapshot generation, windowing operators and sophisticated differential computation mechanisms.

Journal ArticleDOI
01 Sep 2016
TL;DR: This paper presents GraphJet, an in-memory graph processing engine that maintains a real-time bipartite interaction graph between users and tweets and organizes the interaction graph into temporally-partitioned index segments that hold adjacency lists.
Abstract: This paper presents GraphJet, a new graph-based system for generating content recommendations at Twitter. As motivation, we trace the evolution of our formulation and approach to the graph recommendation problem, embodied in successive generations of systems. Two trends can be identified: supplementing batch with real-time processing and a broadening of the scope of recommendations from users to content. Both of these trends come together in Graph-Jet, an in-memory graph processing engine that maintains a real-time bipartite interaction graph between users and tweets. The storage engine implements a simple API, but one that is sufficiently expressive to support a range of recommendation algorithms based on random walks that we have refined over the years. Similar to Cassovary, a previous graph recommendation engine developed at Twitter, GraphJet assumes that the entire graph can be held in memory on a single server. The system organizes the interaction graph into temporally-partitioned index segments that hold adjacency lists. GraphJet is able to support rapid ingestion of edges while concurrently serving lookup queries through a combination of compact edge encoding and a dynamic memory allocation scheme that exploits power-law characteristics of the graph. Each GraphJet server ingests up to one million graph edges per second, and in steady state, computes up to 500 recommendations per second, which translates into several million edge read operations per second.

Journal ArticleDOI
TL;DR: This work proposes a novel distributed algorithm, DistGraph, which is the first approach demonstrated to scale to graphs with over a billion vertices and edges, and uses a set of optimizations and efficient collective communication operations to minimize information exchange.
Abstract: We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.

Book ChapterDOI
14 Nov 2016
TL;DR: This article describes a mapping from UML/OCL conceptual schemas to Blueprints, an abstraction layer on top of a variety of graph databases, and Gremlin, a graph traversal language, via an intermediate Graph metamodel.
Abstract: The need to store and manipulate large volume of (unstructured) data has led to the development of several NoSQL databases for better scalability. Graph databases are a particular kind of NoSQL databases that have proven their efficiency to store and query highly interconnected data, and have become a promising solution for multiple applications. While the mapping of conceptual schemas to relational databases is a well-studied field of research, there are only few solutions that target conceptual modeling for NoSQL databases and even less focusing on graph databases. This is specially true when dealing with the mapping of business rules and constraints in the conceptual schema. In this article we describe a mapping from UML/OCL conceptual schemas to Blueprints, an abstraction layer on top of a variety of graph databases, and Gremlin, a graph traversal language, via an intermediate Graph metamodel. Tool support is fully available.

Journal ArticleDOI
TL;DR: Experiments carried out on an archive of aerial images point out that the proposed approach significantly improves the retrieval performance compared to the state-of-the-art unsupervised RS image retrieval methods.
Abstract: This letter introduces a novel unsupervised graph-theoretic approach in the framework of region-based retrieval of remote sensing (RS) images. The proposed approach is characterized by two main steps: 1) modeling each image by a graph, which provides region-based image representation combining both local information and related spatial organization, and 2) retrieving the images in the archive that are most similar to the query image by evaluating graph-based similarities. In the first step, each image is initially segmented into distinct regions and then modeled by an attributed relational graph, where nodes and edges represent region characteristics and their spatial relationships, respectively. In the second step, a novel inexact graph matching strategy, which jointly exploits a subgraph isomorphism algorithm and a spectral graph embedding technique, is applied to match corresponding graphs and to retrieve images in the order of graph similarity. Experiments carried out on an archive of aerial images point out that the proposed approach significantly improves the retrieval performance compared to the state-of-the-art unsupervised RS image retrieval methods.

Proceedings ArticleDOI
16 May 2016
TL;DR: This work proposes STAR, a top-k knowledge graph search framework that has two components: a fast top-K algorithm for star queries, and an assembling algorithm for general graph queries that uses star query as a building block and iteratively sweeps the star match lists with a dynamically adjusted bound.
Abstract: Given a graph query Q posed on a knowledge graph G, top-k graph querying is to find k matches in G with the highest ranking score according to a ranking function. Fast top-k search in knowledge graphs is challenging as both graph traversal and similarity search are expensive. Conventional top-k graph search is typically based on threshold algorithm (TA), which can no long fit the demand in the new setting. This work proposes STAR, a top-k knowledge graph search framework. It has two components: (a) a fast top-k algorithm for star queries, and (b) an assembling algorithm for general graph queries. The assembling algorithm uses star query as a building block and iteratively sweeps the star match lists with a dynamically adjusted bound. For top-k star graph query where an edge can be matched to a path with bounded length d, we develop a message passing algorithm, achieving time complexity O(d2|E| + md) and space complexity linear to d|V| (assuming the size of Q and k is bounded by a constant), where m is the maximum node degree in G. STAR can further be leveraged to answer general graph queries by decomposing a query to multiple star queries and joining their results later. Learning-based techniques to optimize query decomposition are also developed. We experimentally verify that STAR is 5–10 times faster than the state-of-the-art TA-style graph search algorithm, and 10–100 times faster than a belief propagation approach.

Journal ArticleDOI
TL;DR: This article explores the "biodiversity knowledge graph" as a network of connected entities, such as taxa, taxonomic names, publications, people, species, sequences, images, and collections, and sketches a set of services and tools needed in order to construct the graph.
Abstract: One way to think about "core" biodiversity data is as a network of connected entities, such as taxa, taxonomic names, publications, people, species, sequences, images, and collections that form the "biodiversity knowledge graph". Many questions in biodiversity informatics can be framed as paths in this graph. This article explores this futher, and sketches a set of services and tools we would need in order to construct the graph.

Proceedings ArticleDOI
13 Aug 2016
TL;DR: This paper presents a dedensification technique that losslessly compresses the neighborhood around high-degree nodes, and introduces a query processing technique that enables direct operation of graph query processing operations over the compressed data, without ever having to decompress the data.
Abstract: One of the most common operations on graph databases is graph pattern matching (eg, graph isomorphism and more general types of "subgraph pattern matching") In fact, in some graph query languages every single query is expressed as a graph matching operation Consequently, there has been a significant amount of research effort in optimizing graph matching operations in graph database systems As graph databases have scaled in recent years, so too has recent work on scaling graph matching operations However, the performance of recent proposals for scaling graph pattern matching is limited by the presence of high-degree nodes These high-degree nodes result in an explosion of intermediate result sizes during query execution, and therefore significant performance bottlenecks In this paper we present a dedensification technique that losslessly compresses the neighborhood around high-degree nodes Furthermore, we introduce a query processing technique that enables direct operation of graph query processing operations over the compressed data, without ever having to decompress the data For pattern matching operations, we show how this technique can be implemented as a layer above existing graph database systems, so that the end-user can benefit from this technique without requiring modifications to the core graph database engine code Our technique reduces the size of the intermediate result sets during query processing, and thereby improves query performance

Book ChapterDOI
17 Oct 2016
TL;DR: This paper experimentally compares the efficiency of various database engines for the purposes of querying the Wikidata knowledge-base, which can be conceptualised as a directed edge-labelled graph where edges can be annotated with meta-information called qualifiers.
Abstract: In this paper, we experimentally compare the efficiency of various database engines for the purposes of querying the Wikidata knowledge-base, which can be conceptualised as a directed edge-labelled graph where edges can be annotated with meta-information called qualifiers. We take two popular SPARQL databases (Virtuoso, Blazegraph), a popular relational database (PostgreSQL), and a popular graph database (Neo4J) for comparison and discuss various options as to how Wikidata can be represented in the models of each engine. We design a set of experiments to test the relative query performance of these representations in the context of their respective engines. We first execute a large set of atomic lookups to establish a baseline performance for each test setting, and subsequently perform experiments on instances of more complex graph patterns based on real-world examples. We conclude with a summary of the strengths and limitations of the engines observed.

Proceedings ArticleDOI
13 Nov 2016
TL;DR: G-Store is able to run different algorithms on trillion-edge graphs within tens of minutes, setting a new milestone in semi-external graph processing system and employing a novel slide-cache-rewind strategy to pipeline graph I/O and computing.
Abstract: High-performance graph processing brings great benefits to a wide range of scientific applications, e.g., biology networks, recommendation systems, and social networks, where such graphs have grown to terabytes of data with billions of vertices and trillions of edges. Subsequently, storage performance plays a critical role in designing a high-performance computer system for graph analytics. In this paper, we present G-Store, a new graph store that incorporates three techniques to accelerate the I/O and computation of graph algorithms. First, G-Store develops a space-efficient tile format for graph data, which takes advantage of the symmetry present in graphs as well as a new smallest number of bits representation. Second, G-Store utilizes tile-based physical grouping on disks so that multi-core CPUs can achieve high cache and memory performance and fully utilize the throughput from an array of solid-state disks. Third, G-Store employs a novel slide-cache-rewind strategy to pipeline graph I/O and computing. With a modest amount of memory, G-Store utilizes a proactive caching strategy in the system so that all fetched graph data are fully utilized before evicted from memory. We evaluate G-Store on a number of graphs against two state-of-the-art graph engines and show that G-Store achieves 2 to 8× saving in storage and outperforms both by 2 to 32×. G-Store is able to run different algorithms on trillion-edge graphs within tens of minutes, setting a new milestone in semi-external graph processing system.

Book ChapterDOI
01 Jan 2016
TL;DR: This chapter provides a concise overview of big data storage systems that are capable of dealing with high velocity, high volumes, and high varieties of data and investigates the challenge of storing data in a secure and privacy-preserving way.
Abstract: This chapter provides an overview of big data storage technologies It is the result of a survey of the current state of the art in data storage technologies in order to create a cross-sectorial technology roadmap This chapter provides a concise overview of big data storage systems that are capable of dealing with high velocity, high volumes, and high varieties of data It describes distributed file systems, NoSQL databases, graph databases, and NewSQL databases The chapter investigates the challenge of storing data in a secure and privacy-preserving way The social and economic impact of big data storage technologies is described, open research challenges highlighted, and three selected case studies are provided from the health, finance, and energy sector Some of the key insights on big data storage are (1) in-memory databases and columnar databases typically outperform traditional relational database systems, (2) the major technical barrier to widespread up-take of big data storage solutions are missing standards, and (3) there is a need to address open research challenges related to the scalability and performance of graph databases

Journal ArticleDOI
01 Jul 2016
TL;DR: A new distributed graph database, called Weaver, is introduced, which enables efficient, transactional graph analyses as well as strictly serializable ACID transactions on dynamic graphs, and a novel request ordering mechanism called refinable timestamps.
Abstract: Graph databases have become a common infrastructure component. Yet existing systems either operate on offline snapshots, provide weak consistency guarantees, or use expensive concurrency control techniques that limit performance.In this paper, we introduce a new distributed graph database, called Weaver, which enables efficient, transactional graph analyses as well as strictly serializable ACID transactions on dynamic graphs. The key insight that allows Weaver to combine strict serializability with horizontal scalability and high performance is a novel request ordering mechanism called refinable timestamps. This technique couples coarse-grained vector timestamps with a fine-grained timeline oracle to pay the overhead of strong consistency only when needed. Experiments show that Weaver enables a Bitcoin blockchain explorer that is 8x faster than Blockchain.info, and achieves 10.9x higher throughput than the Titan graph database on social network workloads and 4x lower latency than GraphLab on offline graph traversal workloads.

Proceedings ArticleDOI
26 Jun 2016
TL;DR: This demonstration presents a Cloud Multidatastore Query Language (CloudMdsQL), and its query engine, a functional SQL-like language capable of querying multiple heterogeneous data stores within a single query that may contain embedded invocations to each data store's native query interface.
Abstract: The blooming of different cloud data management infrastructures has turned multistore systems to a major topic in the nowadays cloud landscape. In this demonstration, we present a Cloud Multidatastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store's native query interface. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized. Within our demonstration, we focus on two use cases each involving four diverse data stores (graph, document, relational, and key-value) with its corresponding CloudMdsQL queries. The query execution flows are visualized by an embedded real-time monitoring subsystem. The users can also try out different ad-hoc queries, not necessarily in the context of the use cases.

Journal ArticleDOI
TL;DR: An overview of existing data mining and graph processing frameworks that deal with very big graphs, and a survey of current researches in the field of data mining / pattern mining in big graphs are presented and the main research issues related to this field are discussed.
Abstract: Big graph mining is an important research area and it has attracted considerable attention. It allows to process, analyze, and extract meaningful information from large amounts of graph data. Big graph mining has been highly motivated not only by the tremendously increasing size of graphs but also by its huge number of applications. Such applications include bioinformatics, chemoinformatics and social networks. One of the most challenging tasks in big graph mining is pattern mining in big graphs. This task consists on using data mining algorithms to discover interesting, unexpected and useful patterns in large amounts of graph data. It aims also to provide deeper understanding of graph data. In this context, several graph processing frameworks and scaling data mining/pattern mining techniques have been proposed to deal with very big graphs. This paper gives an overview of existing data mining and graph processing frameworks that deal with very big graphs. Then it presents a survey of current researches in the field of data mining / pattern mining in big graphs and discusses the main research issues related to this field. It also gives a categorization of both distributed data mining and machine learning techniques, graph processing frameworks and large scale pattern mining approaches.

Journal ArticleDOI
01 Mar 2016
TL;DR: This work develops a new open-source system, called Quegel, for querying big graphs, which treats queries as first-class citizens in its design and provides a convenient interface for constructing graph indexes, which significantly improve query performance but are not supported by existing graph-parallel systems.
Abstract: Pioneered by Google's Pregel, many distributed systems have been developed for large-scale graph analytics. These systems employ a user-friendly "think like a vertex" programming model, and exhibit good scalability for tasks where the majority of graph vertices participate in computation. However, the design of these systems can seriously under-utilize the resources in a cluster for processing light-workload graph queries, where only a small fraction of vertices need to be accessed. In this work, we develop a new open-source system, called Quegel, for querying big graphs. Quegel treats queries as first-class citizens in its design: users only need to specify the Pregel-like algorithm for a generic query, and Quegel processes light-workload graph queries on demand, using a novel superstep-sharing execution model to effectively utilize the cluster resources. Quegel further provides a convenient interface for constructing graph indexes, which significantly improve query performance but are not supported by existing graph-parallel systems. Our experiments verified that Quegel is highly efficient in answering various types of graph queries and is up to orders of magnitude faster than existing systems.

Journal ArticleDOI
Shuai Ma1, Jia Li1, Chunming Hu1, Xuelian Lin1, Jinpeng Huai1 
TL;DR: In this article, the authors argue that big graph search is the one filling the gap between traditional relational and XML models, and give an analysis of graph search from an evolutionary point of view, followed by the evidences from both industry and academia.
Abstract: On one hand, compared with traditional relational and XML models, graphs have more expressive power and are widely used today. On the other hand, various applications of social computing trigger the pressing need of a new search paradigm. In this article, we argue that big graph search is the one filling this gap. We first introduce the application of graph search in various scenarios. We then formalize the graph search problem, and give an analysis of graph search from an evolutionary point of view, followed by the evidences from both the industry and academia. After that, we analyze the difficulties and challenges of big graph search. Finally, we present three classes of techniques towards big graph search: query techniques, data techniques and distributed computing techniques.

Posted Content
20 Oct 2016
TL;DR: RENZO ANGLES, Universidad de Talca & Center for Semantic Web Research MARCELO ARENAS, Pontificia Universidad Católica de Chile and Center forSemantic Web research.
Abstract: RENZO ANGLES, Universidad de Talca & Center for Semantic Web Research MARCELO ARENAS, Pontificia Universidad Católica de Chile & Center for Semantic Web Research PABLO BARCELÓ, DCC, Universidad de Chile & Center for Semantic Web Research AIDAN HOGAN, DCC, Universidad de Chile & Center for Semantic Web Research JUAN REUTTER, Pontificia Universidad Católica de Chile & Center for Semantic Web Research DOMAGOJ VRGOČ, Pontificia Universidad Católica de Chile & Center for Semantic Web Research

Journal ArticleDOI
TL;DR: iGraph is designed, an incremental graph processing system for dynamic graph with its continuous updates, and experimental results show that for real life datasets, iGraph outperforms the original GraphX in respect of graph update and graph computation.
Abstract: With the popularity of social network, the demand for real-time processing of graph data is increasing. However, most of the existing graph systems adopt a batch processing mode, therefore the overhead of maintaining and processing of dynamic graph is significantly high. In this paper, we design iGraph, an incremental graph processing system for dynamic graph with its continuous updates. The contributions of iGraph include: 1) a hash-based graph partition strategy to enable fine-grained graph updates; 2) a vertexbased graph computing model to support incremental data processing; 3) detection and rebalance methods of hotspot to address the workload imbalance problem during incremental processing. Through the general-purpose API, iGraph can be used to implement various graph processing algorithms such as PageRank. We have implemented iGraph on Apache Spark, and experimental results show that for real life datasets, iGraph outperforms the original GraphX in respect of graph update and graph computation.

Proceedings ArticleDOI
TL;DR: This work presents a novel platform for the interactive visualization of very large graphs that involves an offline preprocessing phase that builds the layout of the graph by assigning coordinates to its nodes with respect to a Euclidean plane and translates user operations into simple and very efficient spatial operations in the backend.
Abstract: We present a novel platform for the interactive visualization of very large graphs. The platform enables the user to interact with the visualized graph in a way that is very similar to the exploration of maps at multiple levels. Our approach involves an offline preprocessing phase that builds the layout of the graph by assigning coordinates to its nodes with respect to a Euclidean plane. The respective points are indexed with a spatial data structure, i.e., an R-tree, and stored in a database. Multiple abstraction layers of the graph based on various criteria are also created offline, and they are indexed similarly so that the user can explore the dataset at different levels of granularity, depending on her particular needs. Then, our system translates user operations into simple and very efficient spatial operations (i.e., window queries) in the backend. This technique allows for a fine-grained access to very large graphs with extremely low latency and memory requirements and without compromising the functionality of the tool. Our web-based prototype supports three main operations: (1) interactive navigation, (2) multi-level exploration, and (3) keyword search on the graph metadata.