scispace - formally typeset
Search or ask a question

Showing papers on "Graph database published in 2015"


Journal ArticleDOI
TL;DR: In this survey, the vertex-centric approach to graph processing is overviewed, TLAV frameworks are deconstructed into four main components and respectively analyzed, and TLAV implementations are reviewed and categorized.
Abstract: The vertex-centric programming model is an established computational paradigm recently incorporated into distributed processing frameworks to address challenges in large-scale graph processing. Billion-node graphs that exceed the memory capacity of commodity machines are not well supported by popular Big Data tools like MapReduce, which are notoriously poor performing for iterative graph algorithms such as PageRank. In response, a new type of framework challenges one to “think like a vertex” (TLAV) and implements user-defined programs from the perspective of a vertex rather than a graph. Such an approach improves locality, demonstrates linear scalability, and provides a natural way to express and compute many iterative graph algorithms. These frameworks are simple to program and widely applicable but, like an operating system, are composed of several intricate, interdependent components, of which a thorough understanding is necessary in order to elicit top performance at scale. To this end, the first comprehensive survey of TLAV frameworks is presented. In this survey, the vertex-centric approach to graph processing is overviewed, TLAV frameworks are deconstructed into four main components and respectively analyzed, and TLAV implementations are reviewed and categorized.

267 citations


Proceedings ArticleDOI
27 May 2015
TL;DR: This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies.
Abstract: The Linked Data Benchmark Council (LDBC) is now two years underway and has gathered strong industrial participation for its mission to establish benchmarks, and benchmarking practices for evaluating graph data management systems. The LDBC introduced a new choke-point driven methodology for developing benchmark workloads, which combines user input with input from expert systems architects, which we outline. This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies. SNB has three query workloads under development: Interactive, Business Intelligence, and Graph Algorithms. We describe the SNB Interactive Workload in detail and illustrate the workload with some early results, as well as the goals for the two other workloads.

262 citations


Book
10 Jun 2015
TL;DR: This second edition of this practical book includes new code samples and diagrams, using the latest Neo4j syntax, as well as information on new functionality.
Abstract: Discover how graph databases can help you manage and query highly connected data. With this practical book, youll learn how to design and implement a graph database that brings the power of graphs to bear on a broad range of problem domains. Whether you want to speed up your response to user queries or build a database that can adapt as your business evolves, this book shows you how to apply the schema-free graph model to real-world problems.This second edition includes new code samples and diagrams, using the latest Neo4j syntax, as well as information on new functionality. Learn how different organizations are using graph databases to outperform their competitors. With this books data modeling, query, and code examples, youll quickly be able to implement your own solution.Model data with the Cypher query language and property graph modelLearn best practices and common pitfalls when modeling with graphsPlan and implement a graph database solution in test-driven fashionExplore real-world examples to learn how and why organizations use a graph databaseUnderstand common patterns and components of graph database architectureUse analytical techniques and algorithms to mine graph database information

184 citations


Proceedings ArticleDOI
TL;DR: These properties enable Gremlin to naturally support imperative and declarative querying, host language agnosticism, user-defined domain specific languages, an extensible compiler/optimizer, single- and multi-machine execution models, hybrid depth- and breadth-first evaluation, as well as the existence of a Universal Gremlin Machine and its respective entailments.
Abstract: Gremlin is a graph traversal machine and language designed, developed, and distributed by the Apache TinkerPop project. Gremlin, as a graph traversal machine, is composed of three interacting components: a graph $G$, a traversal $\Psi$, and a set of traversers $T$. The traversers move about the graph according to the instructions specified in the traversal, where the result of the computation is the ultimate locations of all halted traversers. A Gremlin machine can be executed over any supporting graph computing system such as an OLTP graph database and/or an OLAP graph processor. Gremlin, as a graph traversal language, is a functional language implemented in the user's native programming language and is used to define the $\Psi$ of a Gremlin machine. This article provides a mathematical description of Gremlin and details its automaton and functional properties. These properties enable Gremlin to naturally support imperative and declarative querying, host language agnosticism, user-defined domain specific languages, an extensible compiler/optimizer, single- and multi-machine execution models, hybrid depth- and breadth-first evaluation, as well as the existence of a Universal Gremlin Machine and its respective entailments.

181 citations


Proceedings ArticleDOI
04 Oct 2015
TL;DR: There is substantial room for a different processor architecture to improve performance without requiring a new memory system in high-performance graph algorithm codebases using hardware performance counters on a conventional dual-socket server.
Abstract: Graph processing is an increasingly important application domain and is typically communication-bound. In this work, we analyze the performance characteristics of three high-performance graph algorithm codebases using hardware performance counters on a conventional dual-socket server. Unlike many other communication-bound workloads, graph algorithms struggle to fully utilize the platform's memory bandwidth and so increasing memory bandwidth utilization could be just as effective as decreasing communication. Based on our observations of simultaneous low compute and bandwidth utilization, we find there is substantial room for a different processor architecture to improve performance without requiring a new memory system.

164 citations


Proceedings ArticleDOI
27 Oct 2015
TL;DR: Gremlin this article is a graph traversal machine and language designed, developed, and distributed by the Apache TinkerPop project, which is composed of three interacting components: a graph, a traversal, and a set of traversers.
Abstract: Gremlin is a graph traversal machine and language designed, developed, and distributed by the Apache TinkerPop project. Gremlin, as a graph traversal machine, is composed of three interacting components: a graph, a traversal, and a set of traversers. The traversers move about the graph according to the instructions specified in the traversal, where the result of the computation is the ultimate locations of all halted traversers. A Gremlin machine can be executed over any supporting graph computing system such as an OLTP graph database and/or an OLAP graph processor. Gremlin, as a graph traversal language, is a functional language implemented in the user's native programming language and is used to define the traversal of a Gremlin machine. This article provides a mathematical description of Gremlin and details its automaton and functional properties. These properties enable Gremlin to naturally support imperative and declarative querying, host language agnosticism, user-defined domain specific languages, an extensible compiler/optimizer, single- and multi-machine execution models, hybrid depth- and breadth-first evaluation, as well as the existence of a Universal Gremlin Machine and its respective entailments.

145 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: This paper characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations, helping users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.
Abstract: With the emergence of data science, graph computing is becoming a crucial tool for processing big connected data. Although efficient implementations of specific graph applications exist, the behavior of full-spectrum graph computing remains unknown. To understand graph computing, we must consider multiple graph computation types, graph frameworks, data representations, and various data sources in a holistic way. In this paper, we present GraphBIG, a benchmark suite inspired by IBM System G project. To cover major graph computation types and data sources, GraphBIG selects representative datastructures, workloads and data sets from 21 real-world use cases of multiple application domains. We characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations. GraphBIG helps users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.

141 citations


BookDOI
01 Jan 2015
TL;DR: This work discusses the role of Cloud Computing Architectures in Big Data, customer relationship management and Big Data Mining, and multi-granular Evaluation model through Fuzzy Random Regression to improve information.
Abstract: Nearest Neighbor Queries on Big Data.- Information Mining for Big Information.- Information Granules Problem: An Efficient Solution of Real-Time Fuzzy Regression Analysis.- How to Understand Connections Based on Big Data: From Cliques to Flexible Granules.- Maintain 'Omics: When e-Maintenance Enters the Big Data Era.- Incrementally Mining Frequent Patterns for Large Database.- Improved Latent Semantic Indexing-based Data Mining Methods and An Application to Big.- The Property of Different Granule and Granular Methods Based on Quotient Space.- Towards An Optimal Task-Driven Information Granulation.- Unified Framework for Construction of Rule Based Classification Systems.- Multi-granular Evaluation Model through Fuzzy Random Regression to Improve Information.- Building Fuzzy Robust Regression Model Based on Granularity and Possibility Distribution.- The Role of Cloud Computing Architectures in Big Data.- Big Data Storage Techniques for Spatial Databases: Implications of Big Data Architecture on Spatial Query Processing.- The Web KnowARR Framework: Orchestrating Computational Intelligence with Graph Databases.- Customer Relationship Management and Big Data Mining.- Performance Competition for ISCIFCM and Application of Computational Intelligence on Analysis of Air Quality Monitoring Big Data.-PEI Models under Uncontrolled Circumstances.- Rough Set Model based Knowledge Acquisition of Market Movements from Economic Data.- Deep Neural Network Modeling for Big Data Weather Forecast.- Current Knowledge and Future Challenge for Visibility Forecasting by Computational Intelligence.- Application of Computational Intelligence on Analysis of Air Quality Monitoring Big Data.

139 citations


Journal ArticleDOI
TL;DR: New ways for exploiting the structure of an image database by representing it as a graph are explored, and it is shown how the rich information embedded in such a graph can improve bag-of-words-based location recognition methods.
Abstract: Recognizing the location of a query image by matching it to an image database is an important problem in computer vision, and one for which the representation of the database is a key issue. We explore new ways for exploiting the structure of an image database by representing it as a graph, and show how the rich information embedded in such a graph can improve bag-of-words-based location recognition methods. In particular, starting from a graph based on visual connectivity, we propose a method for selecting a set of overlapping subgraphs and learning a local distance function for each subgraph using discriminative techniques. For a query image, each database image is ranked according to these local distance functions in order to place the image in the right part of the graph. In addition, we propose a probabilistic method for increasing the diversity of these ranked database images, again based on the structure of the image graph. We demonstrate that our methods improve performance over standard bag-of-words methods on several existing location recognition datasets.

114 citations


01 Jan 2015
TL;DR: This paper compares various options for reifying RDF triples, and generates the four RDF datasets pertaining to each model and discusses high-level aspects relating to data sizes, etc.
Abstract: In this paper, we compare various options for reifying RDF triples. We are motivated by the goal of representing Wikidata as RDF, which would allow legacy Semantic Web languages, techniques and tools – for example, SPARQL engines – to be used for Wikidata. However, Wikidata annotates statements with qualifiers and references, which require some notion of reification to model in RDF. We thus investigate four such options: (1) standard reification, (2) n-ary relations, (3) singleton properties, and (4) named graphs. Taking a recent dump of Wikidata, we generate the four RDF datasets pertaining to each model and discuss high-level aspects relating to data sizes, etc. To empirically compare the effect of the different models on query times, we collect a set of benchmark queries with four model-specific versions of each query. We present the results of running these queries against five popular SPARQL implementations: 4store, BlazeGraph, GraphDB, Jena TDB and Virtuoso.

104 citations


Proceedings ArticleDOI
Wen Sun1, Achille Fokoue1, Kavitha Srinivas1, Anastasios Kementsietsidis2, Gang Hu1, Guotong Xie1 
27 May 2015
TL;DR: It is shown that existing mature, relational optimizers can be exploited with a novel schema to give better performance for property graph storage and retrieval than popular noSQL graph stores.
Abstract: We show that existing mature, relational optimizers can be exploited with a novel schema to give better performance for property graph storage and retrieval than popular noSQL graph stores. The schema combines relational storage for adjacency information with JSON storage for vertex and edge attributes. We demonstrate that this particular schema design has benefits compared to a purely relational or purely JSON solution. The query translation mechanism translates Gremlin queries with no side effects into SQL queries so that one can leverage relational query optimizers. We also conduct an empirical evaluation of our schema design and query translation mechanism with two existing popular property graph stores. We show that our system is 2-8 times better on query performance, and 10-30 times better in throughput on 4.3 billion edge graphs compared to existing stores.

Journal ArticleDOI
TL;DR: A comprehensive survey over the state-of-the-art of large scale graph processing platforms, namely, GraphChi, Apache Giraph, GPS, GraphLab and GraphX, and an extensive experimental study of five popular systems in this domain.
Abstract: Graph is a fundamental data structure that captures relationships between different data entities. In practice, graphs are widely used for modeling complicated data in different application domains such as social networks, protein networks, transportation networks, bibliographical networks, knowledge bases and many more. Currently, graphs with millions and billions of nodes and edges have become very common. In principle, graph analytics is an important big data discovery technique. Therefore, with the increasing abundance of large graphs, designing scalable systems for processing and analyzing large scale graphs has become one of the most timely problems facing the big data research community. In general, scalable processing of big graphs is a challenging task due to their size and the inherent irregular structure of graph computations. Thus, in recent years, we have witnessed an unprecedented interest in building big graph processing systems that attempted to tackle these challenges. In this article, we provide a comprehensive survey over the state-of-the-art of large scale graph processing platforms. In addition, we present an extensive experimental study of five popular systems in this domain, namely, GraphChi, Apache Giraph, GPS, GraphLab and GraphX. In particular, we report and analyze the performance characteristics of these systems using five common graph processing algorithms and seven large graph datasets. Finally, we identify a set of the current open research challenges and discuss some promising directions for future research in the domain of large scale graph processing.

Proceedings ArticleDOI
12 Oct 2015
TL;DR: In this paper, the authors proposed graph encryption schemes that efficiently support approximate shortest distance queries on large-scale encrypted graphs, including three oracle encryption schemes, which are provably secure against any semi-honest server.
Abstract: We propose graph encryption schemes that efficiently support approximate shortest distance queries on large-scale encrypted graphs. Shortest distance queries are one of the most fundamental graph operations and have a wide range of applications. Using such graph encryption schemes, a client can outsource large-scale privacy-sensitive graphs to an untrusted server without losing the ability to query it. Other applications include encrypted graph databases and controlled disclosure systems. We propose GRECS (stands for GRaph EnCryption for approximate Shortest distance queries) which includes three oracle encryption schemes that are provably secure against any semi-honest server. Our first construction makes use of only symmetric-key operations, resulting in a computationally-efficient construction. Our second scheme makes use of somewhat-homomorphic encryption and is less computationally-efficient but achieves optimal communication complexity (i.e. uses a minimal amount of bandwidth). Finally, our third scheme is both computationally-efficient and achieves optimal communication complexity at the cost of a small amount of additional leakage. We implemented and evaluated the efficiency of our constructions experimentally. The experiments demonstrate that our schemes are efficient and can be applied to graphs that scale up to 1.6 million nodes and 11 million edges.

Posted Content
TL;DR: GRECS (stands for GRaph EnCryption for approximate Shortest distance queries) is proposed which includes three oracle encryption schemes that are provably secure against any semi-honest server and is both computationally-efficient and achieves optimal communication complexity at the cost of a small amount of additional leakage.
Abstract: We propose graph encryption schemes that efficiently support approximate shortest distance queries on large-scale encrypted graphs. Shortest distance queries are one of the most fundamental graph operations and have a wide range of applications. Using such graph encryption schemes, a client can outsource large-scale privacy-sensitive graphs to an untrusted server without losing the ability to query it. Other applications include encrypted graph databases and controlled disclosure systems. We propose GRECS (stands for GRaph EnCryption for approximate Shortest distance queries) which includes three schemes that are provably secure against any semi-honest server. Our first construction makes use of only symmetric-key operations, resulting in a computationally-efficient construction. Our second scheme, makes use of somewhat-homomorphic encryption and is less computationally-efficient but achieves optimal communication complexity (i.e., uses a minimal amount of bandwidth). Finally, our third scheme is both computationally-efficient and achieves optimal communication complexity at the cost of a small amount of additional leakage. We implemented and evaluated the efficiency of our constructions experimentally. The experiments demonstrate that our schemes are efficient and can be applied to graphs that scale up to 1.6 million nodes and 11 million edges.

Journal ArticleDOI
Bin Xu1, Jiajun Bu1, Chun Chen1, Can Wang1, Deng Cai1, Xiaofei He1 
TL;DR: This paper proposes a novel scalable graph-based ranking model called Efficient Manifold Ranking (EMR), trying to address the shortcomings of MR from two main perspectives: scalable graph construction and efficient ranking computation.
Abstract: Graph-based ranking models have been widely applied in information retrieval area In this paper, we focus on a well known graph-based model - the Ranking on Data Manifold model, or Manifold Ranking (MR) Particularly, it has been successfully applied to content-based image retrieval, because of its outstanding ability to discover underlying geometrical structure of the given image database However, manifold ranking is computationally very expensive, which significantly limits its applicability to large databases especially for the cases that the queries are out of the database (new samples) We propose a novel scalable graph-based ranking model called Efficient Manifold Ranking (EMR), trying to address the shortcomings of MR from two main perspectives: scalable graph construction and efficient ranking computation Specifically, we build an anchor graph on the database instead of a traditional $k$ -nearest neighbor graph, and design a new form of adjacency matrix utilized to speed up the ranking An approximate method is adopted for efficient out-of-sample retrieval Experimental results on some large scale image databases demonstrate that EMR is a promising method for real world retrieval applications

Proceedings ArticleDOI
15 Nov 2015
TL;DR: This paper presents a fast distributed graph processing system, namely PGX.D, as a low-overhead, bandwidth-efficient communication framework that supports remote data-pulling patterns and recommends the use of balanced beefy clusters where the sustained random DRAM-access bandwidth in aggregate is matched with the bandwidth of the underlying interconnection fabric.
Abstract: Graph analysis is a powerful method in data analysis. Although several frameworks have been proposed for processing large graph instances in distributed environments, their performance is much lower than using efficient single-machine implementations provided with enough memory. In this paper, we present a fast distributed graph processing system, namely PGX.D. We show that PGX.D outperforms other distributed graph systems like GraphLab significantly (3x -- 90x). Furthermore, PGX.D on 4 to 16 machines is also faster than an implementation optimized for single-machine execution. Using a fast cooperative context-switching mechanism, we implement PGX.D as a low-overhead, bandwidth-efficient communication framework that supports remote data-pulling patterns. Moreover, PGX.D achieves large traffic reduction and good workload balance by applying selective ghost nodes, edge partitioning, and edge chunking transparently to the user. Our analysis confirms that each of these features is indeed crucial for overall performance of certain kinds of graph algorithms. Finally, we advocate the use of balanced beefy clusters where the sustained random DRAM-access bandwidth in aggregate is matched with the bandwidth of the underlying interconnection fabric.

Journal ArticleDOI
TL;DR: This paper proposes a systematic method for edit-distance based similarity search, which retrieves graphs that are similar to a given query graph under the constraint of graph edit distance, and derives two lower bounds from different perspectives, i.e., partition-based and branch-based bounds.
Abstract: Since many graph data are often noisy and incomplete in real applications, it has become increasingly important to retrieve graphs $g$ in the graph database $D$ that approximately match the query graph $q$ , rather than exact graph matching. In this paper, we study the problem of graph similarity search, which retrieves graphs that are similar to a given query graph under the constraint of graph edit distance. We propose a systematic method for edit-distance based similarity search problem. Specifically, we derive two lower bounds, i.e., partition-based and branch-based bounds, from different perspectives. More importantly, a hybrid lower bound incorporating both ideas of the two lower bounds is proposed, which is theoretically proved to have higher (at least not lower) pruning power than using the two lower bounds together. We also present a uniform index structure, namely u-tree, to facilitate effective pruning and efficient query processing. Extensive experiments confirm that our proposed approach outperforms the existing approaches significantly, in terms of both the pruning power and query response time.

Journal ArticleDOI
TL;DR: The optimized ordering of vertices and selection of colours in combination with interactive highlighting techniques increases the traceability of communities along the time axis and allows users to investigate the community structure together with the underlying dynamic graph.
Abstract: The community structure of graphs is an important feature that gives insight into the high-level organization of objects within the graph. In real-world systems, the graph topology is oftentimes not static but changes over time and hence, also the community structure changes. Previous timeline-based approaches either visualize the dynamic graph or the dynamic community structure. In contrast, our approach combines both in a single image and therefore allows users to investigate the community structure together with the underlying dynamic graph. Our optimized ordering of vertices and selection of colours in combination with interactive highlighting techniques increases the traceability of communities along the time axis. Users can identify visual signatures, estimate the reliability of the derived community structure and investigate whether community evolution interacts with changes in the graph topology. The utility of our approach is demonstrated in two application examples.

Journal ArticleDOI
TL;DR: This paper proposes a novel and effective graph based semi-supervised learning method for image annotation derived by a compact graph that can well grasp the manifold structure and theoretically proves that it can be analyzed under a regularized framework.
Abstract: The insufficiency of labeled samples is major problem in automatic image annotation However, unlabeled samples are readily available and abundant Hence, semi-supervised learning methods, which utilize partly labeled samples and a large amount of unlabeled samples, have attracted increased attention in the field of image annotation During the past decade, graph-based semi-supervised learning has been becoming one of the most important research areas in semi-supervised learning In this paper, we propose a novel and effective graph based semi-supervised learning method for image annotation The new method is derived by a compact graph that can well grasp the manifold structure In addition, we theoretically prove that the proposed semi-supervised learning method can be analyzed under a regularized framework It can also be easily extended to deal with out-of-sample data Simulation results show that the proposed method can achieve better performance compared with other state-of-the-art graph based semi-supervised learning methods

Proceedings ArticleDOI
02 Apr 2015
TL;DR: A new graph sensemaking hierarchy is proposed that categorizes tools and techniques based on how they operate on the graph data (e.g., local vs global) and concludes with future research directions for graph sense making.
Abstract: Making sense of large graph datasets is a fundamental and challenging process that advances science, education and technology. We survey research on graph exploration and visualization approaches aimed at addressing this challenge. Different from existing surveys, our investigation highlights approaches that have strong potential in handling large graphs, algorithmically, visually, or interactively; we also explicitly connect relevant works from multiple research fields — data mining, machine learning, human-computer ineraction, information visualization, information retrieval, and recommender systems — to underline their parallel and complementary contributions to graph sensemaking. We ground our discussion in sensemaking research; we propose a new graph sensemaking hierarchy that categorizes tools and techniques based on how they operate on the graph data (e.g., local vs global). We summarize and compare their strengths and weaknesses, and highlight open challenges. We conclude with future research directions for graph sensemaking.

Proceedings ArticleDOI
27 May 2015
TL;DR: This work is the first to propose a join approach for template generation in RDF Q/A, and proposes several structural and probability pruning techniques to speed up joining.
Abstract: A challenging task in the natural language question answering (Q/A for short) over RDF knowledge graph is how to bridge the gap between unstructured natural language questions (NLQ) and graph-structured RDF data (GOne of the effective tools is the "template", which is often used in many existing RDF Q/A systems. However, few of them study how to generate templates automatically. To the best of our knowledge, we are the first to propose a join approach for template generation. Given a workload D of SPARQL queries and a set N of natural language questions, the goal is to find some pairs q, n, for q∈ D ∧ n ∈, N, where SPARQL query q is the best match for natural language question n. These pairs provide promising hints for automatic template generation. Due to the ambiguity of the natural languages, we model the problem above as an uncertain graph join task. We propose several structural and probability pruning techniques to speed up joining. Extensive experiments over real RDF Q/A benchmark datasets confirm both the effectiveness and efficiency of our approach.

Patent
31 Aug 2015
TL;DR: In this paper, a system, method, and apparatus to improve connections within a healthcare ecosystem are provided, where a plurality of reusable interface and route definitions to translate and exchange data messages between source and target systems in the healthcare ecosystem.
Abstract: A systems, method, and apparatus to improve connections within a healthcare ecosystem are provided. Example systems, methods, and apparatus can facilitate dynamic interface definition and configuration. An example method includes storing a plurality of reusable interface and route definitions to translate and exchange data messages between source and target systems in a healthcare ecosystem; monitoring message exchanges and message patterns in the healthcare environment via a machine learning system to predict traffic and utilization patterns in the healthcare ecosystem; tracking metadata regarding connections involving the source and target systems and storing the metadata in a graph database; suggesting connections between the source and target systems based on the monitored message exchanges and message patterns and metadata from the graph database using graph analytics; provisioning an interface between the source and target systems based on a suggested connection, the interface provisioned from the reusable interface and route definitions based on the suggested connection.

Journal ArticleDOI
TL;DR: The design of ImmortalGraph explores an interesting interplay among locality, parallelism, and incremental computation in supporting common mining tasks on temporal graphs, resulting in a high-performance temporal-graph system that is up to 5 times more efficient than existing database solutions for graph queries.
Abstract: Temporal graphs that capture graph changes over time are attracting increasing interest from research communities, for functions such as understanding temporal characteristics of social interactions on a time-evolving social graph. ImmortalGraph is a storage and execution engine designed and optimized specifically for temporal graphs. Locality is at the center of ImmortalGraph’s design: temporal graphs are carefully laid out in both persistent storage and memory, taking into account data locality in both time and graph-structure dimensions. ImmortalGraph introduces the notion of locality-aware batch scheduling in computation, so that common “bulk” operations on temporal graphs are scheduled to maximize the benefit of in-memory data locality. The design of ImmortalGraph explores an interesting interplay among locality, parallelism, and incremental computation in supporting common mining tasks on temporal graphs. The result is a high-performance temporal-graph system that is up to 5 times more efficient than existing database solutions for graph queries. The locality optimizations in ImmortalGraph offer up to an order of magnitude speedup for temporal iterative graph mining compared to a straightforward application of existing graph engines on a series of snapshots.

01 Jan 2015
TL;DR: The final author version and the galley proof are versions of the publication after peer review that features the final layout of the paper including the volume, issue and page numbers.
Abstract: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.

Book ChapterDOI
24 Sep 2015
TL;DR: Recent advances and limitations in graph modelling as well as future directions are discussed, including pattern matching in graphs providing, in principle, an arbitrarily complex identity function.
Abstract: Real world data offers a lot of possibilities to be represented as graphs. As a result we obtain undirected or directed graphs, multigraphs and hypergraphs, labelled or weighted graphs and their variants. A development of graph modelling brings also new approaches, e.g., considering constraints. Processing graphs in a database way can be done in many different ways. Some graphs can be represented as JSON or XML structures and processed by their native database tools. More generally, a graph database is specified as any storage system that provides index-free adjacency, i.e. an explicit graph structure. Graph database technology contains some technological features inherent to traditional databases, e.g. ACID properties and availability. Use cases of graph databases like Neo4j, OrientDB, InfiniteGraph, FlockDB, AllegroGraph, and others, document that graph databases are becoming a common means for any connected data. In Big Data era, important questions are connected with scalability for large graphs as well as scaling for read/write operations. For example, scaling graph data by distributing it in a network is much more difficult than scaling simpler data models and is still a work in progress. Still a challenge is pattern matching in graphs providing, in principle, an arbitrarily complex identity function. Mining complete frequent patterns from graph databases is also challenging since supporting operations are computationally costly. In this paper, we discuss recent advances and limitations in these areas as well as future directions.

Journal ArticleDOI
TL;DR: This paper presents G*’s design and implementation principles along with evaluation results that document its unique benefits over traditional graph processing systems.
Abstract: From sensor networks to transportation infrastructure to social networks, we are awash in data. Many of these real-world networks tend to be large ("big data") and dynamic, evolving over time. Their evolution can be modeled as a series of graphs. Traditional systems that store and analyze one graph at a time cannot effectively handle the complexity and subtlety inherent in dynamic graphs. Modern analytics require systems capable of storing and processing series of graphs. We present such a system. G* compresses dynamic graph data based on commonalities among the graphs in the series for deduplicated storage on multiple servers. In addition to the obvious space-saving advantage, large-scale graph processing tends to be I/O bound, so faster reads from and writes to stable storage enable faster results. Unlike traditional database and graph processing systems, G* executes complex queries on large graphs using distributed operators to process graph data in parallel. It speeds up queries on multiple graphs by processing graph commonalities only once and sharing the results across relevant graphs. This architecture not only provides scalability, but since G* is not limited to processing only what is available in RAM, its analysis capabilities are far greater than other systems which are limited to what they can hold in memory. This paper presents G*'s design and implementation principles along with evaluation results that document its unique benefits over traditional graph processing systems.

Proceedings Article
01 Jan 2015
TL;DR: This paper introduces a lightweight repartitioner, which dynamically modifies a partitioning using a small amount of resources and integrated it into Hermes, which is designed as an extension of the open source Neo4j graph database system to support workloads over partitioned graph data distributed over multiple servers.
Abstract: Social networks are large graphs that require multiple graph database servers to store and manage them. Each database server hosts a graph partition with the objectives of balancing server loads, reducing remote traversals (edge-cuts), and adapting the partitioning to changes in the structure of the graph in the face of changing workloads. To achieve these objectives, a dynamic repartitioning algorithm is required to modify an existing partitioning to maintain good quality partitions while not imposing a significant overhead to the system. In this paper, we introduce a lightweight repartitioner, which dynamically modifies a partitioning using a small amount of resources. In contrast to the existing repartitioning algorithms, our lightweight repartitioner is e cient, making it suitable for use in a real system. We integrated our lightweight repartitioner into Hermes, which we designed as an extension of the open source Neo4j graph database system, to support workloads over partitioned graph data distributed over multiple servers. Using real-world social network data, we show that Hermes leverages the lightweight repartitioner to maintain high quality partitions and provides a 2 to 3 times performance improvement over the de-facto standard random hash-based partitioning.

Proceedings ArticleDOI
25 May 2015
TL;DR: This article looks at common classes of graph algorithms and recast them into linear algebra operations using the Graph BLAS building blocks.
Abstract: Big data and the Internet of Things era continue to challenge computational systems. Several technology solutions such as NoSQL databases have been developed to deal with this challenge. In order to generate meaningful results from large datasets, analysts often use a graph representation which provides an intuitive way to work with the data. Graph vertices can represent users and events, and edges can represent the relationship between vertices. Graph algorithms are used to extract meaningful information from these very large graphs. At MIT, the Graphulo initiative is an effort to perform graph algorithms directly in NoSQL databases such as Apache Accumulo or SciDB, which have an inherently sparse data storage scheme. Sparse matrix operations have a history of efficient implementations and the Graph Basic Linear Algebra Subprogram (Graph BLAS) community has developed a set of key kernels that can be used to develop efficient linear algebra operations. However, in order to use the Graph BLAS kernels, it is important that common graph algorithms be recast using the linear algebra building blocks. In this article, we look at common classes of graph algorithms and recast them into linear algebra operations using the Graph BLAS building blocks.

Journal ArticleDOI
01 Jan 2015-Database
TL;DR: This work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database and shows how these models can be linked via annotations and queried.
Abstract: Model repositories such as the BioModels Database, the CellML Model Repository or JWS Online are frequently accessed to retrieve computational models of biological systems. However, their storage concepts support only restricted types of queries and not all data inside the repositories can be retrieved. In this article we present a storage concept that meets this challenge. It grounds on a graph database, reflects the models’ structure, incorporates semantic annotations and simulation descriptions and ultimately connects different types of model-related data. The connections between heterogeneous model-related data and bio-ontologies enable efficient search via biological facts and grant access to new model features. The introduced concept notably improves the access of computational models and associated simulations in a model repository. This has positive effects on tasks such as model search, retrieval, ranking, matching and filtering. Furthermore, our work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. We show how these models can be linked via annotations and queried. Database URL: https://sems.uni-rostock.de/projects/masymos/

Journal ArticleDOI
TL;DR: An innovative graph topic model (GTM) is proposed to address this issue, which uses Bernoulli distributions to model the edges between nodes in a graph to make the edges in a graphs contribute to latent topic discovery and further improve the accuracy of the supervised and unsupervised learning of graphs.
Abstract: Graph mining has been a popular research area because of its numerous application scenarios. Many unstructured and structured data can be represented as graphs, such as, documents, chemical molecular structures, and images. However, an issue in relation to current research on graphs is that they cannot adequately discover the topics hidden in graph-structured data which can be beneficial for both the unsupervised learning and supervised learning of the graphs. Although topic models have proved to be very successful in discovering latent topics, the standard topic models cannot be directly applied to graph-structured data due to the “bag-of-word” assumption. In this paper, an innovative graph topic model (GTM) is proposed to address this issue, which uses Bernoulli distributions to model the edges between nodes in a graph. It can, therefore, make the edges in a graph contribute to latent topic discovery and further improve the accuracy of the supervised and unsupervised learning of graphs. The experimental results on two different types of graph datasets show that the proposed GTM outperforms the latent Dirichlet allocation on classification by using the unveiled topics of these two models to represent graphs.