scispace - formally typeset
Search or ask a question

Showing papers on "Graph database published in 2013"


Proceedings ArticleDOI
23 Jun 2013
TL;DR: GraphX is introduced, which combines the advantages of both data-parallel and graph-par parallel systems by efficiently expressing graph computation within the Spark data- parallel framework and provides powerful new operations to simplify graph construction and transformation.
Abstract: From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g., Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

656 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: The introduction of Trinity, a general purpose graph engine over a distributed memory cloud that leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance, which supports fast graph exploration as well as efficient parallel computing.
Abstract: Computations performed by graph algorithms are data driven, and require a high degree of random data access. Despite the great progresses made in disk technology, it still cannot provide the level of efficient random access required by graph computation. On the other hand, memory-based approaches usually do not scale due to the capacity limit of single machines. In this paper, we introduce Trinity, a general purpose graph engine over a distributed memory cloud. Through optimized memory management and network communication, Trinity supports fast graph exploration as well as efficient parallel computing. In particular, Trinity leverages graph access patterns in both online and offline computation to optimize memory and communication for best performance. These enable Trinity to support efficient online query processing and offline analytics on large graphs with just a few commodity machines. Furthermore, Trinity provides a high level specification language called TSL for users to declare data schema and communication protocols, which brings great ease-of-use for general purpose graph management and computing. Our experiments show Trinity's performance in both low latency graph queries as well as high throughput graph analytics on web-scale, billion-node graphs.

468 citations


Book
20 Jun 2013
TL;DR: This practical book shows you how to apply the schema-free graph model to real-world problems and design and implement a graph database that brings the power of graphs to bear on a broad range of problem domains.
Abstract: Discover how graph databases can help you manage and query highly connected data. With this practical book, you’ll learn how to design and implement a graph database that brings the power of graphs to bear on a broad range of problem domains. Whether you want to speed up your response to user queries or build a database that can adapt as your business evolves, this book shows you how to apply the schema-free graph model to real-world problems. Learn how different organizations are using graph databases to outperform their competitors. With this book’s data modeling, query, and code examples, you’ll quickly be able to implement your own solution.Model data with the Cypher query language and property graph model Learn best practices and common pitfalls when modeling with graphs Plan and implement a graph database solution in test-driven fashion Explore real-world examples to learn how and why organizations use a graph database Understand common patterns and components of graph database architecture Use analytical techniques and algorithms to mine graph database information

415 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: This paper presents an efficient and robust subgraph search solution, called TurboISO, which is turbo-charged with two novel concepts, candidate region exploration and the combine and permute strategy (in short, Comb/Perm).
Abstract: Given a query graph q and a data graph g, the subgraph isomorphism search finds all occurrences of q in g and is considered one of the most fundamental query types for many real applications While this problem belongs to NP-hard, many algorithms have been proposed to solve it in a reasonable time for real datasets However, a recent study has shown, through an extensive benchmark with various real datasets, that all existing algorithms have serious problems in their matching order selection Furthermore, all algorithms blindly permutate all possible mappings for query vertices, often leading to useless computations In this paper, we present an efficient and robust subgraph search solution, called TurboISO, which is turbo-charged with two novel concepts, candidate region exploration and the combine and permute strategy (in short, Comb/Perm) The candidate region exploration identifies on-the-fly candidate subgraphs (ie, candidate regions), which contain embeddings, and computes a robust matching order for each candidate region explored The Comb/Perm strategy exploits the novel concept of the neighborhood equivalence class (NEC) Each query vertex in the same NEC has identically matching data vertices During subgraph isomorphism search, Comb/Perm generates only combinations for each NEC instead of permutating all possible enumerations Thus, if a chosen combination is determined to not contribute to a complete solution, all possible permutations for that combination will be safely pruned Extensive experiments with many real datasets show that TurboISO consistently and significantly outperforms all competitors by up to several orders of magnitude

328 citations


01 Jan 2013
TL;DR: Graph databases (GDB) are now a viable alternative to Relational Database Systems (RDBMS) and comparisons will be drawn between relational database systems (Oracle, MySQL) and graph databases (Neo4J) focusing on aspects such as data structures, data model features and query facilities.
Abstract: Graph databases (GDB) are now a viable alternative to Relational Database Systems (RDBMS). Chemistry, biology, semantic web, social networking and recommendation engines are all examples of applications that can be represented in a much more natural form. Comparisons will be drawn between relational database systems (Oracle, MySQL) and graph databases (Neo4J) focusing on aspects such as data structures, data model features and query facilities. Additionally, several of the inherent and contemporary limitations of current offerings comparing and contrasting graph vs. relational database implementations will be explored.

241 citations


Proceedings ArticleDOI
18 Mar 2013
TL;DR: The results show that the graph-based back-end can match and even outperform the traditional JPA implementation and that Cypher is a promising candidate for a standard graph query language, but still leaves room for improvements.
Abstract: NoSQL and especially graph databases are constantly gaining popularity among developers of Web 2.0 applications as they promise to deliver superior performance when handling highly interconnected data compared to traditional relational databases. Apache Shindig is the reference implementation for OpenSocial with its highly interconnected data model. However, the default back-end is based on a relational database. In this paper we describe our experiences with a different back-end based on the graph database Neo4j and compare the alternatives for querying data with each other and the JPA-based sample back-end running on MySQL. Moreover, we analyze why the different approaches often may yield such diverging results concerning throughput. The results show that the graph-based back-end can match and even outperform the traditional JPA implementation and that Cypher is a promising candidate for a standard graph query language, but still leaves room for improvements.

219 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: This work study the problem of querying graph databases, and, in particular, the expressiveness and complexity of evaluation for several general-purpose query languages, such as the regular path queries and its extensions with conjunctions and inverses.
Abstract: Graph databases have gained renewed interest in the last years, due to its applications in areas such as the Semantic Web and Social Networks Analysis. We study the problem of querying graph databases, and, in particular, the expressiveness and complexity of evaluation for several general-purpose query languages, such as the regular path queries and its extensions with conjunctions and inverses. We distinguish between two semantics for these languages. The first one, based on simple paths, easily leads to intractability, while the second one, based on arbitrary paths, allows tractable evaluation for an expressive family of languages.We also study two recent extensions of these languages that have been motivated by modern applications of graph databases. The first one allows to treat paths as first-class citizens, while the second one permits to express queries that combine the topology of the graph with its underlying data.

192 citations


Proceedings ArticleDOI
08 Sep 2013
TL;DR: A distributed graph database comparison framework is presented and the results obtained by comparing four important players in the graph databases market: Neo4j, Orient DB, Titan and DEX are presented.
Abstract: In recent years, more and more companies provide services that can not be anymore achieved efficiently using relational databases. As such, these companies are forced to use alternative database models such as XML databases, object-oriented databases, document-oriented databases and, more recently graph databases. Graph databases only exist for a few years. Although there have been some comparison attempts, they are mostly focused on certain aspects only. In this paper, we present a distributed graph database comparison framework and the results we obtained by comparing four important players in the graph databases market: Neo4j, Orient DB, Titan and DEX.

134 citations


Journal ArticleDOI
TL;DR: A comparative analysis of a graph database Neo4j with the most widespread relational database MySQL is provided to provide a comparison of the models used for storing and retrieving data.
Abstract: The relational model has dominated the computer industry since the 1980s mainly for storing and retrieving data. Lately, however, relational database has been losing its importance due to its reliance on a strict schema which makes it difficult to add new relationships between the objects. Another important reason of its failure is that as the available data is growing manifolds, it is becoming complicated to work with relational model as joining a large number of tables is not working efficiently. One of the proposed solutions is to transfer to the Graph databases as they aspire to overcome such type of problems. This paper provides a comparative analysis of a graph database Neo4j with the most widespread relational database MySQL.

129 citations


Proceedings ArticleDOI
08 Apr 2013
TL;DR: An in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner and develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right construction parameters for a specific scenario.
Abstract: We present a distributed graph database system to manage historical data for large evolving information networks, with the goal to enable temporal and evolutionary queries and analysis. The cornerstone of our system is a novel, user-extensible, highly tunable, and distributed hierarchical index structure called DeltaGraph, that enables compact recording of the historical network information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right construction parameters for a specific scenario. We also present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information.

129 citations


Proceedings ArticleDOI
23 Dec 2013
TL;DR: Data modeling and query syntax of relational and some classes of NoSQL databases have been explained with the help of an case study of a news website like Slashdot.
Abstract: Relational databases are providing storage for several decades now. However for today's interactive web and mobile applications the importance of flexibility and scalability in data model can not be over-stated. The term NoSQL broadly covers all non-relational databases that provide schema-less and scalable model. NoSQL databases which are also termed as Internetage databases are currently being used by Google, Amazon, Facebook and many other major organizations operating in the era of Web 2.0. Different classes of NoSQL databases namely key-value pair, document, column-oriented and graph databases enable programmers to model the data closer to the format as used in their application. In this paper, data modeling and query syntax of relational and some classes of NoSQL databases have been explained with the help of an case study of a news website like Slashdot.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: New ways for exploiting the structure of a database by representing it as a graph are explored, and how the rich information embedded in a graph can improve a bag-of-words-based location recognition method is shown.
Abstract: Recognizing the location of a query image by matching it to a database is an important problem in computer vision, and one for which the representation of the database is a key issue. We explore new ways for exploiting the structure of a database by representing it as a graph, and show how the rich information embedded in a graph can improve a bag-of-words-based location recognition method. In particular, starting from a graph on a set of images based on visual connectivity, we propose a method for selecting a set of sub graphs and learning a local distance function for each using discriminative techniques. For a query image, each database image is ranked according to these local distance functions in order to place the image in the right part of the graph. In addition, we propose a probabilistic method for increasing the diversity of these ranked database images, again based on the structure of the image graph. We demonstrate that our methods improve performance over standard bag-of-words methods on several existing location recognition datasets.

Patent
17 Dec 2013
TL;DR: In this paper, a method for measuring distance among and organizing similar concepts representing human knowledge, whose information is contained, as example, in databases of documents, is presented, which relates to techniques to analyze and organize bodies of knowledge into information networks.
Abstract: The present invention relates to techniques to analyze and organize bodies of knowledge into information networks. More particularly, it relates to a method for measuring distance among and organizing similar concepts representing human knowledge, whose information is contained, as example, in databases of documents. In particular, said method comprises: a) obtaining a plurality of type of entities and their relative properties, wherein at least two of said entities share at least one property; b) creating a multi-partite graph; c) making a projection for each type of entity onto each of their type of properties to obtain a proximity matrix, or a weighted graph, for each pair type of entity-type of property; d) obtaining a family of proximity matrices for each type of entity; e) querying the computed results in a format so that for each type of entity, portions of proximity matrices, or weighted graphs, of said family, are interactively accessed, represented or displayed. The present invention relates also to a discovery engine based on the above method.

Journal ArticleDOI
TL;DR: A novel and general-purpose graph-based summarizer is presented, namely G raph S um (Graph-based Summarizer), which discovers and exploits association rules to represent the correlations among multiple terms that have been neglected by previous approaches.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: The motivation for GraphBuilder, its architecture, MapReduce algorithms, and performance evaluation of the framework are described, and several graph partitioning methods are developed and evaluated.
Abstract: Graph abstraction is essential for many applications from finding a shortest path to executing complex machine learning (ML) algorithms like collaborative filtering. Graph construction from raw data for various applications is becoming challenging, due to exponential growth in data, as well as the need for large scale graph processing. Since graph construction is a data-parallel problem, MapReduce is well-suited for this task. We developed GraphBuilder, a scalable framework for graph Extract-Transform-Load (ETL), to offload many of the complexities of graph construction, including graph formation, tabulation, transformation, partitioning, output formatting, and serialization. GraphBuilder is written in Java, for ease of programming, and it scales using the MapReduce model. In this paper, we describe the motivation for GraphBuilder, its architecture, MapReduce algorithms, and performance evaluation of the framework. Since large graphs should be partitioned over a cluster for storing and processing and partitioning methods have significant performance impacts, we develop several graph partitioning methods and evaluate their performance. We also open source the framework at https://01.org/graphbuilder/.

Proceedings ArticleDOI
22 Jun 2013
TL;DR: TF-label is an efficient and scalable labeling scheme for processing reachability queries that is constructed based on a novel topological folding that recursively folds an input graph into half so as to reduce the label size, thus improving query efficiency.
Abstract: Reachability querying is a basic graph operation with numerous important applications in databases, network analysis, computational biology, software engineering, etc. Although many indexes have been proposed to answer reachability queries, most of them are only efficient for handling relatively small graphs. We propose TF-label, an efficient and scalable labeling scheme for processing reachability queries. TF-label is constructed based on a novel topological folding (TF) that recursively folds an input graph into half so as to reduce the label size, thus improving query efficiency. We show that TF-label is efficient to construct and propose efficient algorithms and optimization schemes. Our experiments verify that TF-label is significantly more scalable and efficient than the state-of-the-art methods in both index construction and query processing.

Proceedings ArticleDOI
08 Apr 2013
TL;DR: In this article, the authors propose to find all instances of a given sample graph in a larger data graph using a single round of map-reduce, using the techniques of multiway joins.
Abstract: The theme of this paper is how to find all instances of a given “sample” graph in a larger “data graph,” using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of [1] for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be “convertible,” in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.

Proceedings ArticleDOI
11 Aug 2013
TL;DR: This paper proposes the first differentially private algorithm for mining frequent graph patterns using a Markov Chain Monte Carlo (MCMC) sampling based algorithm and establishes the privacy and utility guarantee of the algorithm and proposes an efficient neighboring pattern counting technique.
Abstract: Discovering frequent graph patterns in a graph database offers valuable information in a variety of applications. However, if the graph dataset contains sensitive data of individuals such as mobile phone-call graphs and web-click graphs, releasing discovered frequent patterns may present a threat to the privacy of individuals. Differential privacy has recently emerged as the de facto standard for private data analysis due to its provable privacy guarantee. In this paper we propose the first differentially private algorithm for mining frequent graph patterns. We first show that previous techniques on differentially private discovery of frequent itemsets cannot apply in mining frequent graph patterns due to the inherent complexity of handling structural information in graphs. We then address this challenge by proposing a Markov Chain Monte Carlo (MCMC) sampling based algorithm. Unlike previous work on frequent itemset mining, our techniques do not rely on the output of a non-private mining algorithm. Instead, we observe that both frequent graph pattern mining and the guarantee of differential privacy can be unified into an MCMC sampling framework. In addition, we establish the privacy and utility guarantee of our algorithm and propose an efficient neighboring pattern counting technique as well. Experimental results show that the proposed algorithm is able to output frequent patterns with good precision.

Journal ArticleDOI
01 Oct 2013
TL;DR: The ZenCrowd system uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud.
Abstract: We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

Proceedings ArticleDOI
08 Apr 2013
TL;DR: This paper investigates graph behaviors by exploring the working window (the authors call it wind) changes, where a working window is a set of active vertices that a graph algorithm really needs to access in parallel computing.
Abstract: Graph partitioning is a key issue in graph database processing systems for achieving high efficiency on Cloud. However, the balanced graph partitioning itself is difficult because it is known to be NP-complete. In addition a static graph partitioning cannot keep all graph algorithms efficient for a long time in parallel on Cloud because the workload balancing in different iterations for different graph algorithms are all possible different. In this paper, we investigate graph behaviors by exploring the working window (we call it wind) changes, where a working window is a set of active vertices that a graph algorithm really needs to access in parallel computing. We investigated nine classic graph algorithms using real datasets, and propose simple yet effective policies that can achieve both high graph workload balancing and efficient partition on Cloud.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper proposes a methodology to convert a relational to a graph database by exploiting the schema and the constraints of the source and provides experimental results that show the feasibility of the solution and the efficiency of query answering over the target database.
Abstract: Graph Database Management Systems provide an effective and efficient solution to data storage in current scenarios where data are more and more connected, graph models are widely used, and systems need to scale to large data sets. In this framework, the conversion of the persistent layer of an application from a relational to a graph data store can be convenient but it is usually an hard task for database administrators. In this paper we propose a methodology to convert a relational to a graph database by exploiting the schema and the constraints of the source. The approach supports the translation of conjunctive SQL queries over the source into graph traversal operations over the target. We provide experimental results that show the feasibility of our solution and the efficiency of query answering over the target database.

Proceedings ArticleDOI
18 Mar 2013
TL;DR: This work investigates the behavior and applicability of XPath-like languages for querying graph databases, concentrating on their expressiveness and complexity of query evaluation, and introduces new types of tests that let them capture first-order logic with data comparisons and prove that the low complexity bounds continue to apply to such extended languages.
Abstract: XPath plays a prominent role as an XML navigational language due to several factors, including its ability to express queries of interest, its close connection to yardstick database query languages (e.g., first-order logic), and the low complexity of query evaluation for many fragments. Another common database model---graph databases---also requires a heavy use of navigation in queries; yet it largely adopts a different approach to querying, relying on reachability patterns expressed with regular constraints.Our goal here is to investigate the behavior and applicability of XPath-like languages for querying graph databases, concentrating on their expressiveness and complexity of query evaluation. We are particularly interested in a model of graph data that combines navigation through graphs with querying data held in the nodes, such as, for example, in a social network scenario. As navigational languages, we use analogs of core and regular XPath and augment them with various tests on data values. We relate these languages to first-order logic, its transitive closure extensions, and finite-variable fragments thereof, proving several capture results. In addition, we describe their relative expressive power. We then show that they behave very well computationally: they have a low-degree polynomial combined complexity, which becomes linear for several fragments. Furthermore, we introduce new types of tests for XPath languages that let them capture first-order logic with data comparisons and prove that the low complexity bounds continue to apply to such extended languages. Therefore, XPath-like languages seem to be very well-suited to query graphs.

Journal ArticleDOI
TL;DR: In this paper, the authors compared the performance of a relational database and a graph database for solving three bioinformatics graph processing problems: finding immediate neighbors and their interactions, finding the best scoring path between two proteins and finding the shortest path between them.
Abstract: Graphs are ubiquitous in bioinformatics and frequently consist of too many nodes and edges to represent in random access memory. These graphs are thus stored in databases to allow for efficient queries using declarative query languages such as Structured Query Language (SQL). Traditional relational databases (e.g. MySQL and PostgreSQL) have long been used for this purpose and are based on decades of research into query optimization. Recently, NoSQL databases have caught a lot of attention because of their advantages in scalability. The term NoSQL is used to refer to schemaless databases such as key/value stores (e.g. Apache Cassandra), document stores (e.g. MongoDB) and graph databases (e.g. AllegroGraph, Neo4J, OpenLink Virtuoso), which do not fit within the traditional relational paradigm. Most NoSQL databases do not have a declarative query language. The widely used Neo4J graph database is an exception (Webber et al., 2013). Its query language Cypher is designed for expressing graph queries, but is still evolving. Graph databases have so far seen only limited use within bioinformatics (Schriml et al., 2012). To illustrate the pros and cons of using a graph database (exemplified by Neo4J v1.8.1) instead of a relational database (PostgreSQL v9.1), we imported into both the human interaction network from STRING v9.05 (Franceschini et al., 2013), which is an approximately scale-free network with 20 140 proteins and 2.2 million interactions. As all graph databases, Neo4J stores edges as direct pointers between nodes, which can thus be traversed in constant time. Because Neo4j uses the property graph model, nodes and edges can have properties associated with them; we use this for storing the protein names and the confidence scores associated with the interactions (Fig. 1). In PostgreSQL, we stored the graph as an indexed table of node pairs, which can be traversed with either logarithmic or constant look up complexity depending on the type of index used. On these databases we benchmarked the speed of Cypher and SQL queries for solving three bioinformatics graph processing problems: finding immediate neighbors and their interactions, finding the best scoring path between two proteins and finding the shortest path between them. We have selected these three tasks because they illustrate well the strengths and weaknesses of graph databases compared with traditional relational databases. Fig. 1. Relational versus graph database representation of a small protein interaction network. In the relational database, the network is stored as an interactions table (left). By contrast a graph database directly stores interactions as pointers between protein ... A common task in STRING is to retrieve a neighbor network. This involves finding the immediate neighbors of a protein and all interactions between them. To express this as a single SQL query requires the use of query nesting and a UNION set operation. Because Cypher currently supports neither of these features, two queries are needed to solve the task: one to find immediate neighbors and a second to find their interactions, which must be run for each of the immediate neighbors. Although this precludes some query optimizations, running all these Cypher queries is 36× faster than running the single SQL query (Table 1). However, it should be noted that a 49× fold speedup is attainable with PostgreSQL by similarly decomposing the complex query into multiple simple SQL queries. In theory, posing the task as one declarative query maximizes the opportunity for query optimization, but in practice this does not always give good performance. These results also show that even for graph data, using a graph database is not necessarily an advantage. Table 1. Query benchmark of a relational and a graph database Finding the best scoring path in a weighted graph is another frequently occurring task. For example, finding the best scoring path connecting two proteins in the STRING network is a crucial part of the NetworKIN algorithm (Linding et al., 2007). This task can be expressed single query both in (recursive) SQL and in Cypher. However, in practice neither query can be executed unless the maximal path length is severely constrained, in which case the Cypher query was faster by a factor of 981× (Table 1). The poor scalability is because of an exponential explosion in the number of longer paths, which in part is because of the scale-free nature of the network. The task can be efficiently solved using Dijkstra’s algorithm, but neither database is capable of casting queries as dynamic programming problems, although promising results have been achieved with automatic dynamic programming in declarative languages (Zhou et al., 2010). By contrast, the Cypher graph query language has a dedicated function for finding shortest paths, not taking into account edge weights. This leads to a massive speed improvement for this specific task: Neo4j is able to find the shortest path with no length constraint 2441× faster than PostgreSQL can find the shortest path when constraining the maximal path length to two edges. This shows what is possible when tightly integrating efficient algorithms with graph databases. In summary, graph databases themselves are ready for bioinformatics and can offer great speedups over relational databases on selected problems. The fact that a certain dataset is a graph, however, does not necessarily imply that a graph database is the best choice; it depends on the exact types of queries that need to be performed. Graph queries formulated in terms of paths can be concise and intuitive compared with equivalent SQL queries complicated by joins. Nevertheless, declarative graph query languages leave much to be desired, both feature-wise and performance-wise. Relational databases are a better choice when set operations are needed. Such operations are not as natural a fit to graph databases and have yet to make it into declarative graph database query languages. These languages are efficient for basic path traversal problems, but to realize the full benefits of using a graph database, it is presently necessary to tightly integrate the relevant algorithms with the graph database. Conflict of Interest: none declared.

Journal ArticleDOI
TL;DR: Experimental results demonstrate 98% detection rate and 0% false positive rate for the proposed malware detection system and the graph matching algorithm is based on Longest Common Subsequence (LCS) algorithm which is used on the simplified graphs.
Abstract: Malware stands for malicious software. It is software that is designed with a harmful intent. A malware detector is a system that attempts to identify malware using Application Programming Interface (API) call graph technique and/or other techniques. Matching the API call graph using graph matching algorithm have NP-complete problem and is slow because of computational complexity .In this study, a malware detection system based on API call graph is proposed. Each malware sample is represented as data dependent API call graph. After transforming the input sample into a simplified data dependent graph, graph matching algorithm is used to calculate similarity between the input sample and malware API call graph samples stored in a database. The graph matching algorithm is based on Longest Common Subsequence (LCS) algorithm which is used on the simplified graphs. Such strategy reduces the computation complexity by selecting paths with the same edge label in the API call graph. Experimental results on 85 samples demonstrate 98% detection rate and 0% false positive rate for the proposed malware detection system.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work introduces a data model for time-varying social network data that can be represented as a property graph in the Neo4j graph database, and uses data collected by using wearable sensors to study the performance of real-world queries.
Abstract: Representing and efficiently querying time-varying social network data is a central challenge that needs to be addressed in order to support a variety of emerging applications that leverage high-resolution records of human activities and interactions from mobile devices and wearable sensors. In order to support the needs of specific applications, as well as general tasks related to data curation, cleaning, linking, post-processing, and data analysis, data models and data stores are needed that afford efficient and scalable querying of the data. In particular, it is important to design solutions that allow rich queries that simultaneously involve the topology of the social network, temporal information on the presence and interactions of individual nodes, and node metadata. Here we introduce a data model for time-varying social network data that can be represented as a property graph in the Neo4j graph database. We use time-varying social network data collected by using wearable sensors and study the performance of real-world queries, pointing to strengths, weaknesses and challenges of the proposed approach.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper proposes a microbenchmark based on social networks, and concludes that reachability queries are those that put all the database systems into more difficulties, justifying themselves, and making them good candidates for more complex benchmarks.
Abstract: Graphs have become an indispensable tool for the analysis of linked data. As with any data representation, the need for using database management systems appears when they grow in size and complexity. Associated to those needs, benchmarks appear to assess the performance of such systems in specific scenarios, representative of real use cases.In this paper we propose a microbenchmark based on social networks. This includes a data generator that synthetically creates social graphs, and a set of low level atomic queries that model parts of the behavior of social network users. In order to understand how different data management paradigms are stressed, we execute the benchmark over five different database systems representing graph (Dex and Neo4j), RDF (RDF-3X) and relational (Virtuoso and PostgreSQL) data management. We conclude that reachability queries are those that put all the database systems into more difficulties, justifying themselves, and making them good candidates for more complex benchmarks.

Proceedings ArticleDOI
08 Apr 2013
TL;DR: This paper proposes a new framework of processing kGPM with on-the-fly ranked lists based on spanning trees of the cyclic graph query, and proposes a cost model to estimate the least number of tree answers to be consumed in each ranked list for a given kG PM query.
Abstract: There exist many graph-based applications including bioinformatics, social science, link analysis, citation analysis, and collaborative work. All need to deal with a large data graph. Given a large data graph, in this paper, we study finding top-k answers for a graph pattern query (kGPM), and in particular, we focus on top-k cyclic graph queries where a graph query is cyclic and can be complex. The capability of supporting kGPM provides much more flexibility for a user to search graphs. And the problem itself is challenging. In this paper, we propose a new framework of processing kGPM with on-the-fly ranked lists based on spanning trees of the cyclic graph query. We observe a multidimensional representation for using multiple ranked lists to answer a given kGPM query. Under this representation, we propose a cost model to estimate the least number of tree answers to be consumed in each ranked list for a given kGPM query. This leads to a query optimization approach for kGPM processing, and a top-k algorithm to process kGPM with the optimal query plan. We conducted extensive performance studies using a synthetic dataset and a real dataset, and we confirm the efficiency of our proposed approach.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: In this article, a flexible and extensible clustering algorithm framework with shared-memory parallelism is proposed to address the deficit in computing capability by using a flexible clustering framework with parallel variations of known sequential algorithms.
Abstract: The amount of graph-structured data has recently experienced an enormous growth in many applications. To transform such data into useful information, high-performance analytics algorithms and software tools are necessary. One common graph analytics kernel is community detection (or graph clustering). Despite extensive research on heuristic solvers for this task, only few parallel codes exist, although parallelism is often necessary to scale to the data volume of real-world applications. We address the deficit in computing capability by a flexible and extensible clustering algorithm framework with shared-memory parallelism. Within this framework we implement our parallel variations of known sequential algorithms and combine them by an ensemble approach. In extensive experiments driven by the algorithm engineering paradigm, we identify the most successful parameters and combinations of these algorithms. The processing rate of our fastest algorithm exceeds 10M edges/second for many large graphs, making it suitable for massive data streams. Moreover, the strongest algorithm we developed yields a very good tradeoff between quality and speed.

Proceedings ArticleDOI
27 Oct 2013
TL;DR: This paper derives a lower bound, branch-based bound, which can greatly reduce the search space of the graph similarity search, and proposes a tree index structure, namely b-tree, to facilitate effective pruning and efficient query processing.
Abstract: Due to many real applications of graph databases, it has become increasingly important to retrieve graphs g (in graph database D) that approximately match with query graph q, rather than exact subgraph matches. In this paper, we study the problem of graph similarity search, which retrieves graphs that are similar to a given query graph under the constraint of the minimum edit distance. Specifically, we derive a lower bound, branch-based bound, which can greatly reduce the search space of the graph similarity search. We also propose a tree index structure, namely b-tree, to facilitate effective pruning and efficient query processing. Extensive experiments confirm that our proposed approach outperforms the existing approaches by orders of magnitude, in terms of both pruning power and query response time.

Proceedings ArticleDOI
22 Jun 2013
TL;DR: The goal is to introduce languages that work directly over triples and are closed, i.e., they produce sets of triples, rather than graphs, and compares them with relational languages, such as finite-variable logics, and previously studied graph query languages such as adaptations of XPath, regular path queries, and nested regular expressions.
Abstract: Querying RDF data is viewed as one of the main applications of graph query languages, and yet the standard model of graph databases -- essentially labeled graphs -- is different from the triples-based model of RDF. While encodings of RDF databases into graph data exist, we show that even the most natural ones are bound to lose some functionality when used in conjunction with graph query languages. The solution is to work directly with triples, but then many properties taken for granted in the graph database context (e.g., reachability) lose their natural meaning.Our goal is to introduce languages that work directly over triples and are closed, i.e., they produce sets of triples, rather than graphs. Our basic language is called TriAL, or Triple Algebra: it guarantees closure properties by replacing the product with a family of join operations. We extend TriAL with recursion, and explain why such an extension is more intricate for triples than for graphs. We present a declarative language, namely a fragment of datalog, capturing the recursive algebra. For both languages, the combined complexity of query evaluation is given by low-degree polynomials. We compare our languages with relational languages, such as finite-variable logics, and previously studied graph query languages such as adaptations of XPath, regular path queries, and nested regular expressions; many of these languages are subsumed by the recursive triple algebra. We also provide examples of the usefulness of TriAL in querying graph and RDF data.