scispace - formally typeset
Search or ask a question

Showing papers on "Graph database published in 2017"


Posted Content
TL;DR: Graph Attention Networks (GATs) as discussed by the authors leverage masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations.
Abstract: We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods' features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural networks simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the-art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein-protein interaction dataset (wherein test graphs remain unseen during training).

1,016 citations


Proceedings ArticleDOI
03 Apr 2017
TL;DR: Explicit Semantic Ranking is introduced, a new ranking technique that leverages knowledge graph embedding that represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledgegraph embedding.
Abstract: This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. Analysis of the query log from our academic search engine, SemanticScholar.org, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ESR represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledge graph embedding. Experiments demonstrate ESR's ability in improving Semantic Scholar's online production system, especially on hard queries where word-based ranking fails.

341 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a survey of the fundamental graph querying functionalities, such as graph patterns and navigational expressions, which are used in modern graph query languages such as SPARQL, Cypher and Gremlin.
Abstract: We survey foundational features underlying modern graph query languages. We first discuss two popular graph data models: edge-labelled graphs, where nodes are connected by directed, labelled edges, and property graphs, where nodes and edges can further have attributes. Next we discuss the two most fundamental graph querying functionalities: graph patterns and navigational expressions. We start with graph patterns, in which a graph-structured query is matched against the data. Thereafter, we discuss navigational expressions, in which patterns can be matched recursively against the graph to navigate paths of arbitrary length; we give an overview of what kinds of expressions have been proposed and how they can be combined with graph patterns. We also discuss several semantics under which queries using the previous features can be evaluated, what effects the selection of features and semantics has on complexity, and offer examples of such features in three modern languages that are used to query graphs: SPARQL, Cypher, and Gremlin. We conclude by discussing the importance of formalisation for graph query languages; a summary of what is known about SPARQL, Cypher, and Gremlin in terms of expressivity and complexity; and an outline of possible future directions for the area.

213 citations


Proceedings ArticleDOI
23 Apr 2017
TL;DR: This paper designs a new graph processing engine, named Mosaic, and proposes a new locality-optimizing, space-efficient graph representation---Hilbert-ordered tiles, and a hybrid execution model that enables vertex-centric operations in fast host processors and edge-centric Operations in massively parallel coprocessors.
Abstract: Processing a one trillion-edge graph has recently been demonstrated by distributed graph engines running on clusters of tens to hundreds of nodes. In this paper, we employ a single heterogeneous machine with fast storage media (e.g., NVMe SSD) and massively parallel coprocessors (e.g., Xeon Phi) to reach similar dimensions. By fully exploiting the heterogeneous devices, we design a new graph processing engine, named Mosaic, for a single machine. We propose a new locality-optimizing, space-efficient graph representation---Hilbert-ordered tiles, and a hybrid execution model that enables vertex-centric operations in fast host processors and edge-centric operations in massively parallel coprocessors.Our evaluation shows that for smaller graphs, Mosaic consistently outperforms other state-of-the-art out-of-core engines by 3.2-58.6x and shows comparable performance to distributed graph engines. Furthermore, Mosaic can complete one iteration of the Pagerank algorithm on a trillion-edge graph in 21 minutes, outperforming a distributed disk-based engine by 9.2×.

162 citations


Proceedings ArticleDOI
30 Oct 2017
TL;DR: This paper proposes LDPGen, a novel multi-phase technique that incrementally clusters users based on their connections to different partitions of the whole population, and derives optimal parameters in this process to cluster structurally-similar users together.
Abstract: A large amount of valuable information resides in decentralized social graphs, where no entity has access to the complete graph structure. Instead, each user maintains locally a limited view of the graph. For example, in a phone network, each user keeps a contact list locally in her phone, and does not have access to other users' contacts. The contact lists of all users form an implicit social graph that could be very useful to study the interaction patterns among different populations. However, due to privacy concerns, one could not simply collect the unfettered local views from users and reconstruct a decentralized social network. In this paper, we investigate techniques to ensure local differential privacy of individuals while collecting structural information and generating representative synthetic social graphs. We show that existing local differential privacy and synthetic graph generation techniques are insufficient for preserving important graph properties, due to excessive noise injection, inability to retain important graph structure, or both. Motivated by this, we propose LDPGen, a novel multi-phase technique that incrementally clusters users based on their connections to different partitions of the whole population. Every time a user reports information, LDPGen carefully injects noise to ensure local differential privacy.We derive optimal parameters in this process to cluster structurally-similar users together. Once a good clustering of users is obtained, LDPGen adapts existing social graph generation models to construct a synthetic social graph. We conduct comprehensive experiments over four real datasets to evaluate the quality of the obtained synthetic graphs, using a variety of metrics, including (i) important graph structural measures; (ii) quality of community discovery; and (iii) applicability in social recommendation. Our experiments show that the proposed technique produces high-quality synthetic graphs that well represent the original decentralized social graphs, and significantly outperform those from baseline approaches.

159 citations


Posted Content
TL;DR: G-CORE as mentioned in this paper is a graph query language with two key characteristics: it should be composable, meaning that graphs are the input and the output of queries, and it should treat paths as first-class citizens.
Abstract: We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.

112 citations


Proceedings ArticleDOI
09 May 2017
TL;DR: Graphflow is demonstrated, a prototype active graph data-base that evaluates general one-time and continuous subgraph queries and supports the property graph data model and the Cypher++ query language, which extends Neo4j's declarative Cypher language with subgraph-condition-action triggers.
Abstract: Many applications detect the emergence or deletion of certain subgraphs in their input graphs continuously. In order to evaluate such continuous subgraph queries, these applications resort to inefficient or highly specialized solutions because existing graph databases are passive systems that only support one-time subgraph queries. We demonstrate Graphflow, a prototype active graph data-base that evaluates general one-time and continuous subgraph queries. Graphflow supports the property graph data model and the Cypher++ query language, which extends Neo4j's declarative Cypher language with subgraph-condition-action triggers. At the core of Graphflow's query processor are two worst-case optimal join algorithms called Generic Join and our new Delta Generic Join algorithm for one-time and continuous subgraph queries, respectively.

98 citations


Book ChapterDOI
08 Apr 2017
TL;DR: This work presents Paper2vec, a novel neural network embedding based approach for creating scientific paper representations which make use of both textual and graph-based information, and demonstrates the efficacy of the representations on three real world academic datasets.
Abstract: We present Paper2vec, a novel neural network embedding based approach for creating scientific paper representations which make use of both textual and graph-based information. An academic citation network can be viewed as a graph where individual nodes contain rich textual information. With the current trend of open-access to most scientific literature, we presume that this full text of a scientific article contain vital source of information which aids in various recommendation and prediction tasks concerning this domain. To this end, we propose an approach, Paper2vec, which comprises of information from both the modalities and results in a rich representation for scientific papers. Over the recent past representation learning techniques have been studied extensively using neural networks. However, they are modeled independently for text and graph data. Paper2vec leverages recent research in the broader field of unsupervised feature learning from both graphs and text documents. We demonstrate the efficacy of our representations on three real world academic datasets in two tasks - node classification and link prediction where Paper2vec is able to outperform state-of-the-art by a considerable margin.

85 citations


Proceedings ArticleDOI
01 Nov 2017
TL;DR: This work introduces a novel graph kernel based on the k-dimensional Weisfeiler-Lehman algorithm, and devise a stochastic version of the kernel with provable approximation guarantees using conditional Rademacher averages.
Abstract: Most state-of-the-art graph kernels only take local graph properties into account, i.e., the kernel is computed with regard to properties of the neighborhood of vertices or other small substructures. On the other hand, kernels that do take global graph properties into account may not scale well to large graph databases. Here we propose to start exploring the spacebetween local and global graph kernels, so called glocalized graph kernels, striking the balance between both worlds. Specifically, we introduce a novel graph kernel based on the k-dimensional Weisfeiler-Lehmanalgorithm. Unfortunately, the k-dimensional Weisfeiler-Lehman algorithm scales exponentially in k. Consequently, we devise a stochastic version of the kernel with provable approximation guarantees using conditional Rademacher averages. On bounded-degree graphs, it can even be computed in constant time. We support our theoretical results with experiments on several graph classification benchmarks, showing that our kernels often outperform the state-of-the-art in terms of classification accuracies.

80 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper used graph neural networks (Graph-NNs) to compute the embeddings of OOKB entities, exploiting the limited auxiliary knowledge provided at test time.
Abstract: Knowledge base completion (KBC) aims to predict missing information in a knowledge this http URL this paper, we address the out-of-knowledge-base (OOKB) entity problem in KBC:how to answer queries concerning test entities not observed at training time. Existing embedding-based KBC models assume that all test entities are available at training time, making it unclear how to obtain embeddings for new entities without costly retraining. To solve the OOKB entity problem without retraining, we use graph neural networks (Graph-NNs) to compute the embeddings of OOKB entities, exploiting the limited auxiliary knowledge provided at test time.The experimental results show the effectiveness of our proposed model in the OOKB setting.Additionally, in the standard KBC setting in which OOKB entities are not involved, our model achieves state-of-the-art performance on the WordNet dataset. The code and dataset are available at this https URL

77 citations


Proceedings ArticleDOI
22 Mar 2017
TL;DR: This paper proposes a linear time function call graph (FCG) vector representation based on function clustering that has significant performance gains in addition to improved classification accuracy and shows how this representation can enable using graph features together with other non-graph features.
Abstract: In an attempt to preserve the structural information in malware binaries during feature extraction, function call graph-based features have been used in various research works in malware classification. However, the approach usually employed when performing classification on these graphs, is based on computing graph similarity using computationally intensive techniques. Due to this, much of the previous work in this area incurred large performance overhead and does not scale well. In this paper, we propose a linear time function call graph (FCG) vector representation based on function clustering that has significant performance gains in addition to improved classification accuracy. We also show how this representation can enable using graph features together with other non-graph features.

Journal ArticleDOI
TL;DR: The results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships, and querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.
Abstract: Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.

Journal ArticleDOI
01 Apr 2017
TL;DR: GMark as discussed by the authors is a domain and query language-independent graph instance and query workload generator that targets and controls the diversity of properties of both the generated instances and the generated workloads coupled to these instances.
Abstract: Massive graph data sets are pervasive in contemporary application domains. Hence, graph database systems are becoming increasingly important. In the experimental study of these systems, it is vital that the research community has shared solutions for the generation of database instances and query workloads having predictable and controllable properties. In this paper, we present the design and engineering principles of $\mathsf {gMark}$ , a domain- and query language-independent graph instance and query workload generator. A core contribution of $\mathsf {gMark}$ is its ability to target and control the diversity of properties of both the generated instances and the generated workloads coupled to these instances. Further novelties include support for regular path queries, a fundamental graph query paradigm, and schema-driven selectivity estimation of queries, a key feature in controlling workload chokepoints. We illustrate the flexibility and practical usability of $\mathsf {gMark}$ by showcasing the framework's capabilities in generating high quality graphs and workloads, and its ability to encode user-defined schemas across a variety of application domains.

BookDOI
31 Oct 2017
TL;DR: This paper presents a meta-modelling framework that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually cataloging and integrating different types of data into a single system.
Abstract: Big Data Storage Models.- Big Data Programming Models.- Programming Platforms for Big Data Analysis.- Big Data Analysis on Clouds.- Data Organization and Curation in Big Data.- Big Data Query Engines.- Unbounded Data Processing.- Semantic Data Integration.- Linked Data Management.- Non-native RDF Storage Engines.- Exploratory Ad-hoc Analysis for Big Data.- Pattern Matching over Linked Data Streams.- Searching the Big Data Practices and Experiences in Efficiently Querying Knowledge Bases.- Management and Analysis of Big Graph Data.- Similarity Search in Large-Scale Graph Databases.- Big Graphs Querying, Mining, and Beyond.- Link and Graph Mining in the Big Data Era.- Granular Social Network Model and Applications.- Big Data, IoT and Semantics.- SCADA Systems in the Cloud.- Quantitative Data Analysis in Finance.- Emerging Cost Effective Big Data Architectures.- Bringing High Performance Computing to Big Data.- Cognitive Computing where Big Data is Driving.- Privacy-Preserving Record Linkage for Big Data.

Proceedings ArticleDOI
01 Apr 2017
TL;DR: An efficient, selectivity-aware algorithm to partition graphs of G into highly selective subgraphs is designed in a cost-effective, multi-layered indexing structure, MLIndex (Multi-Layered Index), for GED lower bound crosschecking and false-positive graph filtering with theoretical performance guarantees.
Abstract: We consider in this paper the similarity search problem that retrieves relevant graphs from a graph database under the well-known graph edit distance (GED) constraint Formally, given a graph database G = fg1, g2, …, gng and a query graph q, we aim to search the graph gi 2 G such that the graph edit distance between gi and q, GED(gi, q), is within a userspecified GED threshold In spite of its theoretical significance and wide applicability, the GED-based similarity search problem is challenging in large graph databases due in particular to a large amount of GED computation incurred, which has proven to be NP-hard In this paper, we propose a parameterized, partitionbased GED lower bound that can be instantiated into a series of tight lower bounds towards synergistically pruning false-positive graphs from G before costly GED computation is performed We design an efficient, selectivity-aware algorithm to partition graphs of G into highly selective subgraphs They are further incorporated in a cost-effective, multi-layered indexing structure, MLIndex (Multi-Layered Index), for GED lower bound crosschecking and false-positive graph filtering with theoretical performance guarantees Experimental studies in real and synthetic graph databases validate the efficiency and effectiveness of ML-Index, which achieves up to an order of magnitude speedup over the state-of-the-art method for similarity search in graph databases

Proceedings ArticleDOI
26 Apr 2017
TL;DR: This paper implements an interprocedural analysis technique for PHP applications based on code property graphs that scales well to large amounts of code and is highly adaptable in its nature, and identifies different types of Web application vulnerabilities by means of programmable graph traversals.
Abstract: The Web today is a growing universe of pages and applications teeming with interactive content. The security of such applications is of the utmost importance, as exploits can have a devastating impact on personal and economic levels. The number one programming language in Web applications is PHP, powering more than 80% of the top ten million websites. Yet it was not designed with security in mind and, today, bears a patchwork of fixes and inconsistently designed functions with often unexpected and hardly predictable behavior that typically yield a large attack surface. Consequently, it is prone to different types of vulnerabilities, such as SQL Injection or Cross-Site Scripting. In this paper, we present an interprocedural analysis technique for PHP applications based on code property graphs that scales well to large amounts of code and is highly adaptable in its nature. We implement our prototype using the latest features of PHP 7, leverage an efficient graph database to store code property graphs for PHP, and subsequently identify different types of Web application vulnerabilities by means of programmable graph traversals. We show the efficacy and the scalability of our approach by reporting on an analysis of 1,854 popular open-source projects, comprising almost 80 million lines of code.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: This paper proposes a model for persisting transactions from Ethereum into a graph database, Neo4j, and proposes leveraging graph compute or analytics against the transactions persisted into agraph database.
Abstract: Cryptocurrency platforms such as Bitcoin and Ethereum have become more popular due to decentralized control and the promise of anonymity. Ethereum is particularly powerful due to its support for smart contracts which are implemented through Turing complete scripting languages and digital tokens that represent fungible tradable goods. It is necessary to understand whether de-anonymization is feasible to quantify the promise of anonymity. Cryptocurrencies are increasingly being used in online black markets like Silk Road and ransomware like CryptoLocker and WannaCry. In this paper, we propose a model for persisting transactions from Ethereum into a graph database, Neo4j. We propose leveraging graph compute or analytics against the transactions persisted into a graph database.

Proceedings ArticleDOI
14 May 2017
TL;DR: An unsupervised neural-network based NLP idea, Distributed Representation via Word Embedding, is applied to extract latent information from a relational table to enable a new class of SQL-based business intelligence queries called cognitive intelligence queries that use the generated vectors to analyze contextual semantic relationships between database tokens.
Abstract: We investigate opportunities for exploiting Artificial Intelligence (AI) techniques for enhancing capabilities of relational databases. In particular, we explore applications of Natural Language Processing (NLP) techniques to endow relational databases with capabilities that were very hard to realize in practice. We apply an unsupervised neural-network based NLP idea, Distributed Representation via Word Embedding, to extract latent information from a relational table. The word embedding model is based on meaningful textual view of a relational database and captures inter-/intra-attribute relationships between database tokens. For each database token, the model includes a vector that encodes these contextual semantic relationships. These vectors enable processing a new class of SQL-based business intelligence queries called cognitive intelligence (CI) queries that use the generated vectors to analyze contextual semantic relationships between database tokens. The cognitive capabilities enable complex queries such as semantic matching, reasoning queries such as analogies, predictive queries using entities not present in a database, and using knowledge from external sources.

Journal ArticleDOI
TL;DR: A novel framework, termed as unsupervised single view feature extraction with structured graph (FESG), which learns both a transformation matrix and an ideal structured graph containing the clustering information is proposed.
Abstract: Many feature extraction methods reduce the dimensionality of data based on the input graph matrix. The graph construction which reflects relationships among raw data points is crucial to the quality of resulting low-dimensional representations. To improve the quality of graph and make it more suitable for feature extraction tasks, we incorporate a new graph learning mechanism into feature extraction and add an interaction between the learned graph and the low-dimensional representations. Based on this learning mechanism, we propose a novel framework, termed as unsupervised single view feature extraction with structured graph (FESG), which learns both a transformation matrix and an ideal structured graph containing the clustering information. Moreover, we propose a novel way to extend FESG framework for multi-view learning tasks. The extension is named as unsupervised multiple views feature extraction with structured graph (MFESG), which learns an optimal weight for each view automatically without requiring an additional parameter. To show the effectiveness of the framework, we design two concrete formulations within FESG and MFESG, together with two efficient solving algorithms. Promising experimental results on plenty of real-world datasets have validated the effectiveness of our proposed algorithms.

Book ChapterDOI
03 Apr 2017
TL;DR: In the era of big data, graph databases have become increasingly important for NoSQL technologies, and many systems can be modeled as graphs for semantic queries, and the ability to query over the encrypted graphs is retained.
Abstract: In the era of big data, graph databases have become increasingly important for NoSQL technologies, and many systems can be modeled as graphs for semantic queries. Meanwhile, with the advent of cloud computing, data owners are highly motivated to outsource and store their massive potentially-sensitive graph data on remote untrusted servers in an encrypted form, expecting to retain the ability to query over the encrypted graphs.

Book ChapterDOI
01 Jan 2017
TL;DR: This chapter surveys current system approaches for management and analysis of “big graph data”, and outlines a recent research framework called Gradoop that is build on the so-called Extended Property Graph Data Model with dedicated support for analyzing not only single graphs but also collections of graphs.
Abstract: Many big data applications in business and science require the management and analysis of huge amounts of graph data. Suitable systems to manage and to analyze such graph data should meet a number of challenging requirements including support for an expressive graph data model with heterogeneous vertices and edges, powerful query and graph mining capabilities, ease of use as well as high performance and scalability. In this chapter, we survey current system approaches for management and analysis of “big graph data”. We discuss graph database systems, distributed graph processing systems such as Google Pregel and its variations, and graph dataflow approaches based on Apache Spark and Flink. We further outline a recent research framework called Gradoop that is build on the so-called Extended Property Graph Data Model with dedicated support for analyzing not only single graphs but also collections of graphs. Finally, we discuss current and future research challenges.

Proceedings ArticleDOI
09 May 2017
TL;DR: This paper presents the first practical solution for efficient LCR evaluation, leveraging landmark-based indexes for large graphs, and shows through extensive experiments that their indexes are significantly smaller than state-of-the-art LCR indexing techniques, while supporting up to orders of magnitude faster query evaluation times.
Abstract: Consider a directed edge-labeled graph, such as a social network or a citation network. A fundamental query on such data is to determine if there is a path in the graph from a given source vertex to a given target vertex, using only edges with labels in a restricted subset of the edge labels in the graph. Such label-constrained reachability (LCR) queries play an important role in graph analytics, for example, as a core fragment of the so-called regular path queries which are supported in practical graph query languages such as the W3C's SPARQL 1.1, Neo4j's Cypher, and Oracle's PGQL. Current solutions for LCR evaluation, however, do not scale to large graphs which are increasingly common in a broad range of application domains. In this paper we present the first practical solution for efficient LCR evaluation, leveraging landmark-based indexes for large graphs. We show through extensive experiments that our indexes are significantly smaller than state-of-the-art LCR indexing techniques, while supporting up to orders of magnitude faster query evaluation times. Our complete C++ codebase is available as open source for further research.

Proceedings ArticleDOI
TL;DR: The Subgraph Isomorphism Graph Challenge (SIGG) as discussed by the authors is a benchmark for graph analytic systems that can be used to measure and quantitatively compare a wide range of present day and future systems.
Abstract: The rise of graph analytic systems has created a need for ways to measure and compare the capabilities of these systems. Graph analytics present unique scalability difficulties. The machine learning, high performance computing, and visual analytics communities have wrestled with these difficulties for decades and developed methodologies for creating challenges to move these communities forward. The proposed Subgraph Isomorphism Graph Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a graph challenge that is reflective of many real-world graph analytics processing systems. The Subgraph Isomorphism Graph Challenge is a holistic specification with multiple integrated kernels that can be run together or independently. Each kernel is well defined mathematically and can be implemented in any programming environment. Subgraph isomorphism is amenable to both vertex-centric implementations and array-based implementations (e.g., using the this http URL standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The surrounding kernels provide the context for each kernel that allows rigorous definition of both the input and the output for each kernel. Furthermore, since the proposed graph challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Serial implementations in C++, Python, Python with Pandas, Matlab, Octave, and Julia have been implemented and their single threaded performance have been measured. Specifications, data, and software are publicly available at this http URL.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: It is analyzed the fundamental points of graph databases, showing their main characteristics and advantages, and study Neo4j, the top graph database software in the market and evaluate its performance using the Social Network Benchmark (SNB).
Abstract: The volume of data is growing at an increasing rate. This growth is both in size and in connectivity, where connectivity refers to the increasing presence of relationships between data. Social networks such as Facebook and Twitter store and process petabytes of data each day. Graph databases have gained renewed interest in the last years, due to their applications in areas such as the Semantic Web and Social Network Analysis. Graph databases provide an effective and efficient solution to data storage and querying data in these scenarios, where data is rich in relationships. In this paper, it is analyzed the fundamental points of graph databases, showing their main characteristics and advantages. We study Neo4j, the top graph database software in the market and evaluate its performance using the Social Network Benchmark (SNB).

Journal ArticleDOI
TL;DR: This work presents a novel uncertain network visualization technique based on node-link diagrams that reveals general limitations of the force-directed layout and allows the user to recognize that some nodes of the graph are at a specific position just by chance.
Abstract: We present a novel uncertain network visualization technique based on node-link diagrams. Nodes expand spatially in our probabilistic graph layout, depending on the underlying probability distributions of edges. The visualization is created by computing a two-dimensional graph embedding that combines samples from the probabilistic graph. A Monte Carlo process is used to decompose a probabilistic graph into its possible instances and to continue with our graph layout technique. Splatting and edge bundling are used to visualize point clouds and network topology. The results provide insights into probability distributions for the entire network—not only for individual nodes and edges. We validate our approach using three data sets that represent a wide range of network types: synthetic data, protein—protein interactions from the STRING database, and travel times extracted from Google Maps. Our approach reveals general limitations of the force-directed layout and allows the user to recognize that some nodes of the graph are at a specific position just by chance.

Journal ArticleDOI
TL;DR: This work defines the graph query language of Regular Queries, which is a natural extension of unions of conjunctive 2-way regular path queries, and formalizes regular queries as nonrecursive Datalog programs extended with the transitive closure of binary predicates.
Abstract: Graph databases are currently one of the most popular paradigms for storing data. One of the key conceptual differences between graph and relational databases is the focus on navigational queries that ask whether some nodes are connected by paths satisfying certain restrictions. This focus has driven the definition of several different query languages and the subsequent study of their fundamental properties. We define the graph query language of Regular Queries, which is a natural extension of unions of conjunctive 2-way regular path queries (UC2RPQs) and unions of conjunctive nested 2-way regular path queries (UCN2RPQs). Regular queries allow expressing complex regular patterns between nodes. We formalize regular queries as nonrecursive Datalog programs extended with the transitive closure of binary predicates. This language has been previously considered, but its algorithmic properties are not well understood. Our main contribution is to show elementary tight bounds for the containment problem for regular queries. Specifically, we show that this problem is 2Expspace-complete. For all extensions of regular queries known to date, the containment problem turns out to be non-elementary. Together with the fact that evaluating regular queries is not harder than evaluating UCN2RPQs, our results show that regular queries achieve a good balance between expressiveness and complexity, and constitute a well-behaved class that deserves further investigation.

Proceedings ArticleDOI
09 May 2017
TL;DR: This work revisits SQL recursive queries and shows that the 4 operations with others are ensured to have a fixpoint, following the techniques studied in DATALOG, and enhances the recursive WITH clause in SQL'99.
Abstract: To support analytics on massive graphs such as online social networks, RDF, Semantic Web, etc. many new graph algorithms are designed to query graphs for a specific problem, and many distributed graph processing systems are developed to support graph querying by programming. In this paper, we focus on RDBM, which has been well studied over decades to manage large datasets, and we revisit the issue how RDBM can support graph processing at the SQL level. Our work is motivated by the fact that there are many relations stored in RDBM that are closely related to a graph in real applications and need to be used together to query the graph, and RDBM is a system that can query and manage data while data may be updated over time. To support graph processing, in this work, we propose 4 new relational algebra operations, MM-join, MV-join, anti-join, and union-by-update. Here, MM-join and MV-join are join operations between two matrices and between a matrix and a vector, respectively, followed by aggregation computing over groups, given a matrix/vector can be represented by a relation. Both deal with the semiring by which many graph algorithms can be supported. The anti-join removes nodes/edges in a graph when they are unnecessary for the following computing. The union-by-update addresses value updates to compute PageRank, for example. The 4 new relational algebra operations can be defined by the 6 basic relational algebra operations with group-by & aggregation. We revisit SQL recursive queries and show that the 4 operations with others are ensured to have a fixpoint, following the techniques studied in DATALOG, and enhance the recursive WITH clause in SQL'99. We conduct extensive performance studies to test 10 graph algorithms using 9 large real graphs in 3 major RDBMs. We show that RDBMs are capable of dealing with graph processing in reasonable time. The focus of this work is at SQL level. There is high potential to improve the efficiency by main-memory RDBMs, efficient join processing in parallel, and new storage management.

Journal ArticleDOI
TL;DR: A new design pattern detection method based on the graph theory, which achieves high efficiency and accuracy in detecting design pattern instances from source codes by analyzing the behavioral signature of patterns.
Abstract: Design patterns are strategies for solving commonly occurring problems within a given context in software design. In the process of re-engineering, detection of design pattern instances from source codes can play a major role in understanding large and complex software systems. However, detecting design pattern instances is not always a straightforward task. In this paper, based on the graph theory, a new design pattern detection method is presented. The proposed detection process is subdivided into two sequential phases. In the first phase, we concern both the semantics and the syntax of the structural signature of patterns. To do so, the system under study and the patterns asked to be detected, are transformed into semantic graphs. Now, the initial problem is converted into the problem of finding matches in the system graph for the pattern graph. To reduce the exploration space, based on a predetermined set of criteria, the system graph is broken into the possible subsystem graphs. After applying a semantic matching algorithm and obtaining the candidate instances, by analyzing the behavioral signature of the patterns, in the second phase, final matches will be obtained. The performance of the suggested technique is evaluated on three open source systems regarding precision and recall metrics. The results demonstrate the high efficiency and accuracy of the proposed method.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: This paper proposes an unsupervised text coherence scoring based on graph construction in which edges are established between semantically similar sentences represented by vertices, and provides three graph construction methods establishing an edge from a given vertex to a preceding adjacent vertex, to a single similar vertex, or to multiple similar vertices.
Abstract: Coherence is a crucial feature of text because it is indispensable for conveying its communication purpose and meaning to its readers. In this paper, we propose an unsupervised text coherence scoring based on graph construction in which edges are established between semantically similar sentences represented by vertices. The sentence similarity is calculated based on the cosine similarity of semantic vectors representing sentences. We provide three graph construction methods establishing an edge from a given vertex to a preceding adjacent vertex, to a single similar vertex, or to multiple similar vertices. We evaluated our methods in the document discrimination task and the insertion task by comparing our proposed methods to the supervised (Entity Grid) and unsupervised (Entity Graph) baselines. In the document discrimination task, our method outperformed the unsupervised baseline but could not do the supervised baseline, while in the insertion task, our method outperformed both baselines.

Proceedings ArticleDOI
09 May 2017
TL;DR: On a single server with 244GB memory, ZipG executes tens of thousands of queries from these workloads for raw graph data over half a TB, which leads to an order of magnitude (sometimes as much as 23×) higher throughput than Neo4j and Titan.
Abstract: We present ZipG, a distributed memory-efficient graph store for serving interactive graph queries. ZipG achieves memory efficiency by storing the input graph data using a compressed representation. What differentiates ZipG from other graph stores is its ability to execute a wide range of graph queries directly on this compressed representation. ZipG can thus execute a larger fraction of queries in main memory, achieving query interactivity. ZipG exposes a minimal API that is functionally rich enough to implement published functionalities from several industrial graph stores. We demonstrate this by implementing and evaluating graph queries from Facebook TAO, LinkBench, Graph Search and several other workloads on top of ZipG. On a single server with 244GB memory, ZipG executes tens of thousands of queries from these workloads for raw graph data over half a TB; this leads to an order of magnitude (sometimes as much as 23×) higher throughput than Neo4j and Titan. We get similar gains in distributed settings compared to Titan.