scispace - formally typeset
Search or ask a question

Showing papers on "Graph database published in 2020"


Journal ArticleDOI
TL;DR: A method for systematically digitally encoding papers’ core knowledge contributions in the form of a graph is proposed and the creativity literature is reviewed as a source of inspiration for crafting theoretical contributions.
Abstract: This roadmap identifies two developments for improving the process of literature reviewing. First, a method for systematically digitally encoding papers’ core knowledge contributions in the form of...

121 citations


Posted ContentDOI
17 Jan 2020-bioRxiv
TL;DR: NeuPrint as mentioned in this paper is a database and analysis ecosystem that organizes connectome data in a manner conducive to biological discovery, and it allows users to access the connectome at different levels of abstraction primarily through a graph database, neo4j, and its powerfully expressive query language Cypher.
Abstract: Due to technological advances in electron microscopy (EM) and deep learning, it is now practical to reconstruct a connectome, a description of neurons and the connections between them, for significant volumes of neural tissue. The limited scope of past reconstructions meant they were primarily used by domain experts, and performance was not a serious problem. But the new reconstructions, of common laboratory creatures such as the fruit fly Drosophila melanogaster, upend these assumptions. These natural neural networks now contain tens of thousands of neurons and tens of millions of connections between them, with yet larger reconstructions pending, and are of interest to a large community of non-specialists. This requires new tools that are easy to use and efficiently handle large data. We introduce neuPrint to address these data analysis challenges. neuPrint is a database and analysis ecosystem that organizes connectome data in a manner conducive to biological discovery. In particular, we propose a data model that allows users to access the connectome at different levels of abstraction primarily through a graph database, neo4j, and its powerfully expressive query language Cypher. neuPrint is compatible with modern connectome reconstruction workflows, providing tools for assessing reconstruction quality, and offering both batch and incremental updates to match modern connectome reconstruction flows. Finally, we introduce a web interface and programmer API that targets a diverse user skill set. We demonstrate the effectiveness and efficiency of neuPrint through example database queries.

72 citations


Journal ArticleDOI
TL;DR: NG-Tax 2.0 is a semantic framework for FAIR high-throughput analysis and classification of marker gene amplicon sequences including bacterial and archaeal 16S ribosomal RNA (rRNA), eukaryotic 18S rRNA and Ribosomal intergenic transcribed spacer sequences.
Abstract: NG-Tax 2.0 is a semantic framework for FAIR high-throughput analysis and classification of marker gene amplicon sequences including bacterial and archaeal 16S ribosomal RNA (rRNA), eukaryotic 18S rRNA and ribosomal intergenic transcribed spacer sequences. It can directly use single or merged reads, paired-end reads and unmerged paired-end reads from long range fragments as input to generate de novo amplicon sequence variants (ASV). Using the RDF data model, ASV's can be automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance, thereby achieving the level of interoperability required to utilize such data to its full potential. The graph database can be directly queried, allowing for comparative analyses of over thousands of samples and is connected with an interactive Rshiny toolbox for analysis and visualization of (meta) data. Additionally, NG-Tax 2.0 exports an extended BIOM 1.0 (JSON) file as starting point for further analyses by other means. The extended BIOM file contains new attribute types to include information about the command arguments used, the sequences of the ASVs formed, classification confidence scores and is backwards compatible. The performance of NG-Tax 2.0 was compared with DADA2, using the plugin in the QIIME 2 analysis pipeline. Fourteen 16S rRNA gene amplicon mock community samples were obtained from the literature and evaluated. Precision of NG-Tax 2.0 was significantly higher with an average of 0.95 vs 0.58 for QIIME2-DADA2 while recall was comparable with an average of 0.85 and 0.77, respectively. NG-Tax 2.0 is written in Java. The code, the ontology, a Galaxy platform implementation, the analysis toolbox, tutorials and example SPARQL queries are freely available at http://wurssb.gitlab.io/ngtax under the MIT License.

71 citations


Proceedings ArticleDOI
TL;DR: This work proposes Asynchronous Propagation Attention Network, an asynchronous continuous time dynamic graph algorithm for real-time temporal graph embedding that decouple model inference and graph computation to alleviate the damage of the heavy graph query operation to the speed of model inference.
Abstract: Limited by the time complexity of querying k-hop neighbors in a graph database, most graph algorithms cannot be deployed online and execute millisecond-level inference. This problem dramatically limits the potential of applying graph algorithms in certain areas, such as financial fraud detection. Therefore, we propose Asynchronous Propagation Attention Network, an asynchronous continuous time dynamic graph algorithm for real-time temporal graph embedding. Traditional graph models usually execute two serial operations: first graph computation and then model inference. We decouple model inference and graph computation step so that the heavy graph query operations will not damage the speed of model inference. Extensive experiments demonstrate that the proposed method can achieve competitive performance and 8.7 times inference speed improvement in the meantime.

59 citations


Journal ArticleDOI
Yunsheng Bai1, Hao Ding, Ken Gu1, Yizhou Sun1, Wei Wang1 
03 Apr 2020
TL;DR: The model, Graph-Sim, achieves the state-of-the-art performance on four real-world graph datasets under six out of eight settings, compared to existing popular methods for approximate Graph Edit Distance (GED) and Maximum Common Subgraph (MCS) computation.
Abstract: Graph similarity computation is one of the core operations in many graph-based applications, such as graph similarity search, graph database analysis, graph clustering, etc. Since computing the exact distance/similarity between two graphs is typically NP-hard, a series of approximate methods have been proposed with a trade-off between accuracy and speed. Recently, several data-driven approaches based on neural networks have been proposed, most of which model the graph-graph similarity as the inner product of their graph-level representations, with different techniques proposed for generating one embedding per graph. However, using one fixed-dimensional embedding per graph may fail to fully capture graphs in varying sizes and link structures—a limitation that is especially problematic for the task of graph similarity computation, where the goal is to find the fine-grained difference between two graphs. In this paper, we address the problem of graph similarity computation from another perspective, by directly matching two sets of node embeddings without the need to use fixed-dimensional vectors to represent whole graphs for their similarity computation. The model, Graph-Sim, achieves the state-of-the-art performance on four real-world graph datasets under six out of eight settings (here we count a specific dataset and metric combination as one setting), compared to existing popular methods for approximate Graph Edit Distance (GED) and Maximum Common Subgraph (MCS) computation.

55 citations


Journal ArticleDOI
TL;DR: GraphOne is designed and developed, a graph data store that abstracts thegraph data store away from the specialized systems to solve the fundamental research problems associated with the data store design and presents a new data abstraction, GraphView, to enable data access at two different granularities of data ingestions.
Abstract: There is a growing need to perform a diverse set of real-time analytics (batch and stream analytics) on evolving graphs to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations; rather, they excel in one area that keeps a private data store in a specialized way to favor their operations only. To address this challenge, we have designed and developed GraphOne, a graph data store that abstracts the graph data store away from the specialized systems to solve the fundamental research problems associated with the data store design. It combines two complementary graph storage formats (edge list and adjacency list) and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities of data ingestions (called data visibility) for concurrent execution of diverse classes of real-time graph analytics with only a small data duplication. Experimental results show that GraphOne is able to deliver 11.40× and 5.36× average speedup in ingestion rate against LLAMA and Stinger, the two state-of-the-art dynamic graph systems, respectively. Further, they achieve an average speedup of 8.75× and 4.14× against LLAMA and 12.80× and 3.18× against Stinger for BFS and PageRank analytics (batch version), respectively. GraphOne also gains over 2,000× speedup against Kickstarter, a state-of-the-art stream analytics engine in ingesting the streaming edges and performing streaming BFS when treating first half as a base snapshot and rest as streaming edge in a synthetic graph. GraphOne also achieves an ingestion rate of two to three orders of magnitude higher than graph databases. Finally, we demonstrate that it is possible to run concurrent stream analytics from the same data store.

52 citations


Proceedings ArticleDOI
Xiaofeng Guo1, Xin Peng1, Hanzhang Wang2, Wanxue Li2, Huai Jiang2, Dan Ding2, Tao Xie2, Liangfei Su2 
08 Nov 2020
TL;DR: A graph-based microservice trace analysis approach GMTA is proposed for understanding architecture and diagnosing various problems in industrial-scale microservice systems and its effectiveness and efficiency for architecture understanding and problem diagnosis is demonstrated.
Abstract: Microservice systems are highly dynamic and complex. For such systems, operation engineers and developers highly rely on trace analysis to understand architectures and diagnose various problems such as service failures and quality degradation. However, the huge number of traces produced at runtime makes it challenging to capture the required information in real-time. To address the faced challenges, in this paper, we propose a graph-based microservice trace analysis approach GMTA for understanding architecture and diagnosing various problems. Built on a graph-based representation, GMTA includes efficient processing of traces produced on the fly. It abstracts traces into different paths and further groups them into business flows. To support various analytical applications, GMTA includes an efficient storage and access mechanism by combining a graph database and a real-time analytics database and using a carefully designed storage structure. Based on GMTA, we construct analytical applications for architecture understanding and problem diagnosis, these applications support various needs such as visualizing service dependencies, making architectural decisions, analyzing the changes of services behaviors, detecting performance issues, and locating root causes. GMTA has been implemented and deployed in eBay. An experimental study based on trace data produced by eBay demonstrates GMTA's effectiveness and efficiency for architecture understanding and problem diagnosis. Case studies conducted in eBay's monitoring team and Site Reliability Engineering (SRE) team further confirm GMTA's substantial benefits in industrial-scale microservice systems.

49 citations


Proceedings ArticleDOI
20 Apr 2020
TL;DR: GraphGen as mentioned in this paper converts graphs to sequences using minimum DFS codes, which capture the graph structure precisely along with the label information, and learns complex joint distributions between structure and semantic labels through a novel LSTM architecture.
Abstract: Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics. Our code is released at: https://github.com/idea-iitd/graphgen.

46 citations


Journal ArticleDOI
TL;DR: The main propose of this paper is to systematically review and study the existing data management approaches in the IoT, and classify them into three main classes, including SQL database, NoSQL database, and graph database.
Abstract: Internet of Things (IoT) has the idea of receiving data and sending them for each object within the communication network. One of the main issues in this type of networks is handling a growing volume of data with various data sources and data types to satisfy the performance requirements of applications. In this regard, data management in the IoT plays an important role in its efficient operations and has become a major research topic. However, the data management has a crucial role in the IoT, there is not any comprehensive and systematic work to analyze its approaches. Thus, main propose of this paper is to systematically review and study the existing data management approaches in the IoT. The data management approaches are classified into three main classes, including SQL database, NoSQL database, and graph database. In addition, the detailed comparison of the important mechanisms in each category brings a recommendation for further works.

42 citations


Journal ArticleDOI
TL;DR: A novel facial emotion recognition based on graph mining has been proposed in this paper to make a paradigm shift in the way of representing the face region and shows significant accuracy improvements by the proposed system in comparison to current published works in SAVEE database.

38 citations


Proceedings ArticleDOI
TL;DR: Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics.
Abstract: Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics. Our code is released at this https URL.

Journal ArticleDOI
01 Jan 2020
TL;DR: This paper empirically evaluate all compared heuristics within an integrated implementation of the graph edit distance and provides a systematic overview of the most importantHeuristics.
Abstract: Because of its flexibility, intuitiveness, and expressivity, the graph edit distance (GED) is one of the most widely used distance measures for labeled graphs. Since exactly computing GED is NP-hard, over the past years, various heuristics have been proposed. They use techniques such as transformations to the linear sum assignment problem with error correction, local search, and linear programming to approximate GED via upper or lower bounds. In this paper, we provide a systematic overview of the most important heuristics. Moreover, we empirically evaluate all compared heuristics within an integrated implementation.

Journal ArticleDOI
TL;DR: The cornerstone of ChronoGraph aims at bridging the chasm between point- based semantics and period-based semantics and the gap between temporal graph traversals and static graph traversal, and the graph model and traversal language provide the temporal syntax for both semantics.
Abstract: ChronoGraph is a novel system enabling temporal graph traversals. Compared to snapshot-oriented systems, this traversal-oriented system is suitable for analyzing information diffusion over time without violating a time constraint on temporal paths. The cornerstone of ChronoGraph aims at bridging the chasm between point-based semantics and period-based semantics and the gap between temporal graph traversals and static graph traversals. Therefore, our graph model and traversal language provide the temporal syntax for both semantics, and we present a method converting point-based semantics to period-based ones. Also, ChronoGraph exploits the temporal support and parallelism to handle the temporal degree, which explosively increases compared to static graphs. We demonstrate how three traversal recipes can be implemented on top of our system: temporal breadth-first search (tBFS), temporal depth-first search (tDFS), and temporal single source shortest path (tSSSP). According to our evaluation, our temporal support and parallelism enhance temporal graph traversals in terms of convenience and efficiency. Also, ChronoGraph outperforms existing property graph databases in terms of temporal graph traversals. We prototype ChronoGraph by extending Tinkerpop, a de facto standard for property graphs. Therefore, we expect that our system would be readily accessible to existing property graph users.

Journal ArticleDOI
TL;DR: This paper proposes and builds a manufacturing equipment information query system based on Knowledge Graph, which uses the shortest path optimization algorithm to calculate the similarity between nodes in the search process to achieve the recommendation of similar node information.
Abstract: With the development of a new generation of information technology, such as big data and cognitive intelligence, we are in the postmodern era of artificial intelligence. Currently, the manufacturing industry is in the critical period of transitioning to smart manufacturing, but the cognitive capabilities of devices in smart factories are still scarce. Knowledge Graph (KG) is one of the key technologies of cognitive intelligence, which opens a new path for the horizontal integration of intelligent manufacturing. Therefore, this paper proposes and builds a manufacturing equipment information query system based on KG. Firstly, a large amount of heterogeneous data that contains vast devices information is obtained from the network. Secondly, the conditional random fields (CRF) algorithm is used to extract the entity name, product place, and company name of the device, and then the relationship between the device entities is identified by calculating the similarity and Chinese syntax analysis. In the validation section, we use to the map of Neo4j graph database, when we input a name of a device in the search box, the system can return a relational graph node. In addition, the shortest path optimization algorithm is used to calculate the similarity between nodes in the search process to achieve the recommendation of similar node information.

Journal ArticleDOI
TL;DR: This paper presents three direct mappings (schema-dependent and schema-independent) for transforming an RDF database into a property graph database, including data and schema, and shows that two of the proposed mappings satisfy the properties of semantics preservation and information preservation.
Abstract: RDF triplestores and property graph databases are two approaches for data management which are based on modeling, storing and querying graph-like data. In spite of such common principle, they present special features that complicate the task of database interoperability. While there exist some methods to transform RDF graphs into property graphs, and vice versa, they lack compatibility and a solid formal foundation. This paper presents three direct mappings (schema-dependent and schema-independent) for transforming an RDF database into a property graph database, including data and schema. We show that two of the proposed mappings satisfy the properties of semantics preservation and information preservation. The existence of both mappings allows us to conclude that the property graph data model subsumes the information capacity of the RDF data model.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: A comprehensive study of the existing cardinality estimation techniques for subgraph matching queries, scaling far beyond the original experiments, reveals that all existing techniques have serious problems in accuracy for various scenarios and datasets.
Abstract: Despite the crucial role of cardinality estimation in query optimization, there has been no systematic and in-depth study of the existing cardinality estimation techniques for subgraph matching queries. In this paper, for the first time, we present a comprehensive study of the existing cardinality estimation techniques for subgraph matching queries, scaling far beyond the original experiments. We first introduce a novel framework called g-care that enables us to realize all existing techniques on top of it and that provides insights on their performance. By using g-care, we then reimplement representative cardinality estimation techniques for graph databases as well as relational databases. We next evaluate these techniques w.r.t accuracy on rdf and non-rdf graphs from different domains with subgraph matching queries of various topologies so far considered. Surprisingly, our results reveal that all existing techniques have serious problems in accuracy for various scenarios and datasets. Intriguingly, a simple sampling method based on an online aggregation technique designed for relational data, consistently outperforms all existing techniques.

Posted Content
TL;DR: A general data model for multi-dimensional event data based on labeled property graphs that allows storing structural and temporal relations in a single, integrated graph-based data structure in a systematic way is proposed.
Abstract: Process event data is usually stored either in a sequential process event log or in a relational database. While the sequential, single-dimensional nature of event logs aids querying for (sub)sequences of events based on temporal relations such as "directly/eventually-follows", it does not support querying multi-dimensional event data of multiple related entities. Relational databases allow storing multi-dimensional event data but existing query languages do not support querying for sequences or paths of events in terms of temporal relations. In this paper, we propose a general data model for multi-dimensional event data based on labeled property graphs that allows storing structural and temporal relations in a single, integrated graph-based data structure in a systematic way. We provide semantics for all concepts of our data model, and generic queries for modeling event data over multiple entities that interact synchronously and asynchronously . The queries allow for efficiently converting large real-life event data sets into our data model and we provide 5 converted data sets for further research. We show that typical and advanced queries for retrieving and aggregating such multidimensional event data can be formulated and executed efficiently in the existing query language Cypher, giving rise to several new research questions. Specifically aggregation queries on our data model enable process mining over multiple interrelated entities using off-the-shelf technology.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: A1 as mentioned in this paper is an in-memory distributed database used by the Bing search engine to support complex queries over structured data, which uses FaRM as its underlying storage layer and builds the graph abstraction and query engine on top.
Abstract: A1 is an in-memory distributed database used by the Bing search engine to support complex queries over structured data. The key enablers for A1 are availability of cheap DRAM and high speed RDMA (Remote Direct Memory Access) networking in commodity hardware. A1 uses FaRM [11,12] as its underlying storage layer and builds the graph abstraction and query engine on top. The combination of in-memory storage and RDMA access requires rethinking how data is allocated, organized and queried in a large distributed system. A single A1 cluster can store tens of billions of vertices and edges and support a throughput of 350+ million of vertex reads per second with end to end query latency in single digit milliseconds. In this paper we describe the A1 data model, RDMA optimized data structures and query execution.

Journal ArticleDOI
TL;DR: This paper reviews state-of-the-art geospatial data processing in the 10 most popular NoSQL databases and summarizes the supported geometry objects, main geometry functions, spatial indexes, query languages, and data formats of these 10 No SQL databases.
Abstract: Geospatial information has been indispensable for many application fields, including traffic planning, urban planning, and energy management. Geospatial data are mainly stored in relational databases that have been developed over several decades, and most geographic information applications are desktop applications. With the arrival of big data, geospatial information applications are also being modified into, e.g., mobile platforms and Geospatial Web Services, which require changeable data schemas, faster query response times, and more flexible scalability than traditional spatial relational databases currently have. To respond to these new requirements, NoSQL (Not only SQL) databases are now being adopted for geospatial data storage, management, and queries. This paper reviews state-of-the-art geospatial data processing in the 10 most popular NoSQL databases. We summarize the supported geometry objects, main geometry functions, spatial indexes, query languages, and data formats of these 10 NoSQL databases. Moreover, the pros and cons of these NoSQL databases are analyzed in terms of geospatial data processing. A literature review and analysis showed that current document databases may be more suitable for massive geospatial data processing than are other NoSQL databases due to their comprehensive support for geometry objects and data formats and their performance, geospatial functions, index methods, and academic development. However, depending on the application scenarios, graph databases, key-value, and wide column databases have their own advantages.

Proceedings ArticleDOI
03 Sep 2020
TL;DR: This work is the first to tackle the problem of schema inference for property graphs with a novel end-to-end schema inference method for property graph schemas that tackles complex and nested property values, multi-labeled nodes and node hierarchies.
Abstract: Property graph instances are typically populated without defining a schema beforehand. Although this ensures great flexibility, the lack of a schema implies to miss opportunities for query optimization, data integration and analytics, to name a few. Since several graph instances exist prior to the schema definition, extracting the schema from those instances in a principled way might become a significant yet daunting task. In this paper, we present a novel end-to-end schema inference method for property graph schemas that tackles complex and nested property values, multi-labeled nodes and node hierarchies. Our method consists of three main steps, the first of which builds upon Cypher queries to extract the node and edge serialization of a property graph. The second step builds over a MapReduce type inference system, working on the serialized output thereby obtained during the first step. The third step analyzes subtypes and supertypes to infer node hierarchies. We describe our schema inference pipeline and its implementation, a labels-and a properties-oriented variant. Finally, we experimentally evaluate and compare the scalability and accuracy of our approaches on several real-life datasets. To the best of our knowledge, our work is the first to tackle the problem of schema inference for property graphs.

Journal ArticleDOI
TL;DR: This survey provides a comprehensive overview of the state-of-the-art graph generators by focusing on those that are pertinent and suitable for several data-intensive tasks.
Abstract: The abundance of interconnected data has fueled the design and implementation of graph generators reproducing real-world linking properties or gauging the effectiveness of graph algorithms, techniques, and applications manipulating these data. We consider graph generation across multiple subfields, such as Semantic Web, graph databases, social networks, and community detection, along with general graphs. Despite the disparate requirements of modern graph generators throughout these communities, we analyze them under a common umbrella, reaching out the functionalities, the practical usage, and their supported operations. We argue that this classification is serving the need of providing scientists, researchers, and practitioners with the right data generator at hand for their work. This survey provides a comprehensive overview of the state-of-the-art graph generators by focusing on those that are pertinent and suitable for several data-intensive tasks. Finally, we discuss open challenges and missing requirements of current graph generators along with their future extensions to new emerging fields.

Posted Content
TL;DR: This survey provides a comprehensive overview of the state-of-the-art graph generators by focusing on those that are pertinent and suitable for several data-intensive tasks.
Abstract: The abundance of interconnected data has fueled the design and implementation of graph generators reproducing real-world linking properties, or gauging the effectiveness of graph algorithms, techniques and applications manipulating these data. We consider graph generation across multiple subfields, such as Semantic Web, graph databases, social networks, and community detection, along with general graphs. Despite the disparate requirements of modern graph generators throughout these communities, we analyze them under a common umbrella, reaching out the functionalities, the practical usage, and their supported operations. We argue that this classification is serving the need of providing scientists, researchers and practitioners with the right data generator at hand for their work. This survey provides a comprehensive overview of the state-of-the-art graph generators by focusing on those that are pertinent and suitable for several data-intensive tasks. Finally, we discuss open challenges and missing requirements of current graph generators along with their future extensions to new emerging fields.

Journal ArticleDOI
TL;DR: The primary algorithmic finding was that the Louvain algorithm yields Twitter communities whose distribution size matches closer, in terms of the Kullback–Leibler divergence, the tweet and retweet distributions, with Newman–Girvan, Walktrap, and CNM following in that order.
Abstract: Community discovery is an essential topic in social network analysis since it provides a way for recursively decomposing a large social graph to easily interpretable subgraphs. The implementation of four major community discovery algorithms, namely the Newman–Girvan or Edge Betweeness, the Walktrap, the Louvain, and the CNM as Java analytics over Neo4j is described. Their correctness was evaluated functionally in two real Twitter graphs with vastly different characteristics. This was done on the grounds that a successful structural graph partitioning should eventually be reflected in the network functionality domain. Additionally, most real world graphs lack a list of ground truth communities, rendering a structural verification difficult, while functionality can be easily observed in most cases. Naturally, this renders the evaluation network-specific, as different social networks have different operational characteristics. The primary algorithmic finding was that the Louvain algorithm yields Twitter communities whose distribution size matches closer, in terms of the Kullback–Leibler divergence, the tweet and retweet distributions, with Newman–Girvan, Walktrap, and CNM following in that order.

Posted Content
TL;DR: This work introduces Mal net, the largest public graph database ever constructed, representing a large-scale ontology of software function call graphs, and provides a detailed analysis of MalNet, discussing its properties and provenance.
Abstract: With the rapid emergence of graph representation learning, the construction of new large-scale datasets are necessary to distinguish model capabilities and accurately assess the strengths and weaknesses of each technique. By carefully analyzing existing graph databases, we identify 3 critical components important for advancing the field of graph representation learning: (1) large graphs, (2) many graphs, and (3) class diversity. To date, no single graph database offers all of these desired properties. We introduce MalNet, the largest public graph database ever constructed, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families. Compared to the popular REDDIT-12K database, MalNet offers 105x more graphs, 44x larger graphs on average, and 63x the classes. We provide a detailed analysis of MalNet, discussing its properties and provenance. The unprecedented scale and diversity of MalNet offers exciting opportunities to advance the frontiers of graph representation learning---enabling new discoveries and research into imbalanced classification, explainability and the impact of class hardness. The database is publically available at this http URL.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: In this paper, the authors propose a variant of the Relational Algebra equipped with a fixpoint operator for expressing recursive relational queries, which can notably express unions of conjunctive regular path queries.
Abstract: Graph databases have received a lot of attention as they are particularly useful in many applications such as social networks, life sciences and the semantic web. Various languages have emerged to query graph databases, many of which embed forms of recursion which reveal essential for navigating in graphs. The relational model has benefited from a huge body of research in the last half century and that is why many graph databases rely on techniques of relational query engines. Since its introduction, the relational model has seen various attempts to extend it with recursion and it is now possible to use recursion in several SQL or Datalog based database systems. The optimization of recursive queries remains, however, a challenge. We propose mu-RA, a variation of the Relational Algebra equipped with a fixpoint operator for expressing recursive relational queries. mu-RA can notably express unions of conjunctive regular path queries. Leveraging the fact that this fixpoint operator makes recursive terms more amenable to algebraic transformations, we propose new rewrite rules. These rules makes it possible to generate new query execution plans, that cannot be obtained with previous approaches. We present the syntax and semantics of mu-RA, and the rewriting rules that we specifically devised to tackle the optimization of recursive queries. We report on practical experiments that show that the newly generated plans can provide significant performance improvements for evaluating recursive queries over graphs.

Proceedings ArticleDOI
23 Aug 2020
TL;DR: This work proposes a novel graph neural network (GNN) based semantic hashing, i.e. GHashing, for approximate pruning, for graph similarity search, which achieves significantly faster query time compared to state-of-the-art methods while maintaining a high recall.
Abstract: Graph similarity search aims to find the most similar graphs to a query in a graph database in terms of a given proximity measure, say Graph Edit Distance (GED). It is a widely studied yet still challenging problem. Most of the studies are based on the pruning-verification framework, which first prunes non-promising graphs and then conducts verification on the small candidate set. Existing methods are capable of managing databases with thousands or tens of thousands of graphs, but fail to scale to even larger database, due to their exact pruning strategy. Inspired by the recent success of deep-learning-based semantic hashing in image and document retrieval, we propose a novel graph neural network (GNN) based semantic hashing, i.e. GHashing, for approximate pruning. We first train a GNN with ground-truth GED results so that it learns to generate embeddings and hash codes that preserve GED between graphs. Then a hash index is built to enable graph lookup in constant time. To answer a query, we use the hash codes and the continuous embeddings as two-level pruning to retrieve the most promising candidates, which are sent to the exact solver for final verification. Due to the approximate pruning strategy leveraged by our graph hashing technique, our approach achieves significantly faster query time compared to state-of-the-art methods while maintaining a high recall. Experiments show that our approach is on average 20x faster than the only baseline that works on million-scale databases, which demonstrates GHashing successfully provides a new direction in addressing graph search problem for large-scale graph databases.

Journal ArticleDOI
TL;DR: ProgQuery is a platform to allow users to write their own Java program analyses in a declarative fashion, using graph representations, and it outperforms the other systems in analysis time, and scales better to program sizes and analysis complexity.
Abstract: Although source code programs are commonly written as textual information, they enclose syntactic and semantic information that is usually represented as graphs. This information is used for many different purposes, such as static program analysis, advanced code search, coding guideline checking, software metrics computation, and extraction of semantic and syntactic information to create predictive models. Most of the existing systems that provide these kinds of services are designed ad hoc for the particular purpose they are aimed at. For this reason, we created ProgQuery, a platform to allow users to write their own Java program analyses in a declarative fashion, using graph representations. We modify the Java compiler to compute seven syntactic and semantic representations, and store them in a Neo4j graph database. Such representations are overlaid, meaning that syntactic and semantic nodes of the different graphs are interconnected to allow combining different kinds of information in the queries/analyses. We evaluate ProgQuery and compare it to the related systems. Our platform outperforms the other systems in analysis time, and scales better to program sizes and analysis complexity. Moreover, the queries coded show that ProgQuery is more expressive than the other approaches. The additional information stored by ProgQuery increases the database size and associated insertion time, but these increases are significantly lower than the query/analysis performance gains obtained.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: Db2 Graph is implemented as a layer inside Db2, and thus can support integrated graph and SQL analytics efficiently, and enables efficient execution of graph queries with the help of Db1 relational engine, through sophisticated compile-time and runtime optimization strategies.
Abstract: To meet the challenge of analyzing rapidly growing graph and network data created by modern applications, a large number of graph databases have emerged, such as Neo4j and JanusGraph. They mainly target low-latency graph queries, such as finding the neighbors of a vertex with certain properties, and retrieving the shortest path between two vertices. Although many of the graph databases handle the graph-only queries very well, they fall short for real life applications involving graph analysis. This is because graph queries are not all that one does in an analytics workload of a real life application. They are often only a part of an integrated heterogeneous analytics pipeline, which may include SQL, machine learning, graph, and other analytics. This means graph queries need to be synergistic with other analytics. Unfortunately, most existing graph databases are standalone and cannot easily integrate with other analytics systems. In addition, many graph data (data about relationships between objects or people) are already prevalent in existing non-graph databases, although they are not explicitly stored as graphs. None of existing graph databases can retrofit graph queries onto these existing data without transferring or transforming data. In this paper, we propose an in-DBMS graph query approach, IBM Db2 Graph, to support synergistic and retrofittable graph queries inside the IBM Db2 relational database. It is implemented as a layer inside Db2, and thus can support integrated graph and SQL analytics efficiently. Db2 Graph employs a novel graph overlay approach to expose a graph view of the relational data. This approach flexibly retrofits graph queries to existing graph data stored in relational tables, without expensive data transfer or transformation. In addition, it enables efficient execution of graph queries with the help of Db2 relational engine, through sophisticated compile-time and runtime optimization strategies. Our experimental study, as well as our experience with real customers using Db2 Graph, showed that Db2 Graph can provide very competitive and sometimes even better performance on graph-only queries, compared to existing graph databases. Moreover, it optimizes the overall performance of complex analytics workloads.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: This work describes how GSQL, TigerGraph's graph query language, supports the specification of aggregation in graph analytics and detail the design showing how the ideas transcend GSQL and are eminently portable to the upcominggraph query language standards as well as the existing pattern-based declarative query languages.
Abstract: We describe how GSQL, TigerGraph's graph query language, supports the specification of aggregation in graph analytics. GSQL makes several unique design decisions with respect to both the expressive power and the evaluation complexity of the specified aggregation. We detail our design showing how our ideas transcend GSQL and are eminently portable to the upcoming graph query language standards as well as the existing pattern-based declarative query languages.

Journal ArticleDOI
TL;DR: A novel deep learning method based on the long short-term memory (LSTM) is proposed that can predict the execution time in advance before executing a query task in graph database and can achieve the state-of-the-art prediction performance for query task execution time.