scispace - formally typeset
Search or ask a question

Showing papers on "Graph database published in 2019"


Posted Content
TL;DR: This work presents the first survey and taxonomy of graph database systems, identifying and analyzing fundamental categories of these systems, and outlines graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms).
Abstract: Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 45 graph database systems are presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we describe research and engineering challenges to outline the future of graph databases.

70 citations


Journal ArticleDOI
01 Jul 2019
TL;DR: In this article, a new solution paradigm was proposed to find the densest subgraphs through a k-core (a kind of dense subgraph of a graph) with theoretical guarantees.
Abstract: Densest subgraph discovery (DSD) is a fundamental problem in graph mining. It has been studied for decades, and is widely used in various areas, including network science, biological analysis, and graph databases. Given a graph G, DSD aims to find a subgraph D of G with the highest density (e.g., the number of edges over the number of vertices in D). Because DSD is difficult to solve, we propose a new solution paradigm in this paper. Our main observation is that the densest subgraph can be accurately found through a k-core (a kind of dense subgraph of G), with theoretical guarantees. Based on this intuition, we develop efficient exact and approximation solutions for DSD. Moreover, our solutions are able to find the densest subgraphs for a wide range of graph density definitions, including clique-based- and general pattern-based density. We have performed extensive experimental evaluation on both real and synthetic datasets. Our results show that our algorithms are up to four orders of magnitude faster than existing approaches.

65 citations


Posted Content
TL;DR: TigerGraph's high-level query language, GSQL, is designed for compatibility with SQL, while simultaneously allowing NoSQL programmers to continue thinking in Bulk-Synchronous Processing (BSP) terms and reap the benefits of high- level specification.
Abstract: We present TigerGraph, a graph database system built from the ground up to support massively parallel computation of queries and analytics. TigerGraph's high-level query language, GSQL, is designed for compatibility with SQL, while simultaneously allowing NoSQL programmers to continue thinking in Bulk-Synchronous Processing (BSP) terms and reap the benefits of high-level specification. GSQL is sufficiently high-level to allow declarative SQL-style programming, yet sufficiently expressive to concisely specify the sophisticated iterative algorithms required by modern graph analytics and traditionally coded in general-purpose programming languages like C++ and Java. We report very strong scale-up and scale-out performance over a benchmark we published on GitHub for full reproducibility.

39 citations


Journal ArticleDOI
TL;DR: This article proposes a framework, namely Discovery Information using COmmunity detection (DICO), for identifying overlapped communities of authors from Big Scholarly Data by modeling authors’ interactions through a novel graph-based data model combining jointly document metadata with semantic information.
Abstract: The widespread use of Online Social Networks has also involved the scientific field in which researchers interact each other by publishing or citing a given paper. The huge amount of information about scientific research documents has been described through the term Big Scholarly Data. In this paper we propose a framework, namely Discovery Information using COmmunity detection (DICO), for identifying overlapped communities of authors from Big Scholarly Data by modeling authors' interactions through a novel graph-based data model combining jointly document metadata with semantic information. In particular, DICO presents three distinctive characteristics:i) the co-authorship network has been built from publication records using a novel approach for estimating relationships weight between users;ii) a new community detection algorithm based on Node Location Analysis has been developed to identify overlapped communities;iii) some built-in queries are provided to browse the generated network, though any graph-traversal query can be implemented through the Cypher query language. The experimental evaluation has been carried out to evaluate the efficacy of the proposed community detection algorithm on benchmark networks.Finally, DICO has been tested on a real-world Big Scholarly Dataset to show its usefulness working on the DBLP+AMiner dataset, that contains 1.7M+ distinct authors, 3M+ papers, handling 25M+ citation relationships.

35 citations


Journal ArticleDOI
TL;DR: A GPU-based Bees Swarm Optimization Miner where the GPU is used as a co-processor to compute the CPU-time intensive steps of the algorithm and reveals that GBSO-Miner is up to 800 times faster than an optimized CPU-Implementation.

33 citations


Journal ArticleDOI
TL;DR: A software tool has been developed for supporting visual network analysis in a user-friendly way; providing several functionalities such as peptide retrieval and filtering, network construction and visualization, interactive exploration, and exporting data options.
Abstract: Motivation Bioactive peptides have gained great attention in the academy and pharmaceutical industry since they play an important role in human health. However, the increasing number of bioactive peptide databases is causing the problem of data redundancy and duplicated efforts. Even worse is the fact that the available data is non-standardized and often dirty with data entry errors. Therefore, there is a need for a unified view that enables a more comprehensive analysis of the information on this topic residing at different sites. Results After collecting web pages from a large variety of bioactive peptide databases, we organized the web content into an integrated graph database (starPepDB) that holds a total of 71 310 nodes and 348 505 relationships. In this graph structure, there are 45 120 nodes representing peptides, and the rest of the nodes are connected to peptides for describing metadata. Additionally, to facilitate a better understanding of the integrated data, a software tool (starPep toolbox) has been developed for supporting visual network analysis in a user-friendly way; providing several functionalities such as peptide retrieval and filtering, network construction and visualization, interactive exploration and exporting data options. Availability and implementation Both starPepDB and starPep toolbox are freely available at http://mobiosd-hub.com/starpep/. Supplementary information Supplementary data are available at Bioinformatics online.

33 citations


Proceedings ArticleDOI
08 Apr 2019
TL;DR: The results show that (1) the slow verification method in existing IFV algorithms can lead us to over-estimate the gain of filtering; and (2) the modified subgraph querying algorithms with efficient subgraph matching are competitive in time performance and can scale to hundreds of thousands of data graphs and graphs ofThousands of vertices.
Abstract: A subgraph query finds all data graphs in a graph database each of which contains the given query graph. Existing work takes the indexing-filtering-verification (IFV) approach to first index all data graphs, then filter out some of them based on the index, and finally test subgraph isomorphism on each of the remaining data graphs. This final test of subgraph isomorphism is a sub-problem of subgraph matching, which finds all subgraph isomorphisms from a query graph to a data graph. As such, in this paper, we study whether, and if so, how to utilize efficient subgraph matching to improve subgraph query processing. Specifically, we modify leading subgraph matching algorithms and integrate them with top-performing subgraph querying algorithms. Our results show that (1) the slow verification method in existing IFV algorithms can lead us to over-estimate the gain of filtering; and (2) our modified subgraph querying algorithms with efficient subgraph matching are competitive in time performance and can scale to hundreds of thousands of data graphs and graphs of thousands of vertices.

30 citations


Journal ArticleDOI
Yijian Cheng1, Pengjie Ding1, Tongtong Wang1, Wei Lu1, Xiaoyong Du1 
TL;DR: RDBMSs outperform GDMBSs by a substantial margin under the workloads which mainly consist of group by, sort, and aggregation operations, and their combinations; and G DMBSs show their superiority underThe workloads that mainly consists of multi-table join, pattern match, path identification, and the combinations.
Abstract: Over decades, relational database management systems (RDBMSs) have been the first choice to manage data. Recently, due to the variety properties of big data, graph database management systems (GDBMSs) have emerged as an important complement to RDBMSs. As pointed out in the existing literature, both RDBMSs and GDBMSs are capable of managing graph data and relational data; however, the boundaries of them still remain unclear. For this reason, in this paper, we first extend a unified benchmark for RDBMSs and GDBMSs over the same datasets using the same query workload under the same metrics. We then conduct extensive experiments to evaluate them and make the following findings: (1) RDBMSs outperform GDMBSs by a substantial margin under the workloads which mainly consist of group by, sort, and aggregation operations, and their combinations; (2) GDMBSs show their superiority under the workloads that mainly consist of multi-table join, pattern match, path identification, and their combinations.

29 citations


Journal ArticleDOI
TL;DR: A novel tool QAnalysis is built, where doctors enter their analytic requirements in their natural language and then the tool returns charts and tables to the doctors, which provides a convenient way for doctors to get statistical results directly in natural language.
Abstract: While doctors should analyze a large amount of electronic medical record (EMR) data to conduct clinical research, the analyzing process requires information technology (IT) skills, which is difficult for most doctors in China. In this paper, we build a novel tool QAnalysis, where doctors enter their analytic requirements in their natural language and then the tool returns charts and tables to the doctors. For a given question from a user, we first segment the sentence, and then we use grammar parser to analyze the structure of the sentence. After linking the segmentations to concepts and predicates in knowledge graphs, we convert the question into a set of triples connected with different kinds of operators. These triples are converted to queries in Cypher, the query language for Neo4j. Finally, the query is executed on Neo4j, and the results shown in terms of tables and charts are returned to the user. The tool supports top 50 questions we gathered from two hospital departments with the Delphi method. We also gathered 161 questions from clinical research papers with statistical requirements on EMR data. Experimental results show that our tool can directly cover 78.20% of these statistical questions and the precision is as high as 96.36%. Such extension is easy to achieve with the help of knowledge-graph technology we have adopted. The recorded demo can be accessed from https://github.com/NLP-BigDataLab/QAnalysis-project . Our tool shows great flexibility in processing different kinds of statistic questions, which provides a convenient way for doctors to get statistical results directly in natural language.

29 citations


Proceedings ArticleDOI
26 May 2019
TL;DR: The Maven Dependency Graph as discussed by the authors is a dataset of 2.8M artifacts from the Maven Central Repository with metadata such as exact version, date of upload and list of dependencies towards other artifacts.
Abstract: The Maven Central Repository provides an extraordinary source of data to understand complex architecture and evolution phenomena among Java applications. As of September 6, 2018, this repository includes 2.8M artifacts (compiled piece of code implemented in a JVM-based language), each of which is characterized with metadata such as exact version, date of upload and list of dependencies towards other artifacts. Today, one who wants to analyze the complete ecosystem of Maven artifacts and their dependencies faces two key challenges: (i) this is a huge data set; and (ii) dependency relationships among artifacts are not modeled explicitly and cannot be queried. In this paper, we present the Maven Dependency Graph. This open source data set provides two contributions: a snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database in which we explicitly model all dependencies; an open source infrastructure to query this huge dataset.

29 citations


Proceedings Article
01 Jan 2019
TL;DR: GraphOne as mentioned in this paper is a graph data store that abstracts the graph data stores away from the specialized systems to solve the fundamental research problems associated with the data store design, and it combines two complementary graph storage formats (edge list and adjacency list) to decouple graph computations from updates.
Abstract: There is a growing need to perform a diverse set of real-time analytics (batch and stream analytics) on evolving graphs to deliver the values of big data to users. The key requirement from such applications is to have a data store to support their diverse data access efficiently, while concurrently ingesting fine-grained updates at a high velocity. Unfortunately, current graph systems, either graph databases or analytics engines, are not designed to achieve high performance for both operations; rather, they excel in one area that keeps a private data store in a specialized way to favor their operations only. To address this challenge, we have designed and developed GraphOne, a graph data store that abstracts the graph data store away from the specialized systems to solve the fundamental research problems associated with the data store design. It combines two complementary graph storage formats (edge list and adjacency list) and uses dual versioning to decouple graph computations from updates. Importantly, it presents a new data abstraction, GraphView, to enable data access at two different granularities of data ingestions (called data visibility) for concurrent execution of diverse classes of real-time graph analytics with only a small data duplication. Experimental results show that GraphOne is able to deliver 11.40× and 5.36× average speedup in ingestion rate against LLAMA and Stinger, the two state-of-the-art dynamic graph systems, respectively. Further, they achieve an average speedup of 8.75× and 4.14× against LLAMA and 12.80× and 3.18× against Stinger for BFS and PageRank analytics (batch version), respectively. GraphOne also gains over 2,000× speedup against Kickstarter, a state-of-the-art stream analytics engine in ingesting the streaming edges and performing streaming BFS when treating first half as a base snapshot and rest as streaming edge in a synthetic graph. GraphOne also achieves an ingestion rate of two to three orders of magnitude higher than graph databases. Finally, we demonstrate that it is possible to run concurrent stream analytics from the same data store.

Journal ArticleDOI
TL;DR: This paper presents OpenBiodiv: An OBKMS that utilizes semantic publishing workflows, text and data mining, common standards, ontology modelling and graph database technologies to establish a robust infrastructure for managing biodiversity knowledge.
Abstract: Hundreds of years of biodiversity research have resulted in the accumulation of a substantial pool of communal knowledge; however, most of it is stored in silos isolated from each other, such as published articles or monographs. The need for a system to store and manage collective biodiversity knowledge in a community-agreed and interoperable open format has evolved into the concept of the Open Biodiversity Knowledge Management System (OBKMS). This paper presents OpenBiodiv: An OBKMS that utilizes semantic publishing workflows, text and data mining, common standards, ontology modelling and graph database technologies to establish a robust infrastructure for managing biodiversity knowledge. It is presented as a Linked Open Dataset generated from scientific literature. OpenBiodiv encompasses data extracted from more than 5000 scholarly articles published by Pensoft and many more taxonomic treatments extracted by Plazi from journals of other publishers. The data from both sources are converted to Resource Description Framework (RDF) and integrated in a graph database using the OpenBiodiv-O ontology and an RDF version of the Global Biodiversity Information Facility (GBIF) taxonomic backbone. Through the application of semantic technologies, the project showcases the value of open publishing of Findable, Accessible, Interoperable, Reusable (FAIR) data towards the establishment of open science practices in the biodiversity domain.

Book ChapterDOI
04 Nov 2019
TL;DR: Apart from proposing concise schema DDL inspired by Cypher syntax, this work shows how schema validation can be enforced through homomorphisms between PG schemas and PG instances; and how schema evolution can be described through the use of graph rewriting operations.
Abstract: Despite the maturity of commercial graph databases, little consensus has been reached so far on the standardization of data definition languages (DDLs) for property graphs (PG). Discussion on the characteristics of PG schemas is ongoing in many standardization and community groups. Although some basic aspects of a schema are already present in most commercial graph databases, full support is missing allowing to constraint property graphs with more or less flexibility.

Posted Content
TL;DR: This work provides the first analysis and taxonomy of dynamic and streaming graph processing, focusing on identifying the fundamental system designs and on understanding their support for concurrency, and for different graph updates as well as analytics workloads.
Abstract: Graph processing has become an important part of various areas of computing, including machine learning, medical applications, social network analysis, computational sciences, and others. A growing amount of the associated graph processing workloads are dynamic, with millions of edges added or removed per second. Graph streaming frameworks are specifically crafted to enable the processing of such highly dynamic workloads. Recent years have seen the development of many such frameworks. However, they differ in their general architectures (with key details such as the support for the concurrent execution of graph updates and queries, or the incorporated graph data organization), the types of updates and workloads allowed, and many others. To facilitate the understanding of this growing field, we provide the first analysis and taxonomy of dynamic and streaming graph processing. We focus on identifying the fundamental system designs and on understanding their support for concurrency, and for different graph updates as well as analytics workloads. We also crystallize the meaning of different concepts associated with streaming graph processing, such as dynamic, temporal, online, and time-evolving graphs, edge-centric processing, models for the maintenance of updates, and graph databases. Moreover, we provide a bridge with the very rich landscape of graph streaming theory by giving a broad overview of recent theoretical related advances, and by discussing which graph streaming models and settings could be helpful in developing more powerful streaming frameworks and designs. We also outline graph streaming workloads and research challenges.

Posted Content
TL;DR: The main observation is that a densest subgraph can be accurately found through a k-core (a kind of dense subgraph of G), with theoretical guarantees, and efficient exact and approximation solutions for DSD are developed.
Abstract: Densest subgraph discovery (DSD) is a fundamental problem in graph mining. It has been studied for decades, and is widely used in various areas, including network science, biological analysis, and graph databases. Given a graph G, DSD aims to find a subgraph D of G with the highest density (e.g., the number of edges over the number of vertices in D). Because DSD is difficult to solve, we propose a new solution paradigm in this paper. Our main observation is that a densest subgraph can be accurately found through a k-core (a kind of dense subgraph of G), with theoretical guarantees. Based on this intuition, we develop efficient exact and approximation solutions for DSD. Moreover, our solutions are able to find the densest subgraphs for a wide range of graph density definitions, including clique-based and general pattern-based density. We have performed extensive experimental evaluation on eleven real datasets. Our results show that our algorithms are up to four orders of magnitude faster than existing approaches.

Journal ArticleDOI
TL;DR: A semantic graph model is proposed which can not only represent the scheduling problem with extended constraints but also integrate the entire lifecycle data and inspire a simulation-based ant colony algorithm to acquire a feasible and nearly optimal schedule solution.

Proceedings ArticleDOI
08 Apr 2019
TL;DR: This paper develops two pruning techniques based on geometric properties of the maximal spatial clique to significantly enhance the computing efficiency and shows that this technique can identify groups of spatially close objects in a variety of location-based-service (LBS) applications.
Abstract: Maximal clique enumeration is a fundamental problem in graph database. In this paper, we investigate this problem in the context of spatial database. Given a set P of spatial objects in a 2-dimensional space (e.g., geo-locations of users or point of interests) and a distance threshold r, we can come up with a spatial neighbourhood graph Pr by connecting every pair of objects (vertices) in P within distance r. Given a clique S of Pr, namely a spatial clique, it is immediate that any pairwise distance among objects in S is bounded by r. As the maximal pairwise distance has been widely used to capture the spatial cohesiveness of a group of objects, the maximal spatial clique enumeration technique can identify groups of spatially close objects in a variety of location-based-service (LBS) applications. In addition, we show that the maximal spatial clique enumeration can also be used to identify maximal clique pattern instances in the co-location pattern mining applications. Given the existing techniques for maximal clique enumeration, which can be immediately applied on the spatial neighbourhood graph Pr, two questions naturally arise for the enumeration of maximal spatial cliques: (1) the maximal clique enumeration on general graph is NP hard, can we have a polynomial time solution on the spatial neighbourhood graph? and (2) can we exploit the geometric property of the spatial clique to speed up the computation? In this paper, we give a negative answer to the first question by an example where the number of maximal spatial cliques is exponential to the number of the objects. While the answer to the second question is rather positive: we indeed develop two pruning techniques based on geometric properties of the maximal spatial clique to significantly enhance the computing efficiency. Extensive experiments on real-life geolocation data demonstrate the superior performance of proposed methods compared with two baseline algorithms.

Journal ArticleDOI
TL;DR: In this paper, the authors present a literature review on the main applications of OLAP technology in the analysis of information network data, and show a systematic review to list the works that apply OLAP technologies in graph data.
Abstract: Many real systems produce network data or highly interconnected data, which can be called information networks. These information networks form a critical component in modern information infrastructure, constituting a large graph data volume. The analysis of information network data covers several technological areas, among them OLAP technologies. OLAP is a technology that enables multi-dimensional and multi-level analysis on a large volume of data, providing aggregated data visualizations with different perspectives. This article presents a literature review on the main applications of OLAP technology in the analysis of information network data. To achieve such goal, it shows a systematic review to list the works that apply OLAP technologies in graph data. It defines seven comparison criteria (Materialization, Network, Selection, Aggregation, Model, OLAP Operations, Analytics) to qualify the works found based on their functionalities. The works are analyzed according to each criterion and discussed to identify trends and challenges in the application of OLAP in the information network.

Book ChapterDOI
26 Oct 2019
TL;DR: In this article, a worst-case optimal multiway join algorithm called Leapfrog TrieJoin is proposed to evaluate SPARQL queries based on an existing worstcase join algorithm.
Abstract: Worst-case optimal multiway join algorithms have recently gained a lot of attention in the database literature. These algorithms not only offer strong theoretical guarantees of efficiency, but have also been empirically demonstrated to significantly improve query runtimes for relational and graph databases. Despite these promising theoretical and practical results, however, the Semantic Web community has yet to adopt such techniques; to the best of our knowledge, no native RDF database currently supports such join algorithms, where in this paper we demonstrate that this should change. We propose a novel procedure for evaluating SPARQL queries based on an existing worst-case join algorithm called Leapfrog Triejoin. We propose an adaptation of this algorithm for evaluating SPARQL queries, and implement it in Apache Jena. We then present experiments over the Berlin and WatDiv SPARQL benchmarks, and a novel benchmark that we propose based on Wikidata that is designed to provide insights into join performance for a more diverse set of basic graph patterns. Our results show that with this new join algorithm, Apache Jena often runs orders of magnitude faster than the base version and two other SPARQL engines: Virtuoso and Blazegraph.

Journal ArticleDOI
TL;DR: In this paper, the authors present an algorithm for the fast computation of the general $N$-point spatial correlation functions of any discrete point set embedded within an Euclidean space of $mathbb{R}^n.
Abstract: We present an algorithm for the fast computation of the general $N$-point spatial correlation functions of any discrete point set embedded within an Euclidean space of $\mathbb{R}^n$. Utilizing the concepts of kd-trees and graph databases, we describe how to count all possible $N$-tuples in binned configurations within a given length scale, e.g. all pairs of points or all triplets of points with side lengths $

Book ChapterDOI
27 Aug 2019
TL;DR: This paper introduces GKC, a resolution prover optimized for search in large knowledge bases built upon a shared memory graph database Whitedb, enabling it to solve multiple different queries without a need to repeatedly parse or load the large parsed knowledge base from the disk.
Abstract: This paper introduces GKC, a resolution prover optimized for search in large knowledge bases. The system is built upon a shared memory graph database Whitedb, enabling it to solve multiple different queries without a need to repeatedly parse or load the large parsed knowledge base from the disk. Due to the relatively shallow and simple structure of most of the literals in the knowledge base, the indexing methods used are mostly hash-based. While GKC performs well on large problems from the TPTP set, the system is built for use as a core system for developing a toolset of commonsense reasoning functionalities.

Proceedings ArticleDOI
20 May 2019
TL;DR: RedisGraph is a Redis module developed by Redis Labs to add graph database functionality to the Redis database and is significantly faster than comparable graph databases.
Abstract: RedisGraph is a Redis module developed by Redis Labs to add graph database functionality to the Redis database. RedisGraph represents connected data as adjacency matrices. By representing the data as sparse matrices and employing the power of GraphBLAS (a highly optimized library for sparse matrix operations), RedisGraph delivers a fast and efficient way to store, manage and process graphs. Initial benchmarks indicate that RedisGraph is significantly faster than comparable graph databases.

Proceedings ArticleDOI
15 Mar 2019
TL;DR: The physical database tuning of the Oracle Relational database is done and NoSQL graph database is compared with NoSQL Graph database to increase the performance of relational databases.
Abstract: Relational databases are used in many organizations of various natures from last three decades such as Education, health, businesses and in many other applications. SQL databases are designed to manage structured data and show tremendous performance. Atomicity, Consistency Isolation, Durability (ACID) property of Relational databases is used to manage data integrity and consistency. Physical database techniques are used to increase the performance of relational databases. Tablespaces also called subfolder is one of the physical database technique used by Oracle SQL database. Tablespaces are used to store the data logically in separate data files. Now-a-days huge amount and varied nature (unstructured and semi structured) of data is generated by the various organizations i.e., videos, images, blogs etc. This large amount of data is not handled by the SQL databases efficiently. NoSQL databases are used to process and analyze the large amount of data efficiently. Four different types of NoSQL databases are used in the industry according to the organization requirement. In this article, first, we do the physical database tuning of the Oracle Relational database and then compared with NoSQL Graph database. Relational database performance is increased up to 50% due to physical database tuning technique (Tablespaces). Besides, physical database tuning approach of relational database NoSQL graph database performed better in all our proposed scenarios.

Proceedings ArticleDOI
02 Jul 2019
TL;DR: GraphSE\textsuperscript2 as mentioned in this paper is an encrypted graph database for online social network services to address massive data breaches, where social search queries are conducted on a large-scale social graph and meanwhile perform set and computational operations on user generated contents.
Abstract: In this paper, we propose GraphSE\textsuperscript2, an encrypted graph database for online social network services to address massive data breaches. GraphSE\textsuperscript2 ~preserves the functionality of social search, a key enabler for quality social network services, where social search queries are conducted on a large-scale social graph and meanwhile perform set and computational operations on user-generated contents. To enable efficient privacy-preserving social search, GraphSE\textsuperscript2 ~provides an encrypted structural data model to facilitate parallel and encrypted graph data access. It is also designed to decompose complex social search queries into atomic operations and realise them via interchangeable protocols in a fast and scalable manner. We build GraphSE\textsuperscript2 ~with various queries supported in the Facebook graph search engine and implement a full-fledged prototype. Extensive evaluations on Azure Cloud demonstrate that GraphSE\textsuperscript2 ~is practical for querying a social graph with a million of users.

Journal ArticleDOI
TL;DR: In this article, the authors analyzed graph-based algorithms by measuring the time complexity and performance metrics and comparing them with a widely used algorithm, i.e. Alpha Miner and its expansion.
Abstract: Algorithms of process discovery help analysts to understand business processes and problems in a system by creating a process model based on a log of the system. There are existing algorithms of process discovery, namely graph-based. Of all algorithms, there are algorithms that process graph-database to depict a process model. Those algorithms claimed that those have less time complexity because of the graph-database ability to store relationships. This research analyses graph-based algorithms by measuring the time complexity and performance metrics and comparing them with a widely used algorithm, i.e. Alpha Miner and its expansion. Other than that, this research also gives outline explanations about graph-based algorithms and their focus issues. Based on the evaluations, the graph-based algorithm has high performance and less time complexity than Alpha Miner algorithm.

Journal ArticleDOI
TL;DR: Three potentially synergistic and combinable techniques for data collection are proposed for each stage of data collection – biographies for data extraction, graph databases for data storage, and checklists for data reporting.

Proceedings ArticleDOI
20 Nov 2019
TL;DR: A novel query execution model, called Expert Model, is proposed, which supports adaptive parallelism control at the fine-grained query operation level and allows tailored optimizations for different categories of query operators, thus achieving high parallelism and good load balancing.
Abstract: The property graph (PG) model is one of the most general graph data model and has been widely adopted in many graph analytics and processing systems. However, existing systems suffer from poor performance in terms of both latency and throughput for processing online analytical workloads on PGs due to their design defects such as expensive interactions with external databases, low parallelism, and high network overheads. In this paper, we propose Grasper, a high performance distributed system for OLAP on property graphs. Grasper adopts RDMA-aware system designs to reduce the network communication cost. We propose a novel query execution model, called Expert Model, which supports adaptive parallelism control at the fine-grained query operation level and allows tailored optimizations for different categories of query operators, thus achieving high parallelism and good load balancing. Experimental results show that Grasper achieves low latency and high throughput on a broad range of online analytical workloads.

Journal ArticleDOI
TL;DR: This work combines domain-driven modeling concepts with scalable graph-based repository technology and a custom language for model-level queries to solve the challenges of IT Landscape models and meet the requirements that arise from this application domain.
Abstract: IT Landscape models are representing the real-world IT infrastructure of a company. They include hardware assets such as physical servers and storage media, as well as virtual components like clusters, virtual machines and applications. These models are a critical source of information in numerous tasks, including planning, error detection and impact analysis. The responsible stakeholders often struggle to keep such a large and densely connected model up-to-date due to its inherent size and complexity, as well as due to the lack of proper tool support. Even though modeling techniques are very suitable for this domain, existing tools do not offer the required features, scalability or flexibility. In order to solve these challenges and meet the requirements that arise from this application domain, we combine domain-driven modeling concepts with scalable graph-based repository technology and a custom language for model-level queries. We analyze in detail how we synthesized these requirements from the application domain and how they relate to the features of our repository. We discuss the architecture of our solution which comprises the entire data management stack, including transactions, queries, versioned persistence and metamodel evolution. Finally, we evaluate our approach in a case study where our open-source repository implementation is employed in a production environment in an industrial context, as well as in a comparative benchmark with an existing state-of-the-art solution.

Posted Content
TL;DR: Gremlin this paper is a graph traversal language and machine that provides a common platform for supporting any graph computing system (such as an OLTP graph database or OLAP graph processors).
Abstract: Graph data management (also called NoSQL) has revealed beneficial characteristics in terms of flexibility and scalability by differently balancing between query expressivity and schema flexibility. This peculiar advantage has resulted into an unforeseen race of developing new task-specific graph systems, query languages and data models, such as property graphs, key-value, wide column, resource description framework (RDF), etc. Present-day graph query languages are focused towards flexible graph pattern matching (aka sub-graph matching), whereas graph computing frameworks aim towards providing fast parallel (distributed) execution of instructions. The consequence of this rapid growth in the variety of graph-based data management systems has resulted in a lack of standardization. Gremlin, a graph traversal language, and machine provides a common platform for supporting any graph computing system (such as an OLTP graph database or OLAP graph processors). We present a formalization of graph pattern matching for Gremlin queries. We also study, discuss and consolidate various existing graph algebra operators into an integrated graph algebra.

Proceedings ArticleDOI
25 Jun 2019
TL;DR: Catapult takes a data-driven approach toautomatically select canned patterns, thereby taking a concrete step towards the vision of data- driven construction of visual query interfaces.
Abstract: Visual graph query interfaces (a.k.a gui ) widen the reach of graph querying frameworks across different users by enabling non-programmers to use them. Consequently, several commercial and academic frameworks for querying a large collection of small- or medium-sized data graphs (\textite.g., chemical compounds) provide such visual interfaces. Majority of these interfaces expose a fixed set ofcanned patterns (\textiti.e., small subgraph patterns) to expedite query formulation by enabling pattern-at-a-time in lieu of edge-at-a-time construction mode. Canned patterns to be displayed on a gui are typically selected manually based on domain knowledge. However, manual generation of canned patterns is labour intensive. Furthermore, these patterns may not sufficiently cover the underlying data graphs to expedite visual formulation of a wide range of subgraph queries. In this paper, we present a generic and extensible framework called Catapult to address these limitations. Catapult takes a data-driven approach toautomatically select canned patterns, thereby taking a concrete step towards the vision of data-driven construction of visual query interfaces. Specifically, it firstclusters the underlying data graphs based on their topological similarities and thensummarize each cluster to create acluster summary graph (csg ). The canned patterns within a user-specifiedpattern budget are then generated from these csg s by maximizingcoverage anddiversity, and minimizingcognitive load of the patterns. Experimental study with real-world datasets and visual graph interfaces demonstrates the superiority of Catapult compared to traditional techniques.