scispace - formally typeset
Search or ask a question

Showing papers on "Graph database published in 2014"


ReportDOI
01 May 2014
TL;DR: This work presents GraphChi, a disk-based system for computing efficiently on graphs with billions of edges, and builds on the basis of Parallel Sliding Windows to propose a new data structure Partitioned Adjacency Lists, which is used to design an online graph database graphChi-DB.
Abstract: : Current systems for graph computation require a distributed computing cluster to handle very large real-world problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible developing distributed graph algorithms still remains challenging, especially to non-experts. In this work, we present GraphChi, a disk-based system for computing efficiently on graphs with billions of edges. By using a well-known method to break large graphs into small parts, and a novel Parallel Sliding Windows algorithm, GraphChi is able to execute several advanced data mining, graph mining and machine learning algorithms on very large graphs, using just a single consumer-level computer. We show, through experiments and theoretical analysis, that GraphChi performs well on both SSDs and rotational hard drives. We build on the basis of Parallel Sliding Windows to propose a new data structure Partitioned Adjacency Lists, which we use to design an online graph database GraphChi-DB.We demonstrate that, on a single PC, GraphChi-DB can process over one hundred thousand graph updates per second, while simultaneously performing computation. GraphChi-DB compares favorably to existing graph databases, particularly on data that is much larger than the available memory. We evaluate our work both experimentally and theoretically. Based on the Parallel Sliding Windows algorithm, we propose new I/O efficient algorithms for solving fundamental graph problems. We also propose a novel algorithm for simulating billions of random walks in parallel on a single computer. By repeating experiments reported for existing distributed systems we show that with only fraction of the resources, GraphChi can solve the same problems in a very reasonable time. Our work makes large-scale graph computation available to anyone with a modern PC.

907 citations


Proceedings ArticleDOI
02 Apr 2014
TL;DR: The design and implementation of FaRM is described, a new main memory distributed computing platform that exploits RDMA to improve both latency and throughput by an order of magnitude relative to state of the art main memory systems that use TCP/IP.
Abstract: We describe the design and implementation of FaRM, a new main memory distributed computing platform that exploits RDMA to improve both latency and throughput by an order of magnitude relative to state of the art main memory systems that use TCP/IP. FaRM exposes the memory of machines in the cluster as a shared address space. Applications can use transactions to allocate, read, write, and free objects in the address space with location transparency. We expect this simple programming model to be sufficient for most application code. FaRM provides two mechanisms to improve performance where required: lock-free reads over RDMA, and support for collocating objects and function shipping to enable the use of efficient single machine transactions. FaRM uses RDMA both to directly access data in the shared address space and for fast messaging and is carefully tuned for the best RDMA performance. We used FaRM to build a key-value store and a graph store similar to Facebook's. They both perform well, for example, a 20-machine cluster can perform 167 million key-value lookups per second with a latency of 31µs.

686 citations


Proceedings ArticleDOI
18 May 2014
TL;DR: This paper introduces a novel representation of source code called a code property graph that merges concepts of classic program analysis, namely abstract syntax trees, control flow graphs and program dependence graphs, into a joint data structure that enables it to elegantly model templates for common vulnerabilities with graph traversals that can identify buffer overflows, integer overflOWS, format string vulnerabilities, or memory disclosures.
Abstract: The vast majority of security breaches encountered today are a direct result of insecure code. Consequently, the protection of computer systems critically depends on the rigorous identification of vulnerabilities in software, a tedious and error-prone process requiring significant expertise. Unfortunately, a single flaw suffices to undermine the security of a system and thus the sheer amount of code to audit plays into the attacker's cards. In this paper, we present a method to effectively mine large amounts of source code for vulnerabilities. To this end, we introduce a novel representation of source code called a code property graph that merges concepts of classic program analysis, namely abstract syntax trees, control flow graphs and program dependence graphs, into a joint data structure. This comprehensive representation enables us to elegantly model templates for common vulnerabilities with graph traversals that, for instance, can identify buffer overflows, integer overflows, format string vulnerabilities, or memory disclosures. We implement our approach using a popular graph database and demonstrate its efficacy by identifying 18 previously unknown vulnerabilities in the source code of the Linux kernel.

461 citations


Journal ArticleDOI
28 Feb 2014
TL;DR: A survey of join algorithms with provable worst-case optimality runtime guarantees can be found in this paper, where the authors provide a simpler and unified description of these algorithms that they hope is useful for theory-minded readers, algorithm designers, and systems implementors.
Abstract: Evaluating the relational join is one of the central algorithmic and most well-studied problems in database systems. A staggering number of variants have been considered including Block-Nested loop join, Hash-Join, Grace, Sort-merge (see Grafe [17] for a survey, and [4, 7, 24] for discussions of more modern issues). Commercial database engines use finely tuned join heuristics that take into account a wide variety of factors including the selectivity of various predicates, memory, IO, etc. This study of join queries notwithstanding, the textbook description of join processing is suboptimal. This survey describes recent results on join algorithms that have provable worst-case optimality runtime guarantees. We survey recent work and provide a simpler and unified description of these algorithms that we hope is useful for theory-minded readers, algorithm designers, and systems implementors. Much of this progress can be understood by thinking about a simple join evaluation problem that we illustrate with the so-called triangle query, a query that has become increasingly popular in the last decade with the advent of social networks, biological motifs, and graph databases [36, 37]

208 citations


Proceedings ArticleDOI
18 Jun 2014
TL;DR: A quantitative roadmap for improving the performance of all these frameworks and bridging the "ninja gap" is offered, and changes to alleviate bottlenecks arising from the algorithms themselves vs. programming model abstractions vs. the framework implementations are recommended.
Abstract: Graph algorithms are becoming increasingly important for analyzing large datasets in many fields. Real-world graph data follows a pattern of sparsity, that is not uniform but highly skewed towards a few items. Implementing graph traversal, statistics and machine learning algorithms on such data in a scalable manner is quite challenging. As a result, several graph analytics frameworks (GraphLab, CombBLAS, Giraph, SociaLite and Galois among others) have been developed, each offering a solution with different programming models and targeted at different users. Unfortunately, the "Ninja performance gap" between optimized code and most of these frameworks is very large (2-30X for most frameworks and up to 560X for Giraph) for common graph algorithms, and moreover varies widely with algorithms. This makes the end-users' choice of graph framework dependent not only on ease of use but also on performance. In this work, we offer a quantitative roadmap for improving the performance of all these frameworks and bridging the "ninja gap". We first present hand-optimized baselines that get performance close to hardware limits and higher than any published performance figure for these graph algorithms. We characterize the performance of both this native implementation as well as popular graph frameworks on a variety of algorithms. This study helps end-users delineate bottlenecks arising from the algorithms themselves vs. programming model abstractions vs. the framework implementations. Further, by analyzing the system-level behavior of these frameworks, we obtain bottlenecks that are agnostic to specific algorithms. We recommend changes to alleviate these bottlenecks (and implement some of them) and reduce the performance gap with respect to native code. These changes will enable end-users to choose frameworks based mostly on ease of use.

189 citations


Journal ArticleDOI
01 Aug 2014
TL;DR: This work develops an index, together with effective pruning rules and efficient search algorithms, and proposes techniques that use this infrastructure to answer aggregation queries and proposes an effective maintenance algorithm to handle online updates over RDF repositories.
Abstract: We address efficient processing of SPARQL queries over RDF datasets. The proposed techniques, incorporated into the gStore system, handle, in a uniform and scalable manner, SPARQL queries with wildcards and aggregate operators over dynamic RDF datasets. Our approach is graph based. We store RDF data as a large graph and also represent a SPARQL query as a query graph. Thus, the query answering problem is converted into a subgraph matching problem. To achieve efficient and scalable query processing, we develop an index, together with effective pruning rules and efficient search algorithms. We propose techniques that use this infrastructure to answer aggregation queries. We also propose an effective maintenance algorithm to handle online updates over RDF repositories. Extensive experiments confirm the efficiency and effectiveness of our solutions.

186 citations


Journal ArticleDOI
TL;DR: The k^2-tree is presented, a novel Web graph representation based on a compact tree structure that takes advantage of large empty areas of the adjacency matrix of the graph and offers the least space usage while supporting fast navigation to predecessors and successors and sharply outperforming the others on the extended queries.

139 citations


Proceedings Article
01 Dec 2014
TL;DR: A superior model is proposed to leverage the structure of the knowledge graph via pre-calculating the distinct weight for each training triplet according to its relational mapping property, and is compared with the state-of-the-art method TransE and other prior arts.
Abstract: Many knowledge repositories nowadays contain billions of triplets, i.e. (head-entity, relationship, tail-entity), as relation instances. These triplets form a directed graph with entities as nodes and relationships as edges. However, this kind of symbolic and discrete storage structure makes it difficult for us to exploit the knowledge to enhance other intelligenceacquired applications (e.g. the QuestionAnswering System), as many AI-related algorithms prefer conducting computation on continuous data. Therefore, a series of emerging approaches have been proposed to facilitate knowledge computing via encoding the knowledge graph into a low-dimensional embedding space. TransE is the latest and most promising approach among them, and can achieve a higher performance with fewer parameters by modeling the relationship as a transitional vector from the head entity to the tail entity. Unfortunately, it is not flexible enough to tackle well with the various mapping properties of triplets, even though its authors spot the harm on performance. In this paper, we thus propose a superior model called TransM to leverage the structure of the knowledge graph via pre-calculating the distinct weight for each training triplet according to its relational mapping property. In this way, the optimal function deals with each triplet depending on its own weight. We carry out extensive experiments to compare TransM with the state-of-the-art method TransE and other prior arts. The performance of each approach is evaluated within two different application scenarios on several benchmark datasets. Results show that the model we proposed significantly outperforms the former ones with lower parameter complexity as TransE.

128 citations


Journal ArticleDOI
01 Dec 2014
TL;DR: This paper proposes Multi-Source BFS, an algorithm that is designed to run multiple concurrent BFSs over the same graph on a single CPU core while scaling up as the number of cores increases, and demonstrates how a real graph analytics application---all-vertices closeness centrality---can be efficiently solved with MS-BFS.
Abstract: Graph analytics on social networks, Web data, and communication networks has been widely used in a plethora of applications. Many graph analytics algorithms are based on breadth-first search (BFS) graph traversal, which is not only time-consuming for large datasets but also involves much redundant computation when executed multiple times from different start vertices. In this paper, we propose Multi-Source BFS (MS-BFS), an algorithm that is designed to run multiple concurrent BFSs over the same graph on a single CPU core while scaling up as the number of cores increases. MS-BFS leverages the properties of small-world networks, which apply to many real-world graphs, and enables efficient graph traversal that: (i) shares common computation across concurrent BFSs; (ii) greatly reduces the number of random memory accesses; and (iii) does not incur synchronization costs. We demonstrate how a real graph analytics application---all-vertices closeness centrality---can be efficiently solved with MS-BFS. Furthermore, we present an extensive experimental evaluation with both synthetic and real datasets, including Twitter and Wikipedia, showing that MS-BFS provides almost linear scalability with respect to the number of cores and excellent scalability for increasing graph sizes, outperforming state-of-the-art BFS algorithms by more than one order of magnitude when running a large number of BFSs.

113 citations


Journal ArticleDOI
TL;DR: The survey results shows that Graph based representation is appropriate way of representing text document and improved result of analysis over traditional model for different text applications.
Abstract: A common and standard approach to model text document is bag-of-words. This model is suitable for capturing word frequency, however structural and semantic information is ignored. Graph representation is mathematical constructs and can model relationship and structural information effectively. A text can appropriately represented as Graph using vertex as feature term and edge relation can be significant relation between the feature terms. Text representation using Graph model provides computations related to various operations like term weight, ranking which is helpful in many applications in information retrieval. This paper presents a systematic survey of existing work on Graph based representation of text and also focused on Graph based analysis of text document for different operations in information retrieval. In this process taxonomy of Graph based representation and analysis of text document is derived and result of different methods of Graph based text representation and analysis are discussed. The survey results shows that Graph based representation is appropriate way of representing text document and improved result of analysis over traditional model for different text applications.

104 citations


Proceedings ArticleDOI
11 May 2014
TL;DR: GraphGen, a vertex-centric framework that targets FPGA for hardware acceleration of graph computations, is presented and design case studies using GraphGen to implement stereo matching and handwriting recognition graph applications on Terasic DE4 and Xilinx ML605 FPGa boards are reported.
Abstract: Vertex-centric graph computations are widely used in many machine learning and data mining applications that operate on graph data structures. This paper presents GraphGen, a vertex-centric framework that targets FPGA for hardware acceleration of graph computations. GraphGen accepts a vertex-centric graph specification and automatically compiles it onto an application-specific synthesized graph processor and memory system for the target FPGA platform. We report design case studies using GraphGen to implement stereo matching and handwriting recognition graph applications on Terasic DE4 and Xilinx ML605 FPGA boards. Results show up to 14.6x and 2.9x speedups over software on Intel Core i7 CPU for the two applications, respectively.

Book
05 Dec 2014
TL;DR: Neo4j in Action is a comprehensive guide to Neo4j, aimed at application developers and software architects that explores the full power of native Java APIs for graph data manipulation and querying and also covers Cypher, Neo4J's graph query language.
Abstract: SummaryNeo4j in Action is a comprehensive guide to Neo4j, aimed at application developers and software architects. Using hands-on examples, you'll learn to model graph domains naturally with Neo4j graph structures. The book explores the full power of native Java APIs for graph data manipulation and querying.Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. About the TechnologyMuch of the data today is highly connectedfrom social networks to supply chains to software dependency managementand more connections are continually being uncovered. Neo4j is an ideal graph database tool for highly connected data. It is mature, production-ready, and unique in enabling developers to simply and efficiently model and query connected data. About the BookNeo4j in Action is a comprehensive guide to designing, implementing, and querying graph data using Neo4j. Using hands-on examples, you'll learn to model graph domains naturally with Neo4j graph structures. The book explores the full power of native Java APIs for graph data manipulation and querying. It also covers Cypher, Neo4j's graph query language. Along the way, you'll learn how to integrate Neo4j into your domain-driven app using Spring Data Neo4j, as well as how to use Neo4j in standalone server or embedded modes. Knowledge of Java basics is required. No prior experience with graph data or Neo4j is assumed. What's InsideGraph database patternsHow to model data in social networksHow to use Neo4j in your Java applicationsHow to configure and set up Neo4jAbout the AuthorsAleksa Vukotic is an architect specializing in graph data models. Nicki Watt, Dominic Fox, Tareq Abedrabbo, and Jonas Partner work at OpenCredo, a Neo Technology partner, and have been involved in many projects using Neo4j.Table of ContentsPART 1 INTRODUCTION TO NEO4JA case for a Neo4j databaseData modeling in Neo4jStarting development with Neo4j The power of traversalsIndexing the dataPART 2 APPLICATION DEVELOPMENT WITH NEO4JCypher: Neo4j query languageTransactionsTraversals in depthSpring Data Neo4jPART 3 NEO4J IN PRODUCTIONNeo4j: embedded versus server mode

Proceedings ArticleDOI
16 Feb 2014
TL;DR: A qualitative study and a performance comparison of 12 open source graph databases using four fundamental graph algorithms on networks containing up to 256 million edges are conducted.
Abstract: With the proliferation of large, irregular, and sparse relational datasets, new storage and analysis platforms have arisen to fill gaps in performance and capability left by conventional approaches built on traditional database technologies and query languages. Many of these platforms apply graph structures and analysis techniques to enable users to ingest, update, query, and compute on the topological structure of the network represented as sets of edges relating sets of vertices. To store and process Facebook-scale datasets, software and algorithms must be able to support data sources with billions of edges, update rates of millions of updates per second, and complex analysis kernels. These platforms must provide intuitive interfaces that enable graph experts and novice programmers to write implementations of common graph algorithms. In this paper, we conduct a qualitative study and a performance comparison of 12 open source graph databases using four fundamental graph algorithms on networks containing up to 256 million edges.

Patent
27 Jun 2014
TL;DR: In this article, techniques are described for representing services, network resources, and relationships between such services and resources in a graph database with which to validate, provision, and manage the services in near real-time.
Abstract: In general, techniques are described for representing services, network resources, and relationships between such services and resources in a graph database with which to validate, provision, and manage the services in near real-time. In one example, a controller device includes at least one processor; and at least one memory to store a graph database comprising a graph that represents network resources and relationships between network resources. The controller device receives, at an application programming interface, a data-interchange formatted message that indicates a service request to configure a network service; queries, at least a portion of the plurality of the graph, to determine whether a set of the plurality of network resources can satisfy the service request to provision the network service within the network; and configures the set of the plurality of network resources to provide the network service.

Proceedings ArticleDOI
18 Jun 2014
TL;DR: This paper introduces a Scalable Graph processing Class SGC by relaxing some constraints in MMC to make it suitable for scalable graph processing, and defines two graph join operators in SGC, namely, EN join and NE join, using which a wide range of graph algorithms can be designed.
Abstract: MapReduce has become one of the most popular parallel computing paradigms in cloud, due to its high scalability, reliability, and fault-tolerance achieved for a large variety of applications in big data processing. In the literature, there are MapReduce Class MRC and Minimal MapReduce Class MMC to define the memory consumption, communication cost, CPU cost, and number of MapReduce rounds for an algorithm to execute in MapReduce. However, neither of them is designed for big graph processing in MapReduce, since the constraints in MMC can be hardly achieved simultaneously on graphs and the conditions in MRC may induce scalability problems when processing big graph data. In this paper, we study scalable big graph processing in MapReduce. We introduce a Scalable Graph processing Class SGC by relaxing some constraints in MMC to make it suitable for scalable graph processing. We define two graph join operators in SGC, namely, EN join and NE join, using which a wide range of graph algorithms can be designed, including PageRank, breadth first search, graph keyword search, Connected Component (CC) computation, and Minimum Spanning Forest (MSF) computation. Remarkably, to the best of our knowledge, for the two fundamental graph problems CC and MSF computation, this is the first work that can achieve O(log(n)) MapReduce rounds with $O(n+m)$ total communication cost in each round and constant memory consumption on each machine, where $n$ and $m$ are the number of nodes and edges in the graph respectively. We conducted extensive performance studies using two web-scale graphs Twitter and Friendster with different graph characteristics. The experimental results demonstrate that our algorithms can achieve high scalability in big graph processing.

Journal ArticleDOI
01 Mar 2014
TL;DR: The experimental results show that this new graph querying paradigm is promising: It identifies high-quality matches for both keyword and graph queries over real-life knowledge graphs, and outperforms existing methods significantly in terms of effectiveness and efficiency.
Abstract: Querying complex graph databases such as knowledge graphs is a challenging task for non-professional users. Due to their complex schemas and variational information descriptions, it becomes very hard for users to formulate a query that can be properly processed by the existing systems. We argue that for a user-friendly graph query engine, it must support various kinds of transformations such as synonym, abbreviation, and ontology. Furthermore, the derived query results must be ranked in a principled manner.In this paper, we introduce a novel framework enabling schemaless and structureless graph querying (SLQ), where a user need not describe queries precisely as required by most databases. The query engine is built on a set of transformation functions that automatically map keywords and linkages from a query to their matches in a graph. It automatically learns an effective ranking model, without assuming manually labeled training examples, and can efficiently return top ranked matches using graph sketch and belief propagation. The architecture of SLQ is elastic for "plug-in" new transformation functions and query logs. Our experimental results show that this new graph querying paradigm is promising: It identifies high-quality matches for both keyword and graph queries over real-life knowledge graphs, and outperforms existing methods significantly in terms of effectiveness and efficiency.

Journal ArticleDOI
01 Dec 2014
TL;DR: The semantics and efficient online algorithms for this important and intriguing problem of event pattern matching are studied, and approaches are evaluated with extensive experiments over real world datasets in four different domains.
Abstract: A graph is a fundamental and general data structure underlying all data applications. Many applications today call for the management and query capabilities directly on graphs. Real time graph streams, as seen in road networks, social and communication networks, and web requests, are such applications. Event pattern matching requires the awareness of graph structures, which is different from traditional complex event processing. It also requires a focus on the dynamicity of the graph, time order constraints in patterns, and online query processing, which deviates significantly from previous work on subgraph matching as well. We study the semantics and efficient online algorithms for this important and intriguing problem, and evaluate our approaches with extensive experiments over real world datasets in four different domains.

Proceedings ArticleDOI
27 Jun 2014
TL;DR: A taxonomy and unified perspective on NoSQL systems is provided using multiple facets including system architecture, data model, query language, client API, scalability, and availability to help the reader in choosing an appropriate NoSQL system for a given application.
Abstract: The advent of Big Data created a need for out-of-the-box horizontal scalability for data management systems. This ushered in an array of choices for Big Data management under the umbrella term NoSQL. In this paper, we provide a taxonomy and unified perspective on NoSQL systems. Using this perspective, we compare and contrast various NoSQL systems using multiple facets including system architecture, data model, query language, client API, scalability, and availability. We group current NoSQL systems into seven broad categories: Key-Value, Table-type/Column, Document, Graph, Native XML, Native Object, and Hybrid databases. We also describe application scenarios for each category to help the reader in choosing an appropriate NoSQL system for a given application. We conclude the paper by indicating future research directions.

Book ChapterDOI
06 Sep 2014
TL;DR: This work presents a fast approximate nearest neighbor algorithm for semantic segmentation that builds a graph over superpixels from an annotated set of training images and proposes to learn a distance metric that weights the edges in the graph.
Abstract: We present a fast approximate nearest neighbor algorithm for semantic segmentation. Our algorithm builds a graph over superpixels from an annotated set of training images. Edges in the graph represent approximate nearest neighbors in feature space. At test time we match superpixels from a novel image to the training images by adding the novel image to the graph. A move-making search algorithm allows us to leverage the graph and image structure for finding matches. We then transfer labels from the training images to the image under test. To promote good matches between superpixels we propose to learn a distance metric that weights the edges in our graph. Our approach is evaluated on four standard semantic segmentation datasets and achieves results comparable with the state-of-the-art.

Book ChapterDOI
21 Jul 2014
TL;DR: A scalable persistence layer for the de-facto standard MDE framework EMF that exploits the efficiency of graph databases in storing and accessing graph structures, as EMF models are.
Abstract: Several industrial contexts require software engineering methods and tools able to handle large-size artifacts. The central idea of abstraction makes model-driven engineering (MDE) a promising approach in such contexts, but current tools do not scale to very large models (VLMs): already the task of storing and accessing VLMs from a persisting support is currently inefficient. In this paper we propose a scalable persistence layer for the de-facto standard MDE framework EMF. The layer exploits the efficiency of graph databases in storing and accessing graph structures, as EMF models are. A preliminary experimentation shows that typical queries in reverse-engineering EMF models have good performance on such persistence layer, compared to file-based backends.

Journal ArticleDOI
01 Aug 2014
TL;DR: An algorithm for computing a tree decomposition, which is more efficient and scalable than any previous algorithm, and is the first time to use graph structures explicitly to solve PPR quickly.
Abstract: We propose a new scalable algorithm that can compute Personalized PageRank (PPR) very quickly. The Power method is a state-of-the-art algorithm for computing exact PPR; however, it requires many iterations. Thus reducing the number of iterations is the main challenge.We achieve this by exploiting graph structures of web graphs and social networks. The convergence of our algorithm is very fast. In fact, it requires up to 7.5 times fewer iterations than the Power method and is up to five times faster in actual computation time.To the best of our knowledge, this is the first time to use graph structures explicitly to solve PPR quickly. Our contributions can be summarized as follows.1. We provide an algorithm for computing a tree decomposition, which is more efficient and scalable than any previous algorithm.2. Using the above algorithm, we can obtain a core-tree decomposition of any web graph and social network. This allows us to decompose a web graph and a social network into (1) the core, which behaves like an expander graph, and (2) a small tree-width graph, which behaves like a tree in an algorithmic sense.3. We apply a direct method to the small tree-width graph to construct an LU decomposition.4. Building on the LU decomposition and using it as pre-conditoner, we apply GMRES method (a state-of-the-art advanced iterative method) to compute PPR for whole web graphs and social networks.

Patent
24 Feb 2014
TL;DR: In this paper, a method of processing a query to a graph database using a plurality of processors is proposed, where each thread is associated with a unique thread identifier and each sub-graph is defined by one of the thread identifiers.
Abstract: A method of processing a query to a graph database using a plurality of processors. The method comprises providing a plurality of threads to be executed on a plurality of processors, each the thread is associated with one of a plurality of unique thread identifiers, providing a graph database having a plurality of graph database nodes and a plurality of graph database edges, each the graph database edge represents a relationship between two of the plurality of graph database nodes, receiving a query tree that defines a tree comprising plurality of query nodes connected by a plurality of query tree edges, and searching at least part of the graph database for a match with the query tree, wherein the searching is executed by the plurality of the processors, and wherein each the processor searches one of a plurality of sub-graphs of the graph database, each the sub-graph is defined by one of the plurality of thread identifiers.

BookDOI
01 Jan 2014
TL;DR: This research presents a meta-modelling architecture for social media data management that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and managing social media accounts.
Abstract: Data warehousing.- Database integration.- Mobile databases.- Cloud, distributed, and parallel databases.- High dimensional and temporal data.- Image/video retrieval and databases.- Database performance and tuning.- Privacy and security in databases.- Query processing and optimization.- Semi-structured data and XML.- Spatial data processing and management.- Stream and sensor data management.- Uncertain and probabilistic databases.- Web databases.- Graph databases.- Web service management.- Social media data management.

Journal ArticleDOI
01 Aug 2014
TL;DR: Vertexica is presented, a graph analytics tools on top of a relational database, which is user friendly and yet highly efficient, and has the ability to leverage the relational features and enable much more sophisticated graph analysis.
Abstract: In this paper, we present Vertexica, a graph analytics tools on top of a relational database, which is user friendly and yet highly efficient. Instead of constraining programmers to SQL, Vertexica offers a popular vertex-centric query interface, which is more natural for analysts to express many graph queries. The programmers simply provide their vertex-compute functions and Vertexica takes care of efficiently executing them in the standard SQL engine. The advantage of using Vertexica is its ability to leverage the relational features and enable much more sophisticated graph analysis. These include expressing graph algorithms which are difficult in vertex-centric but straightforward in SQL and the ability to compose end-to-end data processing pipelines, including pre- and post- processing of graphs as well as combining multiple algorithms for deeper insights. Vertexica has a graphical user interface and we outline several demonstration scenarios including, interactive graph analysis, complex graph analysis, and continuous and time series analysis.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: Pagrol introduces a new conceptual Hyper Graph Cube model (which is an attributed-graph analogue of the data cube model for relational DBMS) to aggregate attributed graphs at different granularities and levels and provides an efficient MapReduce-based parallel graph cubing algorithm, MRGraph-Cubing, to compute the graph cube for an attributed graph.
Abstract: Attributed graphs are becoming important tools for modeling information networks, such as the Web and various social networks (e.g. Facebook, LinkedIn, Twitter). However, it is computationally challenging to manage and analyze attributed graphs to support effective decision making. In this paper, we propose, Pagrol, a parallel graph OLAP (Online Analytical Processing) system over attributed graphs. In particular, Pagrol introduces a new conceptual Hyper Graph Cube model (which is an attributed-graph analogue of the data cube model for relational DBMS) to aggregate attributed graphs at different granularities and levels. The proposed model supports different queries as well as a new set of graph OLAP Roll-Up/Drill-Down operations. Furthermore, on the basis of Hyper Graph Cube, Pagrol provides an efficient MapReduce-based parallel graph cubing algorithm, MRGraph-Cubing, to compute the graph cube for an attributed graph. Pagrol employs numerous optimization techniques: (a) a self-contained join strategy to minimize I/O cost; (b) a scheme that groups cuboids into batches so as to minimize redundant computations; (c) a cost-based scheme to allocate the batches into bags (each with a small number of batches); and (d) an efficient scheme to process a bag using a single MapReduce job. Results of extensive experimental studies using both real Facebook and synthetic datasets on a 128-node cluster show that Pagrol is effective, efficient and scalable.

Patent
26 Sep 2014
TL;DR: In this article, the authors propose a method for building and managing a user-customizable knowledge base, the method comprising acquiring data related to a plurality of entities from a plethora of heterogeneous data sources based on a customized acquisition configuration, wherein the customized acquisition configuration specifies a distinct data wrapper for each of the data sources, extracting entity-related information from the data to form a number of graph databases, and integrating the graph databases by mapping relationships between the entities to create an entity-centric knowledge base.
Abstract: A method for building and managing a user-customizable knowledge base, the method comprising acquiring data related to a plurality of entities from a plurality of heterogeneous data sources based on a customized acquisition configuration, wherein the customized acquisition configuration specifies a distinct data wrapper for each of the data sources, extracting entity-related information from the data to form a number of graph databases, and integrating the graph databases by mapping relationships between the entities to create an entity-centric knowledge base.

Book ChapterDOI
28 Sep 2014
TL;DR: This paper proposes a novel architecture for distributed and incremental queries, and conducts experiments to demonstrate that IncQuery-D, the prototype system, can scale up from a single workstation to a cluster that can handle very large models and complex incremental queries efficiently.
Abstract: Queries are the foundations of data intensive applications. In model-driven software engineering (MDE), model queries are core technologies of tools and transformations. As software models are rapidly increasing in size and complexity, traditional tools exhibit scalability issues that decrease productivity and increase costs [17]. While scalability is a hot topic in the database community and recent NoSQL efforts have partially addressed many shortcomings, this happened at the cost of sacrificing the ad-hoc query capabilities of SQL. Unfortunately, this is a critical problem for MDE applications due to their inherent workload complexity. In this paper, we aim to address both the scalability and ad-hoc querying challenges by adapting incremental graph search techniques – known from the EMF-IncQuery framework – to a distributed cloud infrastructure. We propose a novel architecture for distributed and incremental queries, and conduct experiments to demonstrate that IncQuery-D, our prototype system, can scale up from a single workstation to a cluster that can handle very large models and complex incremental queries efficiently.

Journal ArticleDOI
TL;DR: This work provides a classification of patterns, and study standard graph queries on graph patterns based on regular expressions, and provides additional restrictions for tractability, and shows that some intractable cases can be naturally cast as instances of constraint satisfaction problems.
Abstract: Graph data appears in a variety of application domains, and many uses of it, such as querying, matching, and transforming data, naturally result in incompletely specified graph data, that is, graph patterns. While queries need to be posed against such data, techniques for querying patterns are generally lacking, and properties of such queries are not well understood.Our goal is to study the basics of querying graph patterns. The key features of patterns we consider here are node and label variables and edges specified by regular expressions. We provide a classification of patterns, and study standard graph queries on graph patterns. We give precise characterizations of both data and combined complexity for each class of patterns. If complexity is high, we do further analysis of features that lead to intractability, as well as lower-complexity restrictions. Since our patterns are based on regular expressions, query answering for them can be captured by a new automata model. These automata have two modes of acceptance: one captures queries returning nodes, and the other queries returning paths. We study properties of such automata, and the key computational tasks associated with them. Finally, we provide additional restrictions for tractability, and show that some intractable cases can be naturally cast as instances of constraint satisfaction problems.

Journal ArticleDOI
01 Feb 2014
TL;DR: This paper proposes two graph embedding algorithms based on the Granular Computing paradigm, which are engineered as key procedures of a general-purpose graph classification system.
Abstract: Research on Graph-based pattern recognition and Soft Computing systems has attracted many scientists and engineers in several different contexts. This fact is motivated by the reason that graphs are general structures able to encode both topological and semantic information in data. While the data modeling properties of graphs are of indisputable power, there are still different concerns about the best way to compute similarity functions in an effective and efficient manner. To this end, suited transformation procedures are usually conceived to address the well-known Inexact Graph Matching problem in an explicit embedding space. In this paper, we propose two graph embedding algorithms based on the Granular Computing paradigm, which are engineered as key procedures of a general-purpose graph classification system. Tests have been conducted on benchmarking datasets relying on both synthetic and real-world data, achieving competitive results in terms of test set classification accuracy.

Patent
27 Jun 2014
TL;DR: In this paper, a graph database manipulation device includes a processor and a memory configured to store a GDB application, wherein the GDB manipulation application configures the processor to obtain a graph DB including a set of nodes and a sets of edges, determine a source node within the set of vertices, locate the related nodes based on the source node and the vertices of the edges, and recursively update the generated representation of the generated vertices from the perspective of the source nodes.
Abstract: Systems and methods for visualizing and manipulating graph databases in accordance embodiments of the invention are disclosed. In one embodiment of the invention, a graph database manipulation device includes a processor and a memory configured to store a graph database manipulation application, wherein the graph database manipulation application configures the processor to obtain a graph database including a set of nodes and a set of edges, determine a source node within the set of nodes, locate a set of related nodes based on the source node and the set of edges, recursively locate a set of sub-related nodes based on the set of related nodes and the set of edges, generate a representation of the set of related nodes from the perspective of the source node, and recursively update the generated representation of the set of sub-related nodes from the perspective of the source node and the related nodes.