Showing papers on "Graph database published in 2012"

PDF

Open Access

Journal Article•DOI•

[...]

Peter T. Wood¹•Institutions (1)

25 Apr 2012

TL;DR: A brief survey of many of the graph query languages that have been proposed, focussing on the core functionality provided in these languages and issues such as expressive power and the computational complexity of query evaluation.

...read moreread less

Abstract: Query languages for graph databases started to be investigated some 25 years ago. With much current data, such as linked data on the Web and social network data, being graph-structured, there has been a recent resurgence in interest in graph query languages. We provide a brief survey of many of the graph query languages that have been proposed, focussing on the core functionality provided in these languages. We also consider issues such as expressive power and the computational complexity of query evaluation.

...read moreread less

292 citations

Journal Article•DOI•

An in-depth comparison of subgraph isomorphism algorithms in graph databases

[...]

Jinsoo Lee¹, Wook-Shin Han¹, Romans Kasperovics¹, Jeong-Hoon Lee¹•Institutions (1)

Kyungpook National University¹

01 Dec 2012

TL;DR: Five state-of-the-art subgraph isomorphism algorithms in a common code base are implemented and compared by comparing them using many real-world datasets and their query loads and report surprising empirical findings.

...read moreread less

Abstract: Finding subgraph isomorphisms is an important problem in many applications which deal with data modeled as graphs. While this problem is NP-hard, in recent years, many algorithms have been proposed to solve it in a reasonable time for real datasets using different join orders, pruning rules, and auxiliary neighborhood information. However, since they have not been empirically compared one another in most research work, it is not clear whether the later work outperforms the earlier work. Another problem is that reported comparisons were often done using the original authors' binaries which were written in different programming environments. In this paper, we address these serious problems by re-implementing five state-of-the-art subgraph isomorphism algorithms in a common code base and by comparing them using many real-world datasets and their query loads. Through our in-depth analysis of experimental results, we report surprising empirical findings.

...read moreread less

292 citations

Proceedings Article•DOI•

Kineograph: taking the pulse of a fast-changing and connected world

[...]

Raymond Cheng¹, Ji Hong², Aapo Kyrola³, Youshan Miao⁴, Xuetian Weng⁵, Ming Wu⁶, Fan Yang⁶, Lidong Zhou⁶, Feng Zhao⁶, Enhong Chen⁴ - Show less +6 more•Institutions (6)

University of Washington¹, Fudan University², Carnegie Mellon University³, University of Science and Technology of China⁴, Peking University⁵, Microsoft⁶

10 Apr 2012

TL;DR: Kineograph is a distributed system that takes a stream of incoming data to construct a continuously changing graph, which captures the relationships that exist in the data feed and supports graph-mining algorithms to extract timely insights from the fast-changing graph structure.

...read moreread less

Abstract: Kineograph is a distributed system that takes a stream of incoming data to construct a continuously changing graph, which captures the relationships that exist in the data feed. As a computing platform, Kineograph further supports graph-mining algorithms to extract timely insights from the fast-changing graph structure. To accommodate graph-mining algorithms that assume a static underlying graph, Kineograph creates a series of consistent snapshots, using a novel and efficient epoch commit protocol. To keep up with continuous updates on the graph, Kineograph includes an incremental graph-computation engine. We have developed three applications on top of Kineograph to analyze Twitter data: user ranking, approximate shortest paths, and controversial topic detection. For these applications, Kineograph takes a live Twitter data feed and maintains a graph of edges between all users and hashtags. Our evaluation shows that with 40 machines processing 100K tweets per second, Kineograph is able to continuously compute global properties, such as user ranks, with less than 2.5-minute timeliness guarantees. This rate of traffic is more than 10 times the reported peak rate of Twitter as of October 2011.

...read moreread less

257 citations

Proceedings Article•DOI•

A Comparison of Current Graph Database Models

[...]

Renzo Angles

01 Apr 2012

TL;DR: A systematic comparison of current graph database models is presented and includes general features (for data storing and querying), data modeling features (i.e., data structures, query languages, and integrity constraints), and the support for essential graph queries.

...read moreread less

Abstract: The limitations of traditional databases, in particular the relational model, to cover the requirements of current applications has lead the development of new database technologies. Among them, the Graph Databases are calling the attention of the database community because in trendy projects where a database is needed, the extraction of worthy information relies on processing the graph-like structure of the data. In this paper we present a systematic comparison of current graph database models. Our review includes general features (for data storing and querying), data modeling features (i.e., data structures, query languages, and integrity constraints), and the support for essential graph queries.

...read moreread less

255 citations

Proceedings Article•DOI•

Query preserving graph compression

[...]

Wenfei Fan¹, Jianzhong Li¹, Xin Wang², Yinghui Wu³•Institutions (3)

Harbin Institute of Technology¹, University of Edinburgh², University of California, Santa Barbara³

20 May 2012

TL;DR: Graphs can be efficiently compressed via a reachability equivalence relation and graph bisimulation, respectively, while reserving query answers, and are reduced in average by 95% for reachability and 57% for graph pattern matching.

...read moreread less

Abstract: It is common to find graphs with millions of nodes and billions of edges in, e.g., social networks. Queries on such graphs are often prohibitively expensive. These motivate us to propose query preserving graph compression, to compress graphs relative to a class Λ of queries of users' choice. We compute a small Gr from a graph G such that (a) for any query Q E Λ Q, Q(G) = Q'(Gr), where Q' E Λ can be efficiently computed from Q; and (b) any algorithm for computing Q(G) can be directly applied to evaluating Q' on Gras is. That is, while we cannot lower the complexity of evaluating graph queries, we reduce data graphs while preserving the answers to all the queries in Λ. To verify the effectiveness of this approach, (1) we develop compression strategies for two classes of queries: reachability and graph pattern queries via (bounded) simulation. We show that graphs can be efficiently compressed via a reachability equivalence relation and graph bisimulation, respectively, while reserving query answers. (2) We provide techniques for aintaining compressed graph Gr in response to changes ΔG to the original graph G. We show that the incremental maintenance problems are unbounded for the two lasses of queries, i.e., their costs are not a function of the size of ΔG and changes in Gr. Nevertheless, we develop incremental algorithms that depend only on ΔG and Gr, independent ofG, i.e., we do not have to decompress Gr to propagate the changes. (3) Using real-life data, we experimentally verify that our compression techniques could reduce graphs in average by 95% for reachability and 57% for graph pattern matching, and that our incremental maintenance algorithms are efficient.

...read moreread less

208 citations

Patent•

Detecting Social Graph Elements for Structured Search Queries

[...]

Yofay Kari Lee¹, Michael Benjamin Cohen¹, Maxime Boucher¹, Alisson Gusatti Azzolini¹, Xiao Li¹, Lars Eilstrup Rasmussen¹ - Show less +2 more•Institutions (1)

Facebook¹

23 Jul 2012

TL;DR: In this paper, a method is described to parse an unstructured text query, parse the text query to identify n-grams, and determine a score that the ngrams correspond to particular nodes and edges from a social graph.

...read moreread less

Abstract: In particular embodiments, a method includes receiving an unstructured text query, parsing the text query to identify n-grams; determining a score that the n-grams correspond to particular nodes and edges from a social graph, identifying those nodes and edges with a score greater than a threshold score, and then generating structured queries that include references to the identified nodes and edges.

...read moreread less

192 citations

Journal Article•DOI•

Graph-based term weighting for information retrieval

[...]

Roi Blanco¹, Christina Lioma²•Institutions (2)

University of A Coruña¹, University of Stuttgart²

01 Feb 2012-Information Retrieval

TL;DR: This work proposes a principled graph-theoretic approach of computing term weights and integrating discourse aspects into retrieval, and experimentally shows that this type of ranking performs comparably to BM25, and can even outperform it, across different TREC datasets and evaluation measures.

...read moreread less

Abstract: A standard approach to Information Retrieval (IR) is to model text as a bag of words. Alternatively, text can be modelled as a graph, whose vertices represent words, and whose edges represent relations between the words, defined on the basis of any meaningful statistical or linguistic relation. Given such a text graph, graph theoretic computations can be applied to measure various properties of the graph, and hence of the text. This work explores the usefulness of such graph-based text representations for IR. Specifically, we propose a principled graph-theoretic approach of (1) computing term weights and (2) integrating discourse aspects into retrieval. Given a text graph, whose vertices denote terms linked by co-occurrence and grammatical modification, we use graph ranking computations (e.g. PageRank Page et al. in The pagerank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998) to derive weights for each vertex, i.e. term weights, which we use to rank documents against queries. We reason that our graph-based term weights do not necessarily need to be normalised by document length (unlike existing term weights) because they are already scaled by their graph-ranking computation. This is a departure from existing IR ranking functions, and we experimentally show that it performs comparably to a tuned ranking baseline, such as BM25 (Robertson et al. in NIST Special Publication 500-236: TREC-4, 1995). In addition, we integrate into ranking graph properties, such as the average path length, or clustering coefficient, which represent different aspects of the topology of the graph, and by extension of the document represented as a graph. Integrating such properties into ranking allows us to consider issues such as discourse coherence, flow and density during retrieval. We experimentally show that this type of ranking performs comparably to BM25, and can even outperform it, across different TREC (Voorhees and Harman in TREC: Experiment and evaluation in information retrieval, MIT Press, 2005) datasets and evaluation measures.

...read moreread less

185 citations

Journal Article•DOI•

Expressive Languages for Path Queries over Graph-Structured Data

[...]

Pablo Barceló¹, Leonid Libkin², Anthony W. Lin³, Peter T. Wood⁴•Institutions (4)

University of Chile¹, University of Edinburgh², University of Oxford³, Birkbeck, University of London⁴

01 Dec 2012-ACM Transactions on Database Systems

TL;DR: A class of extended CRPQs, called ECRPZs, are proposed, which add regular relations on tuples of paths, and allow path variables in the heads of queries, and are studied for their usefulness in querying graph structured data.

...read moreread less

Abstract: For many problems arising in the setting of graph querying (such as finding semantic associations in RDF graphs, exact and approximate pattern matching, sequence alignment, etc.), the power of standard languages such as the widely studied conjunctive regular path queries (CRPQs) is insufficient in at least two ways. First, they cannot output paths and second, more crucially, they cannot express relationships among paths.We thus propose a class of extended CRPQs, called ECRPQs, which add regular relations on tuples of paths, and allow path variables in the heads of queries. We provide several examples of their usefulness in querying graph structured data, and study their properties. We analyze query evaluation and representation of tuples of paths in the output by means of automata. We present a detailed analysis of data and combined complexity of queries, and consider restrictions that lower the complexity of ECRPQs to that of relational conjunctive queries. We study the containment problem, and look at further extensions with first-order features, and with nonregular relations that add arithmetic constraints on the lengths of paths and numbers of occurrences of labels.

...read moreread less

149 citations

Posted Content•

Efficient Snapshot Retrieval over Historical Graph Data

[...]

Udayan Khurana¹, Amol Deshpande¹•Institutions (1)

University of Maryland, College Park¹

24 Jul 2012-arXiv: Databases

TL;DR: DeltaGraph as discussed by the authors is a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition maintaining the current state for ongoing updates.

...read moreread less

Abstract: We address the problem of managing historical data for large evolving information networks like social networks or citation networks, with the goal to enable temporal and evolutionary queries and analysis. We present the design and architecture of a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition to maintaining the current state for ongoing updates. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compactly recording the historical information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right parameters for a specific scenario. In addition, we present strategies for materializing portions of the historical graph state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information.

...read moreread less

125 citations

Proceedings Article•

Managing large graphs on multi-cores with graph awareness

[...]

Vijayan Prabhakaran¹, Ming Wu¹, Xuetian Weng¹, Frank McSherry¹, Lidong Zhou¹, Maya Haridasan² - Show less +2 more•Institutions (2)

Microsoft¹, Google²

13 Jun 2012

TL;DR: Grace is a graph-aware, in-memory, transactional graph management system, specifically built for real-time queries and fast iterative computations, designed to run on large multi-cores to improve its performance.

...read moreread less

Abstract: Grace is a graph-aware, in-memory, transactional graph management system, specifically built for real-time queries and fast iterative computations. It is designed to run on large multi-cores, taking advantage of the inherent parallelism to improve its performance. Grace contains a number of graph-specific and multi-core-specific optimizations including graph partitioning, careful in-memory vertex ordering, updates batching, and load-balancing. It supports queries, searches, iterative computations, and transactional updates. Grace scales to large graphs (e.g., a Hotmail graph with 320 million vertices) and performs up to two orders of magnitude faster than commercial key-value stores and graph databases.

...read moreread less

121 citations

Patent•

Context-based search for a data store related to a graph node

[...]

Samuel S. Adams¹, Robert R. Friedlander¹, John K. Gerken¹, James R. Kraemer¹•Institutions (1)

IBM¹

23 Aug 2012

TL;DR: In this article, a graph database storage system contains multiple graph nodes, such that a first pointer points from a particular graph node to a particular syntactic context event node in a synthetic context event database.

...read moreread less

Abstract: A graph database storage system contains a graph database that has multiple graph nodes. A first pointer points from a particular graph node to a particular synthetic context event node in a synthetic context event database. A second pointer points from the particular synthetic context event node in the synthetic context event database to a particular data store in a data structure, such that the first pointer and the second pointer associate the particular data store with the particular entity represented in the graph database via the particular synthetic context event node.

...read moreread less

Proceedings Article•DOI•

Managing large dynamic graphs efficiently

[...]

Jayanta Mondal¹, Amol Deshpande¹•Institutions (1)

University of Maryland, College Park¹

20 May 2012

TL;DR: This paper proposes an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it, and develops a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication.

...read moreread less

Abstract: There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks, and so on. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node-centric or edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing different types of queries over large graphs, to date there is little work on understanding the challenges in distributed graph databases, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large-scale dynamically changing graph, and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose aggressive replication of the nodes in the graph for supporting low-latency querying, and investigate three novel techniques to minimize the communication bandwidth and the storage requirements. First, we develop a hybrid replication policy that monitors node read-write frequencies to dynamically decide what data to replicate, and whether to do eager or lazy replication. Second, we propose a clustering-based approach to amortize the costs of making these replication decisions. Finally, we propose using a fairness criterion to dictate how replication decisions should be made. We provide both theoretical analysis and efficient algorithms for the optimization problems that arise. We have implemented our framework as a middleware on top of the open-source CouchDB key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.

...read moreread less

Journal Article•DOI•

Large scale cohesive subgraphs discovery for social network visual analysis

[...]

Feng Zhao¹, Anthony K. H. Tung¹•Institutions (1)

National University of Singapore¹

01 Dec 2012

TL;DR: A novel subgraph concept to capture the cohesion in social interactions, an I/O efficient approach to discover cohesive subgraphs are introduced, and an analytic system which allows users to perform intuitive, visual browsing on large scale social networks is proposed.

...read moreread less

Abstract: Graphs are widely used in large scale social network analysis nowadays. Not only analysts need to focus on cohesive subgraphs to study patterns among social actors, but also normal users are interested in discovering what happening in their neighborhood. However, effectively storing large scale social network and efficiently identifying cohesive subgraphs is challenging. In this work we introduce a novel subgraph concept to capture the cohesion in social interactions, and propose an I/O efficient approach to discover cohesive subgraphs.Besides, we propose an analytic system which allows users to perform intuitive, visual browsing on large scale social networks. Our system stores the network as a social graph in the graph database, retrieves a local cohesive subgraph based on the input keywords, and then hierarchically visualizes the subgraph out on orbital layout, in which more important social actors are located in the center. By summarizing textual interactions between social actors as tag cloud, we provide a way to quickly locate active social communities and their interactions in a unified view.

...read moreread less

Proceedings Article•DOI•

An Efficient Graph Indexing Method

[...]

Xiaoli Wang¹, Xiaofeng Ding², Anthony K. H. Tung¹, Shanshan Ying¹, Hai Jin² - Show less +1 more•Institutions (2)

National University of Singapore¹, Huazhong University of Science and Technology²

01 Apr 2012

TL;DR: This paper proposes SEGOS, an indexing and query processing framework for graph similarity search that is easy to be pipelined to support continuous graph pruning and a novel search strategy based on the index.

...read moreread less

Abstract: Graphs are popular models for representing complex structure data and similarity search for graphs has become a fundamental research problem. Many techniques have been proposed to support similarity search based on the graph edit distance. However, they all suffer from certain drawbacks: high computational complexity, poor scalability in terms of database size, or not taking full advantage of indexes. To address these problems, in this paper, we propose SEGOS, an indexing and query processing framework for graph similarity search. First, an effective two-level index is constructed off-line based on sub-unit decomposition of graphs. Then, a novel search strategy based on the index is proposed. Two algorithms adapted from TA and CA methods are seamlessly integrated into the proposed strategy to enhance graph search. More specially, the proposed framework is easy to be pipelined to support continuous graph pruning. Extensive experiments are conducted on two real datasets to evaluate the effectiveness and scalability of our approaches.

...read moreread less

Patent•

Knowledge graph based search system

[...]

Jeffrey Scott Eder

24 Feb 2012

TL;DR: In this paper, a knowledge-based search system for an entity is presented, where the knowledge graphs are used to support the retrieval of relevant search results, and the entity related data are analyzed as required to develop an entity knowledge and one or more knowledge graphs.

...read moreread less

Abstract: Methods, computer program products and systems for developing and implementing a Knowledge Based Search System for an entity. Entity related data are analyzed as required to develop an entity knowledge and one or more knowledge graphs. The knowledge graphs are used to support the retrieval of relevant search results.

...read moreread less

Proceedings Article•DOI•

Improving large graph processing on partitioned graphs in the cloud

[...]

Rishan Chen¹, Mao Yang², Xuetian Weng¹, Byron Choi³, Bingsheng He⁴, Xiaoming Li¹ - Show less +2 more•Institutions (4)

Peking University¹, Microsoft², Hong Kong Baptist University³, Nanyang Technological University⁴

14 Oct 2012

TL;DR: A novelgraph partitioning framework to improve the network performance of graph partitioning itself, partitioned graph storage and vertex-oriented graph processing, and the effectiveness of network performance aware optimizations on the large graph processing engine.

...read moreread less

Abstract: As the study of large graphs over hundreds of gigabytes becomes increasingly popular for various data-intensive applications in cloud computing, developing large graph processing systems has become a hot and fruitful research area. Many of those existing systems support a vertex-oriented execution model and allow users to develop custom logics on vertices. However, the inherently random access pattern on the vertex-oriented computation generates a significant amount of network traffic. While graph partitioning is known to be effective to reduce network traffic in graph processing, there is little attention given to how graph partitioning can be effectively integrated into large graph processing in the cloud environment. In this paper, we develop a novel graph partitioning framework to improve the network performance of graph partitioning itself, partitioned graph storage and vertex-oriented graph processing. All optimizations are specifically designed for the cloud network environment. In experiments, we develop a system prototype following Pregel (the latest vertex-oriented graph engine by Google), and extend it with our graph partitioning framework. We conduct the experiments with a real-world social network and synthetic graphs over 100GB each in a local cluster and on Amazon EC2. Our experimental results demonstrate the efficiency of our graph partitioning framework, and the effectiveness of network performance aware optimizations on the large graph processing engine.

...read moreread less

Book Chapter•DOI•

Application-Only call graph construction

[...]

Karim Ali¹, Ondřej Lhoták¹•Institutions (1)

University of Waterloo¹

11 Jun 2012

TL;DR: Cgc is presented, a tool that generates a sound call graph for the application part of a program without analyzing the code of the library.

...read moreread less

Abstract: Since call graphs are an essential starting point for all inter-procedural analyses, many tools and frameworks have been developed to generate the call graph of a given program. The majority of these tools focus on generating the call graph of the whole program (i.e., both the application and the libraries that the application depends on). A popular compromise to the excessive cost of building a call graph for the whole program is to ignore all the effects of the library code and any calls the library makes back into the application. This results in potential unsoundness in the generated call graph and therefore in any analysis that uses it. In this paper, we present Cgc, a tool that generates a sound call graph for the application part of a program without analyzing the code of the library.

...read moreread less

Proceedings Article•DOI•

Regular path queries on graphs with data

[...]

Leonid Libkin¹, Domagoj Vrgoč¹•Institutions (1)

University of Edinburgh¹

26 Mar 2012

TL;DR: Two types of extensions of regular expressions that are more user-friendly are defined, and query evaluation techniques for them are developed, and it is shown that results extends to analogs of conjunctive regular path queries.

...read moreread less

Abstract: Graph data models received much attention lately due to applications in social networks, semantic web, biological databases and other areas. Typical query languages for graph databases retrieve their topology, while actual data stored in them is usually queried using standard relational mechanisms.Our goal is to develop techniques that combine these two modes of querying, and give us query languages that can ask questions about both data and topology. As the basic querying mechanism we consider regular path queries, with the key difference that conditions on paths between nodes now talk not only about labels but also specify how data changes along the path. Paths that combine edge labels with data values are closely related to data words, so for stating conditions in queries, we look at several data-word formalisms developed recently. We show that many of them immediately lead to intractable data complexity for graph queries, with the notable exception of register automata, which can specify many properties of interest, and have NLogspace data and Pspace combined complexity. As register automata themselves are not easy to use in querying, we define two types of extensions of regular expressions that are more user-friendly, and develop query evaluation techniques for them. For one class, regular expressions with memory, we achieve the same bounds as for automata, and for the other class, regular expressions with equality, we also obtain tractable combined complexity of query evaluation. In addition, we show that results extends to analogs of conjunctive regular path queries.

...read moreread less

Book•

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement

[...]

Eric Redmond, Jim R. Wilson

11 May 2012

TL;DR: The goal with the book was principally to introduce readers to the field of choices they now have, and to introduce databases that had the right combination of representing their genre and relative popularity.

...read moreread less

Abstract: Q&A with Authors Eric Redmond and Jim Wilson How did you pick the seven databases? Eric: We did have some criteria, if not explicit. The databases had to be open source---we didn't want to cover any databases that would tie readers to a company. We wanted at least one implementation for each of the five database genres (Relational, Key-Value, Columnar, Document, Graph). Then we chose databases that exemplified some general concepts we wanted to cover, like the CAP theorem, or mapreduce. Finally, we chose databases that were good counterpoints to each other--so we chose MongoDB and CouchDB (different ways of implementing document stores). Or we chose Riak because it was a Dynamo (Amazon's database) implementation to compare to HBase as a BigTable (Google's database) implementation. Jim: Our goal with the book was principally to introduce readers to the field of choices they now have. Our selections were largely in service of that goal. Even so, it was a pretty long and iterative process. We knew that no matter which ones we picked there'd be people asking why we did or didn't include their favorite. It came down to choosing the genres we wanted to discuss and then picking databases that had the right combination of (A) representing their genre and (B) relative popularity. For example, we picked PostgreSQL since it sticks very closely to the SQL standard and is relatively less well known than other OSS competitors like MySQL. Similarly, even though both HBase and Cassandra are column-oriented databases, we went with HBase because Cassandra is more of a hybrid that incorporates elements from both the BigTable paper and the Dynamo paper. Databases are rapidly changing. What do you wish youd included now? Eric: There are hundreds of database options, but I'm glad to see that our choices are still going strong a year later. However, if I had to do it over again, I would like to have added a Triplestore (like Mulgara), since the semantic web is slowly popularizing this method of data storage. I also would have liked to spend more time on Neo4j's Cypher language, or have covered Hadoop in a bit of detail, since analytics is a huge part of data storage. Jim: Yes, databases are rapidly changing, in two senses. First, the field of available data storage technology has been seeing an explosion in recent years. More and more different sorts of databases are cropping up to fill in various niche needs. In the other sense, the databases themselves are rapidly evolving. Even between minor version releases, modern NoSQL databases incorporate more and more features in order to claim more of the market and remain competitive. In that regard, there's a bit of convergence happening and it makes choosing one even harder as there are more that can meet your needs all the time. I still think the five genres and seven databases we chose satisfy the criteria that we set out to achieve. But there are others I'd like to write about as well. These include some old favorites like SQLite and some databases you might not think of as such, like OpenLDAP and SOLR (an inverted index/search engine). Why did you decide to write this book? Eric: Jim and I discussed writing a book like this for quite some time. About a year and a half ago he sent me an email with no body--the subject was "Seven Databases in Seven Weeks?" The title sold me. We both loved Bruce's "Seven Languages" book, and this seemed the perfect format to explore this emerging field. Jim: As early as March of 2010, Eric and I brainstormed about writing a NoSQL book of some kind. At the time there was a lot of buzz around the term, but also a lot of confusion. We thought we could bring some structure to the discussion and educate people who might not be up to speed yet on all the latest developments. After reading Bruce A. Tate's Seven Languages in Seven Weeks I thought, "What about Seven Databases?" Eric submitted a proposal and a few weeks later we were off to the races. What do you see as up and coming databases? Eric: I've become a big fan of Neo4j. It's one we covered in the book, but in all honesty we chose it because we wanted to explore an open source graph database. But over the past year it's really come into its own. I really do believe this is the year we'll see wider adoption of graph databases. As for ones we did not cover, I think ElasticSearch is clearly gaining traction. OrientDB is also interesting, as it can act as a relational, key-value, document, or a graph database. I think you'll see more of this multi-genre behavior in the future. And as I hinted at before, Triplestores are gaining a bit of traction, too, though their problem-set greatly overlaps with general graph databases. Jim: There are many, of course, but there are at least two that I personally look forward to exploring in more detail: ElasticSearch and doozer. ElasticSearch is a distributed, peer-based, REST/JSON powered document search engine. Using a distributed Lucene index at its core, ElasticSearch allows REST clients to find documents based on fuzzy criteria. Everyone needs a search engine, and ElasticSearch makes it easy. Doozer is a fast, headless consensus engine. It's written in the Go programming language by the smart folks at Heroku. Doozer is great for storing small blobs of important information that absolutely must be consistent (like cluster configuration metadata), but without a single point of failure.

...read moreread less

Proceedings Article•DOI•

SAHAD: Subgraph Analysis in Massive Networks Using Hadoop

[...]

Zhao Zhao¹, Guanying Wang¹, Ali R. Butt¹, Maleq Khan¹, V. S. Anil Kumar¹, Madhav V. Marathe¹ - Show less +2 more•Institutions (1)

Virginia Tech¹

21 May 2012

TL;DR: SAHAD is the first such Hadoop based subgraph/subtree analysis algorithm, and performs significantly better than prior approaches for very large graphs and templates, and is also amenable to running quite easily on Amazon EC2, without needs for any system level optimization.

...read moreread less

Abstract: Relational sub graph analysis, e.g. finding labeled sub graphs in a network, which are isomorphic to a template, is a key problem in many graph related applications. It is computationally challenging for large networks and complex templates. In this paper, we develop SAHAD, an algorithm for relational sub graph analysis using Hadoop, in which the sub graph is in the form of a tree. SAHAD is able to solve a variety of problems closely related with sub graph isomorphism, including counting labeled/unlabeled sub graphs, finding supervised motifs, and computing graph let frequency distribution. We prove that the worst case work complexity for SAHAD is asymptotically very close to that of the best sequential algorithm. On a mid-size cluster with about 40 compute nodes, SAHAD scales to networks with up to 9 million nodes and a quarter billion edges, and templates with up to 12 nodes. To the best of our knowledge, SAHAD is the first such Hadoop based subgraph/subtree analysis algorithm, and performs significantly better than prior approaches for very large graphs and templates. Another unique aspect is that SAHAD is also amenable to running quite easily on Amazon EC2, without needs for any system level optimization.

...read moreread less

Proceedings Article•DOI•

Distributed graph pattern matching

[...]

Shuai Ma¹, Yang Cao¹, Jinpeng Huai¹, Tianyu Wo¹•Institutions (1)

Beihang University¹

16 Apr 2012

TL;DR: This paper proposes distributed algorithms and optimization techniques that exploit the properties of graph simulation and the analyses of distributed algorithms, and experimentally verifies the effectiveness and efficiency of these algorithms, using both real-life and synthetic data.

...read moreread less

Abstract: Graph simulation has been adopted for pattern matching to reduce the complexity and capture the need of novel applications With the rapid development of the Web and social networks, data is typically distributed over multiple machines Hence a natural question raised is how to evaluate graph simulation on distributed data To our knowledge, no such distributed algorithms are in place yet This paper settles this question by providing evaluation algorithms and optimizations for graph simulation in a distributed setting (1) We study the impacts of components and data locality on the evaluation of graph simulation (2) We give an analysis of a large class of distributed algorithms, captured by a message-passing model, for graph simulation We also identify three complexity measures: visit times, makespan and data shipment, for analyzing the distributed algorithms, and show that these measures are essentially controversial with each other (3) We propose distributed algorithms and optimization techniques that exploit the properties of graph simulation and the analyses of distributed algorithms (4) We experimentally verify the effectiveness and efficiency of these algorithms, using both real-life and synthetic data

...read moreread less

Proceedings Article•DOI•

Benchmarking Traversal Operations over Graph Databases

[...]

Marek Ciglan, Alex Averbuch¹, Ladialav Hluchy•Institutions (1)

Swedish Institute of Computer Science¹

01 Apr 2012

TL;DR: This paper addresses the need to compare the performance of different graph databases, and discusses the challenges of developing fair benchmarking methodologies, and describes the design of the graph traversal benchmark and presents its results.

...read moreread less

Abstract: A significant number of graph database systems has emerged in the past few years. Most aim at the management of the property graph data structure: where graph elements can be assigned with properties. In this paper, we address the need to compare the performance of different graph databases, and discuss the challenges of developing fair benchmarking methodologies. We believe that, compared to other database systems, the ability to efficiently traverse over the graph topology is unique to graph databases. As such, we focus our attention on the benchmarking of traversal operations. We describe the design of the graph traversal benchmark and present its results. The benchmark provides the means to compare the performance of different data management systems and gives us insight into the abilities and limitations of modern graph databases.

...read moreread less

Proceedings Article•DOI•

Workflow Scheduling to Minimize Data Movement Using Multi-constraint Graph Partitioning

[...]

Masahiro Tanaka¹, Osamu Tatebe¹•Institutions (1)

University of Tsukuba¹

13 May 2012

TL;DR: This work proposes a new method of task assignment based on Multi-Constraint Graph Partitioning that relates the dimension of weight vectors to the rank of a task phase defined by traversing the task graph.

...read moreread less

Abstract: Among scheduling algorithms of scientific workflows, the graph partitioning is a technique to minimize data transfer between nodes or clusters. However, when the graph partitioning is simply applied to a complex workflow DAG, tasks in each parallel phase are not always evenly assigned to computation nodes since the graph partitioning algorithm is not aware of edge directions that represent task dependencies. Thus, we propose a new method of task assignment based on Multi-Constraint Graph Partitioning. This method relates the dimension of weight vectors to the rank of a task phase defined by traversing the task graph. Our algorithm is implemented in the Pwrake workflow system and evaluated the performance of the Montage workflow using a computer cluster. The result shows that the file size accessed from remote nodes is reduced from 88% to 14% of the total file size accessed during the workflow and that the elapsed time is reduced by 31%.

...read moreread less

Book Chapter•DOI•

Regular path queries on large graphs

[...]

André Koschmieder¹, Ulf Leser¹•Institutions (1)

Humboldt University of Berlin¹

25 Jun 2012

TL;DR: An algorithm is devised which decomposes an RPQ into a series of smaller RPQs using rare labels, i.e., elements of the query with few matches, as way-points, and which outperforms the automata-based approach, often by orders of magnitude.

...read moreread less

Abstract: The significance of regular path queries (RPQs) on graph-like data structures has grown steadily over the past decade. RPQs are, often in restricted forms, part of graph-oriented query languages such as XQuery/XPath and SPARQL, and have applications in areas such as semantic, social, and biomedical networks. However, existing systems for evaluating RPQs are restricted either in the type of the graph (e.g., only trees), the type of regular expressions (e.g., only single steps), and/or the size of the graphs they can handle. No method has yet been developed that would be capable of efficiently evaluating general RPQs on large graphs, i.e., with millions of nodes/edges. We present a novel approach for answering RPQs on large graphs. Our method exploits the fact that not all labels in a graph are equally frequent. We devise an algorithm which decomposes an RPQ into a series of smaller RPQs using rare labels, i.e., elements of the query with few matches, as way-points. A search thereby is decomposed into a set of smaller search problems which are tackled in a bi-directional fashion, supported by a set of graph indexes. Comparison of our algorithm with two approaches following the traditional methods for tackling such problems, i.e., the usage of automata, reveals that (a) the automata-based methods are not able to handle large graphs due to the amount of memory they require, and that (b) our algorithm outperforms the automata-based approach, often by orders of magnitude. Another advantage of our algorithm is that it can be parallelized easily.

...read moreread less

Proceedings Article•DOI•

Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency

[...]

Jatin Chhugani¹, Nadathur Satish¹, Changkyu Kim¹, Jason Sewall¹, Pradeep Dubey¹ - Show less +1 more•Institutions (1)

Intel¹

21 May 2012

TL;DR: This work presents a scalable Breadth-First Search Traversal algorithm for modern multi-socket, multi-core CPUs, which uses lock- and atomic-free operations on a cache-resident structure for arbitrary sized graphs to filter out expensive main memory accesses, and completely and efficiently utilizes all available bandwidth resources.

...read moreread less

Abstract: Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversal is a key component in graph algorithms such as reach ability and graph matching. Since the scale of data stored and queried in these databases is increasing, it is important to obtain high performing implementations of graph traversal that can efficiently utilize the processing power of modern processors. In this work, we present a scalable Breadth-First Search Traversal algorithm for modern multi-socket, multi-core CPUs. Our algorithm uses lock- and atomic-free operations on a cache-resident structure for arbitrary sized graphs to filter out expensive main memory accesses, and completely and efficiently utilizes all available bandwidth resources. We propose a work distribution approach for multi-socket platforms that ensures load-balancing while keeping cross-socket communication low. We provide a detailed analytical model that accurately projects the performance of our single- and multi-socket traversal algorithms to within 5-10% of obtained performance. Our analytical model serves as a useful tool to analyze performance bottlenecks on modern CPUs. When measured on various synthetic and real-world graphs with a wide range of graph sizes, vertex degrees and graph diameters, our implementation on a dual-socket Intel (R) Xeon (R) X5570 (Intel micro architecture code name Nehalem) system achieves 1.5X -- 13.2X performance speedup over the best reported numbers. We achieve around 1 Billion traversed edges per second on a scale free R-MAT graph with 64M vertices and 2 Billion edges on a dual-socket Nehalem system. Our optimized algorithm is useful as a building block for efficient multi-node implementations and future exascale systems, thereby allowing them to ride the trend of increasing per-node compute and bandwidth resources.

...read moreread less

Patent•

System and method for sharing investigation results

[...]

Kevin Richards¹, David Cohen¹, Khan Tasinga¹•Institutions (1)

Palantir Technologies¹

05 Nov 2012

TL;DR: In this article, a computer-based investigative analysis system is disclosed in which a user can share results of an investigation with other users in a way that allows the sharing user to visualize how the results will be shared before they are shared.

...read moreread less

Abstract: A computer-based investigative analysis system is disclosed in which a user can share results of an investigation with other users in a way that allows the sharing user to visualize how the results will be shared before they are shared. The results are shared in the form of a visual graph having nodes, edges, and other presentation elements. The nodes represent data objects that are the subject of the investigation and the edges represent connections between the data objects. The graph is shared in the form of an automatically generated redacted graph omitting nodes, edges, and presentation elements for which the other users do not have permission to view. Before sharing the graph, the sharing user is presented with a visualization of the automatically generated redacted graph providing the user an opportunity to confirm that sharing the redacted graph will not constitute an unauthorized information leakage.

...read moreread less

Proceedings Article•DOI•

G-SPARQL: a hybrid engine for querying large attributed graphs

[...]

Sherif Sakr¹, Sameh Elnikety², Yuxiong He²•Institutions (2)

University of New South Wales¹, Microsoft²

29 Oct 2012

TL;DR: An algebraic compilation mechanism for the proposed query language, G-SPARQL, which is extended from the relational algebra and based on the basic construct of building SPARQL queries, the Triple Pattern is described.

...read moreread less

Abstract: We propose a SPARQL-like language, G-SPARQL, for querying attributed graphs. The language expresses types of queries which of large interest for applications which model their data as large graphs such as: pattern matching, reachability and shortest path queries. Each query can combine both of structural predicates and value-based predicates (on the attributes of the graph nodes and edges). We describe an algebraic compilation mechanism for our proposed query language which is extended from the relational algebra and based on the basic construct of building SPARQL queries, the Triple Pattern. We describe a hybrid Memory/Disk representation of large attributed graphs where only the topology of the graph is maintained in memory while the data of the graph is stored in a relational database. The execution engine of our proposed query language splits parts of the query plan to be pushed inside the relational database while the execution of other parts of the query plan are processed using memory-based algorithms, as necessary. Experimental results on real datasets demonstrate the efficiency and the scalability of our approach and show that our approach outperforms native graph databases by several factors.

...read moreread less

Proceedings Article•DOI•

Managing and mining large graphs: systems and implementations

[...]

Bin Shao¹, Haixun Wang¹, Yanghua Xiao²•Institutions (2)

Microsoft¹, Fudan University²

20 May 2012

TL;DR: In this article, the authors provide perspectives from a variety of standpoints on the goals and the means for developing a general purpose graph system, highlighting the challenges posed by the graph data, the constraints of architectural design, the different types of application needs, and the power of different programming models that support such needs.

...read moreread less

Abstract: We are facing challenges at all levels ranging from infrastructures to programming models for managing and mining large graphs. A lot of algorithms on graphs are ad-hoc in the sense that each of them assumes that the underlying graph data can be organized in a certain way that maximizes the performance of the algorithm. In other words, there is no standard graph systems based on which graph algorithms are developed and optimized. In response to this situation, a lot of graph systems have been proposed recently. In this tutorial, we discuss several representative systems. Still, we focus on providing perspectives from a variety of standpoints on the goals and the means for developing a general purpose graph system. We highlight the challenges posed by the graph data, the constraints of architectural design, the different types of application needs, and the power of different programming models that support such needs.This tutorial is complementary to the related tutorial "Managing and Mining Large Graphs: Patterns and Algorithms".

...read moreread less

Proceedings Article•DOI•

Using Pregel-like Large Scale Graph Processing Frameworks for Social Network Analysis

[...]

L. Quick, P. Wilkinson, D. Hardcastle

26 Aug 2012

TL;DR: Several important undirected graph algorithms for social network analysis which fit within the Pregel framework are presented, and various graph componentisation methods, diameter estimation, degrees of separations, are discussed.

...read moreread less

Abstract: Pregel is a system for large scale graph processing developed at Google. It provides a scalable framework for running graph analytics on clusters of commodity machines. In this paper, we present several important undirected graph algorithms for social network analysis which fit within this framework. We discuss various graph componentisation methods, diameter estimation, degrees of separations, along with triangle, k-core and k-truss finding and computing clustering coefficients. Finally we present some experimental results using our own implementation of the Pregel framework, and examine key features of the general framework and algorithmic design.

...read moreread less

Proceedings Article•DOI•

Personalized video recommendation through tripartite graph propagation

[...]

Bisheng Chen¹, Jingdong Wang², Qinghua Huang¹, Tao Mei²•Institutions (2)

South China University of Technology¹, Microsoft²

29 Oct 2012

TL;DR: This paper develops an iterative propagation scheme over the tripartite graph to compute the preference information of each user and demonstrates that the proposed method outperforms existing state-of-the-art approaches, co-views and random walks on the user-video bipartite graph.

...read moreread less

Abstract: The rapid growth of the number of videos on the Internet provides enormous potential for users to find content of interest to them. Video search, such as Google, Youtube, Bing, is a popular way to help users to find desired videos. However, it is still very challenging to discover new video contents for users. In this paper, we address the problem of providing personalized video suggestions for users. Rather than only exploring the user-video graph that is formulated using the click-through information, we also investigate other two useful graphs, the user-query graph indicating if a user ever issues a query, and the query-video graph indicating if a video appears in the search result of a query. The two graphs act as a bridge to connect users and videos, and have a large potential to improve the recommendation as the queries issued by a user essentially imply his interest. As a result, we reach a tripartite graph over (user, video, query). We develop an iterative propagation scheme over the tripartite graph to compute the preference information of each user. Experimental results on a dataset of 2,893 users, 23,630 queries and 55,114 videos collected during Feb. 1-28, 2011 demonstrate that the proposed method outperforms existing state-of-the-art approaches, co-views and random walks on the user-video bipartite graph.

...read moreread less

Collapse