scispace - formally typeset
Search or ask a question
Author

Andrey Gubichev

Other affiliations: Max Planck Society, Google
Bio: Andrey Gubichev is an academic researcher from Technische Universität München. The author has contributed to research in topics: RDF & Query optimization. The author has an hindex of 14, co-authored 19 publications receiving 1070 citations. Previous affiliations of Andrey Gubichev include Max Planck Society & Google.

Papers
More filters
Journal ArticleDOI
01 Nov 2015
TL;DR: This paper introduces the Join Order Benchmark (JOB) and experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries.
Abstract: Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries. We investigate the quality of industrial-strength cardinality estimators and find that all estimators routinely produce large errors. We further show that while estimates are essential for finding a good join order, query performance is unsatisfactory if the query engine relies too heavily on these estimates. Using another set of experiments that measure the impact of the cost model, we find that it has much less influence on query performance than the cardinality estimates. Finally, we investigate plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and find that exhaustive enumeration improves performance despite the sub-optimal cardinality estimates.

449 citations

Proceedings ArticleDOI
27 May 2015
TL;DR: This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies.
Abstract: The Linked Data Benchmark Council (LDBC) is now two years underway and has gathered strong industrial participation for its mission to establish benchmarks, and benchmarking practices for evaluating graph data management systems. The LDBC introduced a new choke-point driven methodology for developing benchmark workloads, which combines user input with input from expert systems architects, which we outline. This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies. SNB has three query workloads under development: Interactive, Business Intelligence, and Graph Algorithms. We describe the SNB Interactive Workload in detail and illustrate the workload with some early results, as well as the goals for the two other workloads.

262 citations

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This paper presents a scalable sketch-based index structure that not only supports estimation of node distances, but also computes corresponding shortest paths themselves, leading to near-exact shortest-path approximations in real world graphs.
Abstract: Computing shortest paths between two given nodes is a fundamental operation over graphs, but known to be nontrivial over large disk-resident instances of graph data. While a number of techniques exist for answering reachability queries and approximating node distances efficiently, determining actual shortest paths (i.e. the sequence of nodes involved) is often neglected. However, in applications arising in massive online social networks, biological networks, and knowledge graphs it is often essential to find out many, if not all, shortest paths between two given nodes. In this paper, we address this problem and present a scalable sketch-based index structure that not only supports estimation of node distances, but also computes corresponding shortest paths themselves. Generating the actual path information allows for further improvements to the estimation accuracy of distances (and paths), leading to near-exact shortest-path approximations in real world graphs. We evaluate our techniques - implemented within a fully functional RDF graph database system - over large real-world social and biological networks of sizes ranging from tens of thousand to millions of nodes and edges. Experiments on several datasets show that we can achieve query response times providing several orders of magnitude speedup over traditional path computations while keeping the estimation errors between 0% and 1% on average.

172 citations

Journal ArticleDOI
01 Oct 2018
TL;DR: This paper introduces the Join Order Benchmark that works on real-life data riddled with correlations and introduces 113 complex join queries and investigates plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and finds that exhaustive enumeration improves performance despite the suboptimal cardinality estimates.
Abstract: Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark that works on real-life data riddled with correlations and introduces 113 complex join queries. We experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries. For this purpose, we describe cardinality-estimate injection and extraction techniques that allow us to compare the cardinality estimators of multiple industrial SQL implementations on equal footing, and to characterize the value of having perfect cardinality estimates. Our investigation shows that all industrial-strength cardinality estimators routinely produce large errors: though cardinality estimation using table samples solves the problem for single-table queries, there are still no techniques in industrial systems that can deal accurately with join-crossing correlated query predicates. We further show that while estimates are essential for finding a good join order, query performance is unsatisfactory if the query engine relies too heavily on these estimates. Using another set of experiments that measure the impact of the cost model, we find that it has much less influence on query performance than the cardinality estimates. We investigate plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and find that exhaustive enumeration improves performance despite the suboptimal cardinality estimates. Finally, we extend our investigation from main-memory only, to also include disk-based query processing. Here, we find that though accurate cardinality estimation should be the first priority, other aspects such as modeling random versus sequential I/O are also important to predict query runtime.

105 citations

Proceedings Article
01 Jan 2017
TL;DR: Indexbased join sampling is proposed, a novel cardinality estimation technique for main-memory databases that relies on sampling and existing index structures to obtain accurate estimates and significantly improves estimation as well as overall plan quality.
Abstract: After four decades of research, today’s database systems still suffer from poor query execution plans. Bad plans are usually caused by poor cardinality estimates, which have been called the “Achilles Heel” of modern query optimizers. In this work we propose indexbased join sampling, a novel cardinality estimation technique for main-memory databases that relies on sampling and existing index structures to obtain accurate estimates. Results on a real-world data set show that this approach significantly improves estimation as well as overall plan quality. The additional sampling effort is quite low and can be configured to match the desired application profile. The technique can be easily integrated into most systems.

102 citations


Cited by
More filters
Proceedings ArticleDOI
27 May 2018
TL;DR: This work describes Cypher 9, which is the first version of the language governed by the openCypher Implementers Group, and introduces the language by example, and provides a formal semantic definition of the core read-query features of Cypher, including its variant of the property graph data model.
Abstract: The Cypher property graph query language is an evolving language, originally designed and implemented as part of the Neo4j graph database, and it is currently used by several commercial database products and researchers. We describe Cypher 9, which is the first version of the language governed by the openCypher Implementers Group. We first introduce the language by example, and describe its uses in industry. We then provide a formal semantic definition of the core read-query features of Cypher, including its variant of the property graph data model, and its ASCII Art graph pattern matching mechanism for expressing subgraphs of interest to an application. We compare the features of Cypher to other property graph query languages, and describe extensions, at an advanced stage of development, which will form part of Cypher 10, turning the language into a compositional language which supports graph projections and multiple named graphs.

353 citations

Posted Content
TL;DR: This work proposes a new exact method for shortest-path distance queries on large-scale networks that can handle social networks and web graphs with hundreds of millions of edges, which are two orders of magnitude larger than the limits of previous exact methods.
Abstract: We propose a new exact method for shortest-path distance queries on large-scale networks. Our method precomputes distance labels for vertices by performing a breadth-first search from every vertex. Seemingly too obvious and too inefficient at first glance, the key ingredient introduced here is pruning during breadth-first searches. While we can still answer the correct distance for any pair of vertices from the labels, it surprisingly reduces the search space and sizes of labels. Moreover, we show that we can perform 32 or 64 breadth-first searches simultaneously exploiting bitwise operations. We experimentally demonstrate that the combination of these two techniques is efficient and robust on various kinds of large-scale real-world networks. In particular, our method can handle social networks and web graphs with hundreds of millions of edges, which are two orders of magnitude larger than the limits of previous exact methods, with comparable query time to those of previous methods.

278 citations

Proceedings ArticleDOI
22 Jun 2013
TL;DR: In this article, a new exact method for shortest-path distance queries on large-scale networks is proposed, where the key ingredient introduced here is pruning during breadth-first searches.
Abstract: We propose a new exact method for shortest-path distance queries on large-scale networks. Our method precomputes distance labels for vertices by performing a breadth-first search from every vertex. Seemingly too obvious and too inefficient at first glance, the key ingredient introduced here is pruning during breadth-first searches. While we can still answer the correct distance for any pair of vertices from the labels, it surprisingly reduces the search space and sizes of labels. Moreover, we show that we can perform 32 or 64 breadth-first searches simultaneously exploiting bitwise operations. We experimentally demonstrate that the combination of these two techniques is efficient and robust on various kinds of large-scale real-world networks. In particular, our method can handle social networks and web graphs with hundreds of millions of edges, which are two orders of magnitude larger than the limits of previous exact methods, with comparable query time to those of previous methods.

270 citations

Proceedings ArticleDOI
27 May 2015
TL;DR: This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies.
Abstract: The Linked Data Benchmark Council (LDBC) is now two years underway and has gathered strong industrial participation for its mission to establish benchmarks, and benchmarking practices for evaluating graph data management systems. The LDBC introduced a new choke-point driven methodology for developing benchmark workloads, which combines user input with input from expert systems architects, which we outline. This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies. SNB has three query workloads under development: Interactive, Business Intelligence, and Graph Algorithms. We describe the SNB Interactive Workload in detail and illustrate the workload with some early results, as well as the goals for the two other workloads.

262 citations

Proceedings ArticleDOI
01 Apr 2012
TL;DR: A systematic comparison of current graph database models is presented and includes general features (for data storing and querying), data modeling features (i.e., data structures, query languages, and integrity constraints), and the support for essential graph queries.
Abstract: The limitations of traditional databases, in particular the relational model, to cover the requirements of current applications has lead the development of new database technologies. Among them, the Graph Databases are calling the attention of the database community because in trendy projects where a database is needed, the extraction of worthy information relies on processing the graph-like structure of the data. In this paper we present a systematic comparison of current graph database models. Our review includes general features (for data storing and querying), data modeling features (i.e., data structures, query languages, and integrity constraints), and the support for essential graph queries.

255 citations