Multi-core, main-memory joins: sort vs. hash revisited

doi:10.14778/2732219.2732227

Home
/
Papers
/
Multi-core, main-memory joins: sort vs. hash revisited

Journal Article•DOI•

Multi-core, main-memory joins: sort vs. hash revisited

Cagri Balkesen¹, Gustavo Alonso¹, Jens Teubner², M. Tamer Özsu³•Institutions (3)

ETH Zurich¹, Technical University of Dortmund², University of Waterloo³

01 Sep 2013-Vol. 7, Iss: 1, pp 85-96

TL;DR: The experiments show that, contrary to claims, radix-hash join is still clearly superior, and sort-merge approaches to performance of radix only when very large amounts of data are involved.

read less

Abstract: In this paper we experimentally study the performance of main-memory, parallel, multi-core join algorithms, focusing on sort-merge and (radix-)hash join. The relative performance of these two join approaches have been a topic of discussion for a long time. With the advent of modern multi-core architectures, it has been argued that sort-merge join is now a better choice than radix-hash join. This claim is justified based on the width of SIMD instructions (sort-merge outperforms radix-hash join once SIMD is sufficiently wide), and NUMA awareness (sort-merge is superior to hash join in NUMA architectures). We conduct extensive experiments on the original and optimized versions of these algorithms. The experiments show that, contrary to these claims, radix-hash join is still clearly superior, and sort-merge approaches to performance of radix only when very large amounts of data are involved. The paper also provides the fastest implementations of these algorithms, and covers many aspects of modern hardware architectures relevant not only for joins but for any parallel data processing operator.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

In-Memory Big Data Management and Processing: A Survey

[...]

Hao Zhang¹, Gang Chen², Beng Chin Ooi¹, Kian-Lee Tan¹, Meihui Zhang³ - Show less +1 more•Institutions (3)

National University of Singapore¹, Zhejiang University², Singapore University of Technology and Design³

01 Jul 2015-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This survey aims to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks.

...read moreread less

Abstract: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.

...read moreread less

391 citations

Proceedings Article•DOI•

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age

[...]

Viktor Leis¹, Peter Boncz, Alfons Kemper¹, Thomas Neumann¹•Institutions (1)

Technische Universität München¹

18 Jun 2014

TL;DR: The morsel-driven query execution framework is presented, where scheduling becomes a fine-grained run-time task that is NUMA-aware and the degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload.

...read moreread less

Abstract: With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for plan-driven parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA). In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (morsels) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.

...read moreread less

243 citations

Proceedings Article•DOI•

Meet the walkers: accelerating index traversals for in-memory databases

[...]

Onur Kocberber¹, Boris Grot², Javier Picorel¹, Babak Falsafi¹, Kevin T. Lim³, Parthasarathy Ranganathan⁴ - Show less +2 more•Institutions (4)

École Polytechnique Fédérale de Lausanne¹, University of Edinburgh², Hewlett-Packard³, Google⁴

07 Dec 2013

TL;DR: Widx is introduced, an on-chip accelerator for database hash index lookups, which achieves both high performance and flexibility by decoupling key hashing from the list traversal, and processing multiple keys in parallel on a set of programmable walker units.

...read moreread less

Abstract: The explosive growth in digital data and its growing role in real-time decision support motivate the design of high-performance database management systems (DBMSs). Meanwhile, slowdown in supply voltage scaling has stymied improvements in core performance and ushered an era of power-limited chips. These developments motivate the design of DBMS accelerators that (a) maximize utility by accelerating the dominant operations, and (b) provide flexibility in the choice of DBMS, data layout, and data types. We study data analytics workloads on contemporary in-memory databases and find hash index lookups to be the largest single contributor to the overall execution time. The critical path in hash index lookups consists of ALU-intensive key hashing followed by pointer chasing through a node list. Based on these observations, we introduce Widx, an on-chip accelerator for database hash index lookups, which achieves both high performance and flexibility by (1) decoupling key hashing from the list traversal, and (2) processing multiple keys in parallel on a set of programmable walker units. Widx reduces design cost and complexity through its tight integration with a conventional core, thus eliminating the need for a dedicated TLB and cache. An evaluation of Widx on a set of modern data analytics workloads (TPC-H, TPC-DS) using full-system simulation shows an average speedup of 3.1× over an aggressive OoO core on bulk hash table operations, while reducing the OoO core energy by 83%.

...read moreread less

198 citations

Proceedings Article•DOI•

Rethinking SIMD Vectorization for In-Memory Databases

[...]

Orestis Polychroniou¹, Arun Raghavan², Kenneth A. Ross¹•Institutions (2)

Columbia University¹, Oracle Corporation²

27 May 2015

TL;DR: This paper presents novel vectorized designs and implementations of database operators, based on advanced SIMD operations, such as gathers and scatters, and highlights the impact of efficient vectorization on the algorithmic design of in-memorydatabase operators, as well as the architectural design and power efficiency of hardware.

...read moreread less

Abstract: Analytical databases are continuously adapting to the underlying hardware in order to saturate all sources of parallelism. At the same time, hardware evolves in multiple directions to explore different trade-offs. The MIC architecture, one such example, strays from the mainstream CPU design by packing a larger number of simpler cores per chip, relying on SIMD instructions to fill the performance gap. Databases have been attempting to utilize the SIMD capabilities of CPUs. However, mainstream CPUs have only recently adopted wider SIMD registers and more advanced instructions, since they do not rely primarily on SIMD for efficiency. In this paper, we present novel vectorized designs and implementations of database operators, based on advanced SIMD operations, such as gathers and scatters. We study selections, hash tables, and partitioning; and combine them to build sorting and joins. Our evaluation on the MIC-based Xeon Phi co-processor as well as the latest mainstream CPUs shows that our vectorization designs are up to an order of magnitude faster than the state-of-the-art scalar and vector approaches. Also, we highlight the impact of efficient vectorization on the algorithmic design of in-memory database operators, as well as the architectural design and power efficiency of hardware, by making simple cores comparably fast to complex cores. This work is applicable to CPUs and co-processors with advanced SIMD capabilities, using either many simple cores or fewer complex cores.

...read moreread less

193 citations

Cites background or methods or result from "Multi-core, main-memory joins: sort..."

...The former is dominated by sorting [4, 14]....
[...]
...evaluated join variants on multiple CPUs [1, 4]....
[...]
...The scalar code for buffered shuffling is thoroughly described in previous work [4, 26]....
[...]
...Thread parallelism is achieved, for individual operators, by splitting the input equally among threads [3, 4, 5, 8, 14, 31, 40], and in the case of queries that combine multiple operators, by using the pipeline breaking points of the query plan to split the materialized data in chunks that are distributed to threads dynamically [18, 28]....
[...]
...For example, join and aggregation operators can use hash partitioning to split the input into small partitions that are distributed among threads and now fit in the cache [3, 4, 5, 14, 19, 26]....
[...]

Journal Article•DOI•

The Design and Implementation of Modern Column-Oriented Database Systems

[...]

Daniel J. Abadi¹, Peter Boncz, Stavros Harizopoulos•Institutions (1)

Yale University¹

25 Nov 2013-Foundations and Trends in Databases

TL;DR: The design and implementation of modern column-oriented database systems can be found in this paper, with a specific focus on three influential research prototypes, MonetDB, C-Store, and X100, which form the basis for several well-known commercial column-store implementations.

...read moreread less

Abstract: Database system performance is directly related to the efficiency of the system at storing data on primary storage (for example, disk) and moving it into CPU registers for processing. For this reason, there is a long history in the database community of research exploring physical storage alternatives, including sophisticated indexing, materialized views, and vertical and horizontal partitioning. In recent years, there has been renewed interest in so-called column-oriented systems, sometimes also called column-stores. Column-store systems completely vertically partition a database into a collection of individual columns that are stored separately. By storing each column separately on disk, these column-based systems enable queries to readjust the attributes they need, rather than having to read entire rows from disk and discard unneeded attributes once they are in memory. The Design and Implementation of Modern Column-Oriented Database Systems discusses modern column-stores, their architecture and evolution as well the benefits they can bring in data analytics. There is a specific focus on three influential research prototypes, MonetDB, MonetDB/X100, and C-Store. These systems have formed the basis for several well-known commercial column-store implementations. Their similarities and differences are described and they are discussed in terms of their specific architectural features for compression, late materialization, join processing, vectorization and adaptive indexing (database cracking). The Design and Implementation of Modern Column-Oriented Database Systems is an excellent reference on the topic for database researchers and practitioners.

...read moreread less

190 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

Collapse

References

PDF

Open Access

More filters

Book•

The Art of Computer Programming

[...]

Donald Ervin Knuth

01 Jan 1968

TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.

...read moreread less

Abstract: A fuel pin hold-down and spacing apparatus for use in nuclear reactors is disclosed. Fuel pins forming a hexagonal array are spaced apart from each other and held-down at their lower end, securely attached at two places along their length to one of a plurality of vertically disposed parallel plates arranged in horizontally spaced rows. These plates are in turn spaced apart from each other and held together by a combination of spacing and fastening means. The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid. This apparatus is particularly useful in connection with liquid cooled reactors such as liquid metal cooled fast breeder reactors.

...read moreread less

17,939 citations

The Art in Computer Programming

[...]

Andrew Hunt, Dave Thomas

01 Jan 2001

TL;DR: Here the authors haven’t even started the project yet, and already they’re forced to answer many questions: what will this thing be named, what directory will it be in, what type of module is it, how should it be compiled, and so on.

...read moreread less

Abstract: Writers face the blank page, painters face the empty canvas, and programmers face the empty editor buffer. Perhaps it’s not literally empty—an IDE may want us to specify a few things first. Here we haven’t even started the project yet, and already we’re forced to answer many questions: what will this thing be named, what directory will it be in, what type of module is it, how should it be compiled, and so on.

...read moreread less

6,547 citations

Book•

The Art of Computer Programming: Volume 3: Sorting and Searching

[...]

Donald E. Knuth

01 Jan 1973

4,243 citations

Book•

The art of computer programming, volume 3: (2nd ed.) sorting and searching

[...]

Donald E. Knuth¹•Institutions (1)