Showing papers by "Ulrich Meyer published in 2017"

PDF

Open Access

Journal Article•DOI•

GPU Multisplit: An Extended Study of a Parallel Algorithm

[...]

Saman Ashkiani¹, Andrew Davidson¹, Ulrich Meyer², John D. Owens¹•Institutions (2)

University of California, Davis¹, Goethe University Frankfurt²

23 Aug 2017

TL;DR: In this article, a warp-synchronous programming model and warp-wide communications are used to avoid branch divergence and reduce memory usage for multisplit-based sort.

...read moreread less

Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer Due to the lack of an efficient multisplit on Graphics Processing Units (GPUs), programmers often choose to implement multisplit with a sort One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data Both methods are inefficient and require more work than necessary: The former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket In this work, we provide a parallel model and multiple implementations for the multisplit problem Our principal focus is multisplit for a small (up to 256) number of buckets We use warp-synchronous programming models and emphasize warpwide communications to avoid branch divergence and reduce memory usage We also hierarchically reorder input elements to achieve better coalescing of global memory accesses On a GeForce GTX 1080 GPU, we can reach a peak throughput of 1893Gkeys/s (or 1168Gpairs/s) for a key-only (or key-value) multisplit Finally, we demonstrate how multisplit can be used as a building block for radix sort In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 30Gkeys/s (and 21Gpair/s)

...read moreread less

13 citations

Proceedings Article•DOI•

I/O-efficient Generation of Massive Graphs Following the LFR Benchmark.

[...]

Michael Hamann¹, Ulrich Meyer², Manuel Penschuck², Dorothea Wagner¹•Institutions (2)

Karlsruhe Institute of Technology¹, Goethe University Frankfurt²

01 Jan 2017

TL;DR: EM-LFR is presented, the first external memory algorithm able to generate massive complex networks following the LFR benchmark and evidence that both implementations yield graphs with matching properties by applying clustering algorithms to generated instances is given.

...read moreread less

Abstract: LFR is a popular benchmark graph generator used to evaluate community detection algorithms. We present EM-LFR, the first external memory algorithm able to generate massive complex networks following the LFR benchmark. Its most expensive component is the generation of random graphs with prescribed degree sequences which can be divided into two steps: the graphs are first materialized deterministically using the Havel-Hakimi algorithm, and then randomized. Our main contributions are EM-HH and EM-ES, two I/Oefficient external memory algorithms for these two steps. We also propose EM-CM/ES, an alternative sampling scheme using the Configuration Model and rewiring steps to obtain a random simple graph. In an experimental evaluation we demonstrate their performance; our implementation is able to handle graphs with more than 37 billion edges on a single machine, is competitive with a massive parallel distributed algorithm, and is faster than a state-of-theart internal memory implementation even on instances fitting in main memory. EM-LFR’s implementation is capable of generating large graph instances orders of magnitude faster than the original implementation. We give evidence that both implementations yield graphs with matching properties by applying clustering algorithms to generated instances. Similarly, we analyse the evolution of graph properties as EM-ES is executed on networks obtained with EM-CM/ES and find that the alternative approach can accelerate the sampling process. ∗This work was partially supported by the DFG under grants ME 2088/3-2, WA 654/22-2. Parts of this paper were published as [21]. 1 ar X iv :1 60 4. 08 73 8v 3 [ cs .D S] 1 4 Ju n 20 17

...read moreread less

10 citations

Journal Article•DOI•

GPU Multisplit: an extended study of a parallel algorithm

[...]

Saman Ashkiani¹, Andrew Davidson¹, Ulrich Meyer², John D. Owens¹•Institutions (2)

University of California, Davis¹, Goethe University Frankfurt²

05 Jan 2017-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This work provides a parallel model and multiple implementations for the multisplit problem, and achieves comparable performance to the fastest GPU sort routines, sorting 32-bit keys and key-value pairs with a throughput of 3.0Gkeys/s and 2.1Gpair/s.

...read moreread less

Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data. Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets. We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage. We also hierarchically reorder input elements to achieve better coalescing of global memory accesses. On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93 Gkeys/s (or 11.68 Gpairs/s) for a key-only (or key-value) multisplit. Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0 G keys/s (and 2.1 Gpair/s).

...read moreread less

9 citations

Journal Article•

Large-scale Graph Generation and Big Data: An Overview on Recent Results

[...]

Ulrich Meyer, Manuel Penschuck

20 Jun 2017-Bulletin of The European Association for Theoretical Computer Science

TL;DR: A number of recent results for large-scale graph generation obtained within the DFG priority programme SPP 1736 (Algorithms for Big Data) are surveyed.

...read moreread less

Abstract: Artificially generated input graphs play an important role in algorithm engineering for systematic testing and tuning. In big data settings, however, not only processing huge graphs but also the ecient generation of appropriate test instances itself becomes challenging. In this context we survey a number of recent results for large-scale graph generation obtained within the DFG priority programme SPP 1736 (Algorithms for Big Data).

...read moreread less

5 citations

Posted Content•

GPU Multisplit

[...]

Saman Ashkiani¹, Andrew Davidson¹, Ulrich Meyer², John D. Owens¹•Institutions (2)

University of California, Davis¹, Goethe University Frankfurt²

05 Jan 2017

TL;DR: In this article, a warp-synchronous programming model and warp-wide communications are used to avoid branch divergence and reduce memory usage for the multisplit problem on GPUs.

...read moreread less

Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket In this work, we provide a parallel model and multiple implementations for the multisplit problem Our principal focus is multisplit for a small (up to 256) number of buckets We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage We also hierarchically reorder input elements to achieve better coalescing of global memory accesses On a GeForce GTX 1080 GPU, we can reach a peak throughput of 1893 Gkeys/s (or 1168 Gpairs/s) for a key-only (or key-value) multisplit Finally, we demonstrate how multisplit can be used as a building block for radix sort In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 30 G keys/s (and 21 Gpair/s)

...read moreread less

2 citations

GPU Multisplit: an extended study of a parallel algorithm - eScholarship

[...]

Saman Ashkiani, Andrew A. Davidson, Ulrich Meyer, John D. Owens

01 Aug 2017

TL;DR: In this article, an extended study of multisplit for a small (up to 256) number of buckets is presented, where warp-synchronous programming is used to avoid branch divergence and reduce memory usage.

...read moreread less

Abstract: GPU Multisplit: an extended study of a parallel algorithm SAMAN ASHKIANI, University of California, Davis ANDREW DAVIDSON, University of California, Davis ULRICH MEYER, Goethe-Universitat Frankfurt am Main JOHN D. OWENS, University of California, Davis Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data. Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets. We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage. We also hierarchically reorder input elements to achieve better coalescing of global memory accesses. On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93 Gkeys/s (or 11.68 Gpairs/s) for a key-only (or key-value) multisplit. Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0 G keys/s (and 2.1 Gpair/s). CCS Concepts: •Computing methodologies → Parallel algorithms; •Computer systems organization → Single instruction, multiple data; •Theory of computation → Shared memory algorithms; Additional Key Words and Phrases: Graphics Processing Unit (GPU), multisplit, bucketing, warp-synchronous programming, radix sort, histogram, shuffle, ballot ACM Reference format: Saman Ashkiani, Andrew Davidson, Ulrich Meyer, and John D. Owens. 2017. GPU Multisplit: an extended study of a parallel algorithm. ACM Trans. Parallel Comput. 9, 4, Article 39 (September 2017), 44 pages. DOI: 0000001.0000001 INTRODUCTION This paper studies the multisplit primitive for GPUs. 1 Multisplit divides a set of items (keys or key-value pairs) into contiguous buckets, where each bucket contains items whose keys satisfy a programmer-specified criterion (such as falling into a particular range). Multisplit is broadly useful in a wide range of applications, some of which we will cite later in this introduction. But we begin 1 This paper is an extended version of initial results published at PPoPP 2016 [3]. The source code is available at https: //github.com/owensgroup/GpuMultisplit. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2016 ACM. 1539-9087/2017/9-ART39 $15.00 DOI: 0000001.0000001 ACM Transactions on Parallel Computing, Vol. 9, No. 4, Article 39. Publication date: September 2017.

...read moreread less

1 citations