scispace - formally typeset
Search or ask a question

Showing papers by "Ulrich Meyer published in 2017"


Journal ArticleDOI
23 Aug 2017
TL;DR: In this article, a warp-synchronous programming model and warp-wide communications are used to avoid branch divergence and reduce memory usage for multisplit-based sort.
Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer Due to the lack of an efficient multisplit on Graphics Processing Units (GPUs), programmers often choose to implement multisplit with a sort One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data Both methods are inefficient and require more work than necessary: The former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket In this work, we provide a parallel model and multiple implementations for the multisplit problem Our principal focus is multisplit for a small (up to 256) number of buckets We use warp-synchronous programming models and emphasize warpwide communications to avoid branch divergence and reduce memory usage We also hierarchically reorder input elements to achieve better coalescing of global memory accesses On a GeForce GTX 1080 GPU, we can reach a peak throughput of 1893Gkeys/s (or 1168Gpairs/s) for a key-only (or key-value) multisplit Finally, we demonstrate how multisplit can be used as a building block for radix sort In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 30Gkeys/s (and 21Gpair/s)

13 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: EM-LFR is presented, the first external memory algorithm able to generate massive complex networks following the LFR benchmark and evidence that both implementations yield graphs with matching properties by applying clustering algorithms to generated instances is given.
Abstract: LFR is a popular benchmark graph generator used to evaluate community detection algorithms. We present EM-LFR, the first external memory algorithm able to generate massive complex networks following the LFR benchmark. Its most expensive component is the generation of random graphs with prescribed degree sequences which can be divided into two steps: the graphs are first materialized deterministically using the Havel-Hakimi algorithm, and then randomized. Our main contributions are EM-HH and EM-ES, two I/Oefficient external memory algorithms for these two steps. We also propose EM-CM/ES, an alternative sampling scheme using the Configuration Model and rewiring steps to obtain a random simple graph. In an experimental evaluation we demonstrate their performance; our implementation is able to handle graphs with more than 37 billion edges on a single machine, is competitive with a massive parallel distributed algorithm, and is faster than a state-of-theart internal memory implementation even on instances fitting in main memory. EM-LFR’s implementation is capable of generating large graph instances orders of magnitude faster than the original implementation. We give evidence that both implementations yield graphs with matching properties by applying clustering algorithms to generated instances. Similarly, we analyse the evolution of graph properties as EM-ES is executed on networks obtained with EM-CM/ES and find that the alternative approach can accelerate the sampling process. ∗This work was partially supported by the DFG under grants ME 2088/3-2, WA 654/22-2. Parts of this paper were published as [21]. 1 ar X iv :1 60 4. 08 73 8v 3 [ cs .D S] 1 4 Ju n 20 17

10 citations


Journal ArticleDOI
TL;DR: This work provides a parallel model and multiple implementations for the multisplit problem, and achieves comparable performance to the fastest GPU sort routines, sorting 32-bit keys and key-value pairs with a throughput of 3.0Gkeys/s and 2.1Gpair/s.
Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data. Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets. We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage. We also hierarchically reorder input elements to achieve better coalescing of global memory accesses. On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93 Gkeys/s (or 11.68 Gpairs/s) for a key-only (or key-value) multisplit. Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0 G keys/s (and 2.1 Gpair/s).

9 citations


Journal Article
TL;DR: A number of recent results for large-scale graph generation obtained within the DFG priority programme SPP 1736 (Algorithms for Big Data) are surveyed.
Abstract: Artificially generated input graphs play an important role in algorithm engineering for systematic testing and tuning. In big data settings, however, not only processing huge graphs but also the ecient generation of appropriate test instances itself becomes challenging. In this context we survey a number of recent results for large-scale graph generation obtained within the DFG priority programme SPP 1736 (Algorithms for Big Data).

5 citations


Posted Content
05 Jan 2017
TL;DR: In this article, a warp-synchronous programming model and warp-wide communications are used to avoid branch divergence and reduce memory usage for the multisplit problem on GPUs.
Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket In this work, we provide a parallel model and multiple implementations for the multisplit problem Our principal focus is multisplit for a small (up to 256) number of buckets We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage We also hierarchically reorder input elements to achieve better coalescing of global memory accesses On a GeForce GTX 1080 GPU, we can reach a peak throughput of 1893 Gkeys/s (or 1168 Gpairs/s) for a key-only (or key-value) multisplit Finally, we demonstrate how multisplit can be used as a building block for radix sort In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 30 G keys/s (and 21 Gpair/s)

2 citations


01 Aug 2017
TL;DR: In this article, an extended study of multisplit for a small (up to 256) number of buckets is presented, where warp-synchronous programming is used to avoid branch divergence and reduce memory usage.
Abstract: GPU Multisplit: an extended study of a parallel algorithm SAMAN ASHKIANI, University of California, Davis ANDREW DAVIDSON, University of California, Davis ULRICH MEYER, Goethe-Universitat Frankfurt am Main JOHN D. OWENS, University of California, Davis Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. One way is to first generate an auxiliary array of bucket IDs and then sort input data based on it. In case smaller indexed buckets possess smaller valued keys, another way for multisplit is to directly sort input data. Both methods are inefficient and require more work than necessary: the former requires more expensive data movements while the latter spends unnecessary effort in sorting elements within each bucket. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small (up to 256) number of buckets. We use warp-synchronous programming models and emphasize warp-wide communications to avoid branch divergence and reduce memory usage. We also hierarchically reorder input elements to achieve better coalescing of global memory accesses. On a GeForce GTX 1080 GPU, we can reach a peak throughput of 18.93 Gkeys/s (or 11.68 Gpairs/s) for a key-only (or key-value) multisplit. Finally, we demonstrate how multisplit can be used as a building block for radix sort. In our multisplit-based sort implementation, we achieve comparable performance to the fastest GPU sort routines, sorting 32-bit keys (and key-value pairs) with a throughput of 3.0 G keys/s (and 2.1 Gpair/s). CCS Concepts: •Computing methodologies → Parallel algorithms; •Computer systems organization → Single instruction, multiple data; •Theory of computation → Shared memory algorithms; Additional Key Words and Phrases: Graphics Processing Unit (GPU), multisplit, bucketing, warp-synchronous programming, radix sort, histogram, shuffle, ballot ACM Reference format: Saman Ashkiani, Andrew Davidson, Ulrich Meyer, and John D. Owens. 2017. GPU Multisplit: an extended study of a parallel algorithm. ACM Trans. Parallel Comput. 9, 4, Article 39 (September 2017), 44 pages. DOI: 0000001.0000001 INTRODUCTION This paper studies the multisplit primitive for GPUs. 1 Multisplit divides a set of items (keys or key-value pairs) into contiguous buckets, where each bucket contains items whose keys satisfy a programmer-specified criterion (such as falling into a particular range). Multisplit is broadly useful in a wide range of applications, some of which we will cite later in this introduction. But we begin 1 This paper is an extended version of initial results published at PPoPP 2016 [3]. The source code is available at https: //github.com/owensgroup/GpuMultisplit. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2016 ACM. 1539-9087/2017/9-ART39 $15.00 DOI: 0000001.0000001 ACM Transactions on Parallel Computing, Vol. 9, No. 4, Article 39. Publication date: September 2017.

1 citations