Proceedings ArticleDOI
Low-Span Parallel Algorithms for the Binary-Forking Model
Zafar Ahmad,Rezaul Chowdhury,Rathish Das,Pramod Ganapathi,Aaron Gregory,Mohammad Mahdi Javanmard +5 more
- pp 22-34
TLDR
In this paper, a randomized comparison-based sorting algorithm with optimal O(log n) span and O(n log n) work was proposed for the binary-forking model.Abstract:
The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Θ(log n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging. In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT). Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(log n) span and O(nlog n) work, both w.h.p. in n. (2) An optimal O(log n) span algorithm for Strassen's matrix multiplication (MM) with only a loglog n -factor blow-up in work as well as a near-optimal O(log n loglog log n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(log n logloglog n) span Fast Fourier Transform (FFT) algorithm with less than a log n-factor blow-up in work for all practical values of n (i.e., n le 10 ^10,000 ).read more
Citations
More filters
Proceedings ArticleDOI
Automatic HBM Management: Models and Algorithms
Daniel DeLayo,Kenny Zhang,Kunal Agrawal,Michael L. Bender,Jonathan W. Berry,Rathish Das,Benjamin Moseley,Cynthia A. Phillips +7 more
TL;DR: This paper evaluated algorithms for managing High- Bandwidth Memory automatically, chose a theoretical model, validated it against real hardware, and implemented a basic simulator to determine theoretically and empirically whether investment in priority-based DRAM controller hardware can be justified.
Proceedings ArticleDOI
High-Performance and Flexible Parallel Algorithms for Semisort and Related Problems
TL;DR: In this paper , the authors revisit the semisort problem, with the goal of achieving a high-performance parallel semiisort implementation with a flexible interface, which can easily be extended to two related problems, histogram and collect-reduce.
Proceedings ArticleDOI
Optimal Parallel Sorting with Comparison Errors
Michael T. Goodrich,Riko Jacob +1 more
TL;DR: In this article , a comparison-based parallel algorithm for sorting n comparable items subject to comparison errors is presented, which achieves a small maximum dislocation and small total dislocation of the elements in the output permutation.
Journal ArticleDOI
A Work-Efficient Parallel Algorithm for Longest Increasing Subsequence
TL;DR: This paper proposes a parallel LIS algorithm that costs 𝑂 ( 𝚂 log 𝓘 ) work, ˜ 𝐂 (𝑘 ) span, and 𝒂 ( I𝑛 ) space, and is much simpler than the previous Parallel LIS algorithms.
References
More filters
Journal ArticleDOI
An algorithm for the machine calculation of complex Fourier series
J.W. Cooley,John W. Tukey +1 more
TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.
Journal ArticleDOI
Gaussian elimination is not optimal
TL;DR: In this paper, Cook et al. gave an algorithm which computes the coefficients of the product of two square matrices A and B of order n with less than 4. 7 n l°g 7 arithmetical operations (all logarithms in this paper are for base 2).
Proceedings ArticleDOI
The implementation of the Cilk-5 multithreaded language
TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.
Journal ArticleDOI
Parallel merge sort
TL;DR: A parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small.
Journal ArticleDOI
On computing the discrete Fourier transform
TL;DR: New algorithms for computing the Discrete Fourier Transform of n points are described, which use substantially fewer multiplications than the best algorithm previously known, and about the same number of additions.