scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Brief announcement: efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model

TL;DR: In this paper, a cache-oblivious framework for randomized divide and conquer algorithms on the multicore model with private cache is presented, where the number of processors, the size of an individual cache memory and the block size are assumed to be fixed.
Abstract: In this paper we present a cache-oblivious framework for randomized divide and conquer algorithms on the multicore model with private cache. We first derive an O(n/p log n + log n log log n) expected parallel depth algorithm for sorting n numbers with expected O(n/B logM n) cache misses where p,M and B respectively denote the number of processors, the size of an individual cache memory and the block size respectively. Although similar results have been obtained recently for sorting, we feel that our approach is simpler and general and we apply it to obtain an algorithm for 3D convex hulls with similar bounds.We also present a simple randomized processor allocation technique without the explicit knowledge of the number of processors that is likely to find additional applications in resource oblivious environments.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article , the authors present two cache-oblivious sorting-based convex hull algorithms in the Binary Forking Model, one achieves O(n)$ work, O(log n)$ span, and O (n/B)$ serial cache complexity, where B is the cache line size.
Abstract: We present two cache-oblivious sorting-based convex hull algorithms in the Binary Forking Model. The first is an algorithm for a presorted set of points which achieves $O(n)$ work, $O(\log n)$ span, and $O(n/B)$ serial cache complexity, where $B$ is the cache line size. These are all optimal worst-case bounds for cache-oblivious algorithms in the Binary Forking Model. The second adapts Cole and Ramachandran's cache-oblivious sorting algorithm, matching its properties including achieving $O(n \log n)$ work, $O(\log n \log \log n)$ span, and $O(n/B \log_M n)$ serial cache complexity. Here $M$ is the size of the private cache.
References
More filters
Journal ArticleDOI
TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Abstract: We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transferring P blocks each containing B records in a single time unit; the records in each block must be input from or output to B contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for P = 1 that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case B = P = O(1).

1,344 citations

Proceedings ArticleDOI
14 Jun 2008
TL;DR: This paper presents two sorting algorithms, a distribution sort and a mergesort, and studies sorting lower bounds in a computational model, which is called the parallel external-memory (PEM) model, that formalizes the essential properties of the algorithms for private-cache CMPs.
Abstract: In this paper, we study parallel algorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithms that need no additional assumptions about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs.

127 citations

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

124 citations

Book ChapterDOI
15 Sep 2008
TL;DR: It is suggested that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for a bridging model aimed at capturing the most basic resource parameters of multi-core architectures.
Abstract: We propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for such a bridging model. Portable algorithms would contain efficient designs for all reasonable ranges of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines.

87 citations

Proceedings ArticleDOI
28 Oct 1981
TL;DR: A probabilistic parallel algorithm to sort n keys drawn from some arbitrary total ordered set such that the average runtime is bounded by O(log n), which means the product of time and number of processors meets the information theoretic lower bound for sorting.
Abstract: We describe a probabilistic parallel algorithm to sort n keys drawn from some arbitrary total ordered set. This algorithm can be implemented on a parallel computer consisting of n RAMs, each with small private memory, and a common memory of size O(n) such that the average runtime is bounded by O(log n). Hence for this algorithm the product of time and number of processors meets the information theoretic lower bound for sorting.

61 citations