scispace - formally typeset
Search or ask a question
Posted Content

Efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model

TL;DR: A cache-oblivious framework for randomized divide and conquer algorithms on the multicore model with private cache and a simple randomized processor allocation technique without the explicit knowledge of the number of processors that is likely to find additional applications in resource oblivious environments are presented.
Abstract: In this paper we present randomized algorithms for sorting and convex hull that achieves optimal performance (for speed-up and cache misses) on the multicore model with private cache model. Our algorithms are cache oblivious and generalize the randomized divide and conquer strategy given by Reischuk and Reif and Sen. Although the approach yielded optimal speed-up in the PRAM model, we require additional techniques to optimize cache-misses in an oblivious setting. Under a mild assumption on input and number of processors our algorithm will have optimal time and cache misses with high probability. Although similar results have been obtained recently for sorting, we feel that our approach is simpler and general and we apply it to obtain an optimal parallel algorithm for 3D convex hulls with similar bounds. We also present a simple randomized processor allocation technique without the explicit knowledge of the number of processors that is likely to find additional applications in resource oblivious environments.
Citations
More filters
Proceedings ArticleDOI
25 Jun 2012
TL;DR: In this paper, a cache-oblivious framework for randomized divide and conquer algorithms on the multicore model with private cache is presented, where the number of processors, the size of an individual cache memory and the block size are assumed to be fixed.
Abstract: In this paper we present a cache-oblivious framework for randomized divide and conquer algorithms on the multicore model with private cache. We first derive an O(n/p log n + log n log log n) expected parallel depth algorithm for sorting n numbers with expected O(n/B logM n) cache misses where p,M and B respectively denote the number of processors, the size of an individual cache memory and the block size respectively. Although similar results have been obtained recently for sorting, we feel that our approach is simpler and general and we apply it to obtain an algorithm for 3D convex hulls with similar bounds.We also present a simple randomized processor allocation technique without the explicit knowledge of the number of processors that is likely to find additional applications in resource oblivious environments.

1 citations

References
More filters
Journal ArticleDOI
TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Abstract: We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transferring P blocks each containing B records in a single time unit; the records in each block must be input from or output to B contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for P = 1 that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case B = P = O(1).

1,344 citations


"Efficient cache oblivious algorithm..." refers methods in this paper

  • ...REFERENCES [1] A. Aggarwal and J. S. Vitter....

    [...]

  • ...For the external memory model, Aggarwal and Vitter [1] designed a version of merge sort, that uses a maximum O( B logM/B N/B) I/O’s and this is optimal....

    [...]

  • ...1.1 Previous Related Work For the external memory model, Aggarwal and Vitter [1] designed a version of merge sort, that uses a maximum O( N logM/B N/B) I/O s and this is optimal....

    [...]

Journal ArticleDOI
TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time to execute a fully strict computation on P processors using this scheduler is 1:1.
Abstract: This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies.Specifically, our analysis shows that the expected time to execute a fully strict computation on P processors using our work-stealing scheduler is T1/P + O(T ∞ , where T1 is the minimum serial execution time of the multithreaded computation and (T ∞ is the minimum execution time with an infinite number of processors. Moreover, the space required by the execution is at most S1P, where S1 is the minimum serial space requirement. We also show that the expected total communication of the algorithm is at most O(PT ∞( 1 + nd)Smax), where Smax is the size of the largest activation record of any thread and nd is the maximum number of times that any thread synchronizes with its parent. This communication bound justifies the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.

1,202 citations

Journal ArticleDOI
Richard Cole1
TL;DR: A parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small.
Abstract: We give a parallel implementation of merge sort on a CREW PRAM that uses n processors and $O(\log n)$ time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses n processors and $O(\log n)$ time. The constant in the running time is still moderate, though not as small.

847 citations


"Efficient cache oblivious algorithm..." refers methods in this paper

  • ...e and O(n B log Mn) cache misses 2. Note that both time and cache misses achieved by this algorithm are optimal. Arge et al. [4] formalized the PEM model and presented a merge sort algorithm based on [10] that runs in O(logn) time and has optimal cache misses. Note that their model is both cache aware and processor aware. This algorithm is very similar to merge sort implemented on [2]. They proved the...

    [...]

  • ...ssed above where O is used to represent expected bound.~ To sort nelements using pprocessors Model Time Cache cost Condition Source CRCW O(~ n p logn) np Reischuk [14] PRAM EREW O(n p logn) np Cole [10] PRAM Cache O( n p logn) O( BM B ) nB2p Goodrich [4] Aware MB2 Cache O(n p log 2 n) O( n B log ) nMp Ramachandran [5] Oblivious MB2 Cache Depth3 = O(log2 n) O(n B log M B n B ) M= (B2) Blelloch [7...

    [...]

Proceedings ArticleDOI
20 Nov 1994
TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time T/sub P/ to execute a fully strict computation on P processors using this work- Stealing Scheduler is T/ Sub P/=O(T/sub 1//P+T/ sub /spl infin//), where T/ sub 1/ is the minimum serial execution time of the multith readed computation and T/
Abstract: This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is "work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the expected time T/sub P/ to execute a fully strict computation on P processors using our work-stealing scheduler is T/sub P/=O(T/sub 1//P+T/sub /spl infin//), where T/sub 1/ is the minimum serial execution time of the multithreaded computation and T/sub /spl infin// is the minimum execution time with an infinite number of processors. Moreover, the space S/sub P/ required by the execution satisfies S/sub P//spl les/S/sub 1/P. We also show that the expected total communication of the algorithm is at most O(T/sub /spl infin//S/sub max/P), where S/sub max/ is the size of the largest activation record of any thread, thereby justifying the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor. >

660 citations

Book ChapterDOI
28 May 2003
TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal across a multilevel cache hierarchy, and it is shown that the assumption of optimal replacement made by the ideal-cache model can be simulated efficiently by LRU replacement.
Abstract: Computers with multiple levels of caching have traditionally required techniques such as data blocking in order for algorithms to exploit the cache hierarchy effectively. These "cache-aware" algorithms must be properly tuned to achieve good performance using so-called "voodoo" parameters which depend on hardware properties, such as cache size and cache-line length. Surprisingly, however, for a variety of problems - including matrix multiplication, FFT, and sorting - asymptotically optimal "cache-oblivious" algorithms do exist that contain no voodoo parameters. They perform an optimal amount of work and move data optimally among multiple levels of cache. Since they need not be tuned, cache-oblivious algorithms are more portable than traditional cache-aware algorithms. We employ an "ideal-cache" model to analyze these algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal across a multilevel cache hierarchy. We also show that the assumption of optimal replacement made by the ideal-cache model can be simulated efficiently by LRU replacement. We also provide some empirical results on the effectiveness of cache-oblivious algorithms in practice.

604 citations


"Efficient cache oblivious algorithm..." refers methods in this paper

  • ...[5] presented a sequential cacheoblivious sorting algorithm called Funnel sort which can sort n keys in O(n log n) time and O( n B logM n) cache misses....

    [...]