Efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Brief announcement: efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model

[...]

Neeraj Sharma¹, Sandeep Sen²•Institutions (2)

Mentor Graphics¹, Indian Institute of Technology Delhi²

25 Jun 2012

TL;DR: In this paper, a cache-oblivious framework for randomized divide and conquer algorithms on the multicore model with private cache is presented, where the number of processors, the size of an individual cache memory and the block size are assumed to be fixed.

...read moreread less

Abstract: In this paper we present a cache-oblivious framework for randomized divide and conquer algorithms on the multicore model with private cache. We first derive an O(n/p log n + log n log log n) expected parallel depth algorithm for sorting n numbers with expected O(n/B logM n) cache misses where p,M and B respectively denote the number of processors, the size of an individual cache memory and the block size respectively. Although similar results have been obtained recently for sorting, we feel that our approach is simpler and general and we apply it to obtain an algorithm for 3D convex hulls with similar bounds.We also present a simple randomized processor allocation technique without the explicit knowledge of the number of processors that is likely to find additional applications in resource oblivious environments.

...read moreread less

1 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

The input/output complexity of sorting and related problems

[...]

Alok Aggarwal¹, S. Vitter Jeffrey²•Institutions (2)

IBM¹, Brown University²

01 Sep 1988-Communications of The ACM

TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.

...read moreread less

Abstract: We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transferring P blocks each containing B records in a single time unit; the records in each block must be input from or output to B contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for P = 1 that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case B = P = O(1).

...read moreread less

1,344 citations

"Efficient cache oblivious algorithm..." refers methods in this paper

...REFERENCES [1] A. Aggarwal and J. S. Vitter....
[...]
...For the external memory model, Aggarwal and Vitter [1] designed a version of merge sort, that uses a maximum O( B logM/B N/B) I/O’s and this is optimal....
[...]
...1.1 Previous Related Work For the external memory model, Aggarwal and Vitter [1] designed a version of merge sort, that uses a maximum O( N logM/B N/B) I/O s and this is optimal....
[...]

Journal Article•DOI•

Scheduling multithreaded computations by work stealing

[...]

Robert D. Blumofe¹, Charles E. Leiserson²•Institutions (2)

University of Texas at Austin¹, Massachusetts Institute of Technology²

01 Sep 1999-Journal of the ACM

TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time to execute a fully strict computation on P processors using this scheduler is 1:1.

...read moreread less

Abstract: This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies.Specifically, our analysis shows that the expected time to execute a fully strict computation on P processors using our work-stealing scheduler is T1/P + O(T ∞ , where T1 is the minimum serial execution time of the multithreaded computation and (T ∞ is the minimum execution time with an infinite number of processors. Moreover, the space required by the execution is at most S1P, where S1 is the minimum serial space requirement. We also show that the expected total communication of the algorithm is at most O(PT ∞( 1 + nd)Smax), where Smax is the size of the largest activation record of any thread and nd is the maximum number of times that any thread synchronizes with its parent. This communication bound justifies the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.

...read moreread less

1,202 citations

Journal Article•DOI•

Parallel merge sort

[...]

Richard Cole¹•Institutions (1)

New York University¹

01 Aug 1988-SIAM Journal on Computing

TL;DR: A parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small.

...read moreread less

Abstract: We give a parallel implementation of merge sort on a CREW PRAM that uses n processors and $O(\log n)$ time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses n processors and $O(\log n)$ time. The constant in the running time is still moderate, though not as small.

...read moreread less

847 citations

"Efficient cache oblivious algorithm..." refers methods in this paper

...e and O(n B log Mn) cache misses 2. Note that both time and cache misses achieved by this algorithm are optimal. Arge et al. [4] formalized the PEM model and presented a merge sort algorithm based on [10] that runs in O(logn) time and has optimal cache misses. Note that their model is both cache aware and processor aware. This algorithm is very similar to merge sort implemented on [2]. They proved the...
[...]
...ssed above where O is used to represent expected bound.~ To sort nelements using pprocessors Model Time Cache cost Condition Source CRCW O(~ n p logn) np Reischuk [14] PRAM EREW O(n p logn) np Cole [10] PRAM Cache O( n p logn) O( BM B ) nB2p Goodrich [4] Aware MB2 Cache O(n p log 2 n) O( n B log ) nMp Ramachandran [5] Oblivious MB2 Cache Depth3 = O(log2 n) O(n B log M B n B ) M= (B2) Blelloch [7...
[...]

Proceedings Article•DOI•

Scheduling multithreaded computations by work stealing

[...]

Robert D. Blumofe¹, Charles E. Leiserson¹•Institutions (1)

Massachusetts Institute of Technology¹

20 Nov 1994

TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time T/sub P/ to execute a fully strict computation on P processors using this work- Stealing Scheduler is T/ Sub P/=O(T/sub 1//P+T/ sub /spl infin//), where T/ sub 1/ is the minimum serial execution time of the multith readed computation and T/

...read moreread less

Abstract: This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is "work stealing," in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies. Specifically, our analysis shows that the expected time T/sub P/ to execute a fully strict computation on P processors using our work-stealing scheduler is T/sub P/=O(T/sub 1//P+T/sub /spl infin//), where T/sub 1/ is the minimum serial execution time of the multithreaded computation and T/sub /spl infin// is the minimum execution time with an infinite number of processors. Moreover, the space S/sub P/ required by the execution satisfies S/sub P//spl les/S/sub 1/P. We also show that the expected total communication of the algorithm is at most O(T/sub /spl infin//S/sub max/P), where S/sub max/ is the size of the largest activation record of any thread, thereby justifying the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor. >

...read moreread less

660 citations

Book Chapter•DOI•

Cache-oblivious algorithms

[...]

Charles E. Leiserson¹•Institutions (1)

Massachusetts Institute of Technology¹

28 May 2003

TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal across a multilevel cache hierarchy, and it is shown that the assumption of optimal replacement made by the ideal-cache model can be simulated efficiently by LRU replacement.

...read moreread less

Abstract: Computers with multiple levels of caching have traditionally required techniques such as data blocking in order for algorithms to exploit the cache hierarchy effectively. These "cache-aware" algorithms must be properly tuned to achieve good performance using so-called "voodoo" parameters which depend on hardware properties, such as cache size and cache-line length. Surprisingly, however, for a variety of problems - including matrix multiplication, FFT, and sorting - asymptotically optimal "cache-oblivious" algorithms do exist that contain no voodoo parameters. They perform an optimal amount of work and move data optimally among multiple levels of cache. Since they need not be tuned, cache-oblivious algorithms are more portable than traditional cache-aware algorithms. We employ an "ideal-cache" model to analyze these algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal across a multilevel cache hierarchy. We also show that the assumption of optimal replacement made by the ideal-cache model can be simulated efficiently by LRU replacement. We also provide some empirical results on the effectiveness of cache-oblivious algorithms in practice.

...read moreread less

604 citations

"Efficient cache oblivious algorithm..." refers methods in this paper

...[5] presented a sequential cacheoblivious sorting algorithm called Funnel sort which can sort n keys in O(n log n) time and O( n B logM n) cache misses....
[...]

Efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model

Citations

References

"Efficient cache oblivious algorithm..." refers methods in this paper

"Efficient cache oblivious algorithm..." refers methods in this paper

"Efficient cache oblivious algorithm..." refers methods in this paper

Related Papers (5)