A Bridging Model for Multi-core Computing

doi:10.1007/978-3-540-87744-8_2

Home
/
Papers
/
A Bridging Model for Multi-core Computing

Book Chapter•DOI•

A Bridging Model for Multi-core Computing

Leslie G. Valiant¹•Institutions (1)

Harvard University¹

15 Sep 2008-pp 13-28

TL;DR: It is suggested that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for a bridging model aimed at capturing the most basic resource parameters of multi-core architectures.

read less

Abstract: We propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for such a bridging model. Portable algorithms would contain efficient designs for all reasonable ranges of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Privacy-Preserving Access of Outsourced Data via Oblivious RAM Simulation

[...]

Michael T. Goodrich¹, Michael Mitzenmacher²•Institutions (2)

University of California, Irvine¹, Harvard University²

07 Jul 2010-arXiv: Data Structures and Algorithms

TL;DR: In this article, the oblivious RAM simulation problem with a small logarithmic or polylogarithm amortized increase in access times was studied, with a very high probability of success, while keeping the external storage to be of size O(n).

...read moreread less

Abstract: Suppose a client, Alice, has outsourced her data to an external storage provider, Bob, because he has capacity for her massive data set, of size n, whereas her private storage is much smaller--say, of size O(n^{1/r}), for some constant r > 1. Alice trusts Bob to maintain her data, but she would like to keep its contents private. She can encrypt her data, of course, but she also wishes to keep her access patterns hidden from Bob as well. We describe schemes for the oblivious RAM simulation problem with a small logarithmic or polylogarithmic amortized increase in access times, with a very high probability of success, while keeping the external storage to be of size O(n). To achieve this, our algorithmic contributions include a parallel MapReduce cuckoo-hashing algorithm and an external-memory dataoblivious sorting algorithm.

...read moreread less

194 citations

Proceedings Article•DOI•

Low depth cache-oblivious algorithms

[...]

Guy E. Blelloch¹, Phillip B. Gibbons², Harsha Vardhan Simhadri¹•Institutions (2)

Carnegie Mellon University¹, Intel²

13 Jun 2010

TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.

...read moreread less

Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

...read moreread less

124 citations

Cites background from "A Bridging Model for Multi-core Com..."

...Clearly this lower bound also applies to more general multi-level cache models such as studied in [3, 4, 30, 42, 46, 49]....
[...]
...It would be interesting to extend these results to more general multi-level models [3, 4, 18, 30, 42, 46, 49], while preserving the goal of supporting a simple model for algorithm design and analysis....
[...]

Journal Article•DOI•

Volta: Performance and Programmability

[...]

Jack Hilaire Choquette¹, Olivier Giroux¹, Denis Foley¹•Institutions (1)

Nvidia¹

01 Mar 2018-IEEE Micro

TL;DR: GV100 is NVIDIAs latest flagship GPU that features enhancements to NVLink, a redesigned streaming microprocessor (SM), and independent thread scheduling enhancements to the single instruction, multiple threads (SIMT) model.

...read moreread less

Abstract: GV100 is NVIDIAs latest flagship GPU. It has been designed with many features to improve performance and programmability. It features enhancements to NVLink, a redesigned streaming microprocessor (SM), and independent thread scheduling enhancements to the single instruction, multiple threads (SIMT) model. GV100 also adds new tensor cores for an order of magnitude throughput improvement for deep-learning kernels.

...read moreread less

101 citations

Proceedings Article•

Persistent RNNs: stashing recurrent weights on-chip

[...]

Gregory Diamos¹, Shubho Sengupta¹, Bryan Catanzaro¹, Mike Chrzanowski¹, Adam Coates¹, Erich Elsen¹, Jesse Engel¹, Awni Hannun¹, Sanjeev Satheesh¹ - Show less +5 more•Institutions (1)

Baidu¹

19 Jun 2016

TL;DR: This paper introduces a new technique for mapping Deep Recurrent Neural Networks efficiently onto GPUs that uses persistent computational kernels that exploit the GPU's inverted memory hierarchy to reuse network weights over multiple timesteps.

...read moreread less

Abstract: This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU's inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16× reduction in activation memory footprint, enables model training with 12× more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.

...read moreread less

88 citations

Journal Article•DOI•

Using simple abstraction to reinvent computing for parallelism

[...]

Uzi Vishkin¹•Institutions (1)

University of Maryland, College Park¹

01 Jan 2011-Communications of The ACM

TL;DR: The ICE abstraction may take CS from serial (single- core) computing to effective parallel (many-core) computing in order to take CS to many-core computing.

...read moreread less

Abstract: The ICE abstraction may take CS from serial (single-core) computing to effective parallel (many-core) computing.

...read moreread less

51 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A bridging model for parallel computation

[...]

Leslie G. Valiant¹•Institutions (1)

Harvard University¹

01 Aug 1990-Communications of The ACM

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

3,885 citations

"A Bridging Model for Multi-core Com..." refers background or methods in this paper

...The originally proposed bridging model was the BSP model [ 28 ]....
[...]
...We have argued previously that the general problem of parallel computing should be approached via two notions [13, 28 ]....
[...]

Journal Article•DOI•

The input/output complexity of sorting and related problems

[...]

Alok Aggarwal¹, S. Vitter Jeffrey²•Institutions (2)

IBM¹, Brown University²

01 Sep 1988-Communications of The ACM

TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.

...read moreread less

Abstract: We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transferring P blocks each containing B records in a single time unit; the records in each block must be input from or output to B contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for P = 1 that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case B = P = O(1).

...read moreread less

1,344 citations

"A Bridging Model for Multi-core Com..." refers methods in this paper

...Our lower bound results for straight-line programs we derive using the approach of Irony, Toledo and Tiskin [25] (and also of [23, 33]), while the result for sorting uses an adversarial argument of Aggarwal and Vitter [ 3 ]....
[...]
...The bounds for sorting follow from the following adversarial argument adapted from Aggarwal and Vitter [ 3 ]: Let S = Mi. Consider any total ordering of all the level i component supersteps that respects all the time dependencies....
[...]
...Our proofs of optimality for communication and synchronization given in this section and the one to follow all derive from lower bounds on the number of communication steps required in distributed algorithms and are direct applications of previous work, particularly of Hong and Kung [23], Aggarwal and Vitter [ 3 ], Savage [33, 34] and Irony, Toledo and Tiskin [25]....
[...]

Journal Article•DOI•

Amdahl's Law in the Multicore Era

[...]

Mark D. Hill¹, Michael R. Marty²•Institutions (2)

University of Wisconsin-Madison¹, Google²

01 Jul 2008-IEEE Computer

TL;DR: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores.

...read moreread less

Abstract: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster.

...read moreread less

1,245 citations

"A Bridging Model for Multi-core Com..." refers background in this paper

...Another issue not discussed here is the role of non-homogeneous cores [ 16 , 21]....
[...]

Proceedings Article•DOI•

Parallelism in random access machines

[...]

Steven Fortune, James C. Wyllie

01 May 1978

TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.

...read moreread less

Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

...read moreread less

951 citations

Additional excerpts

...BSP with (p1 ‚ 1;g1 = 1;L1 = 0;m1) is the PRAM [ 11 , 20] model....
[...]

Book Chapter•DOI•

Parallel algorithms for shared-memory machines

[...]

Richard M. Karp¹, Vijaya Ramachandran²•Institutions (2)

University of California, Berkeley¹, University of Texas at Austin²

02 Jan 1991

TL;DR: In this paper, the authors discuss parallel algorithms for shared-memory machines and discuss the theoretical foundations of parallel algorithms and parallel architectures, and present a theoretical analysis of the appropriate logical organization of a massively parallel computer.

...read moreread less

Abstract: Publisher Summary This chapter discusses parallel algorithms for shared-memory machines. Parallel computation is rapidly becoming a dominant theme in all areas of computer science and its applications. It is estimated that, within a decade, virtually all developments in computer architecture, systems programming, computer applications and the design of algorithms will be taking place within the context of parallel computation. In preparation for this revolution, theoretical computer scientists have begun to develop a body of theory centered on parallel algorithms and parallel architectures. As there is no consensus yet on the appropriate logical organization of a massively parallel computer, and as the speed of parallel algorithms is constrained as much by limits on interprocessor communication as it is by purely computational issues, it is not surprising that a variety of abstract models of parallel computation have been pursued. Closest to the hardware level are the VLSI models, which focus on the technological limits of today's chips, in which gates and wires are packed into a small number of planar layers.

...read moreread less

812 citations