scispace - formally typeset
Search or ask a question
Book ChapterDOI

A Bridging Model for Multi-core Computing

15 Sep 2008-pp 13-28
TL;DR: It is suggested that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for a bridging model aimed at capturing the most basic resource parameters of multi-core architectures.
Abstract: We propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully pursued as an effort in designing portable algorithms for such a bridging model. Portable algorithms would contain efficient designs for all reasonable ranges of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: In this article, the oblivious RAM simulation problem with a small logarithmic or polylogarithm amortized increase in access times was studied, with a very high probability of success, while keeping the external storage to be of size O(n).
Abstract: Suppose a client, Alice, has outsourced her data to an external storage provider, Bob, because he has capacity for her massive data set, of size n, whereas her private storage is much smaller--say, of size O(n^{1/r}), for some constant r > 1. Alice trusts Bob to maintain her data, but she would like to keep its contents private. She can encrypt her data, of course, but she also wishes to keep her access patterns hidden from Bob as well. We describe schemes for the oblivious RAM simulation problem with a small logarithmic or polylogarithmic amortized increase in access times, with a very high probability of success, while keeping the external storage to be of size O(n). To achieve this, our algorithmic contributions include a parallel MapReduce cuckoo-hashing algorithm and an external-memory dataoblivious sorting algorithm.

194 citations

Proceedings ArticleDOI
13 Jun 2010
TL;DR: This paper describes several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
Abstract: In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.

124 citations


Cites background from "A Bridging Model for Multi-core Com..."

  • ...Clearly this lower bound also applies to more general multi-level cache models such as studied in [3, 4, 30, 42, 46, 49]....

    [...]

  • ...It would be interesting to extend these results to more general multi-level models [3, 4, 18, 30, 42, 46, 49], while preserving the goal of supporting a simple model for algorithm design and analysis....

    [...]

Journal ArticleDOI
TL;DR: GV100 is NVIDIAs latest flagship GPU that features enhancements to NVLink, a redesigned streaming microprocessor (SM), and independent thread scheduling enhancements to the single instruction, multiple threads (SIMT) model.
Abstract: GV100 is NVIDIAs latest flagship GPU. It has been designed with many features to improve performance and programmability. It features enhancements to NVLink, a redesigned streaming microprocessor (SM), and independent thread scheduling enhancements to the single instruction, multiple threads (SIMT) model. GV100 also adds new tensor cores for an order of magnitude throughput improvement for deep-learning kernels.

101 citations

Proceedings Article
19 Jun 2016
TL;DR: This paper introduces a new technique for mapping Deep Recurrent Neural Networks efficiently onto GPUs that uses persistent computational kernels that exploit the GPU's inverted memory hierarchy to reuse network weights over multiple timesteps.
Abstract: This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU's inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16× reduction in activation memory footprint, enables model training with 12× more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.

88 citations

Journal ArticleDOI
TL;DR: The ICE abstraction may take CS from serial (single- core) computing to effective parallel (many-core) computing in order to take CS to many-core computing.
Abstract: The ICE abstraction may take CS from serial (single-core) computing to effective parallel (many-core) computing.

51 citations

References
More filters
Journal ArticleDOI
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

3,885 citations


"A Bridging Model for Multi-core Com..." refers background or methods in this paper

  • ...The originally proposed bridging model was the BSP model [ 28 ]....

    [...]

  • ...We have argued previously that the general problem of parallel computing should be approached via two notions [13, 28 ]....

    [...]

Journal ArticleDOI
TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Abstract: We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match. Secondary storage is modeled as a magnetic disk capable of transferring P blocks each containing B records in a single time unit; the records in each block must be input from or output to B contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for P = 1 that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case B = P = O(1).

1,344 citations


"A Bridging Model for Multi-core Com..." refers methods in this paper

  • ...Our lower bound results for straight-line programs we derive using the approach of Irony, Toledo and Tiskin [25] (and also of [23, 33]), while the result for sorting uses an adversarial argument of Aggarwal and Vitter [ 3 ]....

    [...]

  • ...The bounds for sorting follow from the following adversarial argument adapted from Aggarwal and Vitter [ 3 ]: Let S = Mi. Consider any total ordering of all the level i component supersteps that respects all the time dependencies....

    [...]

  • ...Our proofs of optimality for communication and synchronization given in this section and the one to follow all derive from lower bounds on the number of communication steps required in distributed algorithms and are direct applications of previous work, particularly of Hong and Kung [23], Aggarwal and Vitter [ 3 ], Savage [33, 34] and Irony, Toledo and Tiskin [25]....

    [...]

Journal ArticleDOI
TL;DR: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores.
Abstract: Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster.

1,245 citations


"A Bridging Model for Multi-core Com..." refers background in this paper

  • ...Another issue not discussed here is the role of non-homogeneous cores [ 16 , 21]....

    [...]

Proceedings ArticleDOI
01 May 1978
TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.
Abstract: A model of computation based on random access machines operating in parallel and sharing a common memory is presented. The computational power of this model is related to that of traditional models. In particular, deterministic parallel RAM's can accept in polynomial time exactly the sets accepted by polynomial tape bounded Turing machines; nondeterministic RAM's can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines. Similar results hold for other classes. The effect of limiting the size of the common memory is also considered.

951 citations


Additional excerpts

  • ...BSP with (p1 ‚ 1;g1 = 1;L1 = 0;m1) is the PRAM [ 11 , 20] model....

    [...]

Book ChapterDOI
02 Jan 1991
TL;DR: In this paper, the authors discuss parallel algorithms for shared-memory machines and discuss the theoretical foundations of parallel algorithms and parallel architectures, and present a theoretical analysis of the appropriate logical organization of a massively parallel computer.
Abstract: Publisher Summary This chapter discusses parallel algorithms for shared-memory machines. Parallel computation is rapidly becoming a dominant theme in all areas of computer science and its applications. It is estimated that, within a decade, virtually all developments in computer architecture, systems programming, computer applications and the design of algorithms will be taking place within the context of parallel computation. In preparation for this revolution, theoretical computer scientists have begun to develop a body of theory centered on parallel algorithms and parallel architectures. As there is no consensus yet on the appropriate logical organization of a massively parallel computer, and as the speed of parallel algorithms is constrained as much by limits on interprocessor communication as it is by purely computational issues, it is not surprising that a variety of abstract models of parallel computation have been pursued. Closest to the hardware level are the VLSI models, which focus on the technological limits of today's chips, in which gates and wires are packed into a small number of planar layers.

812 citations