scispace - formally typeset
Search or ask a question

Showing papers on "Bulk synchronous parallel published in 1994"


Journal ArticleDOI
TL;DR: It is shown that optimality to within a multiplicative factor close to one can be achieved for the problems of Gauss-Jordan elimination and sorting, by transportable algorithms that can be applied for a wide range of values of the parameters p, g, and L.

307 citations


Proceedings Article
01 Jan 1994
TL;DR: This paper theoretically and experimentally analyse theency with which a wide range of important scientic computations can be performed on bulk synchronous parallel architectures.
Abstract: Bulk synchronous parallel architectures oer the prospect of achieving both scalable parallel performance and architecture independent parallel software They provide a robust model on which to base the future development of general purpose parallel computing systems In this paper, we theoretically and experimentally analyse the eciency with which a wide range of important scientic computations can be performed on bulk synchronous parallel architectures The computations considered include the iterative solution of sparse linear systems, molecular dynamics, and the solution of partial dierential equations on a multidimensional discrete grid We analyse these computations in a uniform manner by formulating their basic procedures as a sparse matrix-vector multiplication

91 citations


Book ChapterDOI
06 Sep 1994
TL;DR: An important class of programs for sharedmemory architectures is discussed and how they can be mapped to the LogP machine and a constant factor delay with respect to the optimal LogP execution time can be guaranteed.
Abstract: Currently, many parallel algorithms are defined for sharedmemory architectures. The prefered machine model for designing these algorithms is the PRAM. However, this model does not take into account properties of existing architectures. Recently, Culler et al. defined the LogP machine model which better reflects the behaviour of massively parallel computers. We discuss an important class of programs for sharedmemory architectures and show how they can be mapped to the LogP machine. We define this class and show how to compute the mapping at compile time. For this mapping a constant factor delay with respect to the optimal LogP execution time can be guaranteed.

25 citations


01 Jan 1994
TL;DR: It is shown that the characteristics of a particular parallel machine to be used need to be considered in transforming a given task into a parallel algorithm that executes effectively.
Abstract: Data parallelism is a model of parallel computing in which the same set of instructions is applied to all the elements in a data set. A sampling of data parallel algorithms is presented. The examples are certainly not exhaustive, but address many issues involved in designing data parallel algorithms. Case studies are used to illustrate some algorithm design techniques; and to highlight some implementation decisions that influence the overall performance of a parallel algorithm. It is shown that the characteristics of a particular parallel machine to be used need to be considered in transforming a given task into a parallel algorithm that executes effectively. DATA PARALLEL ALGORITHMS Howard Jay Siegel, Lee Wang, John John E. So, and Muthucurnaru Maheswaran

24 citations


01 Jan 1994
TL;DR: PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) provides a latency corresponding to execution of just a few floating-point operations, and can be implemented at a cost of less than %50/PC, including cables.
Abstract: There are a lot of 3861486lPentium-based personal computers (PCs) out there. They are affordable, reliable, and offer good performance. Thus, it is only natural to think of networking multiple PCs to create a high-performance parallel machine the problem is that conventional networking systems cannot provide low latency synchronization and communication. Lou. latency allows fine grain parallelism; the longer the latency, the fewer thc' pn)g;ams that can achieve good speedup through use of parallelism. Typical parallel machines constructed using PC networks (e.g., PVM software using Ethernet hardware) generally have latencies between 0.001s and 0.1s. Even the "best" commercially-available parallel computers can do no better than a latency corresponding to the time to execute hundreds to thousands of floating-point operations. In contrast, PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) provides a latency corresponding to execution of just a few floating-point operations. Despite this, PAPERS can be implemented at a cost of less than %50/PC, including cables. ' This work was supported in pan by the Office of Naval Research (ONR) under grant number N00014-91-J-4013 and by the National Science Foundation (NSF) under award number 9015696-CDA.

21 citations


01 Jan 1994
TL;DR: A brief overview of the Proteus system is presented and its use in the exploration and development of several non-trivial algorithms, including the fast multipole algorithm for N-body computations is described.
Abstract: The Proteus language is a wide-spectrum parallel programming notation that supports the expression of both high-level architectureindependent speci cations and lower-level architecture-speci c implementations. A methodology based on successive re nement and interactive experimentation supports the development of parallel algorithms from speci cation to various e cient architecture-dependent implementations. The Proteus system combines the language and tools supporting this methodology. This paper presents a brief overview of the Proteus system and describes its use in the exploration and development of several non-trivial algorithms, including the fast multipole algorithm for N-body computations.

17 citations


Proceedings ArticleDOI
11 Dec 1994
TL;DR: This paper is a survey of currently available software tools that facilitate the design of parallel discrete-event simulations.
Abstract: A number of algorithms have been developed to support parallel execution of discrete-event simulation models. In general, these algorithms are complex and implementing them directly in a simulation model is a difficult and resource-intensive programming task. Parallel simulation languages and environments can be of considerable help in hiding the complexity of the underlying synchronization algorithm and in providing a simpler virtual machine to the model designer. This paper is a survey of currently available software tools that facilitate the design of parallel discrete-event simulations.

12 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: The broad thesis presented suggests that the serial emulation of a parallel algorithm has the potential advantage of running an a serial machine faster than a standard serial algorithm for the same problem.
Abstract: The broad thesis presented suggests that the serial emulation of a parallel algorithm has the potential advantage of running an a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of this thesis. However, using some imagination, validity of the thesis and some arguments supporting it may lead to several far-reaching outcomes: (1) Reliance on "predictability of reference" in the design of computer systems will increase. (2) Parallel algorithms will be taught as part of the standard computer science and engineering undergraduate curricula irrespective of whether (or when) parallel processing will become ubiquitous in the general-purpose computing world. (3) A strategic agenda for high-performance parallel computing: a multistage agenda, which in no stage compromises user-friendliness of the programmer's model, and thereby potentially alleviates the so-called "parallel software crisis". Stimulating a debate is one goal of our presentation. >

12 citations


01 Aug 1994
TL;DR: This thesis introduces and evaluates three methods for automatically transforming a parallel algorithm into a less parallel one that takes advantage of only the parallelism available at run time, and combines the ease of writing, reading, debugging, and detecting parallelism in high-level programs.
Abstract: Programs often exhibit more parallelism than is actually available in the target architecture. This thesis introduces and evaluates three methods|loop unrolling, loop common expression elimination, and loop di erencing|for automatically transforming a parallel algorithm into a less parallel one that takes advantage of only the parallelism available at run time. The resulting program performs less computation to produce its results; the running time is not just improved via second-order e ects such as improving use of the memory hierarchy or reducing overhead (such optimizations can further improve performance). The asymptotic complexity is not usually reduced, but the constant factors can be lowered signi cantly, often by a factor of 4 or more. The basis for these methods is the detection of loop common expressions, or common subexpressions in di erent iterations of a parallel loop. The loop di erencing method also permits computation of just the change in an expression from iteration to iteration. We de ne the class of generalized stencil computations, in which loop common expressions can be easily found; each result combines w operands, so a naive implementation requires w operand evaluations and w 1 combining operations per result. Unrolling and application of the twophase common subexpression elimination algorithm, which we introduce and which signi cantly outperforms other common subexpression elimination algorithms, can reduce its cost to less than 2 operand evaluations and 3 combining operations per result. Loop common expression elimination decreases these costs to 1 and logw, respectively; when combined with unrolling they drop to 1 operand evaluation and 4 combining operations per result. Loop di erencing reduces the per-result costs to 2 operand evaluations and 2 combining operations. We discuss the tradeo s among these techniques and when each should be applied. We can achieve such speedups because, while the maximally parallel implementation of an algorithm achieves the greatest speedup on a parallel machine with su ciently many processors, it may be ine cient when run on a machine with too few processors. Serial implementations, on the other hand, run faster on single-processor computers but often contain dependences which prevent parallelization. Our methods combine the e ciency of good serial algorithms with the ease of writing, reading, debugging, and detecting parallelism in high-level programs. Our three methods are primarily applicable to MIMD and SIMD implementations of dataparallel languages when the data set size is larger than the number of processors (including uniprocessor implementations), but they can also improve the performance of parallel programs without serializing them. The methods may be applied as an optimization of a parallelizing compiler after a serial program's parallelism has been exposed, and they are also applicable to some purely serial programs which manipulate arrays or other structured data. The techniques have been implemented, and preliminary timing results are reported. Real-world computations are used as examples throughout, and an appendix lists more potential applications. This technical report is a revision (clarifying and expanding some sections) of the author's M.S. thesis [48], supervised by Charles Leiserson. This work was supported by a National Defense and Science Graduate Fellowship, by Defense Advanced Research Project Agency contract N0001491-J-1698, and by Microsoft Corporation.

11 citations



Journal ArticleDOI
TL;DR: A portable parallel algorithm for logic synthesis based on the Transduction method, called ProperSYN, which uses an asynchronous message driven data-flow model of computation, with no explicit synchronizing barriers separating different phases of parallel computation as used in many previously developed parallel algorithms.
Abstract: Combinational logic synthesis is a very important phase of VLSI system design. But the logic synthesis process requires large computing times if near optimal quality of the logic network is desired. Parallel processing is fast becoming an attractive solution to reduce the computational time. Recently, researchers have started to investigate parallel algorithms for problems in logic synthesis and verification. Much of the work in parallel algorithms for CAD reported to date, however, suffers from a major limitation. The parallel algorithms proposed for the CAD applications are designed with a specific underlying parallel architecture in mind. Moreover, incompatibilities in programming environments also make it difficult to port these programs across different parallel machines. As a result, a parallel algorithm needs to be developed afresh for every target parallel architecture. The ongoing project of ProperCAD offers an attractive solution to that problem. It allows the development and implementation of a parallel algorithm on the CHARM runtime system such that it can be executed in all the parallel machines without any change in the program. In this paper, we describe a portable parallel algorithm for logic synthesis based on the Transduction method, called ProperSYN. This algorithm uses an asynchronous message driven data-flow model of computation, with no explicit synchronizing barriers separating different phases of parallel computation as used in many previously developed parallel algorithms. Our algorithm is therefore more scalable to large numbers of processors. The algorithm has been implemented and it runs on a variety of parallel machines. We present results on several benchmark circuits for shared memory MIMD machines like Sequent Symmetry and Encore Multimax, distributed memory MIMD machine like the Intel/860 hypercube and distributed processing systems like networks of SUN workstations. >

Book ChapterDOI
Jeanne Ferrante1
08 Aug 1994
TL;DR: This paper extends traditional analysis to array section analysis for parallel languages which include additional control and synchronization structures to aid in the development of explicitly parallel programming languages.
Abstract: Data flow analysis has been used by compilers in diverse contexts, from optimization to register allocation. Traditional analysis of sequential programs has centered on scalar variables. More recently, several researchers have investigated analysis of array sections for optimizations on modern architectures. This information has been used to distribute data, optimize data movement and vectorize or parallelize programs. As multiprocessors become more common-place, we believe there will be considerable interest in explicitly parallel programming languages. In this paper, we extend traditional analysis to array section analysis for parallel languages which include additional control and synchronization structures.

Book ChapterDOI
04 Jul 1994
TL;DR: This paper presents a brief introduction to an alternative memory-level scheme which offers the prospect of achieving both efficient and transparent synchronization inulk synchronous parallel architecture.
Abstract: Bulk synchronous parallel architecture incorporates a scalable and transparent communication model. The task-level synchronization mechanism of the machine, however, is not transparent to the user and can be inefficient when applied to the coordination of irregular parallelism. This paper presents a brief introduction to an alternative memory-level scheme which offers the prospect of achieving both efficient and transparent synchronization. This scheme, based on a discrete event simulation paradigm, supports sequential style of programming and, coupled with the BSP communication model, leads to the emergence of a virtual von Neumann parallel computer.

Proceedings ArticleDOI
02 May 1994
TL;DR: It is asserted that a parallel programming methodology must be based on a three-level decomposition, and the notion of algorithms which scales on a distributed memory parallel computer is defined.
Abstract: We compare various models of parallel machines and show that they can be classified in two classes: algorithm oriented or execution oriented. None of them are really satisfying from the user's point of view. Hence bridging models have been proposed. Contrarily to what is done in sequential where a two-level decomposition is used (programmimg-compiling), we assert that a parallel programming methodology must be based on a three-level decomposition. We define the notion of algorithms which scales on a distributed memory parallel computer. We propose such a methodology and advocate its advantages. Then we point out the main difficulties in parallel programming. >

Proceedings ArticleDOI
31 Jan 1994
TL;DR: The authors compare estimates generated by ES to measurements made of a parallel mergesort executing on an Intel iPSC/860 hypercube.
Abstract: ES is a tool for estimating the execution times of parallel algorithms on MIMD parallel systems. ES allows the user to model arbitrary task execution times, explicit task precedence and synchronization constraints, resource contention among tasks, and a variety of scheduling policies for shared resources. Given a model of a parallel algorithm and a parallel system, ES constructs a sequencing tree that represents some or all of the possible sequences of events that may occur during the execution of the algorithm on the system, and uses it to estimate the mean and standard deviation of the execution time of the parallel algorithm. The authors compare estimates generated by ES to measurements made of a parallel mergesort executing on an Intel iPSC/860 hypercube. >


05 Dec 1994
TL;DR: It is argued that shared-memory BSP is efficiently implementable on a wide variety of parallel hardware, and that BSP forms a useful basis for providing an even higher level programming interface based on Sequential Consistency (SC).
Abstract: For parallel programs to become portable, they must be executable with uniform efficiency on a variety of hardware platforms, which is not the case at present. In 1990, Valiant proposed Bulk-Synchronous Parallelism (BSP) as a model on which portable parallel programs can be built. We argue that shared-memory BSP is efficiently implementable on a wide variety of parallel hardware, and that BSP forms a useful basis for providing an even higher level programming interface based on Sequential Consistency (SC). A list of memory and thread management features needed to support BSP and SC parallel programs are given, under the assumption that the parallel computer is space-shared among multiple parallel task, rather than time-shared. Known techniques to realize efficiently the most important of these features are sketched.

Book ChapterDOI
24 Feb 1994
TL;DR: A simple optimal parallel algorithm for pre-processing the input tree such that the path queries can be answered efficiently and the paths between pairs of nodes in an n-node tree can be reported.
Abstract: We present optimal parallel solutions to reporting paths between pairs of nodes in an n-node tree. Our algorithms are deterministic and designed to run on an exclusive read exclusive write parallel random-access machine (EREW PRAM). In particular, we provide a, simple optimal parallel algorithm for pre-processing the input tree such that the path queries can be answered efficiently. Our algorithm for preprocessing runs in O(log n) time using O(n/log n) processors. Using the preprocessing, we can report paths between k node pairs in O(log n + log k) time using O(k + (n + S)/log n) processors on an EREW PRAM, where S is the size of the output. In particular, we can report the path between a single pair of distinct nodes in O(log n) time using O(L/log n) processors, where L denotes the length of the path.

Proceedings ArticleDOI
22 Aug 1994
TL;DR: This paper describes a compiler environment for BSP-L, an experimental programming language based on the Bulk Synchronous Parallel model of computation whose goal is to enable the generation of highly efficient, architecture independent software for a wide range of high performance parallel computers.
Abstract: This paper describes a compiler environment for BSP-L, an experimental programming language based on the Bulk Synchronous Parallel model of computation whose goal is to enable the generation of highly efficient, architecture independent software for a wide range of high performance parallel computers. >

Book ChapterDOI
04 Jul 1994
TL;DR: This paper uses a portable parallel programming environment, the pSystem, to evaluate and compare the performance of various scheduling algorithms on shared memory parallel machines.
Abstract: The efficiency of scheduling algorithms is essential in order to attain optimal performances from parallel programming systems. In this paper we use a portable parallel programming environment we have implemented, the pSystem, to evaluate and compare the performance of various scheduling algorithms on shared memory parallel machines.

01 Mar 1994
TL;DR: This paper describes the behavior of ports, along with the other parallel constructs of Pscheme, a parallel dialect of Scheme that can be used to build higher-level parallel programming abstractions, such as futures, semaphores, and Ada-style rendezvous.
Abstract: In this paper, we describe Pscheme, a parallel dialect of Scheme. The primary construct for specifying parallelism, synchronization, and communication is a natural extension of first-class continuations which we call a port. We describe the behavior of ports, along with the other parallel constructs of Pscheme. Because the user has precise control over the parallel computation, the Pscheme constructs can be used to build higher-level parallel programming abstractions, such as futures, semaphores, and Ada-style rendezvous. We provide the Pscheme code for these abstractions and discuss the current implementation of Pscheme on a shared-memory multiprocessor.


Proceedings ArticleDOI
01 Jan 1994
TL;DR: This paper describes a implementation of a parallel algorithm in the Boltzmann Machine of the two layers of managementers and the multiple groups of neurons that are large scale parallel processing using a number of the simple single bit ALUs and effective expansion realized by multiple chips connected simple bus lines.
Abstract: This paper describes a implementation of a parallel algorithm in the Boltzmann Machine (BM). The implementation is the network of the two layers of managementers and the multiple groups of neurons. The features of the network are large scale parallel processing using a number of the simple single bit ALUs and effective expansion realized by multiple chips connected simple bus lines. >

Proceedings ArticleDOI
19 Dec 1994
TL;DR: A parallel programming environment based on message passing is introduced, which is simple to develop parallel applications and has high performance.
Abstract: With the development of parallel processing technology, more and more high-performance parallel computer systems have been developed. The convenient and flexible parallel programming environment plays an important role in the spread of parallel computing. How to write efficient parallel codes and how to convert the existing sequential applications into parallel codes have become a very important issue in parallel processing. We introduce a parallel programming environment based on message passing, which is simple to develop parallel applications and has high performance.