Showing papers on "Bulk synchronous parallel published in 1994"

PDF

Open Access

Journal Article•DOI•

Direct bulk-synchronous parallel algorithms

[...]

Alexandros V. Gerbessiotis¹, Leslie G. Valiant¹•Institutions (1)

01 Aug 1994-Journal of Parallel and Distributed Computing

TL;DR: It is shown that optimality to within a multiplicative factor close to one can be achieved for the problems of Gauss-Jordan elimination and sorting, by transportable algorithms that can be applied for a wide range of values of the parameters p, g, and L.

...read moreread less

307 citations

Proceedings Article•

Scientific Computing on Bulk Synchronous Parallel Architectures

[...]

Rob H. Bisseling, William F. McColl

01 Jan 1994

TL;DR: This paper theoretically and experimentally analyse theency with which a wide range of important scientic computations can be performed on bulk synchronous parallel architectures.

...read moreread less

Abstract: Bulk synchronous parallel architectures oer the prospect of achieving both scalable parallel performance and architecture independent parallel software They provide a robust model on which to base the future development of general purpose parallel computing systems In this paper, we theoretically and experimentally analyse the eciency with which a wide range of important scientic computations can be performed on bulk synchronous parallel architectures The computations considered include the iterative solution of sparse linear systems, molecular dynamics, and the solution of partial dierential equations on a multidimensional discrete grid We analyse these computations in a uniform manner by formulating their basic procedures as a sparse matrix-vector multiplication

...read moreread less

91 citations

Book Chapter•DOI•

An Approach to Machine-Independent Parallel Programming

[...]

Wolf Zimmermann¹, Welf Löwe¹•Institutions (1)

Karlsruhe Institute of Technology¹

06 Sep 1994

TL;DR: An important class of programs for sharedmemory architectures is discussed and how they can be mapped to the LogP machine and a constant factor delay with respect to the optimal LogP execution time can be guaranteed.

...read moreread less

Abstract: Currently, many parallel algorithms are defined for sharedmemory architectures. The prefered machine model for designing these algorithms is the PRAM. However, this model does not take into account properties of existing architectures. Recently, Culler et al. defined the LogP machine model which better reflects the behaviour of massively parallel computers. We discuss an important class of programs for sharedmemory architectures and show how they can be mapped to the LogP machine. We define this class and show how to compute the mapping at compile time. For this mapping a constant factor delay with respect to the optimal LogP execution time can be guaranteed.

...read moreread less

25 citations

Data Parallel Algorithms

[...]

Howard Jay Siege, Lee Wang, John John E. So, Muthucumaru Maheswaran

01 Jan 1994

TL;DR: It is shown that the characteristics of a particular parallel machine to be used need to be considered in transforming a given task into a parallel algorithm that executes effectively.

...read moreread less

Abstract: Data parallelism is a model of parallel computing in which the same set of instructions is applied to all the elements in a data set. A sampling of data parallel algorithms is presented. The examples are certainly not exhaustive, but address many issues involved in designing data parallel algorithms. Case studies are used to illustrate some algorithm design techniques; and to highlight some implementation decisions that influence the overall performance of a parallel algorithm. It is shown that the characteristics of a particular parallel machine to be used need to be considered in transforming a given task into a parallel algorithm that executes effectively. DATA PARALLEL ALGORITHMS Howard Jay Siegel, Lee Wang, John John E. So, and Muthucurnaru Maheswaran

...read moreread less

24 citations

PAPERS: Purdue's Adapter for Parallel Execution and Rapid synchronization

[...]

H. G. Dietz, T. Muhammad¹, J. B. Sponaugle¹, T. Mattox•Institutions (1)

Purdue University¹

01 Jan 1994

TL;DR: PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) provides a latency corresponding to execution of just a few floating-point operations, and can be implemented at a cost of less than %50/PC, including cables.

...read moreread less

Abstract: There are a lot of 3861486lPentium-based personal computers (PCs) out there. They are affordable, reliable, and offer good performance. Thus, it is only natural to think of networking multiple PCs to create a high-performance parallel machine the problem is that conventional networking systems cannot provide low latency synchronization and communication. Lou. latency allows fine grain parallelism; the longer the latency, the fewer thc' pn)g;ams that can achieve good speedup through use of parallelism. Typical parallel machines constructed using PC networks (e.g., PVM software using Ethernet hardware) generally have latencies between 0.001s and 0.1s. Even the "best" commercially-available parallel computers can do no better than a latency corresponding to the time to execute hundreds to thousands of floating-point operations. In contrast, PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) provides a latency corresponding to execution of just a few floating-point operations. Despite this, PAPERS can be implemented at a cost of less than %50/PC, including cables. ' This work was supported in pan by the Office of Naval Research (ONR) under grant number N00014-91-J-4013 and by the National Science Foundation (NSF) under award number 9015696-CDA.

...read moreread less

21 citations

Specification and Development of Parallel Algorithms with the Proteus System.

[...]

Allen Goldberg, P.H. Mills, Lars Nyland, Jan F. Prins, John H. Reif, James Riely - Show less +2 more

01 Jan 1994

TL;DR: A brief overview of the Proteus system is presented and its use in the exploration and development of several non-trivial algorithms, including the fast multipole algorithm for N-body computations is described.

...read moreread less

Abstract: The Proteus language is a wide-spectrum parallel programming notation that supports the expression of both high-level architectureindependent speci cations and lower-level architecture-speci c implementations. A methodology based on successive re nement and interactive experimentation supports the development of parallel algorithms from speci cation to various e cient architecture-dependent implementations. The Proteus system combines the language and tools supporting this methodology. This paper presents a brief overview of the Proteus system and describes its use in the exploration and development of several non-trivial algorithms, including the fast multipole algorithm for N-body computations.

...read moreread less

17 citations

Proceedings Article•DOI•

Language support for parallel discrete-event simulations

[...]

Rajive Bagrodia¹•Institutions (1)

University of California, Los Angeles¹

11 Dec 1994

TL;DR: This paper is a survey of currently available software tools that facilitate the design of parallel discrete-event simulations.

...read moreread less

Abstract: A number of algorithms have been developed to support parallel execution of discrete-event simulation models. In general, these algorithms are complex and implementing them directly in a simulation model is a difficult and resource-intensive programming task. Parallel simulation languages and environments can be of considerable help in hiding the complexity of the underlying synchronization algorithm and in providing a simpler virtual machine to the model designer. This paper is a survey of currently available software tools that facilitate the design of parallel discrete-event simulations.

...read moreread less

12 citations

Proceedings Article•DOI•

Can parallel algorithms enhance serial implementation

[...]

Uzi Vishkin¹•Institutions (1)

University of Maryland, College Park¹

01 Apr 1994

TL;DR: The broad thesis presented suggests that the serial emulation of a parallel algorithm has the potential advantage of running an a serial machine faster than a standard serial algorithm for the same problem.

...read moreread less

Abstract: The broad thesis presented suggests that the serial emulation of a parallel algorithm has the potential advantage of running an a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of this thesis. However, using some imagination, validity of the thesis and some arguments supporting it may lead to several far-reaching outcomes: (1) Reliance on "predictability of reference" in the design of computer systems will increase. (2) Parallel algorithms will be taught as part of the standard computer science and engineering undergraduate curricula irrespective of whether (or when) parallel processing will become ubiquitous in the general-purpose computing world. (3) A strategic agenda for high-performance parallel computing: a multistage agenda, which in no stage compromises user-friendliness of the programmer's model, and thereby potentially alleviates the so-called "parallel software crisis". Stimulating a debate is one goal of our presentation. >

...read moreread less

12 citations

Serializing Parallel Programs by Removing Redundant Computation

[...]

Michael D. Ernst

01 Aug 1994

TL;DR: This thesis introduces and evaluates three methods for automatically transforming a parallel algorithm into a less parallel one that takes advantage of only the parallelism available at run time, and combines the ease of writing, reading, debugging, and detecting parallelism in high-level programs.

...read moreread less

Abstract: Programs often exhibit more parallelism than is actually available in the target architecture. This thesis introduces and evaluates three methods|loop unrolling, loop common expression elimination, and loop di erencing|for automatically transforming a parallel algorithm into a less parallel one that takes advantage of only the parallelism available at run time. The resulting program performs less computation to produce its results; the running time is not just improved via second-order e ects such as improving use of the memory hierarchy or reducing overhead (such optimizations can further improve performance). The asymptotic complexity is not usually reduced, but the constant factors can be lowered signi cantly, often by a factor of 4 or more. The basis for these methods is the detection of loop common expressions, or common subexpressions in di erent iterations of a parallel loop. The loop di erencing method also permits computation of just the change in an expression from iteration to iteration. We de ne the class of generalized stencil computations, in which loop common expressions can be easily found; each result combines w operands, so a naive implementation requires w operand evaluations and w 1 combining operations per result. Unrolling and application of the twophase common subexpression elimination algorithm, which we introduce and which signi cantly outperforms other common subexpression elimination algorithms, can reduce its cost to less than 2 operand evaluations and 3 combining operations per result. Loop common expression elimination decreases these costs to 1 and logw, respectively; when combined with unrolling they drop to 1 operand evaluation and 4 combining operations per result. Loop di erencing reduces the per-result costs to 2 operand evaluations and 2 combining operations. We discuss the tradeo s among these techniques and when each should be applied. We can achieve such speedups because, while the maximally parallel implementation of an algorithm achieves the greatest speedup on a parallel machine with su ciently many processors, it may be ine cient when run on a machine with too few processors. Serial implementations, on the other hand, run faster on single-processor computers but often contain dependences which prevent parallelization. Our methods combine the e ciency of good serial algorithms with the ease of writing, reading, debugging, and detecting parallelism in high-level programs. Our three methods are primarily applicable to MIMD and SIMD implementations of dataparallel languages when the data set size is larger than the number of processors (including uniprocessor implementations), but they can also improve the performance of parallel programs without serializing them. The methods may be applied as an optimization of a parallelizing compiler after a serial program's parallelism has been exposed, and they are also applicable to some purely serial programs which manipulate arrays or other structured data. The techniques have been implemented, and preliminary timing results are reported. Real-world computations are used as examples throughout, and an appendix lists more potential applications. This technical report is a revision (clarifying and expanding some sections) of the author's M.S. thesis [48], supervised by Charles Leiserson. This work was supported by a National Defense and Science Graduate Fellowship, by Defense Advanced Research Project Agency contract N0001491-J-1698, and by Microsoft Corporation.

...read moreread less

11 citations

Proceedings Article•

The Execution Model and the Architecture for Real-Time Parallel Systems.

[...]

Yoshinori Yamaguchi, Kenji Toda, Kenji Nishida, Eiichi Takahashi

01 Jan 1994

9 citations

Journal Article•DOI•

A portable parallel algorithm for logic synthesis using transduction

[...]

Kaushik De¹, B. Ramkumar, Prithviraj Banerjee•Institutions (1)

LSI Corporation¹

01 May 1994-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: A portable parallel algorithm for logic synthesis based on the Transduction method, called ProperSYN, which uses an asynchronous message driven data-flow model of computation, with no explicit synchronizing barriers separating different phases of parallel computation as used in many previously developed parallel algorithms.

...read moreread less

Abstract: Combinational logic synthesis is a very important phase of VLSI system design. But the logic synthesis process requires large computing times if near optimal quality of the logic network is desired. Parallel processing is fast becoming an attractive solution to reduce the computational time. Recently, researchers have started to investigate parallel algorithms for problems in logic synthesis and verification. Much of the work in parallel algorithms for CAD reported to date, however, suffers from a major limitation. The parallel algorithms proposed for the CAD applications are designed with a specific underlying parallel architecture in mind. Moreover, incompatibilities in programming environments also make it difficult to port these programs across different parallel machines. As a result, a parallel algorithm needs to be developed afresh for every target parallel architecture. The ongoing project of ProperCAD offers an attractive solution to that problem. It allows the development and implementation of a parallel algorithm on the CHARM runtime system such that it can be executed in all the parallel machines without any change in the program. In this paper, we describe a portable parallel algorithm for logic synthesis based on the Transduction method, called ProperSYN. This algorithm uses an asynchronous message driven data-flow model of computation, with no explicit synchronizing barriers separating different phases of parallel computation as used in many previously developed parallel algorithms. Our algorithm is therefore more scalable to large numbers of processors. The algorithm has been implemented and it runs on a variety of parallel machines. We present results on several benchmark circuits for shared memory MIMD machines like Sequent Symmetry and Encore Multimax, distributed memory MIMD machine like the Intel/860 hypercube and distributed processing systems like networks of SUN workstations. >

...read moreread less

Book Chapter•DOI•

Computing Communication Sets for Control Parallel Programs

[...]

Jeanne Ferrante¹•Institutions (1)

IBM¹

08 Aug 1994

TL;DR: This paper extends traditional analysis to array section analysis for parallel languages which include additional control and synchronization structures to aid in the development of explicitly parallel programming languages.

...read moreread less

Abstract: Data flow analysis has been used by compilers in diverse contexts, from optimization to register allocation. Traditional analysis of sequential programs has centered on scalar variables. More recently, several researchers have investigated analysis of array sections for optimizations on modern architectures. This information has been used to distribute data, optimize data movement and vectorize or parallelize programs. As multiprocessors become more common-place, we believe there will be considerable interest in explicitly parallel programming languages. In this paper, we extend traditional analysis to array section analysis for parallel languages which include additional control and synchronization structures.

...read moreread less

Book Chapter•DOI•

From BSP to a Virtual von Neumann Machine

[...]

Nasser Kalantery¹, Stephen Winter¹, Derek R. Wilson¹•Institutions (1)

University of Westminster¹

04 Jul 1994

TL;DR: This paper presents a brief introduction to an alternative memory-level scheme which offers the prospect of achieving both efficient and transparent synchronization inulk synchronous parallel architecture.

...read moreread less

Abstract: Bulk synchronous parallel architecture incorporates a scalable and transparent communication model. The task-level synchronization mechanism of the machine, however, is not transparent to the user and can be inefficient when applied to the coordination of irregular parallelism. This paper presents a brief introduction to an alternative memory-level scheme which offers the prospect of achieving both efficient and transparent synchronization. This scheme, based on a discrete event simulation paradigm, supports sequential style of programming and, coupled with the BSP communication model, leads to the emergence of a virtual von Neumann parallel computer.

...read moreread less

Proceedings Article•DOI•

A comparison of parallel machine models from the point of view of scalability

[...]

Michel Cosnard¹•Institutions (1)

École normale supérieure de Lyon¹

02 May 1994

TL;DR: It is asserted that a parallel programming methodology must be based on a three-level decomposition, and the notion of algorithms which scales on a distributed memory parallel computer is defined.

...read moreread less

Abstract: We compare various models of parallel machines and show that they can be classified in two classes: algorithm oriented or execution oriented. None of them are really satisfying from the user's point of view. Hence bridging models have been proposed. Contrarily to what is done in sequential where a two-level decomposition is used (programmimg-compiling), we assert that a parallel programming methodology must be based on a three-level decomposition. We define the notion of algorithms which scales on a distributed memory parallel computer. We propose such a methodology and advocate its advantages. Then we point out the main difficulties in parallel programming. >

...read moreread less

Proceedings Article•DOI•

ES: a tool for predicting the performance of parallel systems

[...]

J.B. Sinclair¹, W.P. Dawkins•Institutions (1)

Rice University¹

31 Jan 1994

TL;DR: The authors compare estimates generated by ES to measurements made of a parallel mergesort executing on an Intel iPSC/860 hypercube.

...read moreread less

Abstract: ES is a tool for estimating the execution times of parallel algorithms on MIMD parallel systems. ES allows the user to model arbitrary task execution times, explicit task precedence and synchronization constraints, resource contention among tasks, and a variety of scheduling policies for shared resources. Given a model of a parallel algorithm and a parallel system, ES constructs a sequencing tree that represents some or all of the possible sequences of events that may occur during the execution of the algorithm on the system, and uses it to estimate the mean and standard deviation of the execution time of the parallel algorithm. The authors compare estimates generated by ES to measurements made of a parallel mergesort executing on an Intel iPSC/860 hypercube. >

...read moreread less

Proceedings Article•

H-BSP - A General Purpose Parallel Computing Environment.

[...]

Thomas E. Cheatham, Amr Fahmy, Dan C. Stefanescu

01 Jan 1994

OS Support for Portable Bulk Synchronous Parallel Programs

[...]

Abdelsalam A. Heddaya, Amr Fahmy

05 Dec 1994

TL;DR: It is argued that shared-memory BSP is efficiently implementable on a wide variety of parallel hardware, and that BSP forms a useful basis for providing an even higher level programming interface based on Sequential Consistency (SC).

...read moreread less

Abstract: For parallel programs to become portable, they must be executable with uniform efficiency on a variety of hardware platforms, which is not the case at present. In 1990, Valiant proposed Bulk-Synchronous Parallelism (BSP) as a model on which portable parallel programs can be built. We argue that shared-memory BSP is efficiently implementable on a wide variety of parallel hardware, and that BSP forms a useful basis for providing an even higher level programming interface based on Sequential Consistency (SC). A list of memory and thread management features needed to support BSP and SC parallel programs are given, under the assumption that the parallel computer is space-shared among multiple parallel task, rather than time-shared. Known techniques to realize efficiently the most important of these features are sketched.

...read moreread less

Book Chapter•DOI•

A Simple Optimal Parallel Algorithm for Reporting Paths in a Tree

[...]

Anil Maheshwari, Andrzej Lingas¹•Institutions (1)

Lund University¹

24 Feb 1994

TL;DR: A simple optimal parallel algorithm for pre-processing the input tree such that the path queries can be answered efficiently and the paths between pairs of nodes in an n-node tree can be reported.

...read moreread less

Abstract: We present optimal parallel solutions to reporting paths between pairs of nodes in an n-node tree. Our algorithms are deterministic and designed to run on an exclusive read exclusive write parallel random-access machine (EREW PRAM). In particular, we provide a, simple optimal parallel algorithm for pre-processing the input tree such that the path queries can be answered efficiently. Our algorithm for preprocessing runs in O(log n) time using O(n/log n) processors. Using the preprocessing, we can report paths between k node pairs in O(log n + log k) time using O(k + (n + S)/log n) processors on an EREW PRAM, where S is the size of the output. In particular, we can report the path between a single pair of distinct nodes in O(log n) time using O(L/log n) processors, where L denotes the length of the path.

...read moreread less

Proceedings Article•DOI•

A compiler for BSP-L, a programming language for the Bulk Synchronous Parallel model

[...]

Thomas E. Cheatham¹, Amr Fahmy¹, Dan C. Stefanescu¹•Institutions (1)

Harvard University¹

22 Aug 1994

TL;DR: This paper describes a compiler environment for BSP-L, an experimental programming language based on the Bulk Synchronous Parallel model of computation whose goal is to enable the generation of highly efficient, architecture independent software for a wide range of high performance parallel computers.

...read moreread less

Abstract: This paper describes a compiler environment for BSP-L, an experimental programming language based on the Bulk Synchronous Parallel model of computation whose goal is to enable the generation of highly efficient, architecture independent software for a wide range of high performance parallel computers. >

...read moreread less

Book Chapter•DOI•

Scheduling Algorithms Performance with the pSystem Parallel Programming Environment

[...]

Luís Lopes¹, Fernando Silva¹•Institutions (1)

University of Porto¹

04 Jul 1994

TL;DR: This paper uses a portable parallel programming environment, the pSystem, to evaluate and compare the performance of various scheduling algorithms on shared memory parallel machines.

...read moreread less

Abstract: The efficiency of scheduling algorithms is essential in order to attain optimal performances from parallel programming systems. In this paper we use a portable parallel programming environment we have implemented, the pSystem, to evaluate and compare the performance of various scheduling algorithms on shared memory parallel machines.

...read moreread less

Pscheme: Extending Continuations to Express Control and Synchronization in a Parallel LISP

[...]

C. Yao, B. Goldberg

01 Mar 1994

TL;DR: This paper describes the behavior of ports, along with the other parallel constructs of Pscheme, a parallel dialect of Scheme that can be used to build higher-level parallel programming abstractions, such as futures, semaphores, and Ada-style rendezvous.

...read moreread less

Abstract: In this paper, we describe Pscheme, a parallel dialect of Scheme. The primary construct for specifying parallelism, synchronization, and communication is a natural extension of first-class continuations which we call a port. We describe the behavior of ports, along with the other parallel constructs of Pscheme. Because the user has precise control over the parallel computation, the Pscheme constructs can be used to build higher-level parallel programming abstractions, such as futures, semaphores, and Ada-style rendezvous. We provide the Pscheme code for these abstractions and discuss the current implementation of Pscheme on a shared-memory multiprocessor.

...read moreread less

Proceedings Article•

A Calculus of Adaptive Purpose Parallel Computation.

[...]

Pilar de la Torre, Clyde P. Kruskal

01 Jan 1994

Proceedings Article•DOI•

Implementation of a parallel algorithm in Boltzmann machine

[...]

Hongbing Zhu¹, Mamoru Sasaki¹, Fumio Ueno¹, T. Inoue•Institutions (1)

Kumamoto University¹

01 Jan 1994

TL;DR: This paper describes a implementation of a parallel algorithm in the Boltzmann Machine of the two layers of managementers and the multiple groups of neurons that are large scale parallel processing using a number of the simple single bit ALUs and effective expansion realized by multiple chips connected simple bus lines.

...read moreread less

Abstract: This paper describes a implementation of a parallel algorithm in the Boltzmann Machine (BM). The implementation is the network of the two layers of managementers and the multiple groups of neurons. The features of the network are large scale parallel processing using a number of the simple single bit ALUs and effective expansion realized by multiple chips connected simple bus lines. >

...read moreread less

Proceedings Article•DOI•

A parallel programming environment based on message passing

[...]

Yuhong Wen¹, Dingxin Wang, Meiming Shen, Weimin Zhen•Institutions (1)

Tsinghua University¹

19 Dec 1994

TL;DR: A parallel programming environment based on message passing is introduced, which is simple to develop parallel applications and has high performance.

...read moreread less

Abstract: With the development of parallel processing technology, more and more high-performance parallel computer systems have been developed. The convenient and flexible parallel programming environment plays an important role in the spread of parallel computing. How to write efficient parallel codes and how to convert the existing sequential applications into parallel codes have become a very important issue in parallel processing. We introduce a parallel programming environment based on message passing, which is simple to develop parallel applications and has high performance.

...read moreread less