scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1990"


Journal ArticleDOI
TL;DR: In this paper, the best heuristic methods known up to now are compared to solve the flow shop sequencing problem and they improve the complexity of the best one, and a parallel taboo search algorithm is presented and experimental results show that this heuristic allows very good speed-up.

811 citations


Journal ArticleDOI
TL;DR: Gamma as mentioned in this paper is a relational database machine running on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives, where all relations are horizontally partitioned across multiple disk drives enabling relations to be scanned in parallel.
Abstract: The design of the Gamma database machine and the techniques employed in its implementation are described. Gamma is a relational database machine currently operating on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives. Gamma employs three key technical ideas which enable the architecture to be scaled to hundreds of processors. First, all relations are horizontally partitioned across multiple disk drives, enabling relations to be scanned in parallel. Second, parallel algorithms based on hashing are used to implement the complex relational operators, such as join and aggregate functions. Third, dataflow scheduling techniques are used to coordinate multioperator queries. By using these techniques, it is possible to control the execution of very complex queries with minimal coordination. The design of the Gamma software is described and a thorough performance evaluation of the iPSC/s hypercube version of Gamma is presented. >

662 citations


Journal ArticleDOI
TL;DR: An algorithm based on weighted recursive least-squares theory is developed in the wavenumber domain, which is efficient because interpolation and noise removal are performed recursively, and is highly suitable for implementation via the massively parallel computational architectures currently available.
Abstract: In several applications it is required to reconstruct a high-resolution noise-free image from multipath frames of undersampled low-resolution noisy images. Using the aliasing relationship between the undersamples frames and the reference image, an algorithm based on weighted recursive least-squares theory is developed in the wavenumber domain. This algorithm is efficient because interpolation and noise removal are performed recursively, and is highly suitable for implementation via the massively parallel computational architectures currently available. Success in the use of the algorithm is demonstrated through various simulated examples. >

567 citations


Journal ArticleDOI
TL;DR: A simple and efficient method for evaluating the performance of an algorithm, rendered as a directed acyclic graph, on any parallel computer is presented and its application to several common algorithms shows that it is surprisingly accurate.
Abstract: A simple and efficient method for evaluating the performance of an algorithm, rendered as a directed acyclic graph, on any parallel computer is presented. The crucial ingredient is an efficient approximation algorithm for a particular scheduling problem. The only parameter of the parallel computer needed by our method is the message-to-instruction ratio $\tau$. Although the method used in this paper does not take into account the number of processors available, its application to several common algorithms shows that it is surprisingly accurate.

422 citations


Proceedings ArticleDOI
01 May 1990
TL;DR: This paper rejects the simpler load-based inlining method, where tasks are combined based on dynamic load level, in favor of the safer and more robust lazy task creation method, which allows efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems.
Abstract: Many parallel algorithms are naturally expressed at a fine level of granularity, often finer than a MIMD parallel system can exploit efficiently. Most builders of parallel systems have looked to either the programmer or a parallelizing compiler to increase the granularity of such algorithms. In this paper we explore a third approach to the granularity problem by analyzing two strategies for combining parallel tasks dynamically at run-time. We reject the simpler load-based inlining method, where tasks are combined based on dynamic load level, in favor of the safer and more robust lazy task creation method, where tasks are created only retroactively as processing resources become available.These strategies grew out of work on Mul-T [14], an efficient parallel implementation of Scheme, but could be used with other applicative languages as well. We describe our Mul-T implementations of lazy task creation for two contrasting machines, and present performance statistics which show the method's effectiveness. Lazy task creation allows efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems.

344 citations


Book
01 Dec 1990
TL;DR: This straightforward tutorial explains why parallelism is a powerful and proven way to run programs fast and provides the instruction that will transform ordinary programmers into parallel programmers.
Abstract: In the not-too-distant future every programmer, software engineer, and computer scientist will need to understand parallelism, a powerful and proven way to run programs fast. The authors of this straightforward tutorial explain why this is so and provide the instruction that will transform ordinary programmers into parallel programmers."How to Write Parallel Programs" focuses on programming techniques for the largest class of parallel machines - general purpose asynchronous or MIMD machines. It outlines the basic parallel algorithm classes and the three basic programming paradigms, takes up the implementation techniques for these paradigms, and presents a series of case studies explaining code and discussing its measured performance. Because parallel programming requires both a computing language and a coordination language, the authors use C and Linda (a language they developed) as a combination that can be simply and efficiently implemented on a wide range of machines. The techniques discussed, however, can be applied in any comparable language environment.Contents: Introduction. The Three Basic Models of Parallelism. Programming Techniques for the Three Basic Models. A Simple Problem, in Detail. Case Studies. From Parallelism to Coordination. Conclusions. Appendix: Linda User's Manual.

324 citations


01 Jan 1990
TL;DR: This chapter discusses parallel algorithms for shared-memory machines, which focus on the technological limits of today's chips, in which gates and wires are packed into a small number of planar layers.
Abstract: Publisher Summary This chapter discusses parallel algorithms for shared-memory machines. Parallel computation is rapidly becoming a dominant theme in all areas of computer science and its applications. It is estimated that, within a decade, virtually all developments in computer architecture, systems programming, computer applications and the design of algorithms will be taking place within the context of parallel computation. In preparation for this revolution, theoretical computer scientists have begun to develop a body of theory centered on parallel algorithms and parallel architectures. As there is no consensus yet on the appropriate logical organization of a massively parallel computer, and as the speed of parallel algorithms is constrained as much by limits on interprocessor communication as it is by purely computational issues, it is not surprising that a variety of abstract models of parallel computation have been pursued. Closest to the hardware level are the VLSI models, which focus on the technological limits of today's chips, in which gates and wires are packed into a small number of planar layers.

284 citations


Journal ArticleDOI
Alan H. Karp1, Horace P. Flatt1
TL;DR: A new metric that has some advantages over the others is introduced that is illustrated with data from the Linpack benchmark report and the winners of the Gordon Bell Award.
Abstract: Many metrics are used for measuring the performance of a parallel algorithm running on a parallel processor. This article introduces a new metric that has some advantages over the others. Its use is illustrated with data from the Linpack benchmark report and the winners of the Gordon Bell Award.

247 citations


Journal ArticleDOI
13 Mar 1990
TL;DR: The relationship between various models of parallel computation is investigated, using a newly defined concept of efficient simulation, and it is proved that the class PE is invariant across the shared memory models (PRAM's) and fully connected message passing machines.
Abstract: Theoretical research on parallel algorithms has focused on NC theory. This motivates the development of parallel algorithms that are extremely fast, but possibly wasteful in their use of processors. Such algorithms seem of limited interest for real applications currently run on parallel computers. This paper explores an alternative approach that emphasizes the efficiency of parallel algorithms. We define a complexity class PE of problems that can be solved by parallel algorithms that are efficient (the speedup is proportional to the number of processors used) and polynomially faster than sequential algorithms. Other complexity classes are also defined, in terms of time and efficiency: A class that has a slightly weaker efficiency requirement than PE, and a class that is a natural generalization of NC. We investigate the relationship between various models of parallel computation, using a newly defined concept of efficient simulation. This includes new models that reflect asynchrony and high communication latency in parallel computers. We prove that the class PE is invariant across the shared memory models (PRAM's) and fully connected message passing machines. These results show that our definitions are robust. Many open problems motivated by our approach are listed.

244 citations


Journal ArticleDOI
TL;DR: The purpose is to review the current status and to provide an overall perspective of parallel algorithms for solving dense, banded, or block-structured problems arising in the major areas of direct solution of linear systems, least squares computations, eigenvalue and singular value computation, and rapid elliptic solvers.
Abstract: Scientific and engineering research is becoming increasingly dependent upon the development and implementation of efficient parallel algorithms on modern high-performance computers. Numerical linear algebra is an indispensable tool in such research and this paper attempts to collect and describe a selection of some of its more important parallel algorithms. The purpose is to review the current status and to provide an overall perspective of parallel algorithms for solving dense, banded, or block-structured problems arising in the major areas of direct solution of linear systems, least squares computations, eigenvalue and singular value computations, and rapid elliptic solvers. A major emphasis is given here to certain computational primitives whose efficient execution on parallel and vector computers is essential in order to obtain high performance algorithms.

203 citations


Journal ArticleDOI
TL;DR: In this paper, a parallel version of the fast multipole method (FMM) is presented for the evaluation of the potential and force fields in systems of particles whose interactions are Coulombic or gravitational in nature.
Abstract: This paper presents a parallel version of the fast multipole method (FMM). The FMM is a recently developed scheme for the evaluation of the potential and force fields in systems of particles whose interactions are Coulombic or gravitational in nature. The sequential method requires O(N) operations to obtain the fields due to N charges, rather than the O(N2) operations required by the direct calculation. Here, we describe the modifications necessary for implementation of the method on parallel architectures and show that the expected time requirements grow as log N when using N processors. Numerical results are given for a shared memory machine (the Encore Multimax 320).

Journal ArticleDOI
TL;DR: A new recursive prediction error algorithm is derived for the training of feedforward layered neural networks that enables the weights in each neuron of the network to be updated in an efficient parallel manner and has better convergence properties than the classical back propagation algorithm.
Abstract: A new recursive prediction error algorithm is derived for the training of feedforward layered neural networks. The algorithm enables the weights in each neuron of the network to be updated in an efficient parallel manner and has better convergence properties than the classical back propagation algorithm. The relationship between this new parallel algorithm and other existing learning algorithms is discussed. Examples taken from the fields of communication channel equalization and nonlinear systems modelling are used to demonstrate the superior performance of the new algorithm compared with the back propagation routine.

Journal ArticleDOI
TL;DR: The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection, which allows the authors to isolate and replace faulty processors with spare processors.
Abstract: The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. System-level error detection mechanisms have been implemented for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: matrix multiplication, Gaussian elimination, and fast Fourier transform. Schemes for other applications are under development. Extensive studies have been done of error coverage of the system-level error detection schemes in the presence of finite-precision arithmetic, which affects the system-level encodings. Two reconfiguration schemes are proposed that allow the authors to isolate and replace faulty processors with spare processors. >

Journal ArticleDOI
TL;DR: Several texture segmentation algorithms based on deterministic and stochastic relaxation principles, and their implementation on parallel networks, are described, and results of the various schemes in classifying some real textured images are presented.
Abstract: Several texture segmentation algorithms based on deterministic and stochastic relaxation principles, and their implementation on parallel networks, are described. The segmentation process is posed as an optimization problem and two different optimality criteria are considered. The first criterion involves maximizing the posterior distribution of the intensity field given the label field (maximum a posteriori estimate). The posterior distribution of the texture labels is derived by modeling the textures as Gauss Markov random fields (GMRFs) and characterizing the distribution of different texture labels by a discrete multilevel Markov model. A stochastic learning algorithm is proposed. This iterated hill-climbing algorithm combines fast convergence of deterministic relaxation with the sustained exploration of the stochastic algorithms, but is guaranteed to find only a local minimum. The second optimality criterion requires minimizing the expected percentage of misclassification per pixel by maximizing the posterior marginal distribution, and the maximum posterior marginal algorithm is used to obtain the corresponding solution. All these methods implemented on parallel networks can be easily extended for hierarchical segmentation; results of the various schemes in classifying some real textured images are presented. >

Journal ArticleDOI
TL;DR: This work investigates the feasibility of applying connectionist networks with hidden units to forecasting and process control and develops a particular approach which embeds input-output pairs in a state space using delay coordinates.

Proceedings ArticleDOI
16 Jun 1990
TL;DR: A parallel 3-D thinning algorithm which conserves medial surfaces is presented and some new topological predicates are given which are very simple to calculate and it is proved that the thinning operation based on those new predicates does not disconnect a3-D object.
Abstract: A parallel 3-D thinning algorithm which conserves medial surfaces is presented. A new characterization of simple points is proposed and some new topological predicates are given which are very simple to calculate. Some new geometrical predicates are also given. It is proved that the thinning operation based on those new predicates does not disconnect a 3-D object. Experiments show that the method gives a satisfactory result. >

Journal ArticleDOI
TL;DR: A constant time sorting algorithm is derived on a three-dimensional processor array equipped with a reconfigurable bus system, which is far more feasible than the CRCW PRAM model.

Journal ArticleDOI
TL;DR: A discussion is presented of two ways of mapping the cells in a two-dimensional area of a chip onto processors in an n-dimensional hypercube such that both small and large cell moves can be applied.
Abstract: A discussion is presented of two ways of mapping the cells in a two-dimensional area of a chip onto processors in an n-dimensional hypercube such that both small and large cell moves can be applied. Two types of move are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support such a parallel cost evaluation. A novel tree broadcasting strategy is presented for the hypercube that is used extensively in the algorithm for updating cell locations in the parallel environment. A dynamic parallel annealing schedule is proposed that estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control. The performance on an Intel iPSC-2/D4/MX hypercube is reported. >

Journal ArticleDOI
TL;DR: An efficient technique for parallel manipulation of data structures that avoids memory access conflicts is presented and is used in a new parallel radix sort algorithm that is optimal for keys whose values are over a small range.
Abstract: We present an efficient technique for parallel manipulation of data structures that avoids memory access conflicts. That is, this technique works on the Exclusive Read/Exclusive Write (EREW) model of computation, which is the weakest shared memory, MIMD machine model. It is used in a new parallel radix sort algorithm that is optimal for keys whose values are over a small range. Using the radix sort and known results for parallel prefix on linked lists, we develop parallel algorithms that efficiently solve various computations on trees and “unicycular graphs.” Finally, we develop parallel algorithms for connected components, spanning trees, minimum spanning trees, and other graph problems. All of the graph algorithms achieve linear speedup for all but the sparsest graphs.

Journal ArticleDOI
TL;DR: A parallel algorithm for tiling with polyominoes is presented and can be used for placement of components or cells in a very large-scale integrated circuit (VLSI) chip, designing and compacting printed circuit boards, and solving a variety of two- or three-dimensional packing problems.
Abstract: A parallel algorithm for tiling with polyominoes is presented. The tiling problem is to pack polyominoes in a finite checkerboard. The algorithm using l*m*n processing elements requires O(1) time, where l is the number of different kinds of polyominoes on an m*n checkerboard. The algorithm can be used for placement of components or cells in a very large-scale integrated circuit (VLSI) chip, designing and compacting printed circuit boards, and solving a variety of two- or three-dimensional packing problems. >

Journal ArticleDOI
TL;DR: Two algorithms are presented for computing the discrete cosine transform (DCT) on existing VLSI structures and a new prime factor DCT algorithm is presented for the class of DCTs of length N=N/ sub 1/*N/sub 2/, where N/sub 1/ and N/ sub 2/ are relatively prime and odd numbers.
Abstract: Two algorithms are presented for computing the discrete cosine transform (DCT) on existing VLSI structures. First, it is shown that the N-point DCT can be implemented on the existing systolic architecture for the N-point discrete Fourier transform (DFT) by introducing some modifications. Second, a new prime factor DCT algorithm is presented for the class of DCTs of length N=N/sub 1/*N/sub 2/, where N/sub 1/ and N/sub 2/ are relatively prime and odd numbers. It is shown that the proposed algorithm can be implemented on the already existing VLSI structures for prime factor DFT. The number of multipliers required is comparable to that required for the other fast DCT algorithms. It is shown that the discrete sine transform (DST) can be computed by the same structure. >

Journal ArticleDOI
TL;DR: It is shown that the sequencial refinement calculus can be used as such for most of the derivation steps of a parallel version of the Gaussian elimination method for solving simultaneous linear equation systems.

Book ChapterDOI
01 Oct 1990
TL;DR: The paper will discuss the trade-offs between communication overheads involved and numbers of processors employed using various communication networks between processors.
Abstract: The paper discusses the parallel implementation of the genetic algorithm on transputer based parallel processing systems. It considers the implementation of the batch version of the algorithm using a problem from the domain of real-time control. With the problem chosen the evaluation of a member of the population takes a relatively long time, compared with the generation of a member of the population, and so emphasis is laid on parallel evaluation. However, any distribution of processing over a number of processors will involve some communication overheads which are not present when the processing is done on one processor. This overhead will vary depending upon the communication network used. The paper will discuss the trade-offs between communication overheads involved and numbers of processors employed using various communication networks between processors.

Journal ArticleDOI
TL;DR: Parallel algorithms on SIMD (single-instruction stream multiple-data stream) machines for hierarchical clustering and cluster validity computation are proposed, which uses a parallel memory system and an alignment network to facilitate parallel access to both pattern matrix and proximity matrix.
Abstract: Parallel algorithms on SIMD (single-instruction stream multiple-data stream) machines for hierarchical clustering and cluster validity computation are proposed. The machine model uses a parallel memory system and an alignment network to facilitate parallel access to both pattern matrix and proximity matrix. For a problem with N patterns, the number of memory accesses is reduced from O(N/sup 3/) on a sequential machine to O(N/sup 2/) on an SIMD machine with N PEs. >

Journal ArticleDOI
TL;DR: It is shown that parallel processing of HTD faults does indeed result in high fault coverage, which is otherwise not achievable by a uniprocessor algorithm, and the parallel algorithm exhibits superlinear speedups in some cases due to search anomalies.
Abstract: For circuits of VLSI complexity, test generation time can be prohibitive. Most of the time is consumed by hard-to-detect (HTD) faults, which might remain undetected even after a large number of backtracks. The problems inherent in a uniprocessor implementation of a test generation algorithm are identified, and a parallel test generation method which tries to achieve a high fault coverage for HTD faults in a reasonable amount of time is proposed. A dynamic search space allocation strategy which allocates disjoint search spaces to minimize the redundant work is proposed. The search space allocation strategy tries to utilize the partial solutions generated by other processors to increase the probability of searching in a solution area. The parallel test generation algorithm has been implemented on an Intel iPSC/2 hypercube. It is shown that parallel processing of HTD faults does indeed result in high fault coverage, which is otherwise not achievable by a uniprocessor algorithm. The parallel algorithm exhibits superlinear speedups in some cases due to search anomalies. >

Journal ArticleDOI
TL;DR: The problem of computing a fixed point of a nonexpansive function f is considered, and simulation results illustrating the attainable speedup and the effects of asynchronism are presented.
Abstract: The problem of computing a fixed point of a nonexpansive function f is considered. Sufficient conditions are provided under which a parallel, partially asynchronous implementation of the iteration $x: = f(x)$ converges. These results are then applied to (i) quadratic programming subject to box constraints, (ii) strictly convex cost network flow optimization, (iii) an agreement and a Markov chain problem, (iv) neural network optimization, and (v) finding the least element of a polyhedral set determined by a weakly diagonally dominant, Leontief system. Finally, simulation results illustrating the attainable speedup and the effects of asynchronism are presented.

Journal ArticleDOI
TL;DR: A parallel algorithm for the rotation of digitized images is presented that combines the decomposition of a process into a number of subprocesses and the allocation of each subprocess to a processor for execution, together with the decomposing of data into smaller portions.

Journal ArticleDOI
TL;DR: A new randomized parallel algorithm that determines the Smith normal form of a matrix with entries being univariate polynomials with coefficients in an arbitrary field that is probabilistic of Las Vegas type and reduces the problem of Smith form computation to two Hermite form computations.

Journal ArticleDOI
01 Jun 1990
TL;DR: The goal of the Pandore system is to allow the execution of parallel algorithms on DMPCs (Distributed Memory Parallel Computers) without having to take into account the low-level characteristics of the target distributed computer to program the algorithm.
Abstract: The goal of the Pandore system is to allow the execution of parallel algorithms on DMPCs (Distributed Memory Parallel Computers) without having to take into account the low-level characteristics of the target distributed computer to program the algorithm. No explicit process definition and interprocess communications are needed. Parallelization is achieved through logical data organization. The Pandore system provides the user with a mean to specify data partitioning and data distribution over a domain of virtual processors for each parallel step of his algorithm.At compile time, Pandore splits the original program into parallel processes. Each process will execute some appropriate parts of the original code, according to the given data decomposition. In order to achieve a correct utilization of the data structures distributed over the processors, the Pandore system provides an execution scheme based on a communication layer, which is an abstraction of a message-passing architecture. This intermediate level is them implemented using the effective primitives of the real architecture (in our specific case, an Intel iPSC/2).

Book
18 Oct 1990
TL;DR: Most of the algorithms in this book are for hypercubes with the number of processors being a function of problems size, however, for image processing problems, the book also includes algorithms for and MIMD hypercube with a small number of processes.
Abstract: Fundamentals algorithms for SIMD and MIMD hypercubes are developed. These include algorithms for such problems as data broadcasting, data sum, prefix sum, shift, data circulation, data accumulation, sorting, random access reads and writes and data permutation. The fundamental algorithms are then used to obtain efficient hypercube algorithms for matrix multiplication, image processing problems such as convolution, template matching, hough transform, clustering and image processing transformation, and string editing. Most of the algorithms in this book are for hypercubes with the number of processors being a function of problems size. However, for image processing problems, the book also includes algorithms for and MIMD hypercube with a small number of processes. Experimental results on an NCUBE/77 MIMD hypercube are also presented. The book is suitable for use in a one-semester or one-quarter course on hypercube algorithms. For students with no prior exposure to parallel algorithms, it is recommended that one week will be spent on the material in chapter 1, about six weeks on chapter 2 and one week on chapter 3. The remainder of the term can be spent covering topics from the rest of the book.