scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1984"


Journal ArticleDOI
TL;DR: A fast parallel thinning algorithm that consists of two subiterations: one aimed at deleting the south-east boundary points and the north-west corner points while the other one is aimed at deletion thenorth-west boundarypoints and theSouth-east corner points.
Abstract: A fast parallel thinning algorithm is proposed in this paper It consists of two subiterations: one aimed at deleting the south-east boundary points and the north-west corner points while the other one is aimed at deleting the north-west boundary points and the south-east corner points End points and pixel connectivity are preserved Each pattern is thinned down to a skeleton of unitary thickness Experimental results show that this method is very effective 12 references

2,243 citations


Journal ArticleDOI
TL;DR: It is shown that it is quite possible for a parallel branch-and-bound algorithm using n to take more time than one using n, and it is also possible to achieve speed-ups that are in excess of the ratio.
Abstract: We consider the effects of parallelizing branch-and-bound algorithms by expanding several live nodes simultaneously. It is shown that it is quite possible for a parallel branch-and-bound algorithm using n2processors to take more time than one using n1processors, even though n1

249 citations


Book ChapterDOI
TL;DR: The chapter presents a unified treatment of various parallel sorting algorithms by bringing out clearly the relation between the architecture of parallel computers and the structure of algorithms.
Abstract: Publisher Summary This chapter presents a survey on various parallel sorting algorithms. Sorting is a nontrivial problem and has widespread commercial and business applications. Serial algorithms for sorting have been available since the days of punched-card machines. At present, there is a considerable body of literature on serial sorting algorithms. Parallel algorithms for sorting are of a recent origin and came into existence over the past decade. The chapter presents a unified treatment of various parallel sorting algorithms by bringing out clearly the relation between the architecture of parallel computers and the structure of algorithms. In the design of parallel algorithms in general, and of parallel sorting algorithms in particular, two models have been widely used: (1) models based on fixed interconnection networks such as the same or single instruction on multiple data (SIMD) machine mesh-connected network and (2) models based on a global memory, which is shared by various processors. The special-purpose network-sorting algorithms are described. Algorithms for SIMD machines are given.

197 citations


Proceedings ArticleDOI
01 Jun 1984
TL;DR: Specific applications of that theory to loop interchange are discussed as it has been implemented in PFC (Parallel Fortran Converter) -- a program which attempts to uncover operations in sequential Fortran code that may be safely rewritten as vector operations.
Abstract: Parallel and vector machines are becoming increasingly important to many computation intensive applications. Effectively utilizing such architectures, particularly from sequential languages such as Fortran, has demanded increasingly sophisticated compilers. In general, a compiler needs to significantly reorder a program in order to generate code optimal for a specific architecture.Because DO loops typically control the execution of a number of statements, the order in which loops are executed can dramatically affect the performance of a machine on a particular section of code. In particular, loop interchange can often be used to enhance the performance of code on parallel or vector machines.Determining when loops may be safely and profitably interchanged requires a study of the data dependences in the program. This work discusses specific applications of that theory to loop interchange. This theory is described as it has been implemented in PFC (Parallel Fortran Converter) -- a program which attempts to uncover operations in sequential Fortran code that may be safely rewritten as vector operations.

196 citations


Proceedings ArticleDOI
01 Dec 1984
TL;DR: A parallel algorithm is presented which accepts as input a graph and produces a maximal independent set of vertices in G and uses the “dynamic pigeonhole principle” that generalizes the conventional pigeon hole principle.
Abstract: A parallel algorithm is presented which accepts as input a graph G and produces a maximal independent set of vertices in G. On a P-RAM without the concurrent write or concurrent read features, the algorithm executes in O((log n)4) time and uses O((n/log n)3) processors, where n is the number of vertices in G. The algorithm has several novel features that may find other applications. These include the use of balanced incomplete block designs to replace random sampling by deterministic sampling, and the use of a “dynamic pigeonhole principle” that generalizes the conventional pigeonhole principle.

167 citations


Journal ArticleDOI
TL;DR: Algorithms and data structures developed to solve graph problems on parallel computers are surveyed and most algorithms use relatwely simple data structures, although a few algorithms using linked hsts, heaps, and trees are also discussed.
Abstract: Algorithms and data structures developed to solve graph problems on parallel computers are surveyed. The problems discussed relate to searching graphs and finding connected components, maximal chques, maximum cardinahty matchings, mimmum spanning trees, shortest paths, and travehng salesman tours. The algorithms are based on a number of models of parallel computation, including systohc arrays, assoclatwe processors, array processors, and mulhple CPU computers. The most popular model is a direct extension of the standard RAM model of sequential computation. It may not, however, be the best basis for the study of parallel algorithms. More emphasis has been focused recently on communications issues in the analysis of the complexity of parallel algorithms; thus parallel models are coming to be more complementary to implementable architectures. Most algorithms use relatwely simple data structures, such as the adjacency matrix and adjacency hsts, although a few algorithms using linked hsts, heaps, and trees are also discussed.

164 citations


Journal ArticleDOI
TL;DR: Two models of the cost of data movement in parallel numerical algorithms are described, one suitable for shared memory multiprocessors where each processor has vector capabilities and the other applicable to highly parallel nonshared memory MIMD systems.
Abstract: This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.

108 citations


Journal ArticleDOI
TL;DR: Two general schemes for performing branch-and-bound (B&B) search in parallel are discussed, applicable in principle to most of the problems which can be solved by B&B.
Abstract: This paper discusses two general schemes for performing branch-and-bound (B&B) search in parallel. These schemes are applicable in principle to most of the problems which can be solved by B&B. The schemes are implemented for SSS*, a versatile algorithm having applications in game tree search, structural pattern analysis, and AND/OR graph search. The performance of parallel SSS* is studied in the context of AND/OR tree and game tree search. The paper concludes with comments on potential applications of these parallel implementations of SSS* in structural pattern analysis and game playing.

99 citations


Journal ArticleDOI
TL;DR: Fast parallel algorithms are presented for the following problems in symbolic manipulation of univariate polynomials: computing all entries of the extended Euclidean scheme of two polynmials over an arbitrary field, gcd and 1cm of many poynomials, factoring polynomsials over finite fields, and the squarefree decomposition of polynOMials over fields of characteristic zero and over infinite fields.
Abstract: Fast parallel algorithms are presented for the following problems in symbolic manipulation of univariate polynomials: computing all entries of the extended Euclidean scheme of two polynomials over an arbitrary field, gcd and 1cm of many polynomials, factoring polynomials over finite fields, and the squarefree decomposition of polynomials over fields of characteristic zero and over finite fields.For the following estimates, assume that the input polynomials have degree at most n, and the finite field has $p^d $ elements. The Euclidean algorithm is deterministic and runs in parallel time $O(\log ^2 n)$. All the other algorithms are probabilistic (Las Vegas) in the general case, but when applicable to ${\bf Q}$ or ${\bf R}$, they can be implemented deterministically over these fields. The algorithms for gcd and lcm use parallel time $O(\log ^2 n)$. The factoring algorithm runs in parallel time $O(\log ^2 n\log ^2 (d + 1)\log p)$. The algorithm for squarefree decomposition runs in parallel time $O(\log ^2 n)$...

96 citations


Proceedings ArticleDOI
06 Aug 1984
TL;DR: Five abstract algorithms for the parallel execution of production systems on the DADO machine are specified, designed to capture the inherent parallelism in a variety of different production system programs.
Abstract: In this paper we specify five abstract algorithms for the parallel execution of production systems on the DADO machine. Each algorithm is designed to capture the inherent parallelism in a variety of different production system programs. Ongoing research aims to substantiate our conclusions by empirically evaluating the performance of each algorithm on the DAD02 prototype, presently under construction at Columbia University.

83 citations


Journal ArticleDOI
TL;DR: In this article, an efficient parallel algorithm to obtain maximum matchings in convex bipartite graphs is developed, which can be used to obtain efficient parallel algorithms for several scheduling problems.

Journal ArticleDOI
TL;DR: A parallel Earley's recognition algorithm in terms of an ``X*'' operator is presented, which can be executed on a triangular-shape VLSI array by restricting the input context-free grammar to be ¿-free, and which gives the correct error count.
Abstract: Earley's algorithm has been commonly used for the parsing of general context-free languages and the error-correcting parsing in syntactic pattern recognition The time complexity for parsing is 0(n3) This paper presents a parallel Earley's recognition algorithm in terms of an ``X*'' operator By restricting the input context-free grammar to be ?-free, the parallel algorithm can be executed on a triangular-shape VLSI array This array system has an efficient way of moving data to the right place at the right time Simulation results show that this system can recognize a string with length n in 2n + 1 system time We also present a parallel parse-extraction algorithm, a complete parsing algorithm, and an error-correcting recognition algorithm The parallel complete parsing algorithm has been simulated on a processor array which is similar to the triangular VLSI array For an input string of length n the processor array will give the correct right-parse at system time 2n + 1 if the string is accepted The error-correcting recognition algorithm has also been simulated on a triangular VLSI array This array recognizes an erroneous string of length n in time 2n + 1 and gives the correct error count These parallel algorithms are especially useful for syntactic pattern recognition

Proceedings ArticleDOI
01 Dec 1984
TL;DR: This algorithm is a nice example of utilizing another parallel algorithm that does not seem to be closely related to the problem, namely the algorithm for finding the connected components and a spanning forest of an undirected graph.
Abstract: A parallel algorithm for finding Euler circuits in graphs is presented. Its depth is log |E| and it employs |E| processors. The computational model considered is the PRAM (the shared memory model). This algorithm is a nice example of utilizing another parallel algorithm that does not seem to be closely related to our problem, namely the algorithm for finding the connected components and a spanning forest of an undirected graph.

Journal ArticleDOI
TL;DR: Three global balancing algorithms are presented, one of which uses folding with the other two adopting parallel procedures, which show improvement in time efficiency over some sequential algorithms when applied to large binary search trees.
Abstract: A binary search tree can be globally balanced by readjustment of pointers or with a sorting process in O(n) time, n being the total number of nodes This paper presents three global balancing algorithms, one of which uses folding with the other two adopting parallel procedures These algorithms show improvement in time efficiency over some sequential algorithms [1, 2, 7] when applied to large binary search trees A comparison of various algorithms is presented

Journal ArticleDOI
01 Jul 1984
TL;DR: Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.
Abstract: Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

Journal ArticleDOI
Karnin1
TL;DR: A time–memory-processor tradeoff for the knapsack problem is proposed, being the only one which outperforms the CmCs = 2" curve.
Abstract: A time–memory-processor tradeoff for the knapsack problem is proposed. While an exhaustive search over all possible solutions of an n-component knapsack requires T = 0(2n) running time, our parallel algorithm solves the problem in O(2n/2) operations and requires only 0(2n/6) processors and memory cells. It is an improvement over previous time–memory-processor tradeoffs, being the only one which outperforms the CmCs = 2" curve. Cm is the cost of the machine, i.e., the number of its processors and memory cells, and C, is the cost per solution, which is the product of the machine cost by the running time.

Book
01 Jan 1984
TL;DR: The overall result is that the larger the problem, the closer the algorithms approach optimal speedup, which allows algorithms to be designed assuming any number of processing elements.
Abstract: We present and analyze several practical parallel algorithms for multicomputers. Chapter four presents two distributed algorithms for implementing alpha-beta search on a tree of processors. Each processor is an independent computer with its own memory and is connected by communication lines to each of its nearest neighbors. Measurements of the first algorithm's performance on the Arachne distributed operating system are presented. For each algorithm, a theoretical model is developed that predicts speedup with arbitrarily many processors. Chapter five shows how locally-defined iterative methods give rise to natural multicomputer algorithms. We consider two interconnection topologies, the grid and the tree. Each processor (or terminal processor in the case of a tree multicomputer) engages in serial computation on its region and communicates border values to its neighbors when those values become available. As a focus for our investigation we consider the numerical solution of elliptic partial differential equations. We concentrate on the Dirichlet problem for Laplace's equation on a square region, but our results can be generalized to situations involving arbitrarily shaped domains (of any number of dimensions) and elliptic equations with variable coefficients. Our analysis derives the running time of the grid and the tree algorithms with respect to per-message overhead, per-point communication time, and per-point computation time. The overall result is that the larger the problem, the closer the algorithms approach optimal speedup. We also show how to apply the tree algorithms to non-uniform regions. A large-network algorithm solves a problem of size N on a network of N processors. Chapter six presents a general method for transforming large-network algorithms into quotient-network algorithms, which solve problems of size N on networks with fewer processors. This transformation allows algorithms to be designed assuming any number of processing elements. The implementation of such algorithms on a quotient network results in no loss of efficiency, and often a great savings in hardware cost.

BookDOI
01 Oct 1984
TL;DR: Correlation of Algorithms, Software and Hardware of Parallel Computers.
Abstract: 1. Synthesis of Parallel Numerical Algorithms.- 2. Complexity of Parallel Algorithms.- 3. Automatic Construction of Parallel Programs.- 4. Formal Models of Parallel Computations.- 5. On Parallel Languages.- 6. Proving Correctness and Automatic Synthesis of Parallel Programs.- 7. Operating Systems for Modular Partially Reconfigurable Multiprocessor-Systems.- 8. Algorithms for Scheduling Homogeneous Multiprocessor Computers.- 9. Algorithms for Scheduling Inhomogeneous Multiprocessor Computers.- 10. Parallel Processors and Multicomputer Systems.- 11. Data Flow Computer Architecture.- 12. Correlation of Algorithms, Software and Hardware of Parallel Computers.

Journal ArticleDOI
TL;DR: The speed-up of this algorithm is optimal in the sense that the depth of the algorithm is of the order of the running time of the fastest known sequential algorithm over the number of processors used.

Journal ArticleDOI
TL;DR: A couple of approximate inversion techniques are presented which provide a parallel enhancement to several iterative methods for solving linear systems arising from the discretization of boundary value problems.
Abstract: A couple of approximate inversion techniques are presented which provide a parallel enhancement to several iterative methods for solving linear systems arising from the discretization of boundary value problems. In particular, the Jacobi, Gauss‐Seidel, and successive overrelaxation methods can be improved substantially in a parallel environment by the extensions considered. A special case convergence proof is presented. The use of our approximate inverses with the preconditioned conjugate gradient method is examined and comparisons are made with some recently proposed algorithms in this area that also employ approximate inverses. The methods considered are compared under sequential and parallel hardware assumptions.

Journal ArticleDOI
Selim G. Akl1
TL;DR: A parallel algorithm is presented for selecting the kth smallest element of a totally ordered (but not sorted) set of n elements, 1⩽k ⩽n, for an optimal total cost of O(n).

Journal ArticleDOI
TL;DR: A parallel algorithm for depth-first searching of a directed acyclic graph (DAG) on a shared memory model of a SIMD computer is proposed, which uses two parallel tree traversal algorithms, one for the preorder traversal and the other for therpost order traversal of an ordered tree.
Abstract: A parallel algorithm for depth-first searching of a directed acyclic graph (DAG) on a shared memory model of a SIMD computer is proposed. The algorithm uses two parallel tree traversal algorithms, one for the preorder traversal and the other for therpostorder traversal of an ordered tree. Each of these traversal algorithms has a time complexity ofO(logn) whenO(n) processors are used,n being the number of vertices in the tree. The parallel depth-first search algorithm for a directed acyclic graphG withn vertices has a time complexity ofO((logn)2) whenO(n2.81/logn) processors are used.

Book ChapterDOI
03 Dec 1984
TL;DR: The first result states that cfl's can be recognized on a cube-connected computer or on a perfect-shuffle computer in log2n time using n6 processors and it can be viewed as an application of parallel algorithms to the design of efficient sequential algorithms.
Abstract: In this paper we present two results concerning the time and space complexity of context-free recognition. The first result states that cfl's can be recognized on a cube-connected computer (CCC) or on a perfect-shuffle computer (PSC) in log2n time using n6 processors. There are known algorithms with the same parallel time complexity but they use more powerful models of computation. The second result states that deterministic cfl's can be recognized in polynomial time using one log2n bounded pushdown store and log n tape. Known algorithms use log2n tape. Since algorithm is a simulation of a deterministic pda it may be looked upon as an efficient reduction of the height of the pushdown store. The second result is obtained by applying a transformation of a fast parallel recognition of deterministic cfl's and it can be viewed as an application of parallel algorithms to the design of efficient sequential algorithms.

Journal ArticleDOI
Selim G. Akl1
TL;DR: It is shown that the convex hull algorithm leads to a parallel sorting algorithm whose total cost isO(n logn), which is optimal, and this performance matches that of the best currently known sequential conveX hull algorithm.
Abstract: A parallel algorithm is presented for computing the convex hull of a set ofn points in the plane. The algorithm usesn1−e processors, 0

Journal ArticleDOI
TL;DR: PASM is a multifunction partitionable SIMD/MIMD system being designed at Purdue for parallel image understanding that will incorporate over 1,000 complex processing elements.
Abstract: PASM is a multifunction partitionable SIMD/MIMD system being designed at Purdue for parallel image understanding. It is to be a large-scale, dynamically reconfigurable multimicroprocessor system, which will incorporate over 1,000 complex processing elements. Parallel algorithm studies and simulations have been used to analyze application tasks in order to guide design decisions. A prototype of PASM is under construction (funded by an equipment grant from IBM), including 30 Motorola MC68010 processors, a multistage interconnection network, five disk drives, and connections to the Purdue Engineering Computer Network (for access to peripherals, terminals, software development tools, etc.). PASM is to serve as a vehicle for studying the use of parallelism for performing the numeric and symbolic processing needed for tasks such as computer vision. The PASM design concepts and prototype are overviewed and brief examples of parallel algorithms are given.

Journal ArticleDOI
TL;DR: Parallel Breadth-First Search (BFS) algorithms for ordered trees and graphs on a shared memory model of a Single Instruction-stream Multiple Data-stream computer are proposed.
Abstract: Parallel Breadth-First Search (BFS) algorithms for ordered trees and graphs on a shared memory model of a Single Instruction-stream Multiple Data-stream computer are proposed. The parallel BFS algorithm for trees computes the BFS rank of eachnode of an ordered tree consisting of n nodes in time of 0(β log n) when 0(n 1+1/β) processors are used, β being an integer greater than or equal to 2. The parallel BFS algorithm for graphs produces Breadth-First Spanning Trees (BFSTs) of a directedgraph G having n nodes in time 0(log d.log n) using 0(n 3) processors, where d is the diameter of G If G is a strongly connected graph or a connected undirected graph the BFS algorithm produces n BFSTs, each BFST having a different start node.

Journal ArticleDOI
TL;DR: A theoretical analysis shows that the probability of the occurrence of this worst case is extremely small, and the worst case performance of this algorithm is stated.
Abstract: In this paper a parallel algorithm to solve the stable marriage problem is given. The worst case performance of this algorithm is stated. A theoretical analysis shows that the probability of the occurrence of this worst case is extremely small. For instance, if there are sixteen men and sixteen women involved, then the probability that the worst case occurs is only 10−45. Possible future research is also discussed in this paper.

Journal ArticleDOI
01 Dec 1984
TL;DR: The present study suggests the possibility of both reducing the real time processing and increasing the scope of computational modeling in the Heterogeneous Element Processor (HEP) multiple instruction stream computer.
Abstract: A parallelized point rowwise Successive Over-Relaxation (SOR) iterative algorithm is developed for the Heterogeneous Element Processor (HEP) multiple instruction stream computer. The classical point SOR method is not easily vectorizable with rowwise ordering of the grid points, but it can be effectively parallelized on a multiple instruction stream machine without suffering in computational and convergence rate. The details of the implementation including restructuring of a serial FORTRAN program and techniques needed to exploit the parallel processing architectural concept of the HEP are presented. The parallelized algorithm is analyzed in detail. The lessons learned in this study are documented and may provide some guidelines for similar future coding since new approaches and restructuring techniques are required for programming a multiple instruction stream machine, which are totally different than those required for programming an algorithm on a vector processor. To assess the capabilitiesof the parallelized algorithm it was used to solve the Laplace's equation on a rectangular field with Dirichlet boundary conditions. Computer run times are presented which indicate significant speed gain over a scalar version of the code. For a moderate to large size problem seventeen or more processes are required to make efficient use of the parallel processing hardware. Also, to demonstrate the capability of the algorithm for a realistic problem, it was used to obtain the numerical solution of a viscous incompressible fluid in a square cavity. Since point iterative relaxation schemes are at the core of many systems of elliptic as well as non-elliptic partial differential equations occuring in engineering and scientific applications, the present study suggests the possibility of both reducing the real time processing and increasing the scope of computational modeling.

Journal ArticleDOI
TL;DR: Numerical results are presented which show the algorithm finds simple and multiple zeros to an accuracy (usually) limited by the accuracy of polynomial evaluation.
Abstract: A method for finding simple roots of arbitrary polynomials based on divided differences is discussed. Theoretical background is presented for the case of simple roots. Numerical results are presented which show the algorithm finds simple and multiple zeros to an accuracy (usually) limited by the accuracy of polynomial evaluation. The method is designed for a SIMD parallel computer. The algorithm is compared to two other frequently used polynomial root finders, the Jenkins-Traub algorithm and Laguerre's method.

Proceedings Article
06 Aug 1984
TL;DR: This work describes efficient multiresolution iterative algorithms for computing lightness, shape-from-shading, and optical flow, and evaluates the performance of these algorithms using synthesized images.
Abstract: Problems in machine vision that are posed as variational principles or partial differential equations can often be solved by local, iterative, and parallel algorithms. A disadvantage of these algorithms is that they are inefficient at propagating constraints across large visual representations. Application of multigrid methods has overcome this drawback with regard to the computation of visible-surface representations. We argue that our multiresolution approach has wide applicability in vision. In particular, we describe efficient multiresolution iterative algorithms for computing lightness, shape-from-shading, and optical flow, and evaluate the performance of these algorithms using synthesized images.