scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1986"


Journal ArticleDOI
TL;DR: Two basic design strategies are used to develop a very simple and fast parallel algorithms for the maximal independent set (MIS) problem.
Abstract: Two basic design strategies are used to develop a very simple and fast parallel algorithms for the maximal independent set (MIS) problem. The first strategy consists of assigning identical copies o...

1,117 citations


Journal ArticleDOI
TL;DR: The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.
Abstract: Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

1,000 citations


Journal ArticleDOI
Zvi Galil1
TL;DR: The techniques used for designing the most efficient algorithms for finding a maximum cardinality or weighted matching in (general or bipartite) graphs are surveyed.
Abstract: This paper surveys the techniques used for designing the most efficient algorithms for finding a maximum cardinality or weighted matching in (general or bipartite) graphs. It also lists some open problems concerning possible improvements in existing algorithms and the existence of fast parallel algorithms for these problems.

479 citations


Journal ArticleDOI
TL;DR: This paper develops multiresolution iterative algorithms for computing lightness, shape-from-shading, and optical flow and examines the efficiency of these algorithms using synthetic image inputs, and describes the multigrid methodology that is broadly applicable in early vision.
Abstract: Image analysis problems, posed mathematically as variational principles or as partial differential equations, are amenable to numerical solution by relaxation algorithms that are local, iterative, and often parallel. Although they are well suited structurally for implementation on massively parallel, locally interconnected computational architectures, such distributed algorithms are seriously handi capped by an inherent inefficiency at propagating constraints between widely separated processing elements. Hence, they converge extremely slowly when confronted by the large representations of early vision. Application of multigrid methods can overcome this drawback, as we showed in previous work on 3-D surface reconstruction. In this paper, we develop multiresolution iterative algorithms for computing lightness, shape-from-shading, and optical flow, and we examine the efficiency of these algorithms using synthetic image inputs. The multigrid methodology that we describe is broadly applicable in early vision. Notably, it is an appealing strategy to use in conjunction with regularization analysis for the efficient solution of a wide range of ill-posed image analysis problems.

424 citations


Journal ArticleDOI
TL;DR: The problem of constructing a perfect matching in a graph is in the complexity class Random NC; i.e., the problem is solvable in polylog time by a randomized parallel algorithm using a polynomial-bounded number of processors.
Abstract: We show that the problem of constructing a perfect matching in a graph is in the complexity class Random NC; i.e., the problem is solvable in polylog time by a randomized parallel algorithm using a polynomial-bounded number of processors. We also show that several related problems lie in Random NC. These include:

287 citations


Journal ArticleDOI
TL;DR: A parallel randomized algorithm to find a maximal matching is presented that improves the best known deterministic algorithm by a factor of log 2 vbEvb.

248 citations


Proceedings ArticleDOI
27 Oct 1986
TL;DR: A novel scheduling problem is defined; it is solved by repeated, rapid, approximate reschedulings, which leads to a first optimal PRAM algorithm for list ranking, which runs in logarithmic time.
Abstract: We study two parallel scheduling problems and their use in designing parallel algorithms. First, we define a novel scheduling problem; it is solved by repeated, rapid, approximate reschedulings. This leads to a first optimal PRAM algorithm for list ranking, which runs in logarithmic time. Our second scheduling result is for computing prefix sums of logn bit numbers. We give an optimal parallel algorithm for the problem which runs in sublogarithmic time. These two scheduling results together lead to logarithmic time PRAM algorithms for the connectivity, biconnectivity and minimum spanning tree problems. The connectivity and biconnectivity algorithms are optimal unless m = o(nlog*n), in graphs of n vertices and m edges.

196 citations


Journal ArticleDOI
TL;DR: The improved algorithm overcomes some of the disadvantages found in [5] by preserving necessary and essential structures for certain patterns which should not be deleted and maintains very fast speed, from about 1.5 to 2.3 times faster than the four-step and two-step methods described in [3].
Abstract: A fast parallel thinning algorithm for digital patterns is presented. This algorithm is an improved version of the algorithms introduced by Zhang and Suen [5] and Stefanelli and Rosenfeld [3]. An experiment using an Apple II and an Epson printer was conducted. The results show that the improved algorithm overcomes some of the disadvantages found in [5] by preserving necessary and essential structures for certain patterns which should not be deleted and maintains very fast speed, from about 1.5 to 2.3 times faster than the four-step and two-step methods described in [3] although the resulting skeletons look basically the same.

146 citations


Journal ArticleDOI
TL;DR: With the correct choice of ordering the algorithm can be implemented using systolic array processors (Gentleman, personal communication), and can also be used to compute any CS decomposition of a unitary matrix.
Abstract: An algorithm is described for computing the generalized singular value decomposition of $A(m \times n)$ and $B(p \times n)$. Unitary matrices U, V and Q are developed so that $U^H AQ$ and $V^H BQ$ have as many nonzero parallel rows as possible, and these correspond to the common row space of the two matrices. The algorithm consists of an iterative sequence of cycles where each cycle is made up of the serial application of $2 \times 2$ generalized singular value decompositions. Convergence appears to be at least quadratic. With the correct choice of ordering the algorithm can be implemented using systolic array processors (Gentleman,personal communication). The algorithm can also be used to compute any CS decomposition of a unitary matrix.

137 citations


Journal ArticleDOI
TL;DR: A general method for searching efficiently in parallel undirected graphs, called ear-decomposition search (EDS), based on depth-first search (DFS), is presented.

134 citations


Book ChapterDOI
01 Jun 1986

Journal ArticleDOI
01 May 1986
TL;DR: It is observed that to obtain this limited factor of 10-fold speed-up, it is necessary to exploit parallelism at a very fine granularity, and it is proposed that a suitable architecture to exploit such fine-grain parallelism is a bus-based shared-memory multiprocessor with 32-64 processors.
Abstract: Rule-based systems, on the surface, appear to be capable of exploiting large amounts of parallelism—it is possible to match each rule to the data memory in parallel. In practice, however, we show that the speed-up from parallelism is quite limited, less than 10-fold. The reasons for the small speed-up are: (1) the small number of rules relevant to each change to data memory; (2) the large variation in the processing required by the relevant rules; and (3) the small number of changes made to data memory between synchronization steps. Furthermore, we observe that to obtain this limited factor of 10-fold speed-up, it is necessary to exploit parallelism at a very fine granularity. We propose that a suitable architecture to exploit such fine-grain parallelism is a bus-based shared-memory multiprocessor with 32-64 processors. Using such a multiprocessor (with individual processors working at 2 MIPS), it is possible to obtain execution speeds of about 3800 rule-firings/sec. This speed is significantly higher than that obtained by other proposed parallel implementations of rule-based systems.

Journal ArticleDOI
TL;DR: Two parallel formulations of the statistical cooling algorithm are proposed, i.e. a systolic algorithm and a clustered algorithm, based on the requirement that quasi-equilibrium is preserved throughout the optimization process.

Journal ArticleDOI
TL;DR: A parallel O(log 3 vbEvb) algorithm for finding a maximal matching in a graph G(V, E) is presented, and the model of computation is the CRCW-PRAM, and vbVvb + vb Evb processors are used.

Journal ArticleDOI
01 Jul 1986
TL;DR: It is shown that the time lower bound of computing the inverse dynamics of an n-link robot manipulator parallelly using p processors is O(k1 [n/p] + k2 [log<2 p]), where k1 and k2 are constants.
Abstract: It is shown that the time lower bound of computing the inverse dynamics of an n-link robot manipulator parallelly using p processors is O(k1 [n/p] + k2 [log<2 p]), where k1 and k2 are constants. A novel parallel algorithm for computing the inverse dynamics using the Newton-Euler equations of motion was developed to be implemented on a single-instruction-stream multiple-data-stream computer with p processors to achieve the time lower bound. When p = n, the proposed parallel algorithm achieves the Minsky's time lower bound O([log2 n]), whidc is the conjecture of parallel evaluation. The proposed p-fold parallel algorithm can be best described as consisting of p-parallel blocks with pipelined elements within each parallel block The results from the computations in the p blocks form a new homogeneous linear recurrence of size p, which can be computed using the recursive doubling algorithm. A modified inverse perfect shuffle interconnection scheme was suggested to interconnect the p processors. Furthermore, the proposed parallel algorithm is susceptible to a systolic pipelined architecture, requiring three floating-point operations per complete set of joint torques.

Proceedings ArticleDOI
01 Nov 1986
TL;DR: It is shown that the rank of a matrix over an arbitrary field can be computed inO(log2 n) time using a polynomial number of processors.
Abstract: It is shown that the rank of a matrix over an arbitrary field can be computed inO(log2 n) time using a polynomial number of processors.

Journal ArticleDOI
Marina C. Chen1
TL;DR: The fact that Crystal is a general purpose language for parallel programming allows new design methods and synthesis techniques, properties and theorems about problems in specific application domains, and new insights into any given problem to be integrated readily within the existing design framework.

Journal ArticleDOI
TL;DR: It is shown that foqur complete problems for P (nonsparse versions of unification, path system accessibility, monotone circuit value, and ordered depth-first search) are parallelizable.
Abstract: Previous theoretical work in computational complexity has suggested that any problem which is log-space complete for P is not likely in NC, and thus not parallelizable. In practice, this is not the case. To resolve this paradox, we introduce new complexity classes PC and PC* that capture the practical notion of parallelizability we discuss in this paper. We show that foqur complete problems for P (nonsparse versions of unification, path system accessibility, monotone circuit value, and ordered depth-first search) are parallelizable. That is, their running times are O(E + V) on a sequential RAM and O(E/P + V log P) on an EXCLUSIVE-READ EXCLUSIVE-WRITE Parallel RAM with P processors where V and E are the numbers of vertices and edges in the inputed instance of the problem. These problems are in PC and PC*, since an appropriate choice of P can speed up their sequential running times by a factor of μ(P). Several interesting open questions are raised regarding these new parallel complexity classes PC and PC*. Unification is particularly important because it is a basic operation in theorem proving, in type inference algorithms, and in logic programming languages such as Prolog. A fast parallel implementation of Prolog is needed for software development in the Fifth Generation project.

Journal ArticleDOI
TL;DR: A parallel algorithm is developed for Cholesky factorization on a shared-memory multiprocessor based on self-scheduling of a pool of tasks and the most promising variant, which the authors call column-Cholesky, is identified and implemented for the Denelcor HEP multiproprocessor.

01 Nov 1986
TL;DR: The main tool in this environment is a package called SCHEDULE which has been designed to aid a programmer familiar with a Fortran programming environment to implement a parallel algorithm in a manner that will lend itself to transporting the resulting program across a wide variety of parallel machines.
Abstract: This paper describes an environment for the transportable implementation of parallel algorithms in a Fortran setting. By this we mean that a user's code is virtually identical for each machine. The main tool in this environment is a package called SCHEDULE which has been designed to aid a programmer familiar with a Fortran programming environment to implement a parallel algorithm in a manner that will lend itself to transporting the resulting program across a wide variety of parallel machines. The package is designed to allow existing Fortran subroutines to be called through SCHEDULE, without modification, thereby permitting users access to a wide body of existing library software in a parallel setting. Machine intrinsics are invoked within the SCHEDULE package, and considerable effort may be required on our part to move SCHEDULE from one machine to another. On the other hand, the user of SCHEDULE is relieved of the burden of modifying each code he desires to transport from one machine to another. 17 refs., 11 figs., 1 tab.

Journal ArticleDOI
Joseph W. H. Liu1
01 Oct 1986
TL;DR: A new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix and give an overall scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP.
Abstract: In this paper, a systematic and unified treatment of computational task models for parallel sparse Cholesky factorization is presented. They are classified as fine-, medium-, and large-grained graph models. In particular, a new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix. The task scheduling problem for the various task graphs is also discussed. A practical algorithm to schedule the column tasks of the medium-grained model for multiple processors is described. It is based on a heuristic critical path scheduling method. This will give an overall scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP.

01 Jan 1986
TL;DR: In this article, a parallel algorithm for Cholesky factorization on a shared-memory multiprocessor is presented. The algorithm is based on self-scheduling of a pool of tasks.
Abstract: A parallel algorithm is developed for Cholesky factorization on a shared-memory multiprocessor. The algorithm is based on self-scheduling of a pool of tasks. The subtasks in several variants of the basic elimination algorithm are analyzed for potential concurrency in terms of precedence relations, work profiles, and processor utilization. This analysis is supported by simulation results. The most promising variant, which the authors call column-Cholesky, is identified and implemented for the Denelcor HEP multiprocessor. Experimental results are given for this machine.

Journal ArticleDOI
27 Oct 1986
TL;DR: A parallel algorithm for testing a graph for planarity, and for finding an embedding of a planar graph, which uses a sophisticated data structure for representing sets of embeddings, the PQ-tree of [Booth and Lueker, 76].
Abstract: We describe a parallel algorithm for testing a graph for planarity, and for finding an embedding of a planar graph. For a graph on n vertices, the algorithm runs in O(log2 n) steps on n processors of a parallel RAM. The previous best algorithm for planarity testing in parallel polylog time ([Ja'Ja' and Simon, 82]) used a reduction to solving linear systems, and hence required Ω(n2..49...) processors by known methods, whereas our processor bounds are within a polylog factor of optimal. The most significant aspect of our parallel algorithms is the use of a sophisticated data structure for representing sets of embeddings, the PQ-tree of [Booth and Lueker, 76]. Previously no parallel algorithms for PQ-trees were known. We have efficient parallel algorithms for manipulating PQ-trees, which we use in our planarity algorithm.

Journal ArticleDOI
Aggarwal1
TL;DR: The problem of finding the maximum of a set of values stored one per processor on a two-dimensional array of processors with a time-shared global bus is considered and the algorithm given by Bokhari is shown to be optimal, within a multiplier constant, for this network and for other d-dimensional arrays.
Abstract: The problem of finding the maximum of a set of values stored one per processor on a two-dimensional array of processors with a time-shared global bus is considered. The algorithm given by Bokhari is shown to be optimal, within a multiplicative constant, for this network and for other d-dimensional arrays. We generalize this model and demonstrate optimal bounds for finding the maximum of a set of values stored in a d-dimensional array with k time-shared global buses.

Journal ArticleDOI
TL;DR: An algorithm for merging k sorted lists of n/k elements using k processors is presented and it is proved its worst case complexity to be 2n, regardless of the number of processors, while neglecting the cost arising from possible conflicts on the broadcast channel.
Abstract: The paper addresses ways in which one can use "broadcast communication" in distributed algorithms and the relevant issues of design and complexity. We present an algorithm for merging k sorted lists of n/k elements using k processors and prove its worst case complexity to be 2n, regardless of the number of processors, while neglecting the cost arising from possible conflicts on the broadcast channel. We also show that this algorithm is optimal under single-channel broadcast communication. In a variation of the algorithm, we show that by using an extra local memory of O(k) the number of broadcasts is reduced to n. When the algorithm is used for sorting n elements with k processors, where each processor sorts its own list first and then merging, it has a complexity of O(n/k log(n/k) + n), and is thus asymptotically optimal for large n. We also discuss the cost incurred by the channel access scheme and prove that resolving conflicts whenever k processors are involved introduces a cost factor of at least log k.

Journal ArticleDOI
TL;DR: A new algorithm, which is a variant of the sign algorithm, is proposed for the adaptive adjustment of an FIR digital filter with an aim of improving the original convergence characteristics, yet retaining the advantage of hardware simplicity.
Abstract: A new algorithm, which is a variant of the sign algorithm, is proposed for the adaptive adjustment of an FIR digital filter with an aim of improving the original convergence characteristics, yet retaining the advantage of hardware simplicity. Based on a recently proposed theory for the sign algorithm, a practical design method is derived for the new algorithm, and it is shown by computer simulation that the new algorithm in fact performs significantly better than the original algorithm.

Journal ArticleDOI
TL;DR: A parallel nonlinear Gauss–Seidel algorithm for approximating the solution of Au + \phi ( u ) = f where A is an M-matrix is introduced and studied and the speed-up on the Denelcor HEP parallel processing computer is recorded.
Abstract: Multi-splittings of a matrix are used to generate parallel algorithms to approximate the solutions of nonlinear algebraic systems. A parallel nonlinear Gauss–Seidel algorithm for approximating the solution of $Au + \phi ( u ) = f$ where A is an M-matrix is introduced and studied. Also, a parallel Newton–SOR method is defined for the problem $F ( u ) = 0$ where $F' ( u ) = $ the Jacobian is an M-matrix. An illustration and comparison of these methods with their serial versions is given. The speed-up on the Denelcor HEP parallel processing computer is also recorded.

Proceedings ArticleDOI
01 Aug 1986
TL;DR: This work presents techniques which result in improved parallel algorithms for a number of problems whose efficient sequential algorithms use the plane-sweeping paradigm, and never uses the AKS sorting network in any of them.
Abstract: We present techniques which result in improved parallel algorithms for a number of problems whose efficient sequential algorithms use the plane-sweeping paradigm. The problems for which we give improved algorithms include intersection detection, trapezoidal decomposition, triangulation, and planar point location. Our technique can be used to improve on the previous time bound while keeping the space and processor bounds the same, or improve on the previous space bound while keeping the time and processor bounds the same. We also give efficient parallel algorithms for visibility from a point, 3-dimensional maxima, multiple range-counting, and rectilinear segment intersection counting. We never use the AKS sorting network in any of our algorithms.

Journal ArticleDOI
Baru1, Su
TL;DR: The architecture of this system is compared to those of conventional local area networks and shared-memory systems in order to establish the distinct nature and characteristics of a multicomputer system based on the SM3 concept.
Abstract: The architecture of a multicomputer system with switchable main memory modules (SM3) is presented. This architecture supports the efficient execution of parallel algorithms for nonnumeric processing by 1) allowing the sharing of switchable main memory modules between computers, 2) supporting dynamic partitioning of the system, and 3) employing global control lines to efficiently support interprocessor communication. Data transfer time is reduced to memory switching time by allowing some main memory modules to be switched between processors. Dynamic partitioning gives a common bus system the capability of an MIMD machine while performing global operations. The global control lines establish a quick and efficient high-level protocol in the system. The network is supervised by a control computer which oversees network partitioning and other global functions. The hardware involved is quite simple and the network is easily extensible. A simulation study using discrete event simulation techniques has been carried out and the results of the study are presented. The architecture of this system is compared to those of conventional local area networks and shared-memory systems in order to establish the distinct nature and characteristics of a multicomputer system based on the SM3 concept.

Journal ArticleDOI
TL;DR: In this article, the authors show that the greedy algorithm introduced in [1] and [5] to perform the parallel QR decomposition of a dense rectangular matrix of sizem×n is optimal.
Abstract: We show that the greedy algorithm introduced in [1] and [5] to perform the parallel QR decomposition of a dense rectangular matrix of sizem×n is optimal. Then we assume thatm/n2 tends to zero asm andn go to infinity, and prove that the complexity of such a decomposition is asymptotically2n, when an unlimited number of processors is available.