scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1988"


Journal ArticleDOI
TL;DR: An alternative method based on the preflow concept of Karzanov, which runs as fast as any other known method on dense graphs, achieving an O(n) time bound on an n-vertex graph and faster on graphs of moderate density.
Abstract: All previously known efficient maximum-flow algorithms work by finding augmenting paths, either one path at a time (as in the original Ford and Fulkerson algorithm) or all shortest-length augmenting paths at once (using the layered network approach of Dinic). An alternative method based on the preflow concept of Karzanov is introduced. A preflow is like a flow, except that the total amount flowing into a vertex is allowed to exceed the total amount flowing out. The method maintains a preflow in the original network and pushes local flow excess toward the sink along what are estimated to be shortest paths. The algorithm and its analysis are simple and intuitive, yet the algorithm runs as fast as any other known method on dense graphs, achieving an O(n3) time bound on an n-vertex graph. By incorporating the dynamic tree data structure of Sleator and Tarjan, we obtain a version of the algorithm running in O(nm log(n2/m)) time on an n-vertex, m-edge graph. This is as fast as any known method for any graph density and faster on graphs of moderate density. The algorithm also admits efficient distributed and parallel implementations. A parallel implementation running in O(n2log n) time using n processors and O(m) space is obtained. This time bound matches that of the Shiloach-Vishkin algorithm, which also uses n processors but requires O(n2) space.

1,700 citations


Book
01 Mar 1988

1,148 citations


Journal ArticleDOI
TL;DR: H hierarchical network structures are developed that have the property that the optimal global estimate based on all the available information can be reconstructed from estimates computed by local processor nodes solely on the basis of their own local information and transmitted to a central processor.
Abstract: Various multisensor network scenarios with signal processing tasks that are amenable to multiprocessor implementation are described The natural origins of such multitasking are emphasized, and novel parallel structures for state estimation using the Kalman filter are proposed that extend existing results in several directions In particular, hierarchical network structures are developed that have the property that the optimal global estimate based on all the available information can be reconstructed from estimates computed by local processor nodes solely on the basis of their own local information and transmitted to a central processor The algorithms potentially yield an approximately linear speedup rate, are reasonably failure-resistant, and are optimized with respect to communication bandwidth and memory requirements at the various processors >

482 citations


Book
01 Jan 1988
TL;DR: The emphasis of the book is on designing algorithms within the timeless and abstracted context of a high-level programming language rather than depending on highly specific computer architectures.
Abstract: From the Publisher: This text is an introduction to the field of efficient parallel algorithms and to techniques for efficient parallelisation. It is largely self-contained and presumes no special knowledge of parallel computers or particular mathematics. The emphasis of the book is on designing algorithms within the timeless and abstracted context of a high-level programming language rather than depending on highly specific computer architectures. This approach concentrates on the essence of algorithmic theory, and on determining and taking advantage of the inherently parallel nature of certain types of problem. The authors present regularly used techniques and a range of algorithms which includes some of the more celebrated. The text is targeted at non-specialists who are considering entering the field of parallel algorithms. It will be particularly useful for courses aimed at advanced undergraduate or new postgraduate students of computer science and mathematics.

466 citations


01 Jan 1988
TL;DR: A survey of the growing body of theory concerned with parallel algorithms and the complexity of parallel computation, which considers the parallel random-access machine (PRAM), in which it is assumed that each processor has random access in unit time to any cell of a global memory.
Abstract: This paper is a survey of the growing body of theory concerned with parallel algorithms and the complexity of parallel computation The principal computation that we consider is the parallel random-access machine (PRAM), in which it is assumed that each processor has random access in unit time to any cell of a global memory This model permits the logical structure of parallel computation to be studied in a context divorced from issues of interprocessor communication Section 2 surveys efficient parallel algorithms for bookkeeping operations such as compacting an array by squeezing out its "dead" elements, for evaluating algebraic expressions, for searching a graph and decomposing it into various kinds of components, and for sorting, merging and selection These algorithms are typically completely different from the best sequential algorithms for the same problems, and their discovery has required the creation of a new set of paradigms for the construction of parallel algorithms Section 3 studies the relationships among several variants of the PRAM model which differ in their implementation of concurrent reading and/or concurrent writing, presents lower bounds on the time to solve certain elementary problems on various kinds of P RAMs, and compares the PRAM with other models such as bounded-fan-in and unbounded-fan-in circuits, alternating Turing machines and vector machines Section 3 also introduces NC, a hierarchy of problems solvable by deterministic algorithms that operate in polylog time using a polynomial-bounded number of processors Section 4 discusses specific problems within NC Among the problems shown to lie at low levels within this hierarchy are the basic arithmetic operations, transitive closure and Boolean matrix multiplication, the computation of the determinant, the rank an d inverse of a matrix, the evaluation of certain classes of straight-line programs and the construction of a maximal independent set of vertices in a graph Section 4 also discusses the randomized version of NC, and gives fast randomized parallel algorithm s for problems such as finding a maximum matching in a graph Section 4 concludes by exhibiting several problems that are complete in the sequential complexity class P with respect to logspace reducibility, and hence unlikely to lie in NC

383 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present efficient parallel algorithms for several basic problems in computational geometry: convex hulls, Voronoi diagrams, detecting line segment intersections, triangulating simple polygons, minimizing a circumscribing triangle, and recursive data-structures for three-dimensional queries.
Abstract: We present efficient parallel algorithms for several basic problems in computational geometry: convex hulls, Voronoi diagrams, detecting line segment intersections, triangulating simple polygons, minimizing a circumscribing triangle, and recursive data-structures for three-dimensional queries.

311 citations


Journal ArticleDOI
01 Jun 1988
TL;DR: A parallel algorithm for the rasterization of polygons is presented that is particularly well suited for 3D Z-buffered graphics implementations and can be interpolated with hardware similar to hardware required to interpolate color and Z pixel values.
Abstract: A parallel algorithm for the rasterization of polygons is presented that is particularly well suited for 3D Z-buffered graphics implementations. The algorithm represents each edge of a polygon by a linear edge function that has a value greater than zero on one side of the edge and less than zero on the opposite side. The value of the function can be interpolated with hardware similar to hardware required to interpolate color and Z pixel values. In addition, the edge function of adjacent pixels may be easily computed in parallel. The coefficients of the "Edge function" can be computed from floating point endpoints in such a way that sub-pixel precision of the endpoints can be retained in an elegant way.

259 citations


Journal ArticleDOI
24 Oct 1988
TL;DR: A parallel algorithm for the Delta +1 vertex coloring problem with running time O(log/sup 3/ nlog log n) using a linear number of processors on a concurrent-read-concurrent-write parallel random-access machine.
Abstract: Some general techniques are developed for removing randomness from randomized NC algorithms without a blowup in the number of processors. One of the requirements for the application of these techniques is that the analysis of the randomized algorithm uses only pairwise independence. The main new result is a parallel algorithm for the Delta +1 vertex coloring problem with running time O(log/sup 3/ nlog log n) using a linear number of processors on a concurrent-read-concurrent-write parallel random-access machine. The techniques also apply to several other problems, including the maximal-independent-set problem and the maximal-matching problem. The application of the general technique to these last two problems is mostly of academic interest, because NC algorithms using a linear number of processors that have better running times have been previously found. >

210 citations


Proceedings Article
01 Jan 1988
TL;DR: It would be very interesting if the authors could combine stages (3) and (4) into a single step whereby the performance of the algorithm is measured as the makespan of the schedule (elapsed time for computing the last result).
Abstract: Harnessing the massively parallel architectures soon to become available into efficient algorithmic cooperation is one of the most important intellectual challenges facing Computer Science today. To the theoretician, the task seems similar to thiat of understanding the issues involved in the performance of sequential algorithms (which motivated Knuth’s books, among other important works), only mfinitely more complex. In sequential computation, th.e design process involves (a) choosing an algorithm and (b) analyzing it (mostly, counting its steps). In the parallel context, however, we have at least four stages: (1) Choose the algorithm (say, a directed acyclic graph (dag) indicating the elementary computations and their interdependence, a model in which evaluation of sequential performance is trivial). (2) Choose a particular multiprocessor architecture. (3) Find a schedule whereby the algorithm is executed on the processors (so that all necessary data a:re available at the appropriate processor at the time of each computation). (4) Only now can we talk about the performance of the algorithm, measured as the makespan of the schedule (elapsed time for computing the last result). In our opinion, it is this multi-layered nature of the problem that lies at the heart of the difficulties encountered in the development of the necessary ideas, principles, and tools for the design of parallel algorithms. Is there a way to shortcut the process, thus improving our chances of finally gaining some insight into parallel algorithms? It would be very interesting if we could combine stages (3) and (4) into a single step whereby the performance of the algorithm cho-

176 citations


Journal ArticleDOI
TL;DR: Several parallel algorithms are presented for solving triangular systems of linear equations on distributed-memory multiprocessors and new wavefront algorithms are developed for both row-oriented and column-oriented matrix storage.
Abstract: Several parallel algorithms are presented for solving triangular systems of linear equations on distributed-memory multiprocessors. New wavefront algorithms are developed for both row-oriented and column-oriented matrix storage. Performance of the new algorithms and several previously proposed algorithms is analyzed theoretically and illustrated empirically using implementations on commercially available hypercube multiprocessors.

160 citations


Proceedings Article
21 Aug 1988
TL;DR: This paper presents many different parallel formulations of the A*/Branch-and-Bound search algorithm, and discovered problem characteristics that make certain formulations more (or less) suitable for some search problems.
Abstract: This paper presents many different parallel formulations of the A*/Branch-and-Bound search algorithm. The parallel formulations primarily differ in the data structures used. Some formulations are suited only for shared-memory architectures, whereas others are suited for distributed-memory architectures as well. These parallel formulations have been implemented to solve the vertex cover problem and the TSP problem on the BBN Butterfly parallel processor. Using appropriate data structures, we are able to obtain fairly linear speedups for as many as 100 processors. We also discovered problem characteristics that make certain formulations more (or less) suitable for some search problems. Since the best-first search paradigm of A*/Branch-and-Bound is very commonly used, we expect these parallel formulations to be effective for a variety of problems. Concurrent and distributed priority queues used in these parallel formulations can be used in many parallel algorithms other than parallel A*/branch-and-bound.

Book
01 Jan 1988
TL;DR: Part 1 Fundamentals of parallel computation: general principles of parallel computing parallel techniques and algorithms parallel sorting algorithms and future trends in algorithm development.
Abstract: Part 1 Fundamentals of parallel computation: general principles of parallel computing parallel techniques and algorithms parallel sorting algorithms. Part 2 Numerical linear algebra: solution of a system of linear algebraic equations the symmetric eigenvalue problem - Jacobi method QR factorization singular-value decomposition and related problems future trends in algorithm development.

Journal ArticleDOI
TL;DR: A new contour generating serial algorithm is faster and more efficient than conventional contour tracing and parallel algorithms.
Abstract: A new contour generating serial algorithm is faster and more efficient than conventional contour tracing and parallel algorithms

Book ChapterDOI
28 Jun 1988
TL;DR: This paper describes a simple parallel algorithm for list ranking that matches the performance of the Cole-Vishkin [CV86a] algorithm but is simple and has reasonable constant factors.
Abstract: In this paper we describe a simple parallel algorithm for list ranking. The algorithm is deterministic and runs in O(log n) time on EREW P-RAM with n/log n processor. The algorithm matches the performance of the Cole-Vishkin [CV86a] algorithm but is simple and has reasonable constant factors.

Journal ArticleDOI
01 Mar 1988
TL;DR: It is shown that the SAXPY, GAXPY and DOT algorithms of Dongarra, Gustavson and Karp, as well as parallel versions of the LDMt, LDLt, Doolittle and Cholesky algorithms, can be classified into four task graph models.
Abstract: This paper introduces a graph-theoretic approach to analyse the performances of several parallel Gaussian-like triangularization algorithms on an MIMD computer. We show that the SAXPY, GAXPY and DOT algorithms of Dongarra, Gustavson and Karp, as well as parallel versions of the LDMt, LDLt, Doolittle and Cholesky algorithms, can be classified into four task graph models. We derive new complexity results and compare the asymptotic performances of these parallel versions.

Journal ArticleDOI
TL;DR: A modified version of the fast parallel thinning algorithm proposed by Zhang and Suen is presented, which preserves the original merits such as the contour noise immunity and good effect in thinning crossing lines; and overcomes the original demerits.

Journal ArticleDOI
01 Apr 1988
TL;DR: Lower and upper bounds on the deterministic and randomized complexity of parallel search algorithms are derived to establish that randomized parallel algorithms are much more powerful than deterministic ones, and to show that even randomized algorithms cannot make effective use of extremely large numbers of processors.
Abstract: This paper studies parallel search algorithms within the framework of independence systems. It is motivated by earlier work on parallel algorithms for concrete problems such as the determination of a maximal independent set of vertices or a maximum matching in a graph, and by the general question of determining the parallel complexity of a search problem when an oracle is available to solve the associated decision problem. Our results provide a parallel analogue of the self-reducibility process that is so useful in sequential computation. An abstract independence system is specified by a ground set E and a family of subsets of E called the independent sets; it is required that every subset of an independent set be independent. We investigate parallel algorithms for determining a maximal independent set through oracle queries of the form "Is the set a independent?", as well as parallel algorithms for determining a maximum independent set through queries to a more powerful oracle called a rank oracle. We also study these problems for three special types of independence systems: matroidoids, graphic matroids and partition matroids. We derive lower and upper bounds on the deterministic and randomized complexity of these problems. These bounds are sharp enough to give a clear picture of the processor-time trade-offs that are possible, to establish that randomized parallel algorithms are much more powerful than deterministic ones, and to show that even randomized algorithms cannot make effective use of extremely large numbers of processors.

Patent
24 Nov 1988
TL;DR: In this paper, a parallel algorithm for rendering an important graphic primitive for accomplishing the production of a smoothly shaded color three-dimensional triangle with anti-aliased edges is presented.
Abstract: SIMD computer architecture is used in conjunction with a host processor and coordinate processor to render quality, three-dimensional, anti-aliased shaded color images into the frame buffer of a video display system. The method includes a parallel algorithm for rendering an important graphic primitive for accomplishing the production of a smoothly shaded color three-dimensional triangle with anti-aliased edges. By taking advantage of the SIMD architecture and said parallel algorithm, the very time consuming pixel by pixel computations are broken down for parallel execution. A single coordinate processor computes and transmits an overall triangle record which is essentially the same for all blocks of pixels within a given bounding box which box in turn surrounds each triangle. The individual pixel data is produced by a group of M×N pixel processors and stored in the frame buffer in a series of repetitive steps wherein each step corresponds to the processing of an M×N block of pixels within the bounding box of the triangle. Thus, each pixel processor performs the same operation, modifying its computations in accordance with triangle data received from the coordinate processor and positional data unique to its own sequential connectivity to the frame buffer, thus allowing parallel access to the frame buffer.

Journal ArticleDOI
TL;DR: An off-road vehicle with a suspension system that has eight closed loops is used to illustrate the parallel processor algorithm and to investigate parallel processing speed-up and overhead.
Abstract: A high speed dynamic simulation algorithm that exploits emerging parallel processor computer technology is presented. Medium grain parallelism is defined by the graph structure of a mechanism and the recursive algorithm derived in parts I and II of this paper, for both open and closed loop systems. An off-road vehicle with a suspension system that has eight closed loops is used to illustrate the parallel processor algorithm. A shared memory multiprocessor is used to implement the algorithm and to investigate parallel processing speed-up and overhead. Real-time simulation of a ground vehicle is demonstrated.

Journal ArticleDOI
01 Sep 1988
TL;DR: A number of algorithmic tools that have been found useful in the construction of parallel algorithms are described; among these are prefix computation, ranking, Euler tours, ear decomposition, and matrix calculations.
Abstract: We have described a number of algorithmic tools that have been found useful in the construction of parallel algorithms; among these are prefix computation, ranking, Euler tours, ear decomposition, and matrix calculations. We have also described some of the applications of these tools, and listed many other applications. These algorithms seem likely to be useful not only in their own right, but also as examples of ways to break up other problems into parts suitable for parallel solution.

Proceedings ArticleDOI
06 Jan 1988
TL;DR: This paper presents an algorithm for hidden surface removal for a class of polyhedral surfaces which have a property that they can be ordered relatively quickly like the terrain maps and presents a parallel algorithm based on a similar approach.
Abstract: In this paper we present an algorithm for hidden surface removal for a class of polyhedral surfaces which have a property that they can be ordered relatively quickly like the terrain maps A distinguishing feature of this algorithm is that its running time is sensitive to the actual size of the visible image rather than the total number of intersections in the image plane which can be much larger than the visible image The time complexity of this algorithm is O((k +n)lognloglogn) where n and k are respectively the input and the output sizes Thus, in a significant number of situations this will be faster than the worst case optimal algorithms which have running time O(n2) irrespective of the output size (where as the output size k is O(n2) only in the worst case) We also present a parallel algorithm based on a similar approach which runs in time O(log4(n+k)) using O((n + k)/log(n+k)) processors in a CREW PRAM model All our bounds are obtained using ammortized analysis

Journal ArticleDOI
TL;DR: A new parallel algorithm is given to evaluate a straight line program over a commutative semi-ring R of degree d and size n in time O (log n(log nd) time) using M(n) processors.
Abstract: A new parallel algorithm is given to evaluate a straight line program. The algorithm evaluates a program over a commutative semi-ring R of degree d and size n in time O(log n(log nd)) using M(n) processors, where M(n) is the number of processors required for multiplying n×n matrices over the semi-ring R in O (log n) time.

Journal ArticleDOI
Richard Cole1
TL;DR: An optimally efficient parallel algorithm for selection on the EREW PRAM that requires a linear number of operations and O(log n log ∗ /log log n) time is given.

Book ChapterDOI
01 Jan 1988
TL;DR: A deterministic parallel algorithm for parallel tree contraction that is optimal in the sense that the product P · T is equal to the input size and gives an O(log n) time algorithm when P = n/log n.
Abstract: A deterministic parallel algorithm for parallel tree contraction is presented in this paper. The algorithm takes T = O(n/P) time and uses P (P ≤ n/log n) processors, where n = the number of vertices in a tree using an Exclusive Read and Exclusive Write (EREW) Parallel Random Access Machine (PRAM). This algorithm improves the results of Miller and Reif [MR85,MR87], who use the CRCW randomized PRAM model to get the same complexity and processor count. The algorithm is optimal in the sense that the product P · T is equal to the input size and gives an O(log n) time algorithm when P = n/log n. Since the algorithm requires O(n) space, which is the input size, it is optimal in space as well. Techniques for prudent parallel tree contraction are also discussed, as well as implementation techniques for fixed-connection machines.

Journal ArticleDOI
TL;DR: It is concluded that, in the absence of loop-unrolling, $LU$ factorization with partial pivoting is most efficient when pipelining is used to mask the cost of pivoting.
Abstract: In this paper, we consider the effect that the data-storage scheme and pivoting scheme have on the efficiency of $LU$ factorization on a distributed-memory multiprocessor. Our presentation will focus on the hypercube architecture, but most of our results are applicable to distributed-memory architectures in general. We restrict our attention to two commonly used storage schemes (storage by rows and by columns) and investigate partial pivoting both by rows and by columns, yielding four factorization algorithms. Our goal is to determine which of these four algorithms admits the most efficient parallel implementation. We analyze factors such as load distribution, pivoting cost, and potential for pipelining. We conclude that, in the absence of loop-unrolling, $LU$ factorization with partial pivoting is most efficient when pipelining is used to mask the cost of pivoting. The two schemes that can be pipelined are pivoting by interchanging rows when the coefficient matrix is distributed to the processors by columns, and pivoting by interchanging columns when the matrix is distributed to the processors by rows.

Journal ArticleDOI
TL;DR: This work considers solving triangular systems of linear equations on a distributed-memory multiprocessor which allows for a ring embedding and proposes a parallel algorithm, applicable when the triangular matrix is distributed by column in a wrap fashion.
Abstract: We consider solving triangular systems of linear equations on a distributed-memory multiprocessor which allows for a ring embedding. Specifically, we propose a parallel algorithm, applicable when the triangular matrix is distributed by column in a wrap fashion. Numerical experiments indicate that the new algorithm is very efficient in some circumstances (in particular, when the size of the problem is sufficiently large relative to the number of processors).A theoretical analysis confirms that the total running time varies linearly, with respect to the matrix order, up to a threshold value of the matrix order, after which the dependence is quadratic. Moreover, we show that total message traffic is essentially the minimum possible.Finally, we describe an analogous row-oriented algorithm.

Journal ArticleDOI
TL;DR: Experimental results for sorting integers, two-dimensional fast Fourier transforms (FFT), and constraint-satisfied searching are presented, illustrating the power of the SMP cluster programming methodology.

Journal ArticleDOI
TL;DR: A comparison of recently proposed parallel text search methods to alternative available search strategies that use serial processing machines suggests parallel methods do not provide large-scale gains in either retrieval effectiveness or efficiency.
Abstract: A comparison of recently proposed parallel text search methods to alternative available search strategies that use serial processing machines suggests parallel methods do not provide large-scale gains in either retrieval effectiveness or efficiency.

Journal ArticleDOI
TL;DR: In this paper, iterative algorithms based on the conjugate gradient method are developed for hypercubes designed for coarse-grained parallelism, and the communication requirements of different schemes for mapping finite-element meshes onto the processors of a hypercube are analyzed with respect to the effect of communication parameters of the architecture.
Abstract: Finite-element discretization produces linear equations in the form Ax=b, where A is large, sparse, and banded with proper ordering of the variables x. The solution of such equations on distributed-memory message-passing multiprocessors implementing the hypercube topology is addressed. Iterative algorithms based on the conjugate gradient method are developed for hypercubes designed for coarse-grained parallelism. The communication requirements of different schemes for mapping finite-element meshes onto the processors of a hypercube are analyzed with respect to the effect of communication parameters of the architecture. Experimental results for a 16-node Intel 80386-based iPSC/2 hypercube are presented and discussed. >

Journal ArticleDOI
TL;DR: The proposed parallel algorithm has attractive convergence properties and can be implemented as parallel algorithms for tackling definite quadratic programs, linear programs, systems of linear equations and systems of generalized nonlinear inequalities.
Abstract: A parallel algorithm is proposed in this paper for solving the problem $\min \{ q(x)|x \in C_1 \cap \cdots \cap C_m \} $ where q is an uniformly convex function and $C_i$ are closed convex sets in $R^n$. In each iteration of the method, we solve in parallel m independent subproblems, each minimizing a definite quadratic function over an individual set $C_i$. The method has attractive convergence properties and can be implemented as parallel algorithms for tackling definite quadratic programs, linear programs, systems of linear equations and systems of generalized nonlinear inequalities.