scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1994"


Journal ArticleDOI
TL;DR: A low-complexity heuristic for scheduling parallel tasks on an unbounded number of completely connected processors, named the dominant sequence clustering algorithm (DSC), which guarantees a performance within a factor of 2 of the optimum for general coarse-grain DAG's.
Abstract: We present a low-complexity heuristic, named the dominant sequence clustering algorithm (DSC), for scheduling parallel tasks on an unbounded number of completely connected processors. The performance of DSC is on average, comparable to, or even better than, other higher-complexity algorithms. We assume no task duplication and nonzero communication overhead between processors. Finding the optimum solution for arbitrary directed acyclic task graphs (DAG's) is NP-complete. DSC finds optimal schedules for special classes of DAG's, such as fork, join, coarse-grain trees, and some fine-grain trees. It guarantees a performance within a factor of 2 of the optimum for general coarse-grain DAG's. We compare DSC with three higher-complexity general scheduling algorithms: the ETF by J.J. Hwang, Y.C. Chow, F.D. Anger, and C.Y. Lee (1989); V. Sarkar's (1989) clustering algorithm; and the MD by M.Y. Wu and D. Gajski (1990). We also give a sample of important practical applications where DSC has been found useful. >

694 citations


Journal ArticleDOI
TL;DR: A classification scheme is described that is based on where the sort from object coordinates to screen coordinates occurs, which it is believed is fundamental whenever both geometry processing and rasterization are performed in parallel.
Abstract: We describe a classification scheme that we believe provides a more structured framework for reasoning about parallel rendering. The scheme is based on where the sort from object coordinates to screen coordinates occurs, which we believe is fundamental whenever both geometry processing and rasterization are performed in parallel. This classification scheme supports the analysis of computational and communication costs, and encompasses the bulk of current and proposed highly parallel renderers - both hardware and software. We begin by reviewing the standard feed-forward rendering pipeline, showing how different ways of parallelizing it lead to three classes of rendering algorithms. Next, we consider each of these classes in detail, analyzing their aggregate processing and communication costs, possible variations, and constraints they may impose on rendering applications. Finally, we use these analyses to compare the classes and identify when each is likely to be preferable. >

612 citations


Proceedings ArticleDOI
14 Dec 1994
TL;DR: Two serial and parallel algorithms for solving a system of equations that arises from the discretization of the Hamilton-Jacobi equation associated to a trajectory optimization problem of the following type are presented.
Abstract: Presents serial and parallel algorithms for solving a system of equations that arises from the discretization of the Hamilton-Jacobi equation associated to a trajectory optimization problem of the following type. A vehicle starts at a prespecified point x/sub 0/ and follows a unit speed trajectory x(t) inside a region in /spl Rfr//sup m/, until an unspecified time T that the region is excited. A trajectory minimising a cost function of the form /spl int//sub 0//sup T/ r(x(t))dt+q(x(T)) is sought. The discretized Hamilton-Jacobi equation corresponding to this problem is usually served using iterative methods. Nevertheless, assuming that the function r is positive, one is able to exploit the problem structure and develop one-pass algorithms for the discretized problem. The first m resembles Dijkstra's shortest path algorithm and runs in time O(n log n), where n is the number of grid points. The second algorithm uses a somewhat different discretization and borrows some ideas from Dial's shortest path algorithm; it runs in time O(n), which is the best possible, under some fairly mild assumptions. Finally, the author shows that the latter algorithm can be efficiently parallelized: for two-dimensional problems and with p processors, its running time becomes O(n/p), provided that p=O(/spl radic/n/log n). >

589 citations


Book
01 Mar 1994
TL;DR: Graph theoretic terminology review of complex numbers parallel algorithm design strategies and how to design parallel algorithms for linear systems and multiprocessors.
Abstract: PRAM algorithms processor arrays, multiprocessors and multicomputers parallel programming languages mapping and scheduling elementary parallel algorithms matrix multiplication the fast Fourier transform solving linear systems sorting dictionary operations graph algorithms combinational search. Appendices: graph theoretic terminology review of complex numbers parallel algorithm design strategies.

472 citations


Journal ArticleDOI
TL;DR: A fast parallel algorithm is given that provides good solutions to very large problems in a very short computation time and identifies a type of problem for which taboo search provides an optimal solution in a polynomial mean time in practice.
Abstract: We apply the global optimization technique called taboo search to the job shop scheduling problem and show that our method is typically more efficient than the shifting bottleneck procedure, and also more efficient than a recently proposed simulated annealing implementation. We also identify a type of problem for which taboo search provides an optimal solution in a polynomial mean time in practice, while an implementation of the shifting bottleneck procedure seems to take an exponential amount of computation time. Included are computational results that establish new best solutions for a number of benchmark problems from the literature. Finally, we give a fast parallel algorithm that provides good solutions to very large problems in a very short computation time. INFORMS Journal on Computing, ISSN 1091-9856, was published as ORSA Journal on Computing from 1989 to 1995 under ISSN 0899-1499.

347 citations


Journal ArticleDOI
TL;DR: Initial benchmark results of NESL show that NESL′s performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.

329 citations


Journal ArticleDOI
TL;DR: A new characterization of branch-and-bound algorithms is given, which consists of isolating the performed operations without specifying any particular order for their execution.
Abstract: We present a detailed and up-to-date survey of the literature on parallel branch-and-bound algorithms. We synthesize previous work in this area and propose a new classification of parallel branch-and-bound algorithms. This classification is used to analyze the methods proposed in the literature. To facilitate our analysis, we give a new characterization of branch-and-bound algorithms, which consists of isolating the performed operations without specifying any particular order for their execution.

319 citations


Book
01 Apr 1994
TL;DR: 1. Mathematical Preliminaries, Elements of Computability Theory, and Space-Complexity Classes: Algorithms and Complexity Classes.
Abstract: 1. Mathematical Preliminaries. 2. Elements of Computability Theory. 4. The Class P. 5. The Glass NP. 6. The Complexity of Optiimzation Problems. 7. Beyond NP. 8. Space-Complexity Classes. 9. Probabiillistic. 10. Algorithms and Complexity Classes. 11. Interactivite Proof. 12. Systems. 13. Models of Parallel Computer. 14. Parallel Algorithms.

312 citations


Journal ArticleDOI
TL;DR: It is proved that prefix sums of n integers of at most b bits can be found on a COMMON CRCW PRAM in time with a linear time-processor product, and the algorithm is optimally fast, for any polynomial number of processors.
Abstract: We prove that prefix sums of n integers of at most b bits can be found on a COMMON CRCW PRAM in time with a linear time-processor product. The algorithm is optimally fast, for any polynomial number of processors. In particular, if the time taken is . This is a generalisation of previous result. The previous time algorithm was valid only for O(log n)-bit numbers. Application of this algorithm to r-way parallel merge sort algorithm is also considered. We also consider a more realistic PRAM variant, in which the word size, m, may be smaller than b (m≥log n). On this model, prefix sums can be found in optimal time.

311 citations


Journal ArticleDOI
TL;DR: A parallel volume-rendering algorithm, which consists of two parts: parallel ray tracing and parallel compositing, which is particularly effective for massively parallel processing, as it always uses all processing units by repeatedly subdividing the partial images and distributing them to the appropriate processing units.
Abstract: We describe a parallel volume-rendering algorithm, which consists of two parts: parallel ray tracing and parallel compositing. In the most recent implementation on Connection Machine's CM-5 and networked workstations, the parallel volume renderer evenly distributes data to the computing resources available. Without the need to communicate with other processing units, each subvolume is ray traced locally and generates a partial image. The parallel compositing process then merges all resulting partial images in depth order to produce the complete image. The compositing algorithm is particularly effective for massively parallel processing, as it always uses all processing units by repeatedly subdividing the partial images and distributing them to the appropriate processing units. Test results on both the CM-5 and the workstations are promising. They do, however, expose different performance issues for each platform. >

311 citations


Journal ArticleDOI
TL;DR: It is shown that optimality to within a multiplicative factor close to one can be achieved for the problems of Gauss-Jordan elimination and sorting, by transportable algorithms that can be applied for a wide range of values of the parameters p, g, and L.

Book
01 Jun 1994
TL;DR: In this paper, a tight lower bound of the VLSI layout area of the binary de Bruijn multiprocessor network (BDM) is derived; a procedure for an area-optimal VLSIsI layout is also described.
Abstract: It is shown that the binary de Bruijn multiprocessor network (BDM) can solve a wide variety of classes of problems. The BDM admits an N-node linear array, an N-node ring, (N-1)-node complete binary trees, ((3N/4)-2)-node tree machines, and an N-node one-step shuffle-exchange network, where N (=2/sup k/, k an integer) is the total number of nodes. The de Bruijn multiprocessor networks are proved to be fault-tolerant as well as extensible. A tight lower bound of the VLSI layout area of the BDM is derived; a procedure for an area-optimal VLSI layout is also described. It is demonstrated that the BDM is more versatile than the shuffle-exchange and the cube-connected cycles. Recent work has classified sorting architectures into (1) sequential input/sequential output, (2) parallel input/sequential output, (3) parallel input/parallel output, (4) sequential input/parallel output, and (5) hybrid input/hybrid output. It is demonstrated that the de Bruijn multiprocessor networks can sort data items in all of the abovementioned categories. No other network which can sort data items in all the categories is known. >

Journal ArticleDOI
TL;DR: This paper analyses the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics: the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible to estimate the size of total work at a given processor.

Proceedings ArticleDOI
26 Oct 1994
TL;DR: A new coarse-grained GA architecture, the Injection Island GA (iiGA), is proposed and the preliminary results of iiGA's show them to be a promising new approach to coarse-grain GA's.
Abstract: This paper describes a number of different coarse-grain GA's, including various migration strategies and connectivity schemes to address the premature convergence problem. These approaches are evaluated on a graph partitioning problem. Our experiments showed, first, that the sequential GA's used are not as effective as parallel GA's for this graph partition problem. Second, for coarse-grain GA's, the results indicate that using a large number of nodes and exchanging individuals asynchronously among them is very effective. Third, GA's that exchange solutions based on population similarity instead of a fixed connection topology get better results without any degradation in speed. Finally, we propose a new coarse-grained GA architecture, the Injection Island GA (iiGA). The preliminary results of iiGA's show them to be a promising new approach to coarse-grain GA's. >

Journal ArticleDOI
TL;DR: The objectives of this paper are to critically assess the state of the art in the theory of scalability analysis, and to motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures.

Journal ArticleDOI
TL;DR: Theoretical results show that a large class of algorithm-machine combinations is scalable and the scalability can be predicted through premeasured machine parameters, and a harmony between speedup and scalability has been observed.
Abstract: Scalability has become an important consideration in parallel algorithm and machine designs. The word scalable, or scalability, has been widely and often used in the parallel processing community. However, there is no adequate, commonly accepted definition of scalability available. Scalabilities of computer systems and programs are difficult to quantify, evaluate, and compare. In this paper, scalability is formally defined for algorithm-machine combinations. A practical method is proposed to provide a quantitative measurement of the scalability. The relation between the newly proposed scalability and other existing parallel performance metrics is studied. A harmony between speedup and scalability has been observed. Theoretical results show that a large class of algorithm-machine combinations is scalable and the scalability can be predicted through premeasured machine parameters. Two algorithms have been studied on an nCUBE 2 multicomputer and on a MasPar MP-1 computer. These case studies have shown how scalabilities can be measured, computed, and predicted. Performance instrumentation and visualization tools also have been used and developed to understand the scalability related behavior. >

Journal ArticleDOI
TL;DR: This work considers how to formulate a parallel analytical molecular surface algorithm that has expected linear complexity with respect to the total number of atoms in a molecule, and aims to compute and display these surfaces at interactive rates, by taking advantage of advances in computational geometry.
Abstract: We consider how we set out to formulate a parallel analytical molecular surface algorithm that has expected linear complexity with respect to the total number of atoms in a molecule. To achieve this goal, we avoided computing the complete 3D regular triangulation over the entire set of atoms, a process that takes time O(n log n), where n is the number of atoms in the molecule. We aim to compute and display these surfaces at interactive rates, by taking advantage of advances in computational geometry, making further algorithmic improvements and parallelizing the computations. >

Journal ArticleDOI
TL;DR: This article demonstrates that simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the high degree of temporal locality obviates the need for explicit data distribution and communication management on the best known visualization algorithms.
Abstract: Recently, a new class of scalable, shared-address-space multiprocessors has emerged. Like message-passing machines, these multiprocessors have a distributed interconnection network and physically distributed main memory. However, they provide hardware support for efficient implicit communication through a shared address space, and they automatically exploit temporal locality by caching both local and remote data in a processor's hardware cache. In this article, we show that these architectural characteristics make it much easier to obtain very good speedups on the best known visualization algorithms. Simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the high degree of temporal locality obviates the need for explicit data distribution and communication management. We demonstrate our claims through parallel versions of three state-of-the-art algorithms: a recent hierarchical radiosity algorithm by Hanrahan et al. (1991), a parallelized ray-casting volume renderer by Levoy (1992), and an optimized ray-tracer by Spach and Pulleyblank (1992). We also discuss a new shear-warp volume rendering algorithm that provides the first demonstration of interactive frame rates for a 256/spl times/256/spl times/256 voxel data set on a general-purpose multiprocessor. >

Journal ArticleDOI
TL;DR: The Maisie simulation language is presented, a set of optimizations are described, and the use of the language in the design of efficient parallel simulations is illustrated.
Abstract: Maisie is a C-based discrete-event simulation language that was designed to cleanly separate a simulation model from the underlying algorithm (sequential or parallel) used for the execution of the model. With few modifications, a Maisie program may be executed by using a sequential simulation algorithm, a parallel conservative algorithm or a parallel optimistic algorithm. The language constructs allow the run-time system to implement optimizations that reduce recomputation and state saving overheads for optimistic simulations and synchronization overheads for conservative implementations. This paper presents the Maisie simulation language, describes a set of optimizations, and illustrates the use of the language in the design of efficient parallel simulations. >

Journal ArticleDOI
TL;DR: This work considers the application of the genetic algorithm to a particular problem, the Assembly Line Balancing Problem, and carries out extensive computational testing to find appropriate values for the various parameters associated with this genetic algorithm.
Abstract: Genetic algorithms are one example of the use of a random element within an algorithm for combinatorial optimization. We consider the application of the genetic algorithm to a particular problem, the Assembly Line Balancing Problem. A general description of genetic algorithms is given, and their specialized use on our test-bed problems is discussed. We carry out extensive computational testing to find appropriate values for the various parameters associated with this genetic algorithm. These experiments underscore the importance of the correct choice of a scaling parameter and mutation rate to ensure the good performance of a genetic algorithm. We also describe a parallel implementation of the genetic algorithm and give some comparisons between the parallel and serial implementations. Both versions of the algorithm are shown to be effective in producing good solutions for problems of this type (with appropriately chosen parameters). INFORMS Journal on Computing, ISSN 1091-9856, was published as ORSA Journ...

Journal ArticleDOI
TL;DR: A parallel Tabu search heuristic for the Vehicle Routing Problem with Time Windows, which is synchronous and runs on a Multiple-Instruction Multiple-Data computer architecture.

Journal ArticleDOI
TL;DR: The superior convergence property of the parallel hybrid neural network learning algorithm presented in this paper is demonstrated.
Abstract: A new algorithm is presented for training of multilayer feedforward neural networks by integrating a genetic algorithm with an adaptive conjugate gradient neural network learning algorithm. The parallel hybrid learning algorithm has been implemented in C on an MIMD shared memory machine (Cray Y-MP8/864 supercomputer). It has been applied to two different domains, engineering design and image recognition. The performance of the algorithm has been evaluated by applying it to three examples. The superior convergence property of the parallel hybrid neural network learning algorithm presented in this paper is demonstrated. >

Journal ArticleDOI
TL;DR: A wide class of problems, the divide & conquer class (D&Q), is shown to be easily and efficiently solvable on the HHC topology, and parallel algorithms are provided to describe how a D&Q problem can be solved efficiently on an HHC structure.
Abstract: Interconnection networks play a crucial role in the performance of parallel systems. This paper introduces a new interconnection topology that is called the hierarchical hypercube (HHC). This topology is suitable for massively parallel systems with thousands of processors. An appealing property of this network is the low number of connections per processor, which enhances the VLSI design and fabrication of the system. Other alluring features include symmetry and logarithmic diameter, which imply easy and fast algorithms for communication. Moreover, the HHC is scalable; that is it can embed HHC's of lower dimensions. The paper presents two algorithms for data communication in the HHC. The first algorithm is for one-to-one transfer, and the second is for one-to-all broadcasting. Both algorithms take O(log/sub 2/ k), where k is the total number of processors in the system. A wide class of problems, the divide & conquer class (D&Q), is shown to be easily and efficiently solvable on the HHC topology. Parallel algorithms are provided to describe how a D&Q problem can be solved efficiently on an HHC structure. The solution of a D&Q problem instance having up to k inputs requires a time complexity of O(log/sub 2/ k). >

Journal ArticleDOI
TL;DR: A new algorithm for passively estimating the ranges and bearings of multiple narrow-band sources using a uniform linear sensor array is presented, which reduces the global 2D search over range and bearing to 2(m/spl minus/1) independent 1D searches.
Abstract: A new algorithm for passively estimating the ranges and bearings of multiple narrow-band sources using a uniform linear sensor array is presented. The algorithm is computationally efficient and converges globally. It minimizes the MUSIC cost function subject to geometrical constraints imposed by the curvature of the received wavefronts. The estimation problem is reduced to one of solving a set of two coupled 2D polynomial equations. The proposed algorithm solves this nonlinear problem using a modification of the path-following (or homotopy) method. For an array having m sensors, the algorithm reduces the global 2D search over range and bearing to 2(m/spl minus/1) independent 1D searches. This imparts a high degree of parallelism that can be exploited to obtain source location estimates very efficiently. >

Book ChapterDOI
01 Jan 1994
TL;DR: The alternating direction method of multipliers decomposition algorithm for convex programming, as recently generalized by Eckstein and Bert- sekas, is considered, and some reformulations of the algorithm are given, and several alternative means for deriving them are discussed.
Abstract: We consider the alternating direction method of multipliers decomposition algorithm for convex programming, as recently generalized by Eckstein and Bert- sekas. We give some reformulations of the algorithm, and discuss several alternative means for deriving them. We then apply these reformulations to a number of optimization problems, such as the minimum convex-cost transportation and multicommodity flow. The convex transportation version is closely related to a linear-cost transportation algorithm proposed earlier by Bertsekas and Tsitsiklis. Finally, we construct a simple data-parallel implementation of the convex-cost transportation algorithm for the CM-5 family of parallel computers, and give computational results. The method appears to converge quite quickly on sparse quadratic-cost transportation problems, even if they are very large; for example, we solve problems with over a million arcs in roughly 100 iterations, which equates to about 30 seconds of run time on a system with 256 processing nodes. Substantially better timings can probably be achieved with a more careful implementation.

Proceedings ArticleDOI
01 Aug 1994
TL;DR: Improvements are given for the first two to improve performance significantly, although without improving their asymptotic complexity, and for the hybrid, which combines features of the others and is generally the fastest of those tested.
Abstract: This paper presents a comparison of the pragmatic aspects of some parallel algorithms for finding connected components, together with optimizations on these algorithms. The algorithms being compared are two similar algorithms by Shiloach-Vishkin [22] and Awerbuch-Shiloach [2], a randomized contraction algorithm based on algorithms by Reif [21] and Phillips [20], and a hybrid algorithm [11]. Improvements are given for the first two to improve performance significantly, although without improving their asymptotic complexity. The hybrid combines features of the others and is generally the fastest of those tested. Timings were made using NESL [4] code as executed on a Connection Machine 2 and Cray Y-MP/C90.

Journal ArticleDOI
TL;DR: In this paper, new algorithms for adaptive eigendecomposition of time-varying data covariance matrices are presented, based on a first-order perturbation analysis of the rank-one update for covariance matrix estimates with exponential windows.
Abstract: In this paper, new algorithms for adaptive eigendecomposition of time-varying data covariance matrices are presented. The algorithms are based on a first-order perturbation analysis of the rank-one update for covariance matrix estimates with exponential windows. Different assumptions on the eigenvalue structure lead to three distinct algorithms with varying degrees of complexity. A stabilization technique is presented and both issues of initialization and computational complexity are discussed. Computer simulations indicate that the new algorithms can achieve the same performance as a direct approach in which the exact eigendecomposition of the updated sample covariance matrix is obtained at each iteration. Previous algorithms with similar performance require O(LM/sup 2/) complex operations per iteration, where L and M respectively denote the data vector and signal-subspace dimensions, and involve either some form of Gram-Schmidt orthogonalization or a nonlinear eigenvalue search. The new algorithms have parallel structures, sequential operation counts of order O(LM) or less, and do not involve any of the above steps. One particular algorithm can be used to update the complete signal-subspace eigenstructure in 5LM complex operations. This represents an order of magnitude improvement in computational complexity over existing algorithms with similar performance. Finally, a simplified local convergence analysis of one of the algorithms shows that it is stable and converges in the mean to the true eigendecomposition. The convergence is geometrical and is characterized by a single time constant. >

Book
01 Jan 1994
TL;DR: This text discusses the design and use of practical parallel algorithms for solving problems in a growing application area whose computational requirements are enormous - VLSI CAD applications.
Abstract: Parallel computing is becoming an increasingly cost effective and affordable means for providing enormous computing power, and massively parallel (MPP) machines have been relatively easy to build. However, designing good parallel algorithms that can efficiently use the hardware resources to get the maximum performance remains a challenge. This text discusses the design and use of practical parallel algorithms for solving problems in a growing application area whose computational requirements are enormous - VLSI CAD applications. It also examines practical parallel algorithms (written in C and pseudo-C) for all forms of parallel programming - shared memory, MIMD, message passing distributed MIMD and SIMD, for a variety of interesting applications, with experimental results.

Journal ArticleDOI
Cherng Min Ma1
TL;DR: In this paper, sufficient conditions for 3D parallel thinning algorithms to preserve topology were established for 2D and 3D binary images, and a 2D-parallel thinning algorithm can be proved to be topology preserving by checking a small number of configurations.
Abstract: Topology preservation is a major concern of parallel thinning algorithms for 2D and 3D binary images. To prove that a parallel thinning algorithm preserves topology, one must show that it preserves topology for all possible images. But it would be difficult to check all images, since there are too many possible images. Efficient sufficient conditions which can simplify such proofs for the 2D case were proposed by Ronse [Discrete Appl. Math. 21, 1988, 69-79]. By Ronse′s results, a 2D parallel thinning algorithm can be proved to be topology preserving by checking a rather small number of configurations. This paper establishes sufficient conditions for 3D parallel thinning algorithms to preserve topology.

Journal ArticleDOI
01 Jun 1994
TL;DR: In this paper, the authors discuss two physical systems from separate disciplines that make use of the same algorithmic and mathematical structures to reduce the number of operations necessary to complete a realistic simulation.
Abstract: : We discuss two physical systems from separate disciplines that make use of the same algorithmic and mathematical structures to reduce the number of operations necessary to complete a realistic simulation. In the gravitational N- body problem, the acceleration of an object is given by the familiar Newtonian laws of motion and gravitation. The computational load is reduced by treating groups of bodies as single multipole sources rather than individual bodies. In the simulation of incompressible flows, the flow may be modeled by the dynamics of a set of N interacting vortices. Vortices are vector objects in three dimensions, but their interactions are mathematically similar to that of gravitating masses. The multipole approximation can be used to greatly reduce the time needed to compute the interactions between vortices. Both types of simulations were carried out on the Intel Touchstone Delta, a parallel MIMD computer with 512 processors. Timings are reported for systems of up to 10 million bodies, and demonstrate that the implementation scales well on massively parallel systems. The majority of the code is common between the two applications, which differ only in certain physics modules. In particular, the code for parallel tree construction and traversal is shared.