scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 1995"


Journal ArticleDOI
TL;DR: In this article, three parallel algorithms for classical molecular dynamics are presented, which can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors.

32,670 citations


Journal ArticleDOI
TL;DR: Two serial and parallel algorithms for solving a system of equations that arises from the discretization of the Hamilton-Jacobi equation associated to a trajectory optimization problem of the following type are presented.
Abstract: We present serial and parallel algorithms for solving a system of equations that arises from the discretization of the Hamilton-Jacobi equation associated to a trajectory optimization problem of the following type. A vehicle starts at a prespecified point x/sub o/ and follows a unit speed trajectory x(t) inside a region in /spl Rscr//sup m/ until an unspecified time T that the region is exited. A trajectory minimizing a cost function of the form /spl int//sub 0//sup T/ r(x(t))dt+q(x(T)) is sought. The discretized Hamilton-Jacobi equation corresponding to this problem is usually solved using iterative methods. Nevertheless, assuming that the function r is positive, we are able to exploit the problem structure and develop one-pass algorithms for the discretized problem. The first algorithm resembles Dijkstra's shortest path algorithm and runs in time O(n log n), where n is the number of grid points. The second algorithm uses a somewhat different discretization and borrows some ideas from a variation of Dial's shortest path algorithm (1969) that we develop here; it runs in time O(n), which is the best possible, under some fairly mild assumptions. Finally, we show that the latter algorithm can be efficiently parallelized: for two-dimensional problems and with p processors, its running time becomes O(n/p), provided that p=O(/spl radic/n/log n). >

816 citations


Journal ArticleDOI
01 Aug 1995
TL;DR: This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics, and a general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly.
Abstract: Hierarchical clustering is common method used to determine clusters of similar data points in multi-dimensional spaces. $O(n^2)$ algorithms, where $n$ is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe $O(n)$ time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an $n$ node CRCW PRAM and $O(n \log n)$ algorithms for these metrics (except average link and complete link) on $\frac{n}{\log n}$ node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.

429 citations


Journal ArticleDOI
11 Jan 1995
TL;DR: The algorithm is implemented on the CM-5 and is run repeatedly on two deceptive problems to demonstrate the added implicit parallelism and faster convergence which can result from larger population sizes.
Abstract: This paper introduces and analyzes a parallel method of simulated annealing. Borrowing from genetic algorithms, an effective combination of simulated annealing and genetic algorithms, called parallel recombinative simulated annealing, is developed. This new algorithm strives to retain the desirable asymptotic convergence properties of simulated annealing, while adding the populations approach and recombinative power of genetic algorithms. The algorithm iterates a population of solutions rather than a single solution, employing a binary recombination operator as well as a unary neighborhood operator. Proofs of global convergence are given for two variations of the algorithm. Convergence behavior is examined, and empirical distributions are compared to Boltzmann distributions. Parallel recombinative simulated annealing is amenable to straightforward implementation on SIMD, MIMD, or shared-memory machines. The algorithm, implemented on the CM-5, is run repeatedly on two deceptive problems to demonstrate the added implicit parallelism and faster convergence which can result from larger population sizes.

326 citations


Book
01 Jan 1995
TL;DR: In this article, the authors introduce a new metric for measuring the performance of a parallel algorithm running on a parallel processor that has some advantages over the others, such as its use is illustrated with data from the Linpack benchmark report and the winners of the Gordon Bell Award.
Abstract: Many metrics are used for measuring the performance of a parallel algorithm running on a parallel processor. This article introduces a new metric that has some advantages over the others. Its use is illustrated with data from the Linpack benchmark report and the winners of the Gordon Bell Award.

267 citations


Journal ArticleDOI
TL;DR: A new algorithm for computing optical flow in a differential framework based on a robust version of total least squares is developed, incorporating only past time frames.
Abstract: We have developed a new algorithm for computing optical flow in a differential framework. The image sequence is first convolved with a set of linear, separable spatiotemporal filter kernels similar to those that have been used in other early vision problems such as texture and stereopsis. The brightness constancy constraint can then be applied to each of the resulting images, giving us, in general, an overdetermined system of equations for the optical flow at each pixel. There are three principal sources of error: (a) stochastic error due to sensor noise (b) systematic errors in the presence of large displacements and (c) errors due to failure of the brightness constancy model. Our analysis of these errors leads us to develop an algorithm based on a robust version of total least squares. Each optical flow vector computed has an associated reliability measure which can be used in subsequent processing. The performance of the algorithm on the data set used by Barron et al. (IJCV 1994) compares favorably with other techniques. In addition to being separable, the filters used are also causal, incorporating only past time frames. The algorithm is fully parallel and has been implemented on a multiple processor machine.

264 citations


Journal ArticleDOI
TL;DR: The proposed systolic array and the parallel filter architectures implement these on-line algorithms and are optimal both with respect to area and time (under the word-serial model).
Abstract: This paper presents a wide range of algorithms and architectures for computing the 1D and 2D discrete wavelet transform (DWT) and the 1D and 2D continuous wavelet transform (CWT). The algorithms and architectures presented are independent of the size and nature of the wavelet function. New on-line algorithms are proposed for the DWT and the CWT that require significantly small storage. The proposed systolic array and the parallel filter architectures implement these on-line algorithms and are optimal both with respect to area and time (under the word-serial model). Moreover, these architectures are very regular and support single chip implementations in VLSI. The proposed SIMD architectures implement the existing pyramid and a'trous algorithms and are optimal with respect to time. >

244 citations


Journal ArticleDOI
TL;DR: A subpixel addressing mechanism (called linear interpolation) is utilized for intermediate pixel addressing in the differentiation step, which results in improved accuracy of corner localization and reduced computational complexity.

207 citations


Proceedings ArticleDOI
29 May 1995
TL;DR: In this paper, it was shown that a unit-cost RAM with a word length of bits can sort integers in the range in time, for arbitrary!, a significant improvement over the bound of " # $ achieved by the fusion trees of Fredman and Willard, provided that % &'( *),+., for some fixed /102, the sorting can even be accomplished in linear expected time with a randomized algorithm.
Abstract: We show that a unit-cost RAM with a word length of bits can sort integers in the range in time, for arbitrary ! , a significant improvement over the bound of " # $ achieved by the fusion trees of Fredman and Willard. Provided that % & ' ( *),+., for some fixed /102 , the sorting can even be accomplished in linear expected time with a randomized algorithm. Both of our algorithms parallelize without loss on a unitcost PRAM with a word length of bits. The first one yields an algorithm that uses 3 4 5 $ time and 6 ( operations on a deterministic CRCW PRAM. The second one yields an algorithm that uses ' 5 7 expected time and " expected operations on a randomized EREW PRAM, provided that 8 ' 5 7 *),+.for some fixed /90: . Our deterministic and randomized sequential and parallel algorithms generalize to the lexicographic sorting problem of sorting multiple-precision integers represented in several words.

194 citations


Book
01 Jan 1995
TL;DR: A thorough introduction to this technology, explaining the fundamentals of parallelism in a logical and readable way, developing the algorithms of vector processors, shared-memory parallel machines and distributed-memory machines emphasising the link between architectures, models and algorithms.
Abstract: From the Publisher: Developments in parallel computing in recent years have made it possible to build multi-processor architectures that enable efficiency and speed in today's computing environment. Parallel Algorithms and Architectures provides a thorough introduction to this technology, explaining the fundamentals of parallelism in a logical and readable way. Progressing from theory to implementation, the text develops the algorithms of vector processors, shared-memory parallel machines and distributed-memory machines emphasising the link between architectures, models and algorithms. In addition, the book addresses a number of issues that are of great practical importance to people developing parallel programs, including coverage of LINPACK and BLAS, vectorisation, task placement and scheduling. Parallel Algorithms and Architectures is ideal for both computer science students and people in industry who require an understanding of parallelism.

194 citations


Proceedings ArticleDOI
Neal E. Young1
22 Jan 1995
TL;DR: A new technique called oblivious rounding is introduced a variant of randomized rounding that avoids the bottleneck of first solving the linear program, which yields more efficient algorithms and brings probabilistic methods to bear on a new class of problems.
Abstract: We introduce a new technique called oblivious rounding a variant of randomized rounding that avoids the bottleneck of first solving the linear program. Avoiding this bottleneck yields more efficient algorithms and brings probabilistic methods to bear on a new class of problems. We give oblivious rounding algorithms that approximately solve general packing and covering problems, including a parallel algorithm to find sparse strategies for matrix games.

Journal ArticleDOI
TL;DR: This paper presents a distributed genetic algorithm for optimization of large structures on a cluster of workstations connected via a local area network (LAN) based on its adaptability to a high degree of parallelism.
Abstract: Parallel algorithms for optimization of structures reported in the literature have been restricted to shared-memory multiprocessors. This paper presents a distributed genetic algorithm for optimization of large structures on a cluster of workstations connected via a local area network (LAN). The selection of genetic algorithm is based on its adaptability to a high degree of parallelism. Two different approaches are used to transform the constrained structural optimization problem to an unconstrained optimization problem: a penalty-function method and augmented Lagrangian approach. For the solution of the resulting simultaneous linear equations the iterative preconditioned conjugate gradient (PCG) method is used because of its low memory requirement. A dynamic load-balancing mechanism is developed to account for the unpredictable multiuser, multasking environment of a networked cluster of workstations, heterogeneity of machines, and indeterminate nature of the interative PCG equation solver. The algorithm ...

Journal ArticleDOI
TL;DR: The key issues addressed by Matcher are described and the underlying parallel algorithm, which generates the data structures needed for handling arbitrary and non-conforming fluid/structure interfaces in aeroelastic computations, is described.

19 Sep 1995
TL;DR: The NESL as mentioned in this paper is a strongly-typed, applicative, data-parallel language for programming a variety of parallel and vector computers, and as a basis for teaching parallel algorithms.
Abstract: : This report describes NESL, a strongly-typed, applicative, data-parallel language. NESL is intended to be used as a portable interface for programming a variety of parallel and vector computers, and as a basis for teaching parallel algorithms. Parallelism is supplied through a simple set of data-parallel constructs based on sequences, including a mechanism for applying any function over the elements of a sequence in parallel and a rich set of parallel functions that manipulate sequences.

Journal ArticleDOI
TL;DR: A new algorithm for the fast Hough transform (FHT) is described that satisfactorily solves the problems other fast algorithms propose in the literature-erroneous solutions, point redundance, scaling, and detection of straight lines of different sizes-and needs less storage space.
Abstract: The authors describe a new algorithm for the fast Hough transform (FHT) that satisfactorily solves the problems other fast algorithms propose in the literature-erroneous solutions, point redundance, scaling, and detection of straight lines of different sizes-and needs less storage space. By using the information generated by the algorithm for the detection of straight lines, they manage to detect the segments of the image without appreciable computational overhead. They also discuss the performance and the parallelization of the algorithm and show its efficiency with some examples. >

01 Aug 1995
TL;DR: The performance results show that, while both algorithms parallelize easily and obtain good speedup and scale-up results, the parallel SEAR version performs better than parallel SPEAR, despite the fact that it uses more communication.
Abstract: The eld of knowledge discovery in databases, or \Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One speciic data mining task is the mining of Association Rules, particularly from retail data. The task is to determine patterns (or rules) that characterize the shopping behavior of customers from a large database of previous consumer transactions. The rules can then be used to focus marketing eeorts such as product placement and sales promotions. Because early algorithms required an unpredictably large number of IO operations, reducing IO cost has been the primary target of the algorithms presented in the literature. One of the most recent proposed algorithms, called PARTITION, uses a new TID-list data representation and a new partitioning technique. The partitioning technique reduces IO cost to a constant amount by processing one database portion at a time in memory. We implemented an algorithm called SPTID that incorporates both TID-lists and partitioning to study their beneets. For comparison, a non-partitioning algorithm called SEAR, which is based on a new preex-tree data structure, is used. Our experiments with SPTID and SEAR indicate that TID-lists have inherent ineeciencies; furthermore, because all of the algorithms tested tend to be CPU-bound, trading CPU-overhead against I/O operations by partitioning did not lead to better performance. In order to scale mining algorithms to the huge databases (e.g., multiple Terabytes) that large organizations will manage in the near future, we implemented parallel versions of SEAR and SPEAR (its partitioned counterpart). The performance results show that, while both algorithms parallelize easily and obtain good speedup and scale-up results, the parallel SEAR version performs better than parallel SPEAR, despite the fact that it uses more communication.

Proceedings ArticleDOI
25 Apr 1995
TL;DR: A new heuristic algorithm for optimizing the assignment of priorities to tasks and messages in distributed hard realtime systems that executes two orders of magnitude faster than simulated annealing, finds better solutions, and finds solutions in cases where the latter method fails.
Abstract: Recent advances in the analysis of distributed realtime systems have made it possible to predict if hard realtime requirements will be met. However, it is still difficult to find a feasible priority assignment when the utilization levels of the CPUs and communication networks are pushed near to their limits. This paper presents a new heuristic algorithm for optimizing the assignment of priorities to tasks and messages in distributed hard realtime systems. The algorithm is based on the knowledge of the parameters that influence the worst-case response time of a distributed application. This algorithm is compared to simulated annealing, which is a general optimization technique for discrete functions that had been previously used for solving similar problems. On average, our heuristic algorithm executes two orders of magnitude faster than simulated annealing, finds better solutions, and finds solutions in cases where the latter method fails. >

ReportDOI
31 Dec 1995
TL;DR: A new parallel algorithm for mesh smoothing that has a fast parallel runtime both in theory and in practice and experimental results obtained on the IBM SP system demonstrating the efficiency of this approach are presented.
Abstract: Automatic mesh generation and adaptive refinement methods have proven to be very successful tools for the efficient solution of complex finite element applications. A problem with these methods is that they can produce poorly shaped elements; such elements are undesirable because they introduce numerical difficulties in the solution process. However, the shape of the elements can be improved through the determination of new geometric locations for mesh vertices by using a mesh smoothing algorithm. In this paper the authors present a new parallel algorithm for mesh smoothing that has a fast parallel runtime both in theory and in practice. The authors present an efficient implementation of the algorithm that uses non-smooth optimization techniques to find the new location of each vertex. Finally, they present experimental results obtained on the IBM SP system demonstrating the efficiency of this approach.

Journal ArticleDOI
TL;DR: An efficient algorithm to find exact (tight) bounds on the separation time of events in an arbitrary process graph without conditional behavior is presented, which will form a basis for exploration of timing-constrained synthesis techniques.
Abstract: Determining the time separation of events is a fundamental problem in the analysis, synthesis, and optimization of concurrent systems. Applications range from logic optimization of asynchronous digital circuits to evaluation of execution times of programs for real-time systems. We present an efficient algorithm to find exact (tight) bounds on the separation time of events in an arbitrary process graph without conditional behavior. This result is more general than the methods presented in several previously published papers as it handles cyclic graphs and yields the tightest possible bounds on event separations. The algorithm is based on a functional decomposition technique that permits the implicit evaluation of an infinitely unfolded process graph. Examples are presented that demonstrate the utility and efficiency of the solution. The algorithm will form a basis for exploration of timing-constrained synthesis techniques. >

Proceedings ArticleDOI
23 Oct 1995
TL;DR: It is shown that for degrees of parallelism of typical practical interest, the Gauss-Seidel iterations updates may be computed in parallel with little loss in convergence speed.
Abstract: While Bayesian methods can significantly improve the quality of tomographic reconstructions, they require the solution of large iterative optimization problems. Recent results indicate that the convergence of these optimization problems can be improved by using sequential pixel updates, or Gauss-Seidel iterations. However, Gauss-Seidel iterations may be perceived as less useful when parallel computing architectures are use. We show that for degrees of parallelism of typical practical interest, the Gauss-Seidel iterations updates may be computed in parallel with little loss in convergence speed. In this case, the theoretical speed up of parallel implementations is nearly linear with the number of processors.

Journal ArticleDOI
TL;DR: It is proved that by use of certain randomizations on the input system the parallel speed up is roughly by the number of vectors in the blocks when using as many processors.
Abstract: By using projections by a block of vectors in place of a single vector it is possible to parallelize the outer loop of iterative methods for solving sparse linear systems. We analyze such a scheme proposed by Coppersmith for Wiedemann's coordinate recurrence algorithm, which is based in part on the Krylov subspace approach. We prove that by use of certain randomizations on the input system the parallel speed up is roughly by the number of vectors in the blocks when using as many processors. Our analysis is valid for fields of entries that have sufficiently large cardinality. Our analysis also deals with an arising subproblem of solving a singular block Toeplitz system by use of the theory of Toeplitz-like matrices

Journal ArticleDOI
TL;DR: The presented algorithm is able to generate almost optimal packing schemes and even in its sequential version the algorithm empirically is proven to be superior to different approaches like random search or simulated annealing.

Proceedings ArticleDOI
20 Jul 1995
TL;DR: The paper identifies a class of parallel schedules that are provably efficient in both time and space and describes a scheduler for implementing high-level languages with nested parallel- ism, that generates schedules in this class.
Abstract: Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any computation with w units of work and critical path length d, and for any sequential schedule that takes space s1, we provide a parallel schedule that takes fewer than w/p 1 d steps on p processors and requires less than s 1 1 p z d space. This matches the lower bound that we show, and significantly improves upon the best previous bound of s1 z p space for the common case where d , , s 1. The paper then describes a scheduler for implementing high-level languages with nested parallel- ism, that generates schedules in this class. During program execution, as the structure of the computation is revealed, the scheduler keeps track of the active tasks, allocates the tasks to the processors, and performs the necessary task synchronization. The scheduler is itself a parallel algorithm, and incurs at most a constant factor overhead in time and space, even when the scheduling granularity is individual units of work. The algorithm is the first efficient solution to the scheduling problem discussed here, even if space considerations are ignored.

Journal ArticleDOI
01 Feb 1995
TL;DR: An adaptive interacting multiple-model algorithm (AIMM) for use in manoeuvring target tracking that does not need predefined models and can be implemented on parallel machines.
Abstract: The paper describes an adaptive interacting multiple-model algorithm (AIMM) for use in manoeuvring target tracking. The algorithm does not need predefined models. A two-stage Kalman estimator is used to estimate the acceleration of the target. This acceleration value is then fed to the subfilters in an interacting multiple-model (IMM) algorithm, where the subfilters have different acceleration parameters. Results compare the performance of the AIMM algorithm with the IMM algorithm, using simulations of different manoeuvring-target scenarios. Also considered are the relative computational requirements, and the ease with which the algorithms can be implemented on parallel machines.

Journal ArticleDOI
Gilles Bertrand1
TL;DR: A new 3D parallel thinning algorithm for medial surfaces that works in cubic grids with the 6-connectivity is proposed, based on a precise definition of end points which are points belonging to surfaces or curves.

Book
01 Jan 1995
TL;DR: Isoefficiency analysis helps us determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions as mentioned in this paper, which is the case here.
Abstract: Isoefficiency analysis helps us determine the best algorithm/architecture combination for a particular problem without explicitly analyzing all possible combinations under all possible conditions. >

Journal ArticleDOI
01 Oct 1995
TL;DR: This parallel model is functionally equivalent to the National Center for Atmospheric Research's Community Climate Model, CCM2, but is structured to exploit distributed memory multi-computers and incorporates parallel spectral transform, semi-Lagrangian transport, and load balancing algorithms.
Abstract: We describe the design of a parallel global atmospheric circulation model, PCCM2 This parallel model is functionally equivalent to the National Center for Atmospheric Research's Community Climate Model, CCM2, but is structured to exploit distributed memory multi-computers PCCM2 incorporates parallel spectral transform, semi-Lagrangian transport, and load balancing algorithms We present detailed performance results on the IBM SP2 and Intel Paragon These results provide insights into the scalability of the individual parallel algorithms and of the parallel model as a whole

Proceedings ArticleDOI
04 Jan 1995
TL;DR: The role of unbundled compiler technology in facilitating the development of such a parallel computer environment is described, based on the Bulk Synchronous Parallel Model, which is a general purpose parallel computing environment for developing transportable algorithms.
Abstract: A necessary condition for the establishment, on a substantial basis, of a parallel software industry would appear to be the availability of technology for generating transportable software, i.e. architecture independent software which delivers scalable performance for a wide variety of applications on a wide range of multiprocessor computers. This paper describes H-BSP-a general purpose parallel computing environment for developing transportable algorithms. H-BSP is based on the Bulk Synchronous Parallel Model (BSP), in which a computation involves a number of supersteps, each having several parallel computational threads that synchronize at the end of the superstep. The BSP Model deals explicitly with the notion of communication among computational threads and introduces parameters g and L that quantify the ratio of communication throughput to computation throughput, and the synchronization period, respectively. These two parameters, together with the number of processors and the problem size, are used to quantify the performance and, therefore, the transportability of given classes of algorithms across machines having different values for these parameters. This paper describes the role of unbundled compiler technology in facilitating the development of such a parallel computer environment. >

Journal ArticleDOI
01 Aug 1995
TL;DR: The EM reconstruction algorithm for volume acquisition from current generation retracted-septa PET scanners is implemented, and extensive use of EM system matrix (C/sub ij/) symmetries reduces the storage cost by a factor of 188.
Abstract: We have implemented the EM reconstruction algorithm for volume acquisition from current generation retracted-septa PET scanners. Although the software was designed for a GE Advance scanner, it is easily adaptable to other 3D scanners. The reconstruction software was written for an Intel iPSC/860 parallel computer with 128 compute nodes. Running on 32 processors, the algorithm requires approximately 55 minutes per iteration to reconstruct a 128/spl times/128/spl times/35 image. No projection data compression schemes or other approximations were used in the implementation. Extensive use of EM system matrix (C/sub ij/) symmetries (including the 8-fold in-plane symmetries, 2-fold axial symmetries, and axial parallel line redundancies) reduces the storage cost by a factor of 188. The parallel algorithm operates on distributed projection data which are decomposed by base-symmetry angles. Symmetry operators copy and index the C/sub ij/ chord to the form required for the particular symmetry. The use of asynchronous reads, lookup tables, and optimized image indexing improves computational performance. >

Journal ArticleDOI
01 Nov 1995
TL;DR: The REFINE multiprocessor is shown to offer a cost-effective alternative to the Boolean n-cube multiprocessionor architecture without substantial loss in performance.
Abstract: A reconfigurable interconnection network based on a multi-ring architecture called REFINE is described. REFINE embeds a single 1-factor of the Boolean hypercube in any given configuration. The mathematical properties of the REFINE topology and the hardware for the reconfiguration switch are described. The REFINE topology is scalable in the sense that the number of interprocessor communication links scales linearly with network size whereas the network diameter scales logarithmically with network size. Primitive parallel operations on the REFINE topology are described and analyzed. These primitive operations could be used as building blocks for more complex parallel algorithms. A large class of algorithms for the Boolean n-cube which includes the FFT and the Batcher's bitonic sort is shown to map efficiently on the REFINE topology. The REFINE multiprocessor is shown to offer a cost-effective alternative to the Boolean n-cube multiprocessor architecture without substantial loss in performance.