scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2001"


Journal ArticleDOI
01 May 2001
TL;DR: In this article, Component averaging (CAV) is introduced as a new iterative parallel technique suitable for large and sparse unstructured systems of linear equations, which simultaneously projects the current iterate onto all the system's hyperplanes, and is thus inherently parallel.
Abstract: Component averaging (CAV) is introduced as a new iterative parallel technique suitable for large and sparse unstructured systems of linear equations. It simultaneously projects the current iterate onto all the system's hyperplanes, and is thus inherently parallel. However, instead of orthogonal projections and scalar weights (as used, for example, in Cimmino's method), it uses oblique projections and diagonal weighting matrices, with weights related to the sparsity of the system matrix. These features provide for a practical convergence rate which approaches that of algebraic reconstruction technique (ART) (Kaczmarz's row-action algorithm) – even on a single processor. Furthermore, the new algorithm also converges in the inconsistent case. A proof of convergence is provided for unit relaxation, and the fast convergence is demonstrated on image reconstruction problems of the Herman head phantom obtained within the SNARK93 image reconstruction software package. Both reconstructed images and convergence plots are presented. The practical consequences of the new technique are far reaching for real-world problems in which iterative algorithms are used for solving large, sparse, unstructured and often inconsistent systems of linear equations.

233 citations


Journal ArticleDOI
TL;DR: This paper describes the essential elements of a parallel algorithm for the FDTD method using the MPI (message passing interface) library, and uses a new method that makes it unnecessary to split the field components.
Abstract: In this paper, we describe the essential elements of a parallel algorithm for the FDTD method using the MPI (message passing interface) library. To simplify and accelerate the algorithm, an MPI Cartesian 2D topology is used. The inter-process communications are optimized by the use of derived data types. A general approach is also explained for parallelizing the auxiliary tools, such as far-field computation, thin-wire treatment, etc. For PMLs, we have used a new method that makes it unnecessary to split the field components. This considerably simplifies the computer programming, and is compatible with the parallel algorithm.

224 citations


Proceedings ArticleDOI
14 Oct 2001
TL;DR: The main contribution is that the algorithms solve mixed packing and covering problems (in contrast to pure packing or pure covering problems, which have only "/spl les/" or only "/ spl ges/" inequalities) and run in time independent of the so-called width of the problem.
Abstract: We describe sequential and parallel algorithms that approximately solve linear programs with no negative coefficients (aka mixed packing and covering problems). For explicitly given problems, our fastest sequential algorithm returns a solution satisfying all constraints within a 1/spl plusmn//spl epsi/ factor in O(mdlog(m)//spl epsi//sup 2/) time, where m is the number of constraints and d is the maximum number of constraints any variable appears in. Our parallel algorithm runs in time polylogarithmic in the input size times /spl epsi//sup -4/ and uses a total number of operations comparable to the sequential algorithm. The main contribution is that the algorithms solve mixed packing and covering problems (in contrast to pure packing or pure covering problems, which have only "/spl les/" or only "/spl ges/" inequalities, but not both) and run in time independent of the so-called width of the problem.

202 citations


Proceedings ArticleDOI
27 May 2001
TL;DR: Results show that PQGA is superior to QGA as well as other conventional genetic algorithms, and is able to possess the two characteristics of exploration and exploitation simultaneously.
Abstract: This paper proposes a new parallel evolutionary algorithm called parallel quantum-inspired genetic algorithm (PQGA). Quantum-inspired genetic algorithm (QGA) is based on the concept and principles of quantum computing such as qubits and superposition of states. Instead of binary, numeric, or symbolic representation, by adopting the qubit chromosome as a representation, QGA can represent a linear superposition of solutions due to its probabilistic representation. QGA is suitable for parallel structures because of rapid convergence and good global search capability. That is, QGA is able to possess the two characteristics of exploration and exploitation simultaneously. The effectiveness and the applicability of PQGA are demonstrated by experimental results on the knapsack problem, which is a well-known combinatorial optimization problem. The results show that PQGA is superior to QGA as well as other conventional genetic algorithms.

190 citations


Journal ArticleDOI
TL;DR: A new parallel algorithm for data mining of association rules on shared-memory multiprocessors is presented, the degree of parallelism, synchronization, and data locality issues are studied, and proposed optimizations for fast frequency computation are presented.
Abstract: In this paper we present a new parallel algorithm for data mining of association rules on shared-memory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm.

168 citations


Proceedings ArticleDOI
29 Nov 2001
TL;DR: A new parallel algorithm MLFPT (multiple local frequent pattern tree) for parallel mining of frequent patterns, based on FP-growth mining, that uses only two full I/O scans of the database, eliminating the need for generating candidate items, and distributing the work fairly among processors.
Abstract: In this paper we introduce a new parallel algorithm MLFPT (multiple local frequent pattern tree) for parallel mining of frequent patterns, based on FP-growth mining, that uses only two full I/O scans of the database, eliminating the need for generating candidate items, and distributing the work fairly among processors. We have devised partitioning strategies at different stages of the mining process to achieve near optimal balancing between processors. We have successfully tested our algorithm on datasets larger than 50 million transactions.

166 citations


Journal ArticleDOI
TL;DR: This article is concerned with optimization of very large steel structures subjected to the actual constraints of the American Institute of Steel Construction ASD and LRFD specifications on high‐performance multiprocessor machines using biologically inspired genetic algorithms.
Abstract: This article is concerned with optimization of very large steel structures subjected to the actual constraints of the American Institute of Steel Construction ASD and LRFD specifications on high-performance multiprocessor machines using biologically inspired genetic algorithms. First, parallel fuzzy genetic algorithms (GAs) are presented for optimization of steel structures using a distributed memory Message Passing Interface (MPI) with two different schemes: the processor farming scheme and the migration scheme. Next, two bilevel parallel GAs are presented for large-scale structural optimization through judicious combination of shared memory data parallel processing using the OpenMP Application Programming Interface (API) and distributed memory message passing parallel processing using MPI. Speedup results are presented for parallel algorithms.

163 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a data association algorithm, termed m-best S-D, that can determine in O(mSkn/sup 3/) time (m assignments, S/spl ges/3 lists of size n, k relaxations) the (approximately) m- best solutions to an S -D assignment problem.
Abstract: In this paper we describe a novel data association algorithm, termed m-best S-D, that determines in O(mSkn/sup 3/) time (m assignments, S/spl ges/3 lists of size n, k relaxations) the (approximately) m-best solutions to an S-D assignment problem. The m-best S-D algorithm is applicable to tracking problems where either the sensors are synchronized or the sensors and/or the targets are very slow moving. The significance of this work is that the m-best S-D assignment algorithm (in a sliding window mode) can provide for an efficient implementation of a suboptimal multiple hypothesis tracking (MHT) algorithm by obviating the need for a brute force enumeration of an exponential number of joint hypotheses. We first describe the general problem for which the m-best S-D applies. Specifically, given line of sight (LOS) (i.e., incomplete position) measurements from S sensors, sets of complete position measurements are extracted, namely, the 1st, 2nd, ..., mth best (in terms of likelihood) sets of composite measurements are determined by solving a static S-D assignment problem. Utilizing the joint likelihood functions used to determine the m best S-D assignment solutions, the composite measurements are then quantified with a probability of being correct using a JPDA-like (joint probabilistic data association) technique. Lists of composite measurements from successive scans, along with their corresponding probabilities, are used in turn with a state estimator in a dynamic 2-D assignment algorithm to estimate the states of moving targets over time. The dynamic assignment cost coefficients are based on a likelihood function that incorporates the "true" composite measurement probabilities obtained from the (static) m-best S-D assignment solutions. We demonstrate the merits of the m-best S-D algorithm by applying it to a simulated multitarget passive sensor track formation and maintenance problem, consisting of multiple time samples of LOS measurements originating from multiple (S=7) synchronized high frequency direction finding sensors.

161 citations


Proceedings ArticleDOI
02 May 2001
TL;DR: This paper presents parallel algorithms for constructing state spaces (or Labeled Transition Systems) on a network or a cluster of workstations by using parallelization techniques and shows close to ideal speedups and a good load balancing between network nodes.
Abstract: The verification of concurrent finite-state systems by model-checking often requires to generate (a large part of) the state space of the system under analysis. Because of the state explosion problem, this may be a resource-consuming operation, both in terms of memory and CPU time. In this paper, we aim at improving the performances of state space construction by using parallelization techniques. We present parallel algorithms for constructing state spaces (or Labeled Transition Systems) on a network or a cluster of workstations. Each node in the network builds a part of the state space, all parts being merged to form the whole state space upon termination of the parallel computation. These algorithms have been implemented with the CADP verification tool set and experimented on various concurrent applications specified in LOTOS. The results obtained show close to ideal speedups and a good load balancing between network nodes.

150 citations


Book ChapterDOI
14 May 2001
TL;DR: An implementation of RSA cryptosystem using the RNS Montgomery multiplication is described, and an implementation method using the Chinese Remainder Theorem (CRT) is presented.
Abstract: We proposed a fast parallel algorithm of Montgomery multiplication based on Residue Number Systems (RNS). An implementation of RSA cryptosystem using the RNS Montgomery multiplication is described in this paper. We discuss how to choose the base size of RNS and the number of parallel processing units. An implementation method using the Chinese Remainder Theorem (CRT) is also presented. An LSI prototype adopting the proposed Cox-Rower Architecture achieves 1024- bit RSA transactions in 4.2 msec without CRT and 2.4 msec with CRT, when the operating frequency is 80 MHz and the total number of logic gates is 333 KG for 11 parallel processing units.

128 citations


Journal ArticleDOI
TL;DR: Some local and parallel discretizations and adaptive finite element algorithms are proposed and analyzed for nonlinear elliptic boundary value problems in both two and three dimensions for finite element solutions on general shape-regular grids.
Abstract: In this paper, some local and parallel discretizations and adaptive finite element algorithms are proposed and analyzed for nonlinear elliptic boundary value problems in both two and three dimensions. The main technique is to use a standard finite element discretization on a coarse grid to approximate low frequencies and then to apply some linearized discretization on a fine grid to correct the resulted residual (which contains mostly high frequencies) by some local/parallel procedures. The theoretical tools for analyzing these methods are some local a priori and a posteriori error estimates for finite element solutions on general shape-regular grids that are also obtained in this paper.

Journal ArticleDOI
TL;DR: Scalability tests of these algorithms for density-functional-theory based electronic-structure calculations show that the linear-scaling DFT algorithm is highly scalable.

Journal ArticleDOI
TL;DR: This paper presents a performance model for long-running parallel computations that execute with checkpointing enabled, discusses how it is relevant to today's parallel computing environments and software, and presents case studies of using the model to select runtime parameters.

Journal ArticleDOI
TL;DR: The algorithm proposed is based on concepts used in parallel genetic algorithms and local search heuristics and employs the Island model in which the migration frequency must not be very high.

Journal ArticleDOI
TL;DR: pSPADE as mentioned in this paper decomposes the original search space into smaller suffix-based classes, which can be solved in main memory using efficient search techniques and simple join operations with no synchronization.

Journal ArticleDOI
TL;DR: The GA, an evolution-like algorithm that is applied to a large population of RNA structures based on a pool of helical stems derived from an RNA sequence, evolves this population in parallel.
Abstract: A massively parallel Genetic Algorithm (GA) has been applied to RNA sequence folding on three different computer architectures. The GA, an evolution-like algorithm that is applied to a large population of RNA structures based on a pool of helical stems derived from an RNA sequence, evolves this population in parallel. The algorithm was originally designed and developed for a 16384 processor SIMD (Single Instruction Multiple Data) MasPar MP-2. More recently it has been adapted to a 64 processor MIMD (Multiple Instruction Multiple Data) SGI ORIGIN 2000, and a 512 processor MIMD CRAY T3E. The MIMD version of the algorithm raises issues concerning RNA structure data-layout and processor communication. In addition, the effects of population variation on the predicted results are discussed. Also presented are the scaling properties of the algorithm from the perspective of the number of physical processors utilized and the number of virtual processors (RNA structures) operated upon.

Journal ArticleDOI
TL;DR: The performance results with a commercial air traffic control simulation demonstrate that cloning can significantly reduce the time required to compute multiple alternate futures.
Abstract: We present a cloning mechanism that enables the evaluation of multiple simulated futures. Performance of the mechanism is analyzed and evaluated experimentally on a shared memory multiprocessor. A running parallel discrete event simulation is dynamically cloned at decision points to explore different execution paths concurrently. In this way, what-if and alternative scenario analysis can be performed in applications such as gaming or tactical and strategic battle management. A construct called virtual logical processes avoids repeating common computations among clones and improves efficiency. The advantages of cloning are preserved regardless of the number of clones (or execution paths). Our performance results with a commercial air traffic control simulation demonstrate that cloning can significantly reduce the time required to compute multiple alternate futures.

Journal ArticleDOI
01 Feb 2001
TL;DR: Numerical tests indicate that the sequential version of the Tabu search algorithm is highly competitive with the best existing heuristics and that the parallel algorithm outperforms all of these algorithms.
Abstract: We present a Tabu search algorithm for the vehicle routing problem under capacity and distance restrictions. The neighborhood search is based on compound moves generated by a node-ejection chain process. During the course of the algorithm, two types of neighborhood structures are used and crossing infeasible solutions is allowed. Then, a parallel version of the algorithm which exploits the moves’ characteristics is described. Parallel processing is used to explore the solution space more extensively and to accelerate the search process. Tests are carried out on a SUNSparc workstation and the parallel algorithm uses a network of four of these machines. Numerical tests indicate that the sequential version of the algorithm is highly competitive with the best existing heuristics and that the parallel algorithm outperforms all of these algorithms.

Journal ArticleDOI
TL;DR: A multiblock parallel Euler/Navier-Stokes solver using multigrid and dual-time stepping is developed along with the moving mesh method for unsteady flow calculations of airfoils and wings with deforming shapes as found in flutter simulations.
Abstract: A novel parallel dynamic moving mesh algorithm designed for multiblock parallel unsteady flow calculations using body-fitted grids is presented. The moving grid algorithm within each block uses a method of arc-length-based transfinite interpolation, which is performed independently on local processors where the blocks reside. A spring network approach is used to determine the motion of the corner points of the blocks, which may be connected in an unstructured fashion in a general multiblock method. A smoothing operator is applied to the points of the block face boundaries and edges to maintain grid smoothness and grid angles. A multiblock parallel Euler/Navier-Stokes solver using multigrid and dual-time stepping is developed along with the moving mesh method. Computational results are presented for the unsteady flow calculations of airfoils and wings with deforming shapes as found in flutter simulations

Journal ArticleDOI
TL;DR: When BICAV is optimized for block size and relaxation parameters, its very first iterates are far superior to those of CAV, and more or less on a par with ART.
Abstract: Component averaging (CAV) was recently introduced by Censor, Gordon, and Gordon as a new iterative parallel technique suitable for large and sparse unstructured systems of linear equations. Based on earlier work of Byrne and Censor, it uses diagonal weighting matrices, with pixel-related weights determined by the sparsity of the system matrix. CAV is inherently parallel (similar to the very slowly converging Cimmino method) but its practical convergence on problems of image reconstruction from projections is similar to that of the algebraic reconstruction technique (ART). Parallel techniques are becoming more important for practical image reconstruction since they are relevant not only for supercomputers but also for the increasingly prevalent multiprocessor workstations. This paper reports on experimental results with a block-iterative version of component averaging (BICAV). When BICAV is optimized for block size and relaxation parameters, its very first iterates are far superior to those of CAV, and more or less on a par with ART. Similar to CAV, BICAV is also inherently parallel. The fast convergence is demonstrated on problems of image reconstruction from projections, using the SNARK93 image reconstruction software package. Detailed plots of various measures of convergence, and reconstructed images are presented.

Journal ArticleDOI
TL;DR: The design of a parallel algorithm that uses moving fluids in a three-dimensional microfluidic system to solve a nondeterministically polynomial complete problem (the maximal clique problem) inPolynomial time is described.
Abstract: This paper describes the design of a parallel algorithm that uses moving fluids in a three-dimensional microfluidic system to solve a nondeterministically polynomial complete problem (the maximal clique problem) in polynomial time. This algorithm relies on (i) parallel fabrication of the microfluidic system, (ii) parallel searching of all potential solutions by using fluid flow, and (iii) parallel optical readout of all solutions. This algorithm was implemented to solve the maximal clique problem for a simple graph with six vertices. The successful implementation of this algorithm to compute solutions for small-size graphs with fluids in microchannels is not useful, per se, but does suggest broader application for microfluidics in computation and control.

Journal ArticleDOI
TL;DR: Two new system architectures, overlap-state sequential and split-and-merge parallel, are proposed based on a novel boundary postprocessing technique for the computation of the discrete wavelet transform (DWT) to introduce multilevel partial computations for samples near data boundaries.
Abstract: In this paper, two new system architectures, overlap-state sequential and split-and-merge parallel, are proposed based on a novel boundary postprocessing technique for the computation of the discrete wavelet transform (DWT). The basic idea is to introduce multilevel partial computations for samples near data boundaries based on a finite state machine model of the DWT derived from the lifting scheme. The key observation is that these partially computed (lifted) results can also be stored back to their original locations and the transform can be continued anytime later as long as these partial computed results are preserved. It is shown that such an extension of the in-place calculation feature of the original lifting algorithm greatly helps to reduce the extra buffer and communication overheads, in sequential and parallel system implementations, respectively. Performance analysis and experimental results show that, for the Daubechies (see J.Fourier Anal. Appl., vol.4, no.3, p.247-69, 1998) (9,7) wavelet filters, using the proposed boundary postprocessing technique, the minimal required buffer size in the line-based sequential DWT algorithm is 40% less than the best available approach. In the parallel DWT algorithm we show 30% faster performance than existing approaches.

Proceedings ArticleDOI
01 May 2001
TL;DR: This paper investigates the approach of using low cost PC cluster to parallelize the computation of iceberg-cube queries and recommends a “recipe” which uses PT as the default algorithm, but may also deploy ASL under specific circumstances.
Abstract: In this paper, we investigate the approach of using low cost PC cluster to parallelize the computation of iceberg-cube queries. We concentrate on techniques directed towards online querying of large, high-dimensional datasets where it is assumed that the total cube has net been precomputed. The algorithmic space we explore considers trade-offs between parallelism, computation and I/0. Our main contribution is the development and a comprehensive evaluation of various novel, parallel algorithms. Specifically: (1) Algorithm RP is a straightforward parallel version of BUC [BR99]; (2) Algorithm BPP attempts to reduce I/0 by outputting results in a more efficient way; (3) Algorithm ASL, which maintains cells in a cuboid in a skiplist, is designed to put the utmost priority on load balancing; and (4) alternatively, Algorithm PT load-balances by using binary partitioning to divide the cube lattice as evenly as possible.We present a thorough performance evaluation on all these algorithms on a variety of parameters, including the dimensionality of the cube, the sparseness of the cube, the selectivity of the constraints, the number of processors, and the size of the dataset. A key finding is that it is not a one-algorithm-fit-all situation. We recommend a “recipe” which uses PT as the default algorithm, but may also deploy ASL under specific circumstances.

Journal ArticleDOI
TL;DR: This paper resolves a long-standing open problem on whether the concurrent write capability of parallel random access machine (PRAM) is essential for solving fundamental graph problems like connected components and minimum spanning trees in logarithmic time.
Abstract: This paper resolves a long-standing open problem on whether the concurrent write capability of parallel random access machine (PRAM) is essential for solving fundamental graph problems like connected components and minimum spanning trees in O(logn) time. Specifically, we present a new algorithm to solve these problems in O(logn) time using a linear number of processors on the exclusive-read exclusive-write PRAM. The logarithmic time bound is actually optimal since it is well known that even computing the “OR” of nbit requires O(log n time on the exclusive-write PRAM. The efficiency achieved by the new algorithm is based on a new schedule which can exploit a high degree of parallelism.

Book ChapterDOI
28 May 2001
TL;DR: Using a simple model of hierarchical memories, mathematics is employed to determine a locally-optimal strategy for blocking matrices and the resulting family of algorithms yields performance that is superior to that of methods that automatically tune such kernels.
Abstract: During the last half-decade, a number of research efforts have centered around developing software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software end-products of both projects employ brute force to search a parameter space for blockings that accommodate multiple levels of memory hierarchy. We take a different approach: using a simple model of hierarchical memories we employ mathematics to determine a locally-optimal strategy for blocking matrices. The theoretical results show that, depending on the shape of the matrices involved, different strategies are locally-optimal. Rather than determining a blocking strategy at library generation time, the theoretical results show that, ideally, one should pursue a heuristic that allows the blocking strategy to be determined dynamically at run-time as a function of the shapes of the operands. When the resulting family of algorithms is combined with a highly optimized inner-kernel for a small matrix multiplication, the approach yields performance that is superior to that of methods that automatically tune such kernels. Preliminary results, for the Intel Pentium (R) III processor, support the theoretical insights.

Journal ArticleDOI
TL;DR: A numerical method designed for modelling dierent kinds of astrophysical flows in three dimensions employing the local shearing- box technique and uses parallel algorithms to increase the performance of standard serial methods.
Abstract: In this paper we describe a numerical method designed for modelling dierent kinds of astrophysical flows in three dimensions. Our method is a standard explicit nite dierence method employing the local shearing- box technique. To model the features of astrophysical systems, which are usually compressible, magnetised and turbulent, it is desirable to have high spatial resolution and large domain size to model as many features as possible, on various scales, within a particular system. In addition, the time-scales involved are usually wide- ranging also requiring signicant amounts of CPU time. These two limits (resolution and time-scales) enforce huge limits on computational capabilities. The model we have developed therefore uses parallel algorithms to increase the performance of standard serial methods. The aim of this paper is to report the numerical methods we use and the techniques invoked for parallelising the code. The justication of these methods is given by the extensive tests presented herein.

Journal ArticleDOI
TL;DR: This work presents and analyzes an asynchronous algorithm that is a generalization of the naive two-processor algorithm where the two processes each start at one side of the array and walk towards each other until they collide.
Abstract: The problem of using P processes to write a given value to all positions of a shared array of size N is called the Write-All problem. We present and analyze an asynchronous algorithm with work complexity O(NċPlog(x+1)/x)), where x N1/log(P) (assuming N = xk and P = 2k). Our algorithm is a generalization of the naive two-processor algorithm where the two processes each start at one side of the array and walk towards each other until they collide.

Journal ArticleDOI
TL;DR: Numerical results obtained with the meshfree contact algorithm show that this new contact algorithm can accurately predict the contact as well as separation of projectile and target.

Journal ArticleDOI
TL;DR: A structured neural network implementing the gradient projection algorithm is developed to solve the quadratic programming problem in constrained model predictive control in a massively parallel fashion with guaranteed convergence to optimal solution.

Journal ArticleDOI
TL;DR: Domain decomposition algorithms for parallel numerical solution of parabolic equations are studied for steady state or slow unsteady computation, showing that the resulting schemes are of second order global accuracy in space, and stable in the sense of Osher or in $L_{\infty }$.
Abstract: Domain decomposition algorithms for parallel numerical solution of parabolic equations are studied for steady state or slow unsteady computation. Implicit schemes are used in order to march with large time steps. Parallelization is realized by approximating interface values using explicit computation. Various techniques are examined, including a multistep second order explicit scheme and a one-step high-order scheme. We show that the resulting schemes are of second order global accuracy in space, and stable in the sense of Osher or in $L_{\infty }$. They are optimized with respect to the parallel efficiency.