Showing papers on "Parallel algorithm published in 2002"

PDF

Open Access

Parallel Metropolis-Coupled Markov Chain Monte Carlo for Bayesian

[...]

Gautam Altekar, Sandhya Dwarkadas, John P. Huelsenbeck, Fredrik Ronquist

01 Jul 2002

TL;DR: In this paper, a parallel implementation of Metropolis-Coupled Markov Chain Monte Carlo (MCMC) has been proposed to explore multiple peaks in the posterior distribution of trees while maintaining a fast execution time.

...read moreread less

Abstract: Bayesian estimation of phylogeny is based on the posterior probability distribution of trees. Currently, the only numerical method that can effectively approximate posterior probabilities of trees is Markov Chain Monte Carlo (MCMC). Standard implementations of MCMC can be prone to entrapment in local optima. A variant of MCMC, known as Metropolis-Coupled MCMC, allows multiple peaks in the landscape of trees to be more readily explored, but at the cost of increased execution time. This paper presents a parallel algorithm for Metropolis-Coupled MCMC. The proposed parallel algorithm retains the ability to explore multiple peaks in the posterior distribution of trees while maintaining a fast execution time. The algorithm has been implemented using two parallel programming models: the Message Passing Interface (MPI) and the Cashmere software distributed shared memory protocol. Performance results indicate nearly linear speed improvement in both programming models for small and large data sets. (MrBayes v3.0 is available at http://morphbank.ebc.uu.se/mrbayes/.)

...read moreread less

965 citations

Journal Article•DOI•

BoomerAMG: a parallel algebraic multigrid solver and preconditioner

[...]

Van Emden Henson¹, Ulrike Meier Yang¹•Institutions (1)

Lawrence Livermore National Laboratory¹

01 Apr 2002-Applied Numerical Mathematics

TL;DR: This paper describes an implementation of a parallel AMG code, using the algorithm of A.J. Cleary, and considers three basic coarsening schemes and certain modifications to the basic schemes, designed to address specific performance issues.

...read moreread less

849 citations

Patent•

Novel massively parallel supercomputer

[...]

Matthias A. Blumrich¹, Dong Chen¹, George Liang-Tai Chiu¹, Thomas Mario Cipolla¹, Paul W. Coteus¹, Alan Gara¹, Mark E. Giampapa¹, Philip Heidelberger¹, Gerald V Kopscay¹, Lawrence S. Mok¹, Todd E. Takken¹ - Show less +7 more•Institutions (1)

IBM¹

25 Feb 2002

TL;DR: In this paper, a massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures based upon System-On-a-Chip technology, i.e., each processing node comprises a single Application Specific Integrated Circuit (ASIC), within each ASIC node is a plurality of processing elements each of which consists of a central processing unit (CPU) and plurality of floating point processors.

...read moreread less

Abstract: A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures based upon System-On-a-Chip technology, i.e., each processing node comprises a single Application Specific Integrated Circuit (ASIC). Within each ASIC node is a plurality of processing elements each of which consists of a central processing unit (CPU) and plurality of floating point processors to enable optimal balance of computational performance, packaging density, low cost, and power and cooling requirements. The plurality of processors within a single node may be used individually or simultaneously to work on any combination of computation or communication as required by the particular algorithm being solved or executed at any point in time. The system-on-a-chip ASIC nodes are interconnected by multiple independent networks that optimally maximizes packet communications throughput and minimizes latency. In the preferred embodiment, the multiple networks include three high-speed networks for parallel algorithm message passing including a Torus, Global Tree, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. For particular classes of parallel algorithms, or parts of parallel calculations, this architecture exhibits exceptional computational performance, and may be enabled to perform calculations for new classes of parallel algorithms. Additional networks are provided for external connectivity and used for Input/Output, System Management and Configuration, and Debug and Monitoring functions. Special node packaging techniques implementing midplane and other hardware devices facilitates partitioning of the supercomputer in multiple networks for optimizing supercomputing resources.

...read moreread less

329 citations

Book Chapter•DOI•

Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO

[...]

Olaf Schenk¹, Klaus Gärtner•Institutions (1)

University of Basel¹

21 Apr 2002

TL;DR: Experiments demonstrate that a wide set of unsymmetric linear systems can be solved and high performance is consistently achieved for large sparse unsympetric matrices from real world applications.

...read moreread less

Abstract: Supernode pivoting for unsymmetric matrices coupled with supernode partitioning and asynchronous computation can achieve high gigaflop rates for parallel sparse LU factorization on shared memory parallel computers. The progress in weighted graph matching algorithms helps to extend these concepts further and prepermutation of rows is used to place large matrix entries on the diagonal. Supernode pivoting allows dynamical interchanges of columns and rows during the factorization process. The BLAS-3 level efficiency is retained. An enhanced left-right looking scheduling scheme is uneffected and results in good speedup on SMP machines without increasing the operation count. These algorithms have been integrated into the recent unsymmetric version of the PARDISO solver. Experiments demonstrate that a wide set of unsymmetric linear systems can be solved and high performance is consistently achieved for large sparse unsymmetric matrices from real world applications.

...read moreread less

323 citations

Journal Article•DOI•

Use of intelligent-particle swarm optimization in electromagnetics

[...]

Gabriela Ciuprina¹, Daniel Ioan², Irina Munteanu²•Institutions (2)

University of Bucharest¹, Politehnica University of Bucharest²

07 Aug 2002-IEEE Transactions on Magnetics

TL;DR: The Intelligent Particle Swarm Optimization (IPSO) algorithm as mentioned in this paper uses concepts such as group experiences, unpleasant memories (tabu to be avoided), local landscape models based on virtual neighbors, and memetic replication of successful behavior parameters.

...read moreread less

Abstract: The paper describes a new stochastic heuristic algorithm for global optimization. The new optimization algorithm, called intelligent-particle swarm optimization (IPSO), offers more intelligence to particles by using concepts such as: group experiences, unpleasant memories (tabu to be avoided), local landscape models based on virtual neighbors, and memetic replication of successful behavior parameters. The new individual complexity is amplified at the group level and consequently generates a more efficient optimization procedure. A simplified version of the IPSO algorithm was implemented and compared with the classical PSO algorithm for a simple test function and for the Loney's solenoid.

...read moreread less

276 citations

Journal Article•DOI•

ROSS: A high-performance, low-memory, modular Time Warp system

[...]

Christopher D. Carothers¹, David W. Bauer¹, Shawn Pearce¹•Institutions (1)

Rensselaer Polytechnic Institute¹

01 Nov 2002-Journal of Parallel and Distributed Computing

TL;DR: ROSS demonstrates for the first time that stable, highly efficient execution using little memory above what the sequential model would require is possible for low-event granularity simulation models.

...read moreread less

214 citations

Book Chapter•DOI•

A Fast Parallel Elliptic Curve Multiplication Resistant against Side Channel Attacks

[...]

Tetsuya Izu¹, Tsuyoshi Takagi²•Institutions (2)

Fujitsu¹, Technische Universität Darmstadt²

12 Feb 2002

TL;DR: This paper proposes a fast elliptic curve multiplication algorithm applicable for any types of curves over finite fields Fp (p a prime), together with criteria which make the algorithm resistant against the side channel attacks (SCA).

...read moreread less

Abstract: This paper proposes a fast elliptic curve multiplication algorithm applicable for any types of curves over finite fields Fp (p a prime), based on [Mon87], together with criteria which make our algorithm resistant against the side channel attacks (SCA). The algorithm improves both on an addition chain and an addition formula in the scalar multiplication. Our addition chain requires no table look-up (or a very small number of pre-computed points) and a prominent property is that it can be implemented in parallel. The computing time for n-bit scalar multiplication is one ECDBL + (n - 1) ECADDs in the parallel case and (n - 1) ECDBLs + (n - 1) ECADDs in the single case. We also propose faster addition formulas which only use the x-coordinates of the points. By combination of our addition chain and addition formulas, we establish a faster scalar multiplication resistant against the SCA in both single and parallel computation. The improvement of our scalar multiplications over the previous method is about 37% for two processors and 5.7% for a single processor. Our scalar multiplication is suitable for the implementation on smart cards.

...read moreread less

197 citations

Journal Article•DOI•

Parallel evolutionary algorithms can achieve super-linear performance

[...]

Enrique Alba

15 Apr 2002-Information Processing Letters

TL;DR: The conclusion is that super-linear performance is possible for PEAs, theoretically and in practice, both in homogeneous and in heterogeneous parallel machines.

...read moreread less

182 citations

Journal Article•DOI•

Parallel static and dynamic multi‐constraint graph partitioning

[...]

Kirk Schloegel¹, George Karypis¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

01 Mar 2002-Concurrency and Computation: Practice and Experience

TL;DR: A parallel formulation of a multi‐constraint graph‐partitioning algorithm, as well as a new partitioning algorithm for dynamic multi‐phase simulations, which are able to minimize the data redistribution required to balance the load better than a naive scratch‐remap approach.

...read moreread less

Abstract: Sequential multi-constraint graph partitioners have been developed to address the static load balancing requirements of multi-phase simulations. These work well when (i) the graph that models the computation fits into the memory of a single processor, and (ii) the simulation does not require dynamic load balancing. The efficient execution of very large or dynamically adapting multi-phase simulations on high-performance parallel computers requires that the multi-constraint partitionings are computed in parallel. This paper presents a parallel formulation of a multi-constraint graph-partitioning algorithm, as well as a new partitioning algorithm for dynamic multi-phase simulations. We describe these algorithms and give experimental results conducted on a 128-processor Cray T3E. These results show that our parallel algorithms are able to efficiently compute partitionings of similar edge-cuts as serial multi-constraint algorithms, and can scale to very large graphs. Our dynamic multi-constraint algorithm is also able to minimize the data redistribution required to balance the load better than a naive scratch-remap approach. We have shown that both of our parallel multi-constraint graph partitioners are as scalable as the widely-used parallel graph partitioner implemented in PARMETIS. Both of our parallel multi-constraint graph partitioners are very fast, as they are able to compute three-constraint 128-way partitionings of a 7.5 million vertex graph in under 7 s on 128 processors of a Cray T3E. Copyright © 2002 John Wiley & Sons, Ltd.

...read moreread less

174 citations

Proceedings Article•DOI•

Parallel simulated annealing for the vehicle routing problem with time windows

[...]

Zbigniew J. Czech, Piotr Czarnas¹•Institutions (1)

University of Wrocław¹

09 Jan 2002

TL;DR: A parallel simulated annealing algorithm to solve the vehicle routing problem with time windows is presented and the empirical evidence indicate that parallel simulatedAnnealing can be applied with success to bicriterion optimization problems.

...read moreread less

Abstract: A parallel simulated annealing algorithm to solve the vehicle routing problem with time windows is presented. The objective is to find the best possible solutions to some well-known instances of the problem by using parallelism. The empirical evidence indicate that parallel simulated annealing can be applied with success to bicriterion optimization problems.

...read moreread less

170 citations

Journal Article•DOI•

Initial testing of a massively parallel ensemble Kalman filter with the Poseidon isopycnal ocean general circulation model

[...]

Christian L. Keppenne¹, Michele M. Rienecker²•Institutions (2)

Science Applications International Corporation¹, Goddard Space Flight Center²

01 Dec 2002-Monthly Weather Review

TL;DR: In this paper, a multivariate ensemble Kalman filter (MvEnKF) is implemented on a massively parallel computer architecture for the Poseidon ocean circulation model and tested with a Pacific basin model configuration.

...read moreread less

Abstract: A multivariate ensemble Kalman filter (MvEnKF) implemented on a massively parallel computer architecture has been developed for the Poseidon ocean circulation model and tested with a Pacific basin model configuration. There are about 2 million prognostic state-vector variables. Parallelism for the data assimilation step is achieved by regionalization of the background-error covariances that are calculated from the phase‐space distribution of the ensemble. Each processing element (PE) collects elements of a matrix measurement functional from nearby PEs. To avoid the introduction of spurious long-range covariances associated with finite ensemble sizes, the background-error covariances are given compact support by means of a Hadamard (element by element) product with a three-dimensional canonical correlation function. The methodology and the MvEnKF implementation are discussed. To verify the proper functioning of the algorithms, results from an initial experiment with in situ temperature data are presented. Furthermore, it is shown that the regionalization of the background covariances has a negligible impact on the quality of the analyses. Even though the parallel algorithm is very efficient for large numbers of observations, individual PE memory, rather than speed, dictates how large an ensemble can be used in practice on a platform with distributed memory.

...read moreread less

Journal Article•DOI•

Approximation Techniques for Average Completion Time Scheduling

[...]

Chandra Chekuri¹, Rajeev Motwani², Balas K. Natarajan, Clifford Stein³•Institutions (3)

Alcatel-Lucent¹, Stanford University², Dartmouth College³

01 Jan 2002-SIAM Journal on Computing

TL;DR: In this article, the authors consider the problem of nonpreemptive scheduling to minimize average (weighted) completion time, allowing for release dates, parallel machines, and precedence constraints.

...read moreread less

Abstract: We consider the problem of nonpreemptive scheduling to minimize average (weighted) completion time, allowing for release dates, parallel machines, and precedence constraints. Recent work has led to constant-factor approximations for this problem based on solving a preemptive or linear programming relaxation and then using the solution to get an ordering on the jobs. We introduce several new techniques which generalize this basic paradigm. We use these ideas to obtain improved approximation algorithms for one-machine scheduling to minimize average completion time with release dates. In the process, we obtain an optimal randomized on-line algorithm for the same problem that beats a lower bound for deterministic on-line algorithms. We consider extensions to the case of parallel machine scheduling, and for this we introduce two new ideas: first, we show that a preemptive one-machine relaxation is a powerful tool for designing parallel machine scheduling algorithms that simultaneously produce good approximations and have small running times; second, we show that a nongreedy "rounding" of the relaxation yields better approximations than a greedy one. We also prove a general theorem relating the value of one-machine relaxations to that of the schedules obtained for the original m-machine problems. This theorem applies even when there are precedence constraints on the jobs. We apply this result to obtain improved approximation ratios for precedence graphs such as in-trees, out-trees, and series-parallel graphs.

...read moreread less

Proceedings Article•DOI•

Parallel Multiscale Gauss-Newton-Krylov Methods for Inverse Wave Propagation

[...]

Volkan Akcelik¹, George Biros², Omar Ghattas¹•Institutions (2)

Carnegie Mellon University¹, New York University²

16 Nov 2002

TL;DR: This work presents a parallel algorithm for inverse problems governed by time-dependent PDEs, and scalability results for an inverse wave propagation problem of determining the material field of an acoustic medium.

...read moreread less

Abstract: One of the outstanding challenges of computational science and engineering is large-scale nonlinear parameter estimation of systems governed by partial differential equations. These are known as inverse problems, in contradistinction to the forward problems that usually characterize large-scale simulation. Inverse problems are significantly more difficult to solve than forward problems, due to ill-posedness, large dense ill-conditioned operators, multiple minima, space-time coupling, and the need to solve the forward problem repeatedly. We present a parallel algorithm for inverse problems governed by time-dependent PDEs, and scalability results for an inverse wave propagation problem of determining the material field of an acoustic medium. The difficulties mentioned above are addressed through a combination of total variation regularization, preconditioned matrix-free Gauss-Newton-Krylov iteration, algorithmic checkpointing, and multiscale continuation. We are able to solve a synthetic inverse wave propagation problem though a pelvic bone geometry involving 2.1 million inversion parameters in 3 hours on 256 processors of the Terascale Computing System at the Pittsburgh Supercomputing Center.

...read moreread less

Proceedings Article•DOI•

A complete, local and parallel reconfiguration algorithm for cube style modular robots

[...]

Sergei Vassilvitskii¹, Mark Yim², John W. Suh²•Institutions (2)

Cornell University¹, PARC²

07 Aug 2002

TL;DR: A complete, local, and parallel reconfiguration algorithm for metamorphic robots made up of Telecubes, six degree of freedom cube shaped modules currently being developed at PARC is presented.

...read moreread less

Abstract: We present a complete, local, and parallel reconfiguration algorithm for metamorphic robots made up of Telecubes, six degree of freedom cube shaped modules currently being developed at PARC. We show that by using 2 /spl times/ 2 /spl times/ 2 meta-modules we can achieve completeness of reconfiguration space using only local rules. Furthermore, this reconfiguration can be done in place and massively in parallel with many simultaneous module movements. Finally we present a loose quadratic upper bound on the total number of module movements required by the algorithm.

...read moreread less

Book Chapter•DOI•

Parallel Fuzzy c-Means Clustering for Large Data Sets

[...]

Terence Kwok¹, Kate A. Smith¹, Sebastián Lozano², David Taniar¹•Institutions (2)

Monash University¹, University of Seville²

27 Aug 2002

TL;DR: In an implementation of PFCM to cluster a large data set from an insurance company, the proposed algorithm is demonstrated to have almost ideal speedups as well as an excellent scaleup with respect to the size of the data sets.

...read moreread less

Abstract: The parallel fuzzy c-means (PFCM) algorithm for clustering large data sets is proposed in this paper. The proposed algorithm is designed to run on parallel computers of the Single Program Multiple Data (SPMD) model type with the Message Passing Interface (MPI). A comparison is made between PFCM and an existing parallel k-means (PKM) algorithm in terms of their parallelisation capability and scalability. In an implementation of PFCM to cluster a large data set from an insurance company, the proposed algorithm is demonstrated to have almost ideal speedups as well as an excellent scaleup with respect to the size of the data sets.

...read moreread less

Journal Article•DOI•

Heterogeneous computing and parallel genetic algorithms

[...]

Enrique Alba¹, Antonio J. Nebro¹, José M. Troya¹•Institutions (1)

University of Málaga¹

01 Sep 2002-Journal of Parallel and Distributed Computing

TL;DR: This paper uses Java to implement a distributed PGA model, and finds out that heterogeneous computing can be as efficient or even more efficient than homogeneous computing for parallel heuristics.

...read moreread less

Journal Article•DOI•

Mathematical models and numerical methods for the forward problem in cardiac electrophysiology

[...]

Glenn T. Lines¹, Martin L. Buist², Per Grøttum³, Andrew J. Pullan², Joakim Sundnes¹, Aslak Tveito¹ - Show less +2 more•Institutions (3)

Simula Research Laboratory¹, University of Auckland², University of Oslo³

01 Jul 2002-Computing and Visualization in Science

TL;DR: An overview of the forward problem of cardiac electrophysiology is given and the error introduced by solving the equations decoupled is demonstrated.

...read moreread less

Abstract: The purpose of this article is to give an overview of the forward problem of cardiac electrophysiology. The relevant models are derived and the mathematical problem formulated. Different solution strategies are discussed. In particular, the error introduced by solving the equations decoupled is demonstrated. Some novel techniques to deal with this problem are presented.

...read moreread less

Proceedings Article•DOI•

An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems

[...]

Xiao Qin¹, Hong Jiang¹, David Swanson¹•Institutions (1)

Lincoln University (Pennsylvania)¹

18 Aug 2002

TL;DR: An efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment that provides more features and capabilities than existing algorithms that schedule only independent tasks in real- time homogeneous systems is investigated.

...read moreread less

Abstract: In this paper, we investigate an efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment. It provides more features and capabilities than existing algorithms that schedule only independent tasks in real-time homogeneous systems. In addition, the proposed algorithm takes the heterogeneities of computation, communication and reliability into account, thereby improving the reliability. To provide fault-tolerant capability, the algorithm employs a primary-backup copy scheme that enables the system to tolerate permanent failures in any single processor. In this scheme, a backup copy is allowed to overlap with other backup copies on the same processor, as long as their corresponding primary copies are allocated to different processors. Tasks are judiciously allocated to processors so as to reduce the schedule length as well as the reliability cost, defined to be the product of processor failure rate and task execution time. In addition, the time for detecting and handling a permanent fault is incorporated into the scheduling scheme, thus making the algorithm more practical. To quantify the combined performance of fault-tolerance and schedulability, the performability measure is introduced Compared with the existing scheduling algorithms in the literature, our scheduling algorithm achieves an average of 16.4% improvement in reliability and an average of 49.3% improvement in performability.

...read moreread less

Journal Article•DOI•

Genetic Algorithms and Parallel Processing in Maximum-Likelihood Phylogeny Inference

[...]

Matthew J. Brauer¹, Mark T. Holder², Laurie A. Dries³, Laurie A. Dries¹, Derrick J. Zwickl¹, Paul O. Lewis², David M. Hillis - Show less +3 more•Institutions (3)

University of Texas at Austin¹, University of Connecticut², Ohio University³

01 Oct 2002-Molecular Biology and Evolution

TL;DR: The usefulness of a parallel genetic algorithm for phylogenetic inference under the maximum-likelihood (ML) optimality criterion is investigated and the parallelization strategy appears to be highly effective at improving computation time for large phylogenetic problems using the genetic algorithm.

...read moreread less

Abstract: We investigated the usefulness of a parallel genetic algorithm for phylogenetic inference under the maximum-likelihood (ML) optimality criterion. Parallelization was accomplished by assigning each "individual" in the genetic algorithm "population" to a separate processor so that the number of processors used was equal to the size of the evolving population (plus one additional processor for the control of operations). The genetic algorithm incorporated branch-length and topological mutation, recombination, selection on the ML score, and (in some cases) migration and recombination among subpopulations. We tested this parallel genetic algorithm with large (228 taxa) data sets of both empirically observed DNA sequence data (for angiosperms) as well as simulated DNA sequence data. For both observed and simulated data, search-time improvement was nearly linear with respect to the number of processors, so the parallelization strategy appears to be highly effective at improving computation time for large phylogenetic problems using the genetic algorithm. We also explored various ways of optimizing and tuning the parameters of the genetic algorithm. Under the conditions of our analyses, we did not find the best-known solution using the genetic algorithm approach before terminating each run. We discuss some possible limitations of the current implementation of this genetic algorithm as well as of avenues for its future improvement.

...read moreread less

Journal Article•DOI•

Recursive array layouts and fast matrix multiplication

[...]

Siddhartha Chatterjee¹, Alvin R. Lebeck², P.K. Patnala², Mithuna Thottethodi•Institutions (2)

IBM¹, Duke University²

01 Nov 2002-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.

...read moreread less

Abstract: The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen (1969) and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.

...read moreread less

Book Chapter•DOI•

Local Parallel Model Checking for the Alternation-Free µ-Calculus

[...]

Benedikt Bollig¹, Martin Leucker², Marc Weber¹•Institutions (2)

RWTH Aachen University¹, University of Pennsylvania²

11 Apr 2002

TL;DR: The design of (several variants of) a local parallel model-checking algorithm for the alternation-free fragment of the µ-calculus is described, which exploits a characterisation of the problem for this fragment in terms of two-player games.

...read moreread less

Abstract: We describe the design of (several variants of) a local parallel model-checking algorithm for the alternation-free fragment of the µ-calculus. It exploits a characterisation of the problem for this fragment in terms of two-player games. For the corresponding winner, our algorithm determines in parallel a winning strategy, which may be employed for debugging the underlying system interactively, and is designed to run on a network of workstations. Depending on the variant, its complexity is linear or quadratic. A prototype implementation within the verification tool Truth shows promising results in practice.

...read moreread less

Proceedings Article•DOI•

Media processing applications on the Imagine stream processor

[...]

John D. Owens¹, Scott Rixner², Ujval J. Kapasi¹, Peter Mattson¹, Brian Towles¹, B. Serebrin¹, William J. Dally¹ - Show less +3 more•Institutions (2)

Stanford University¹, Rice University²

16 Sep 2002

TL;DR: The Imagine stream programming system as discussed by the authors is a set of software tools and algorithms used to program media applications in the stream programming model and achieves real-time performance on a variety of media processing applications with high computation rates (4-15 billion achieved operations per second).

...read moreread less

Abstract: Media applications, such as image processing, signal processing, video, and graphics, require high computation rates and data bandwidths. The stream programming model is a natural and powerful way to describe these applications. Expressing media applications in this model allows hardware and software systems to take advantage of their concurrency and locality in order to meet their high computational demands. The Imagine stream programming system, a set of software tools and algorithms, is used to program media applications in the stream programming model. We achieve real-time performance on a variety of media processing applications with high computation rates (4-15 billion achieved operations per second) and high efficiency (84-95% occupancy on the arithmetic clusters).

...read moreread less

Journal Article•DOI•

A parallel MAP algorithm for low latency turbo decoding

[...]

Seokhyun Yoon¹, Yeheskel Bar-Ness¹•Institutions (1)

New Jersey Institute of Technology¹

07 Aug 2002-IEEE Communications Letters

TL;DR: A parallel algorithm for maximum a posteriori (MAP) decoders that divides a whole noisy codeword into sub-blocks and uses the forward and backward variables computed in the previous iteration to provide boundary distributions for each sub-block MAP decoder.

...read moreread less

Abstract: To reduce the computational decoding delay of turbo codes, we propose a parallel algorithm for maximum a posteriori (MAP) decoders. We divide a whole noisy codeword into sub-blocks and use multiple processors to perform sub-block MAP decoding in parallel. Unlike the previously proposed approach with sub-block overlapping, we utilize the forward and backward variables computed in the previous iteration to provide boundary distributions for each sub-block MAP decoder. Our scheme depicts asymptotically optimal performance in the sense that the BER is the same as that of the regular turbo decoder.

...read moreread less

Journal Article•DOI•

An optimal scheduling algorithm based on task duplication

[...]

Chanik Park¹, Tae-Young Choe•Institutions (1)

Pohang University of Science and Technology¹

01 Apr 2002-IEEE Transactions on Computers

TL;DR: An extended TDS algorithm is proposed whose optimality condition is less restricted and where the length of the generated schedule is shorter than the original T DS algorithm.

...read moreread less

Abstract: Under the condition that the communication time is relatively shorter than the computation time for a given task, the task duplication-based scheduling (TDS) algorithm proposed by S. Darbha and D.P. Agrawal (1998) generates an optimal schedule. In this paper, we propose an extended TDS algorithm whose optimality condition is less restricted and where the length of the generated schedule is shorter than the original TDS algorithm.

...read moreread less

Journal Article•DOI•

An algorithm for parallel Sn sweeps on unstructured meshes

[...]

Shawn D. Pautz

01 Feb 2002-Nuclear Science and Engineering

TL;DR: In this article, a new algorithm for performing parallel Sn sweeps on unstructured meshes is developed, which uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned m...

...read moreread less

Abstract: A new algorithm for performing parallel Sn sweeps on unstructured meshes is developed. The algorithm uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned m...

...read moreread less

Proceedings Article•DOI•

The improved BiCGStab method for large and sparse unsymmetric linear systems on parallel distributed memory architectures

[...]

Laurence T. Yang¹, Richard P. Brent•Institutions (1)

St. Francis Xavier University¹

23 Oct 2002

TL;DR: An improved version of the Bi CGStab (IBiCGStab) method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices is proposed, which combines elements of numerical stability and parallel algorithm design without increasing the computational costs.

...read moreread less

Abstract: In this paper, an improved version of the BiCGStab (IBiCGStab) method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices is proposed. The method combines elements of numerical stability and parallel algorithm design without increasing the computational costs. The algorithm is derived such that all inner products of a single iteration step are independent and communication time required for the inner product can be overlapped efficiently with computation time of vector updates. Therefore, the cost of global communication which represents the bottleneck of the parallel performance can be significantly reduced. The resulting IBiCGStab algorithm maintains the favorable properties of the original method while not increasing computational costs. Data distribution suitable for both irregularly and regularly structured matrices based on the analysis of the nonzero matrix elements is presented. Communication scheme is supported by overlapping execution of computation and communication to reduce waiting times. The efficiency of this method is demonstrated by numerical experimental results carried out on a massively parallel distributed memory system.

...read moreread less

Journal Article•DOI•

Parallel micro genetic algorithm for constrained economic dispatch

[...]

J. Tippayachai¹, Weerakorn Ongsakul, Issarachai Ngamroo•Institutions (1)

Thammasat University¹

07 Nov 2002-IEEE Transactions on Power Systems

TL;DR: The proposed parallel micro genetic algorithm (PMGA) is shown to be viable to the online implementation of the constrained ED due to substantial generator fuel cost savings and high speedup upper bounds.

...read moreread less

Abstract: This paper proposes a parallel micro genetic algorithm (PMGA) for solving ramp rate constrained economic dispatch (ED) problems for generating units with nonmonotonically and monotonically increasing incremental cost (IC) functions. The developed PMGA algorithm is implemented on the thirty-two-processor Beowulf cluster with ethernet switches network on the systems with the number of generating units ranging from 10 to 80 over the entire dispatch periods. The PMGA algorithm carefully schedules its processors, computational loads, and synchronization overhead for the best performance. The speedup upper bounds and the synchronization overheads on the Beowulf cluster are shown on different system sizes and different migration frequencies. The proposed PMGA is shown to be viable to the online implementation of the constrained ED due to substantial generator fuel cost savings and high speedup upper bounds.

...read moreread less

Journal Article•DOI•

A Randomized Time-Work Optimal Parallel Algorithm for Finding a Minimum Spanning Forest

[...]

Seth Pettie¹, Vijaya Ramachandran¹•Institutions (1)

University of Texas at Austin¹

01 Jun 2002-SIAM Journal on Computing

TL;DR: This work presents a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph that is optimal w.r.t. both work and parallel time, and is the first provably optimal parallel algorithm for this problem under both measures.

...read moreread less

Abstract: We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r. t. both work and parallel time, and is the first provably optimal parallel algorithm for this problem under both measures. We also give a simple, general processor allocation scheme for tree-like computations.

...read moreread less

Journal Article•DOI•

Video compression with parallel processing

[...]

Ishfaq Ahmad¹, Yong He², M.L. Liou³•Institutions (3)

University of Texas at Arlington¹, Motorola², Hong Kong University of Science and Technology³

01 Aug 2002

TL;DR: An overview of the recent research in video compression using parallel processing is presented, outlining the basic philosophy of each approach and providing examples, and suggesting future research directions.

...read moreread less

Abstract: Driven by the rapidly increasing demand for audio-visual applications, digital video compression technology has become a mature field, offering several available products based on both hardware and software implementations. Taking advantage of spatial, temporal, and statistical redundancies in video data, a video compression system aims to maximize the compression ratio while maintaining a high picture quality. Despite the tremendous progress in this area, video compression remains a challenging research problem due to its computational requirements and also because of the need for higher picture quality at lower data rates. Designing efficient coding algorithms continues to be a prolific area of research. For circumvent the computational requirement, researchers has resorted to parallel processing with a variety of approaches using dedicated parallel VLSI architectures as well as software on general-purpose available multiprocessor systems. Despite the availability of fast single processors, parallel processing helps to explore advanced algorithms and to build more sophisticated systems. This paper presents an overview of the recent research in video compression using parallel processing. The paper provides a discussion of the basic compression techniques, existing video coding standards, and various parallelization approaches. Since video compression is multi-step in nature using various algorithms, parallel processing can be exploited at an individual algorithm or at a complete system level. The paper covers a broad spectrum of such approaches, outlining the basic philosophy of each approach and providing examples. We contrast these approaches when possible, highlight their pros and cons, and suggest future research directions. While the emphasis of this paper is on software-based methods, a significant discussion of hardware and VLSI is also included.

...read moreread less

Proceedings Article•DOI•

Out-of-core rendering of massive geometric environments

[...]

Gokul Varadhan¹, Dinesh Manocha¹•Institutions (1)

University of North Carolina at Chapel Hill¹

27 Oct 2002

TL;DR: An external memory algorithm for fast display of very large and complex geometric environments and a novel prioritized prefetching technique that takes into account LOD-switching and visibility-based events between successive frames is presented.

...read moreread less

Abstract: We present an external memory algorithm for fast display of very large and complex geometric environments. We represent the model using a scene graph and employ different culling techniques for rendering acceleration. Our algorithm uses a parallel approach to render the scene as well as fetch objects from the disk in a synchronous manner. We present a novel prioritized prefetching technique that takes into account LOD-switching and visibility-based events between successive frames. We have applied our algorithm to large gigabyte sized environments that are composed of thousands of objects and tens of millions of polygons. The memory overhead of our algorithm is output sensitive and is typically tens of megabytes. In practice, our approach scales with the model sizes, and its rendering performance is comparable to that of an in-core algorithm.

...read moreread less

Collapse