scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2002"


01 Jul 2002
TL;DR: In this paper, a parallel implementation of Metropolis-Coupled Markov Chain Monte Carlo (MCMC) has been proposed to explore multiple peaks in the posterior distribution of trees while maintaining a fast execution time.
Abstract: Bayesian estimation of phylogeny is based on the posterior probability distribution of trees. Currently, the only numerical method that can effectively approximate posterior probabilities of trees is Markov Chain Monte Carlo (MCMC). Standard implementations of MCMC can be prone to entrapment in local optima. A variant of MCMC, known as Metropolis-Coupled MCMC, allows multiple peaks in the landscape of trees to be more readily explored, but at the cost of increased execution time. This paper presents a parallel algorithm for Metropolis-Coupled MCMC. The proposed parallel algorithm retains the ability to explore multiple peaks in the posterior distribution of trees while maintaining a fast execution time. The algorithm has been implemented using two parallel programming models: the Message Passing Interface (MPI) and the Cashmere software distributed shared memory protocol. Performance results indicate nearly linear speed improvement in both programming models for small and large data sets. (MrBayes v3.0 is available at http://morphbank.ebc.uu.se/mrbayes/.)

965 citations


Journal ArticleDOI
TL;DR: This paper describes an implementation of a parallel AMG code, using the algorithm of A.J. Cleary, and considers three basic coarsening schemes and certain modifications to the basic schemes, designed to address specific performance issues.

849 citations


Patent
25 Feb 2002
TL;DR: In this paper, a massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures based upon System-On-a-Chip technology, i.e., each processing node comprises a single Application Specific Integrated Circuit (ASIC), within each ASIC node is a plurality of processing elements each of which consists of a central processing unit (CPU) and plurality of floating point processors.
Abstract: A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures based upon System-On-a-Chip technology, i.e., each processing node comprises a single Application Specific Integrated Circuit (ASIC). Within each ASIC node is a plurality of processing elements each of which consists of a central processing unit (CPU) and plurality of floating point processors to enable optimal balance of computational performance, packaging density, low cost, and power and cooling requirements. The plurality of processors within a single node may be used individually or simultaneously to work on any combination of computation or communication as required by the particular algorithm being solved or executed at any point in time. The system-on-a-chip ASIC nodes are interconnected by multiple independent networks that optimally maximizes packet communications throughput and minimizes latency. In the preferred embodiment, the multiple networks include three high-speed networks for parallel algorithm message passing including a Torus, Global Tree, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. For particular classes of parallel algorithms, or parts of parallel calculations, this architecture exhibits exceptional computational performance, and may be enabled to perform calculations for new classes of parallel algorithms. Additional networks are provided for external connectivity and used for Input/Output, System Management and Configuration, and Debug and Monitoring functions. Special node packaging techniques implementing midplane and other hardware devices facilitates partitioning of the supercomputer in multiple networks for optimizing supercomputing resources.

329 citations


Book ChapterDOI
21 Apr 2002
TL;DR: Experiments demonstrate that a wide set of unsymmetric linear systems can be solved and high performance is consistently achieved for large sparse unsympetric matrices from real world applications.
Abstract: Supernode pivoting for unsymmetric matrices coupled with supernode partitioning and asynchronous computation can achieve high gigaflop rates for parallel sparse LU factorization on shared memory parallel computers. The progress in weighted graph matching algorithms helps to extend these concepts further and prepermutation of rows is used to place large matrix entries on the diagonal. Supernode pivoting allows dynamical interchanges of columns and rows during the factorization process. The BLAS-3 level efficiency is retained. An enhanced left-right looking scheduling scheme is uneffected and results in good speedup on SMP machines without increasing the operation count. These algorithms have been integrated into the recent unsymmetric version of the PARDISO solver. Experiments demonstrate that a wide set of unsymmetric linear systems can be solved and high performance is consistently achieved for large sparse unsymmetric matrices from real world applications.

323 citations


Journal ArticleDOI
TL;DR: The Intelligent Particle Swarm Optimization (IPSO) algorithm as mentioned in this paper uses concepts such as group experiences, unpleasant memories (tabu to be avoided), local landscape models based on virtual neighbors, and memetic replication of successful behavior parameters.
Abstract: The paper describes a new stochastic heuristic algorithm for global optimization. The new optimization algorithm, called intelligent-particle swarm optimization (IPSO), offers more intelligence to particles by using concepts such as: group experiences, unpleasant memories (tabu to be avoided), local landscape models based on virtual neighbors, and memetic replication of successful behavior parameters. The new individual complexity is amplified at the group level and consequently generates a more efficient optimization procedure. A simplified version of the IPSO algorithm was implemented and compared with the classical PSO algorithm for a simple test function and for the Loney's solenoid.

276 citations


Journal ArticleDOI
TL;DR: ROSS demonstrates for the first time that stable, highly efficient execution using little memory above what the sequential model would require is possible for low-event granularity simulation models.

214 citations


Book ChapterDOI
12 Feb 2002
TL;DR: This paper proposes a fast elliptic curve multiplication algorithm applicable for any types of curves over finite fields Fp (p a prime), together with criteria which make the algorithm resistant against the side channel attacks (SCA).
Abstract: This paper proposes a fast elliptic curve multiplication algorithm applicable for any types of curves over finite fields Fp (p a prime), based on [Mon87], together with criteria which make our algorithm resistant against the side channel attacks (SCA). The algorithm improves both on an addition chain and an addition formula in the scalar multiplication. Our addition chain requires no table look-up (or a very small number of pre-computed points) and a prominent property is that it can be implemented in parallel. The computing time for n-bit scalar multiplication is one ECDBL + (n - 1) ECADDs in the parallel case and (n - 1) ECDBLs + (n - 1) ECADDs in the single case. We also propose faster addition formulas which only use the x-coordinates of the points. By combination of our addition chain and addition formulas, we establish a faster scalar multiplication resistant against the SCA in both single and parallel computation. The improvement of our scalar multiplications over the previous method is about 37% for two processors and 5.7% for a single processor. Our scalar multiplication is suitable for the implementation on smart cards.

197 citations


Journal ArticleDOI
TL;DR: The conclusion is that super-linear performance is possible for PEAs, theoretically and in practice, both in homogeneous and in heterogeneous parallel machines.

182 citations


Journal ArticleDOI
TL;DR: A parallel formulation of a multi‐constraint graph‐partitioning algorithm, as well as a new partitioning algorithm for dynamic multi‐phase simulations, which are able to minimize the data redistribution required to balance the load better than a naive scratch‐remap approach.
Abstract: Sequential multi-constraint graph partitioners have been developed to address the static load balancing requirements of multi-phase simulations. These work well when (i) the graph that models the computation fits into the memory of a single processor, and (ii) the simulation does not require dynamic load balancing. The efficient execution of very large or dynamically adapting multi-phase simulations on high-performance parallel computers requires that the multi-constraint partitionings are computed in parallel. This paper presents a parallel formulation of a multi-constraint graph-partitioning algorithm, as well as a new partitioning algorithm for dynamic multi-phase simulations. We describe these algorithms and give experimental results conducted on a 128-processor Cray T3E. These results show that our parallel algorithms are able to efficiently compute partitionings of similar edge-cuts as serial multi-constraint algorithms, and can scale to very large graphs. Our dynamic multi-constraint algorithm is also able to minimize the data redistribution required to balance the load better than a naive scratch-remap approach. We have shown that both of our parallel multi-constraint graph partitioners are as scalable as the widely-used parallel graph partitioner implemented in PARMETIS. Both of our parallel multi-constraint graph partitioners are very fast, as they are able to compute three-constraint 128-way partitionings of a 7.5 million vertex graph in under 7 s on 128 processors of a Cray T3E. Copyright © 2002 John Wiley & Sons, Ltd.

174 citations


Proceedings ArticleDOI
09 Jan 2002
TL;DR: A parallel simulated annealing algorithm to solve the vehicle routing problem with time windows is presented and the empirical evidence indicate that parallel simulatedAnnealing can be applied with success to bicriterion optimization problems.
Abstract: A parallel simulated annealing algorithm to solve the vehicle routing problem with time windows is presented. The objective is to find the best possible solutions to some well-known instances of the problem by using parallelism. The empirical evidence indicate that parallel simulated annealing can be applied with success to bicriterion optimization problems.

170 citations


Journal ArticleDOI
TL;DR: In this paper, a multivariate ensemble Kalman filter (MvEnKF) is implemented on a massively parallel computer architecture for the Poseidon ocean circulation model and tested with a Pacific basin model configuration.
Abstract: A multivariate ensemble Kalman filter (MvEnKF) implemented on a massively parallel computer architecture has been developed for the Poseidon ocean circulation model and tested with a Pacific basin model configuration. There are about 2 million prognostic state-vector variables. Parallelism for the data assimilation step is achieved by regionalization of the background-error covariances that are calculated from the phase‐space distribution of the ensemble. Each processing element (PE) collects elements of a matrix measurement functional from nearby PEs. To avoid the introduction of spurious long-range covariances associated with finite ensemble sizes, the background-error covariances are given compact support by means of a Hadamard (element by element) product with a three-dimensional canonical correlation function. The methodology and the MvEnKF implementation are discussed. To verify the proper functioning of the algorithms, results from an initial experiment with in situ temperature data are presented. Furthermore, it is shown that the regionalization of the background covariances has a negligible impact on the quality of the analyses. Even though the parallel algorithm is very efficient for large numbers of observations, individual PE memory, rather than speed, dictates how large an ensemble can be used in practice on a platform with distributed memory.

Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of nonpreemptive scheduling to minimize average (weighted) completion time, allowing for release dates, parallel machines, and precedence constraints.
Abstract: We consider the problem of nonpreemptive scheduling to minimize average (weighted) completion time, allowing for release dates, parallel machines, and precedence constraints. Recent work has led to constant-factor approximations for this problem based on solving a preemptive or linear programming relaxation and then using the solution to get an ordering on the jobs. We introduce several new techniques which generalize this basic paradigm. We use these ideas to obtain improved approximation algorithms for one-machine scheduling to minimize average completion time with release dates. In the process, we obtain an optimal randomized on-line algorithm for the same problem that beats a lower bound for deterministic on-line algorithms. We consider extensions to the case of parallel machine scheduling, and for this we introduce two new ideas: first, we show that a preemptive one-machine relaxation is a powerful tool for designing parallel machine scheduling algorithms that simultaneously produce good approximations and have small running times; second, we show that a nongreedy "rounding" of the relaxation yields better approximations than a greedy one. We also prove a general theorem relating the value of one-machine relaxations to that of the schedules obtained for the original m-machine problems. This theorem applies even when there are precedence constraints on the jobs. We apply this result to obtain improved approximation ratios for precedence graphs such as in-trees, out-trees, and series-parallel graphs.

Proceedings ArticleDOI
16 Nov 2002
TL;DR: This work presents a parallel algorithm for inverse problems governed by time-dependent PDEs, and scalability results for an inverse wave propagation problem of determining the material field of an acoustic medium.
Abstract: One of the outstanding challenges of computational science and engineering is large-scale nonlinear parameter estimation of systems governed by partial differential equations. These are known as inverse problems, in contradistinction to the forward problems that usually characterize large-scale simulation. Inverse problems are significantly more difficult to solve than forward problems, due to ill-posedness, large dense ill-conditioned operators, multiple minima, space-time coupling, and the need to solve the forward problem repeatedly. We present a parallel algorithm for inverse problems governed by time-dependent PDEs, and scalability results for an inverse wave propagation problem of determining the material field of an acoustic medium. The difficulties mentioned above are addressed through a combination of total variation regularization, preconditioned matrix-free Gauss-Newton-Krylov iteration, algorithmic checkpointing, and multiscale continuation. We are able to solve a synthetic inverse wave propagation problem though a pelvic bone geometry involving 2.1 million inversion parameters in 3 hours on 256 processors of the Terascale Computing System at the Pittsburgh Supercomputing Center.

Proceedings ArticleDOI
07 Aug 2002
TL;DR: A complete, local, and parallel reconfiguration algorithm for metamorphic robots made up of Telecubes, six degree of freedom cube shaped modules currently being developed at PARC is presented.
Abstract: We present a complete, local, and parallel reconfiguration algorithm for metamorphic robots made up of Telecubes, six degree of freedom cube shaped modules currently being developed at PARC. We show that by using 2 /spl times/ 2 /spl times/ 2 meta-modules we can achieve completeness of reconfiguration space using only local rules. Furthermore, this reconfiguration can be done in place and massively in parallel with many simultaneous module movements. Finally we present a loose quadratic upper bound on the total number of module movements required by the algorithm.

Book ChapterDOI
27 Aug 2002
TL;DR: In an implementation of PFCM to cluster a large data set from an insurance company, the proposed algorithm is demonstrated to have almost ideal speedups as well as an excellent scaleup with respect to the size of the data sets.
Abstract: The parallel fuzzy c-means (PFCM) algorithm for clustering large data sets is proposed in this paper. The proposed algorithm is designed to run on parallel computers of the Single Program Multiple Data (SPMD) model type with the Message Passing Interface (MPI). A comparison is made between PFCM and an existing parallel k-means (PKM) algorithm in terms of their parallelisation capability and scalability. In an implementation of PFCM to cluster a large data set from an insurance company, the proposed algorithm is demonstrated to have almost ideal speedups as well as an excellent scaleup with respect to the size of the data sets.

Journal ArticleDOI
TL;DR: This paper uses Java to implement a distributed PGA model, and finds out that heterogeneous computing can be as efficient or even more efficient than homogeneous computing for parallel heuristics.

Journal ArticleDOI
TL;DR: An overview of the forward problem of cardiac electrophysiology is given and the error introduced by solving the equations decoupled is demonstrated.
Abstract: The purpose of this article is to give an overview of the forward problem of cardiac electrophysiology. The relevant models are derived and the mathematical problem formulated. Different solution strategies are discussed. In particular, the error introduced by solving the equations decoupled is demonstrated. Some novel techniques to deal with this problem are presented.

Proceedings ArticleDOI
18 Aug 2002
TL;DR: An efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment that provides more features and capabilities than existing algorithms that schedule only independent tasks in real- time homogeneous systems is investigated.
Abstract: In this paper, we investigate an efficient off-line scheduling algorithm in which real-time tasks with precedence constraints are executed in a heterogeneous environment. It provides more features and capabilities than existing algorithms that schedule only independent tasks in real-time homogeneous systems. In addition, the proposed algorithm takes the heterogeneities of computation, communication and reliability into account, thereby improving the reliability. To provide fault-tolerant capability, the algorithm employs a primary-backup copy scheme that enables the system to tolerate permanent failures in any single processor. In this scheme, a backup copy is allowed to overlap with other backup copies on the same processor, as long as their corresponding primary copies are allocated to different processors. Tasks are judiciously allocated to processors so as to reduce the schedule length as well as the reliability cost, defined to be the product of processor failure rate and task execution time. In addition, the time for detecting and handling a permanent fault is incorporated into the scheduling scheme, thus making the algorithm more practical. To quantify the combined performance of fault-tolerance and schedulability, the performability measure is introduced Compared with the existing scheduling algorithms in the literature, our scheduling algorithm achieves an average of 16.4% improvement in reliability and an average of 49.3% improvement in performability.

Journal ArticleDOI
TL;DR: The usefulness of a parallel genetic algorithm for phylogenetic inference under the maximum-likelihood (ML) optimality criterion is investigated and the parallelization strategy appears to be highly effective at improving computation time for large phylogenetic problems using the genetic algorithm.
Abstract: We investigated the usefulness of a parallel genetic algorithm for phylogenetic inference under the maximum-likelihood (ML) optimality criterion. Parallelization was accomplished by assigning each "individual" in the genetic algorithm "population" to a separate processor so that the number of processors used was equal to the size of the evolving population (plus one additional processor for the control of operations). The genetic algorithm incorporated branch-length and topological mutation, recombination, selection on the ML score, and (in some cases) migration and recombination among subpopulations. We tested this parallel genetic algorithm with large (228 taxa) data sets of both empirically observed DNA sequence data (for angiosperms) as well as simulated DNA sequence data. For both observed and simulated data, search-time improvement was nearly linear with respect to the number of processors, so the parallelization strategy appears to be highly effective at improving computation time for large phylogenetic problems using the genetic algorithm. We also explored various ways of optimizing and tuning the parameters of the genetic algorithm. Under the conditions of our analyses, we did not find the best-known solution using the genetic algorithm approach before terminating each run. We discuss some possible limitations of the current implementation of this genetic algorithm as well as of avenues for its future improvement.

Journal ArticleDOI
TL;DR: Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.
Abstract: The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen (1969) and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.

Book ChapterDOI
11 Apr 2002
TL;DR: The design of (several variants of) a local parallel model-checking algorithm for the alternation-free fragment of the µ-calculus is described, which exploits a characterisation of the problem for this fragment in terms of two-player games.
Abstract: We describe the design of (several variants of) a local parallel model-checking algorithm for the alternation-free fragment of the µ-calculus. It exploits a characterisation of the problem for this fragment in terms of two-player games. For the corresponding winner, our algorithm determines in parallel a winning strategy, which may be employed for debugging the underlying system interactively, and is designed to run on a network of workstations. Depending on the variant, its complexity is linear or quadratic. A prototype implementation within the verification tool Truth shows promising results in practice.

Proceedings ArticleDOI
16 Sep 2002
TL;DR: The Imagine stream programming system as discussed by the authors is a set of software tools and algorithms used to program media applications in the stream programming model and achieves real-time performance on a variety of media processing applications with high computation rates (4-15 billion achieved operations per second).
Abstract: Media applications, such as image processing, signal processing, video, and graphics, require high computation rates and data bandwidths. The stream programming model is a natural and powerful way to describe these applications. Expressing media applications in this model allows hardware and software systems to take advantage of their concurrency and locality in order to meet their high computational demands. The Imagine stream programming system, a set of software tools and algorithms, is used to program media applications in the stream programming model. We achieve real-time performance on a variety of media processing applications with high computation rates (4-15 billion achieved operations per second) and high efficiency (84-95% occupancy on the arithmetic clusters).

Journal ArticleDOI
TL;DR: A parallel algorithm for maximum a posteriori (MAP) decoders that divides a whole noisy codeword into sub-blocks and uses the forward and backward variables computed in the previous iteration to provide boundary distributions for each sub-block MAP decoder.
Abstract: To reduce the computational decoding delay of turbo codes, we propose a parallel algorithm for maximum a posteriori (MAP) decoders. We divide a whole noisy codeword into sub-blocks and use multiple processors to perform sub-block MAP decoding in parallel. Unlike the previously proposed approach with sub-block overlapping, we utilize the forward and backward variables computed in the previous iteration to provide boundary distributions for each sub-block MAP decoder. Our scheme depicts asymptotically optimal performance in the sense that the BER is the same as that of the regular turbo decoder.

Journal ArticleDOI
TL;DR: An extended TDS algorithm is proposed whose optimality condition is less restricted and where the length of the generated schedule is shorter than the original T DS algorithm.
Abstract: Under the condition that the communication time is relatively shorter than the computation time for a given task, the task duplication-based scheduling (TDS) algorithm proposed by S. Darbha and D.P. Agrawal (1998) generates an optimal schedule. In this paper, we propose an extended TDS algorithm whose optimality condition is less restricted and where the length of the generated schedule is shorter than the original TDS algorithm.

Journal ArticleDOI
TL;DR: In this article, a new algorithm for performing parallel Sn sweeps on unstructured meshes is developed, which uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned m...
Abstract: A new algorithm for performing parallel Sn sweeps on unstructured meshes is developed. The algorithm uses a low-complexity list ordering heuristic to determine a sweep ordering on any partitioned m...

Proceedings ArticleDOI
23 Oct 2002
TL;DR: An improved version of the Bi CGStab (IBiCGStab) method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices is proposed, which combines elements of numerical stability and parallel algorithm design without increasing the computational costs.
Abstract: In this paper, an improved version of the BiCGStab (IBiCGStab) method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices is proposed. The method combines elements of numerical stability and parallel algorithm design without increasing the computational costs. The algorithm is derived such that all inner products of a single iteration step are independent and communication time required for the inner product can be overlapped efficiently with computation time of vector updates. Therefore, the cost of global communication which represents the bottleneck of the parallel performance can be significantly reduced. The resulting IBiCGStab algorithm maintains the favorable properties of the original method while not increasing computational costs. Data distribution suitable for both irregularly and regularly structured matrices based on the analysis of the nonzero matrix elements is presented. Communication scheme is supported by overlapping execution of computation and communication to reduce waiting times. The efficiency of this method is demonstrated by numerical experimental results carried out on a massively parallel distributed memory system.

Journal ArticleDOI
TL;DR: The proposed parallel micro genetic algorithm (PMGA) is shown to be viable to the online implementation of the constrained ED due to substantial generator fuel cost savings and high speedup upper bounds.
Abstract: This paper proposes a parallel micro genetic algorithm (PMGA) for solving ramp rate constrained economic dispatch (ED) problems for generating units with nonmonotonically and monotonically increasing incremental cost (IC) functions. The developed PMGA algorithm is implemented on the thirty-two-processor Beowulf cluster with ethernet switches network on the systems with the number of generating units ranging from 10 to 80 over the entire dispatch periods. The PMGA algorithm carefully schedules its processors, computational loads, and synchronization overhead for the best performance. The speedup upper bounds and the synchronization overheads on the Beowulf cluster are shown on different system sizes and different migration frequencies. The proposed PMGA is shown to be viable to the online implementation of the constrained ED due to substantial generator fuel cost savings and high speedup upper bounds.

Journal ArticleDOI
TL;DR: This work presents a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph that is optimal w.r.t. both work and parallel time, and is the first provably optimal parallel algorithm for this problem under both measures.
Abstract: We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r. t. both work and parallel time, and is the first provably optimal parallel algorithm for this problem under both measures. We also give a simple, general processor allocation scheme for tree-like computations.

Journal ArticleDOI
01 Aug 2002
TL;DR: An overview of the recent research in video compression using parallel processing is presented, outlining the basic philosophy of each approach and providing examples, and suggesting future research directions.
Abstract: Driven by the rapidly increasing demand for audio-visual applications, digital video compression technology has become a mature field, offering several available products based on both hardware and software implementations. Taking advantage of spatial, temporal, and statistical redundancies in video data, a video compression system aims to maximize the compression ratio while maintaining a high picture quality. Despite the tremendous progress in this area, video compression remains a challenging research problem due to its computational requirements and also because of the need for higher picture quality at lower data rates. Designing efficient coding algorithms continues to be a prolific area of research. For circumvent the computational requirement, researchers has resorted to parallel processing with a variety of approaches using dedicated parallel VLSI architectures as well as software on general-purpose available multiprocessor systems. Despite the availability of fast single processors, parallel processing helps to explore advanced algorithms and to build more sophisticated systems. This paper presents an overview of the recent research in video compression using parallel processing. The paper provides a discussion of the basic compression techniques, existing video coding standards, and various parallelization approaches. Since video compression is multi-step in nature using various algorithms, parallel processing can be exploited at an individual algorithm or at a complete system level. The paper covers a broad spectrum of such approaches, outlining the basic philosophy of each approach and providing examples. We contrast these approaches when possible, highlight their pros and cons, and suggest future research directions. While the emphasis of this paper is on software-based methods, a significant discussion of hardware and VLSI is also included.

Proceedings ArticleDOI
27 Oct 2002
TL;DR: An external memory algorithm for fast display of very large and complex geometric environments and a novel prioritized prefetching technique that takes into account LOD-switching and visibility-based events between successive frames is presented.
Abstract: We present an external memory algorithm for fast display of very large and complex geometric environments. We represent the model using a scene graph and employ different culling techniques for rendering acceleration. Our algorithm uses a parallel approach to render the scene as well as fetch objects from the disk in a synchronous manner. We present a novel prioritized prefetching technique that takes into account LOD-switching and visibility-based events between successive frames. We have applied our algorithm to large gigabyte sized environments that are composed of thousands of objects and tens of millions of polygons. The memory overhead of our algorithm is output sensitive and is typically tens of megabytes. In practice, our approach scales with the model sizes, and its rendering performance is comparable to that of an in-core algorithm.