scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2007"


Journal ArticleDOI
TL;DR: The inter-relationships between graph problems, software, and parallel hardware in the current state of the art are presented and the range of these challenges suggests a research agenda for the development of scalable high-performance software for graph problems.
Abstract: Graph algorithms are becoming increasingly important for solving many problems in scientific computing, data mining and other domains. As these problems grow in scale, parallel computing resources are required to meet their computational and memory requirements. Unfortunately, the algorithms, software, and hardware that have worked well for developing mainstream parallel scientific applications are not necessarily effective for large-scale graph problems. In this paper we present the inter-relationships between graph problems, software, and parallel hardware in the current state of the art and discuss how those issues present inherent challenges in solving large-scale graph problems. The range of these challenges suggests a research agenda for the development of scalable high-performance software for graph problems.

488 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: A massively parallel machine called Anton is described, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems and is designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation.
Abstract: The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, chemistry and medicine. A wide range of biologically interesting phenomena, however, occur over time scales on the order of a millisecond--about three orders of magnitude beyond the duration of the longest current MD simulations.In this paper, we describe a massively parallel machine called Anton, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems. The machine, which is scheduled for completion by the end of 2008, is based on 512 identical MD-specific ASICs that interact in a tightly coupled manner using a specialized high-speed communication network. Anton has been designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation. The remainder of the simulation algorithm is executed by a programmable portion of each chip that achieves a substantial degree of parallelism while preserving the flexibility necessary to accommodate anticipated advances in physical models and simulation methods.

340 citations


Journal ArticleDOI
TL;DR: A robust heuristic approach is proposed for the VRPTW using travel distance as the main objective through an efficient genetic algorithm and a set partitioning formulation that outperforms all previously known and published heuristic methods in terms of the minimal travel distance.

250 citations


Proceedings ArticleDOI
12 Dec 2007
TL;DR: A new scheduling algorithm based on two conventional scheduling algorithms, Min-Min and Max-Min, to use their cons and at the same time, cover their pros, which selects between the two algorithms based on standard deviation of the expected completion time of tasks on resources.
Abstract: Today, the high cost of supercomputers in the one hand and the need for large-scale computational resources on the other hand, has led to use network of computational resources known as Grid. Numerous research groups in universities, research labs, and industries around the world are now working on a type of Grid called Computational Grids that enable aggregation of distributed resources for solving large-scale data intensive problems in science, engineering, and commerce. Several institutions and universities have started research and teaching programs on Grid computing as part of their parallel and distributed computing curriculum. To better use tremendous capabilities of this distributed system, effective and efficient scheduling algorithms are needed. In this paper, we introduce a new scheduling algorithm based on two conventional scheduling algorithms, Min-Min and Max-Min, to use their cons and at the same time, cover their pros. It selects between the two algorithms based on standard deviation of the expected completion time of tasks on resources. We evaluate our scheduling heuristic, the Selective algorithm, within a grid simulator called GridSim. We also compared our approach to its two basic heuristics. The experimental results show that the new heuristic can lead to significant performance gain for a variety of scenarios.

211 citations


Patent
26 Jun 2007
TL;DR: In this paper, a massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements.
Abstract: A novel massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. Novel use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.

179 citations


Journal ArticleDOI
TL;DR: A variant of a least squares ensemble (Kalman) filter that is suitable for implementation on parallel architectures and produces results that are identical to those from sequential algorithms when forward observation operators that relate the model state vector to the expected value of observations are linear.
Abstract: A variant of a least squares ensemble (Kalman) filter that is suitable for implementation on parallel architectures is presented. This parallel ensemble filter produces results that are identical to those from sequential algorithms already described in the literature when forward observation operators that relate the model state vector to the expected value of observations are linear (although actual results may differ due to floating point arithmetic round-off error). For nonlinear forward observation operators, the sequential and parallel algorithms solve different linear approximations to the full problem but produce qualitatively similar results. The parallel algorithm can be implemented to produce identical answers with the state variable prior ensembles arbitrarily partitioned onto a set of processors for the assimilation step (no caveat on round-off is needed for this result). Example implementations of the parallel algorithm are described for environments with low (high) communication lat...

159 citations


Journal ArticleDOI
TL;DR: In this paper, a made-to-measure (M2M) algorithm for constructing N-particle models of stellar systems from observational data (X 2 M2M), extending earlier ideas by Syer & Tremaine, is described.
Abstract: We describe a made-to-measure (M2M) algorithm for constructing N-particle models of stellar systems from observational data (X 2 M2M), extending earlier ideas by Syer & Tremaine. The algorithm properly accounts for observational errors, is flexible, and can be applied to various systems and geometries. We implement this algorithm in a parallel code NMAGIC and carry out a sequence of tests to illustrate its power and performance. (i) We reconstruct an isotropic Hernquist model from density moments and projected kinematics and recover the correct differential energy distribution and intrinsic kinematics. (ii) We build a self-consistent oblate three-integral maximum rotator model and compare how the distribution function is recovered from integral field and slit kinematic data. (iii) We create a non-rotating and a figure rotating triaxial stellar particle model, reproduce the projected kinematics of the figure rotating system by a non-rotating system of the same intrinsic shape, and illustrate the signature of pattern rotation in this model. From these tests, we comment on the dependence of the results from X 2 M2M on the initial model, the geometry, and the amount of available data.

146 citations


Proceedings ArticleDOI
26 Mar 2007
TL;DR: A multithreaded algorithm for connected components and a new heuristic for inexact subgraph isomorphism are introduced and explored, and the performance of these and other basic graph algorithms on large scale-free graphs is explored.
Abstract: Search-based graph queries, such as finding short paths and isomorphic subgraphs, are dominated by memory latency. If input graphs can be partitioned appropriately, large cluster-based computing platforms can run these queries. However, the lack of compute-bound processing at each vertex of the input graph and the constant need to retrieve neighbors implies low processor utilization. Furthermore, graph classes such as scale-free social networks lack the locality to make partitioning clearly effective. Massive multithreading is an alternative architectural paradigm, in which a large shared memory is combined with processors that have extra hardware to support many thread contexts. The processor speed is typically slower than normal, and there is no data cache. Rather than mitigating memory latency, multithreaded machines tolerate it. This paradigm is well aligned with the problem of graph search, as the high ratio of memory requests to computation can be tolerated via multithreading. In this paper, we introduce the multithreaded graph library (MTGL), generic graph query software for processing semantic graphs on multithreaded computers. This library currently runs on serial machines and the Cray MTA-2, but Sandia is developing a run-time system that will make it possible to run MTGL-based code on symmetric multiprocessors. We also introduce a multithreaded algorithm for connected components and a new heuristic for inexact subgraph isomorphism We explore the performance of these and other basic graph algorithms on large scale-free graphs. We conclude with a performance comparison between the Cray MTA-2 and Blue Gene/Light for s-t connectivity.

108 citations


Journal ArticleDOI
TL;DR: A parallel CPM algorithm for simulations of morphogenesis, which includes cell–cell adhesion, a cell volume constraint, and cell haptotaxis is presented, which satisfies the balance condition, which is sufficient for convergence of the underlying Markov chain.

104 citations


Journal ArticleDOI
TL;DR: This work proposes implementing a parallel EA on consumer graphics cards, which can find in many PCs, and lets more people use the authors' parallel algorithm to solve large-scale, real-world problems such as data mining.
Abstract: We propose implementing a parallel EA on consumer graphics cards, which we can find in many PCs. This lets more people use our parallel algorithm to solve large-scale, real-world problems such as data mining. Parallel evolutionary algorithms run on consumer-grade graphics hardware achieve better execution times than ordinary evolutionary algorithms and offer greater accessibility than those run on high-performance computers

102 citations


Journal ArticleDOI
TL;DR: Various results of optimized coverage patterns are shown herein to illustrate the effectiveness and validity of the GA technique.
Abstract: A parallel genetic algorithm (GA) optimization tool has been developed for the synthesis of arbitrarily shaped beam coverage using planar 2D phased-array antennas. Typically, the synthesis of a contoured beam footprint using a planar 2D array is difficult because of the inherently large number of degrees of freedom involved (in general, the amplitude and phase of each element must be determined). We make use of a parallel GA tool in this study to compensate for this aspect of the design problem. The algorithm essentially compares a desired pattern envelope with that of trial arrays, and quantifies the effectiveness or desirability of each test case via a fitness function. The GA uses this information to rank and select subsequent arrays over a given number of generations via the conventional stochastic operators, i.e., selection, crossover, and mutation. Each fitness evaluation of a trial pattern is done on a node of the aerospace fellowship cluster supercomputer, which increases the speed of the algorithm linearly with the number of nodes. Because of the continuous nature of the parameters for this optimization problem, a real parameter encoding scheme is employed for the GA chromosome in order to avoid the quantization errors associated with a binary representation. A benchmark 10 times 10 (100) element array is employed, and various results of optimized coverage patterns are shown herein to illustrate the effectiveness and validity of the technique.

Proceedings Article
15 Jun 2007
TL;DR: In this article, the authors present three parallel algorithms for UCT for 9×9 Go, and they all improve the results of the programs that use them against GNU GO 3.6.
Abstract: We present three parallel algorithms for UCT. For 9×9 Go, they all improve the results of the programs that use them against GNU GO 3.6. The simplest one, the single-run algorithm, uses very few communications and shows improvements comparable to the more complex ones. Further improvements may be possible sharing more information in the multiple-runs algorithm.

Journal ArticleDOI
TL;DR: A new branch-and-price optimization algorithm, termed “ primal box ”, and a specific branching variable selection rule that significantly reduces the number of explored nodes are proposed, which solve problems of large size to optimality within reasonable computational time.

Journal ArticleDOI
TL;DR: This paper studies designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems, and proposes three parameterized algorithms which can be tuned according to the problem size and the available hardware resources.
Abstract: The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The processing elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-ll Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1

Journal ArticleDOI
TL;DR: Evaluating the possibility of defining a hybrid customizable neighbourhood search algorithm for combinatorial problems as a combination of a subset of concepts and features from three main metaheuristics of reference, i.e., the TS, the SA and the VNS aims to evaluate the effectiveness of the HMH.

Proceedings ArticleDOI
21 Oct 2007
TL;DR: A general computational pattern that works well with early phase termination is identified and it is explained why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.
Abstract: We present a new technique, early phase termination, for eliminating idle processors in parallel computations that use barrier synchronization. This technique simply terminates each parallel phaseas soon as there are too few remaining tasks to keep all of the processors busy. Although this technique completely eliminates the idling that would other wise occur at barrier synchronization points, it may also change the computation and therefore the result that the computation produces. We address this issue by providing probabilistic distortion models that characterize how the use of early phase termination distorts the result that the computation produces. Our experimental results show that for our set of benchmark applications, 1) early phase termination can improve the performance of the parallel computation, 2) the distortion is small (or can be made to be small with the use of an appropriate compensation technique) and 3) the distortion models provide accurate and tight distortion bounds. These bounds can enable users to evaluate the effect of early phase termination and confidently accept results from parallel computations that use this technique if they find the distortion bounds to be acceptable. Finally, we identify a general computational pattern that works well with early phase termination and explain why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.

Proceedings ArticleDOI
18 Sep 2007
TL;DR: AFGPGA method based on GPU-acceleration, which maps parallel GA algorithm to texture-rendering on consumer-level graphics cards is proposed, which increases the population size, speeds up its execution and provides ordinary users with a feasible FGPGA solution.
Abstract: Fine-grained parallel genetic algorithm (FGPGA), though a popular and robust strategy for solving complicated optimization problems, is sometimes inconvenient to use as its population size is restricted by heavy data communication and the parallel computers are relatively difficult to use, manage, maintain and may not be accessible to most researchers. In this paper, we propose a FGPGA method based on GPU-acceleration, which maps parallel GA algorithm to texture-rendering on consumer-level graphics cards. The analytical results demonstrate that the proposed method increases the population size, speeds up its execution and provides ordinary users with a feasible FGPGA solution.

Journal ArticleDOI
TL;DR: In this article, a universal approach to grid motion is presented, which is applicable to any grid type, unstructured, hybrid or structured single-and multiblock, and is shown to preserve grid quality even when a large deformation case is considered.
Abstract: Numerical simulation of unsteady flows including moving boundaries, whether rigid, prescribed, or deforming, requires the mesh to move/deform also. Many approaches to mesh motion have been considered, but the approach adopted often depends on both the meshing approach used and the proposed application. A universal approach to grid motion is presented here. The method developed has several important properties, the most significant of which is that the scheme requires no grid point connectivity information, i.e. each point can be moved totally independently to its neighbours. This has two major implications: first, it means the scheme is universal, i.e. applicable to any grid type, unstructured, hybrid or structured single- and multiblock. Second, and equally as important, is that the scheme is perfectly parallel, as no communication is required between points/blocks. Each block can be updated independently to its neighbours so there is no connectivity data required, and this also means the flow-solver does not have to carry the grid motion parameterization data in memory. The scheme accounts for moving surface rotations as well as displacements, and this ensures grid quality is preserved by maintaining orthogonality. The method is applied to a four-bladed lifting rotor in forward flight. This is a particularly challenging unsteady problem as there are multiple bodies in relative motion, each one with its own axis system, as each blade has a cyclic pitch variation. Grid quality is proven to be preserved even when a large deformation case is considered and, significantly, the scheme adds only around 1% to the cost of a flow solution real time step. Copyright © 2006 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
09 Jun 2007
TL;DR: This paper presents a programming and execution model for multi-core architectures with memory hierarchy, and proposes a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators.
Abstract: Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between non-consecutive stages, in other words, the data dependence is non-uniform. This kind of dynnamic programming is typically called nonserial polyadic dynamic programming. Owing to the non-uniform data dependence, it is harder to optimize this problem for parallelism and locality on parallel architectures. In this paper, we address the chanllenge of exploiting fine grain parallelism and locality of nonserial polyadic dynamic programming on a multi-core architecture. We present a programming and execution model for multi-core architectures with memory hierarchy. In the framework of the new model, the parallelism and locality benifit from a data dependence transformation. We propose a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators. The new parallel algorithm tolerates the memory access latency using multi-thread and is easily improved with tile technique. We formulate and analytically solve the optimization problem determing the tile size that minimizes the total execution time. The experiments on a simulator give a validation of the proposed model and show that the fine grain parallel algorithm achieves sub-linear speedup and that a potential high scalability on multi-core arichitecture.

Book ChapterDOI
09 Sep 2007
TL;DR: A distributed algorithm due to Hoepman is analysed and it is shown how this can be turned into a parallel algorithm that scales well using up to 32 processors.
Abstract: We consider the problem of computing a weighted edge matching in a large graph using a parallel algorithm. This problem has application in several areas of combinatorial scientific computing. Since an exact algorithm for the weighted matching problem is both fairly expensive to compute and hard to parallelise we instead consider fast approximation algorithms. We analyse a distributed algorithm due to Hoepman [8] and show how this can be turned into a parallel algorithm. Through experiments using both complete as well as sparse graphs we show that our new parallel algorithm scales well using up to 32 processors.

Proceedings ArticleDOI
04 Jun 2007
TL;DR: It is argued that implicitly parallel programming models are critical for addressing the software development crises and software scalability challenges for many-core microprocessors.
Abstract: This paper argues for an implicitly parallel programming model for many-core microprocessors, and provides initial technical approaches towards this goal. In an implicitly parallel programming model, programmers maximize algorithm- level parallelism, express their parallel algorithms by asserting high-level properties on top of a traditional sequential programming language, and rely on parallelizing compilers and hardware support to perform parallel execution under the hood. In such a model, compilers and related tools require much more advanced program analysis capabilities and programmer assertions than what are currently available so that a comprehensive understanding of the input program's concurrency can be derived. Such an understanding is then used to drive automatic or interactive parallel code generation tools for a diverse set of parallel hardware organizations. The chip-level architecture and hardware should maintain parallel execution state in such a way that a strictly sequential execution state can always be derived for the purpose of verifying and debugging the program. We argue that implicitly parallel programming models are critical for addressing the software development crises and software scalability challenges for many-core microprocessors.

Proceedings ArticleDOI
01 Jan 2007
TL;DR: This paper examines existing ATC methods, providing an alternative to existing nested coordination schemes by using the block coordinate descent method (BCD), and applies diagonal quadratic approximation (DQA) by linearizing the cross term of the augmented Lagrangian function to create separable subproblems.
Abstract: Analytical Target Cascading (ATC) is an effective decomposition approach used for engineering design optimization problems that have hierarchical structures. With ATC, the overall system is split into subsystems, which are solved separately and coordinated via target/response consistency constraints. As parallel computing becomes more common, it is desirable to have separable subproblems in ATC so that each subproblem can be solved concurrently to increase computational throughput. In this paper, we first examine existing ATC methods, providing an alternative to existing nested coordination schemes by using the block coordinate descent method (BCD). Then we apply diagonal quadratic approximation (DQA) by linearizing the cross term of the augmented Lagrangian function to create separable subproblems. Local and global convergence proofs are described for this method. To further reduce overall computational cost, we introduce the truncated DQA (TDQA) method that limits the number of inner loop iterations of DQA. These two new methods are empirically compared to existing methods using test problems from the literature. Results show that computational cost of nested loop methods is reduced by using BCD and generally the computational cost of the truncated methods, TDQA and ALAD, are superior to other nested loop methods with lower overall computational cost than the best previously reported results.© 2007 ASME

Proceedings ArticleDOI
15 Sep 2007
TL;DR: This study reports the experience implementing a realistic application using transactional memory (TM) and evaluates the exploitable parallelism of a transactional parallel implementation and explores how it can be adapted to deliver better performance.
Abstract: Transactional memory proposes an alternative synchronization primitive to traditional locks. Its promise is to simplify the software development of multi-threaded applications while at the same time delivering the performance of parallel applications using (complex and error prone) fine grain locking. This study reports our experience implementing a realistic application using transactional memory (TM). The application is Lee's routing algorithm and was selected for its abundance of parallelism but difficulty of expressing it with locks. Each route between a source and a destination point in a grid can be considered a unit of parallelism. Starting from this simple approach, we evaluate the exploitable parallelism of a transactional parallel implementation and explore how it can be adapted to deliver better performance. The adaptations do not introduce locks nor alter the essence of the implemented algorithm, but deliver up to 20 times more parallelism. The adaptations are derived from understanding the application itself and TM. The evaluation simulates an abstracted TM system and, thus, the results are independent of specific software or hardware TM implemented, and describe properties of the application.

Proceedings ArticleDOI
12 Aug 2007
TL;DR: This paper develops a fully dynamic distributed algorithm for maintaining sparse spanners that improves drastically the quiescence time and improves significantly upon the state-of-the-art algorithm in all efficiency parameters.
Abstract: Currently, there are no known explicit algorithms for the great majority of graph problems in the dynamic distributed message-passing model. Instead, most state-of-the-art dynamic distributed algorithms are constructed by composing a static algorithm for the problem at hand with a simulation technique that converts static algorithms to dynamic ones. We argue that this powerful methodology does not provide satisfactory solutions for many important dynamic distributed problems, and this necessitates developing algorithms for these problems from scratch.In this paper we develop a fully dynamic distributed algorithm for maintaining sparse spanners. Our algorithm improves drastically the quiescence time of the state-of-the-art algorithm for the problem. Moreover, we show that the quiescence time of our algorithm is optimal up to a small constant factor. In addition, our algorithm improves significantly upon the state-of-the-art algorithm in all efficiency parameters, specifically, it has smaller quiescence message and space complexities, and smaller local processing time. Finally, our algorithm is self-contained and fairly simple, and is, consequently, amenable to implementation on unsophisticated network devices.

Proceedings ArticleDOI
12 Mar 2007
TL;DR: Three versions of the algorithm for regular expression matching on streams with out of order data are implemented and it is shown by experimental study that the algorithms are highly effective in matching regular expressions on IP packet streams.
Abstract: We present an efficient algorithm for regular expression matching on streams with out of order data, while maintaining a small state and without complete stream reconstruction. We have implemented three versions of the algorithm - sequential, parallel and mixed - and show by experimental study that the algorithms are highly effective in matching regular expressions on IP packet streams.

Book ChapterDOI
18 Dec 2007
TL;DR: A new approach to high performance molecular dynamics simulations on graphics processing units using the Compute Unified Device Architecture (CUDA) to design and implement a new parallel algorithm with significant runtime savings on an off-the-shelf computer graphics card.
Abstract: Molecular dynamics simulations are a common and often repeated task in molecular biology. The need for speeding up this treatment comes from the requirement for large system simulations with many atoms and numerous time steps. In this paper we present a new approach to high performance molecular dynamics simulations on graphics processing units. Using modern graphics processing units for high performance computing is facilitated by their enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) to design and implement a new parallel algorithm. This results in an implementation with significant runtime savings on an off-the-shelf computer graphics card.

BookDOI
20 Dec 2007
TL;DR: A Programming Model and Architectural Extensions for Fine-Grain Parallelism, and Applications Using FG to reduce the Effect of Latency in Parallel Programs Running on Clusters.
Abstract: PREFACE MODELS Evolving Computational Systems S.G. Akl Decomposable BSP: A Bandwidth-Latency Model for Parallel and Hierarchical Computation G. Bilardi, A. Pietracaprina, and G. Pucci Membrane Systems: A "Natural" Way of Computing with Cells O.H. Ibarra and A. Paun Optical Transpose Systems: Models and Algorithms C.-F. Wang and S. Sahni Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform U. Vishkin, G. Caragea, and B. Lee Deterministic and Randomized Sorting Algorithms for the Parallel Disks Model S. Rajasekaran A Programming Model and Architectural Extensions for Fine-Grain Parallelism A. Gontmakher, A. Mendelson, A. Schuster, and G. Shklover Computing with Mobile Agents in Distributed Networks E. Kranakis, D. Krizanc, and S. Rajsbaum Transitional Issues: Fine-Grain to Coarse-Grain Multicomputers S. Olariu Distributed Computing in the Presence of Mobile Faults N. Santoro and P. Widmayer A Hierarchical Performance Model for Reconfigurable Computers R. Scorfano and V.K. Prasanna Hierarchical Performance Modeling and Analysis of Distributed Software Systems R.A. Ammar Randomized Packet Routing, Selection, and Sorting on the POPS Network J. Davila and S. Rajasekaran Dynamic Reconfiguration on the R-Mesh R. Vaidyanathan and J.L. Trahan Fundamental Algorithms on the Reconfigurable Mesh K. Nakano Reconfigurable Computing with Optical Buses A.G. Bourgeois ALGORITHMS Distributed Peer-to-Peer Data Structures M.T. Goodrich and M.J. Nelson Parallel Algorithms via the Probabilistic Method L. Kliemann and A. Srivastav Broadcasting on Networks of Workstations S. Khuller, Y.-A. Kim, and Y.-C. Wan Atomic Selfish Routing in Networks: A Survey S. Kontogiannis and P. Spirakis Scheduling in Grid Environments Y-C. Lee and A.Y. Zomaya QoS Scheduling in Network and Storage Systems P.J. Varman and A. Gulati Optimal Parallel Scheduling Algorithms in WDM Packet Interconnects Y. Yang Online Real-Time Scheduling Algorithms for Multiprocessor Systems M.A. Palis Parallel Algorithms for Maximal Independent Set and Maximal Matching Y. Han Efficient Parallel Graph Algorithms for Shared-Memory Multiprocessors D.A. Bader and G. Cong Parallel Algorithms for Volumetric Surface Construction J. JaJa, Q. Shi, and A. Varshney Mesh-Based Parallel Algorithms for Ultra-Fast Computer Vision S. Olariu Prospectus for a Dense Linear Algebra Software Library J. Demmel and J. Dongarra Parallel Algorithms on Strings W. Rytter Design of Multithreaded Algorithms for Combinatorial Problems D.A. Bader, K. Madduri, G. Cong, and J. Feo Parallel Data Mining Algorithms for Association Rules and Clustering J. Li, Y. Liu, W.-K. Liao, and A. Choudhary An Overview of Mobile Computing Algorithmics S. Olariu and A.Y. Zomaya APPLICATIONS Using FG to Reduce the Effect of Latency in Parallel Programs Running on Clusters T.H. Cormen and E.R. Davidson High-Performance Techniques for Parallel I/O A. Ching, K. Coloma, A. Choudhary, and W.-K. Liao Message Dissemination Using Modern Communication Primitives T. Gonzalez Online Computation in Large Networks S. Albers Online Call Admission Control in Wireless Cellular Networks I. Caragiannis, C. Kaklamanis, and E. Papaioannou Minimum Energy Communication in Ad Hoc Wireless Networks I. Caragiannis, C. Kaklamanis, and P. Kanellopoulos Power Aware Mapping of Real-Time Tasks to Multiprocessors D. Zhu, B.R. Childers, D. Mosse, and R. Melhem Perspectives on Robust Resource Allocation for Heterogeneous Parallel and Distributed Systems S. Ali, H.J. Siegel, and A.A. Maciejewski A Transparent Distributed Runtime for Java M. Factor, A. Schuster, and K. Shagin Scalability of Parallel Programs A. Grama and V. Kumar Spatial Domain Decomposition Methods in Parallel Scientific Computing Sudip Seal and Srinivas Aluru Game Theoretical Solutions for Data Replication in Distributed Computing Systems S.U. Khan and I. Ahmad Effectively Managing Data on a Grid C.L. Ruby and R. Miller Fast and Scalable Parallel Matrix Multiplication and Its Applications on Distributed Memory Systems K. Li INDEX

Proceedings ArticleDOI
26 Mar 2007
TL;DR: A generic work-partitioning technique on the Cell to hide memory access latency, and applies this to efficiently implement list ranking, a representative problem from the class of combinatorial and graph-theoretic applications.
Abstract: The Sony-Toshiba-IBM Cell Broadband Engine is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE), with eight SIMD coprocessing units (SPEs) integrated on-chip. We present a complexity model for designing algorithms on the Cell processor, along with a systematic procedure for algorithm analysis. To estimate the execution time of the algorithm, we consider the computational complexity, memory access patterns (DMA transfer sizes and latency), and the complexity of branching instructions. This model, coupled with the analysis procedure, simplifies algorithm design on the Cell and enables quick identification of potential implementation bottlenecks. Using the model, we design an efficient implementation of list ranking, a representative problem from the class of combinatorial and graph-theoretic applications. Due to its highly irregular memory patterns, list ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures. We describe a generic work-partitioning technique on the Cell to hide memory access latency, and apply this to efficiently implement list ranking. We run our algorithm on a 3.2 GHz Cell processor using an IBM QS20 Cell Blade and demonstrate a substantial speedup for list ranking on the Cell in comparison to traditional cache-based microprocessors. For a random linked list of 1 million nodes, we achieve an an overall speedup of 8.34 over a PPE-only implementation.

Journal ArticleDOI
TL;DR: The design and implementation of a memory scalable parallel symbolic factorization algorithm for general sparse unsymmetric matrices using a graph partitioning approach applied to the graph of A to partition the matrix in such a way that is good for sparsity preservation as well as for parallel factorization.
Abstract: This paper presents the design and implementation of a memory scalable parallel symbolic factorization algorithm for general sparse unsymmetric matrices. Our parallel algorithm uses a graph partitioning approach, applied to the graph of $|A|+|A|^T$, to partition the matrix in such a way that is good for sparsity preservation as well as for parallel factorization. The partitioning yields a so-called separator tree which represents the dependencies among the computations. We use the separator tree to distribute the input matrix over the processors using a block cyclic approach and a subtree to subprocessor mapping. The parallel algorithm performs a bottom-up traversal of the separator tree. With a combination of right-looking and left-looking partial factorizations, the algorithm obtains one column structure of $L$ and one row structure of $U$ at each step. The algorithm is implemented in C and MPI. From a performance study on large matrices, we show that the parallel algorithm significantly reduces the memory requirement of the symbolic factorization step, as well as the overall memory requirement of the parallel solver. It also often reduces the runtime of the sequential algorithm, which is already relatively small. In general, the parallel algorithm prevents the symbolic factorization step from being a time or memory bottleneck of the parallel solver.

Journal ArticleDOI
TL;DR: A correction to Ma and Sonka's thinning algorithm, 'A fully parallel 3D thinning algorithms and its applications', which fails to preserve connectivity of 3D objects, is presented.