Showing papers on "Parallel algorithm published in 2007"

PDF

Open Access

Journal Article•DOI•

[...]

Andrew Lumsdaine¹, Douglas Gregor¹, Bruce Hendrickson², Jonathan W. Berry²•Institutions (2)

Indiana University¹, Sandia National Laboratories²

01 Mar 2007-Parallel Processing Letters

TL;DR: The inter-relationships between graph problems, software, and parallel hardware in the current state of the art are presented and the range of these challenges suggests a research agenda for the development of scalable high-performance software for graph problems.

...read moreread less

Abstract: Graph algorithms are becoming increasingly important for solving many problems in scientific computing, data mining and other domains. As these problems grow in scale, parallel computing resources are required to meet their computational and memory requirements. Unfortunately, the algorithms, software, and hardware that have worked well for developing mainstream parallel scientific applications are not necessarily effective for large-scale graph problems. In this paper we present the inter-relationships between graph problems, software, and parallel hardware in the current state of the art and discuss how those issues present inherent challenges in solving large-scale graph problems. The range of these challenges suggests a research agenda for the development of scalable high-performance software for graph problems.

...read moreread less

488 citations

Proceedings Article•DOI•

Anton, a special-purpose machine for molecular dynamics simulation

[...]

David E. Shaw¹, Martin M. Deneroff¹, Ron O. Dror¹, Jeffrey S. Kuskin¹, Richard H. Larson¹, John K. Salmon¹, Cliff Young¹, Brannon Batson¹, Kevin J. Bowers¹, Jack C. Chao¹, Michael P. Eastwood¹, Joseph Gagliardo¹, J. P. Grossman¹, C. Richard Ho¹, Douglas J. Ierardi¹, István Kolossváry¹, John L. Klepeis¹, Timothy Layman¹, Christine McLeavey¹, Mark A. Moraes¹, Rolf Mueller¹, Edward C. Priest¹, Yibing Shan¹, Jochen Spengler¹, Michael Theobald¹, Brian Towles¹, Stanley C. Wang¹ - Show less +23 more•Institutions (1)

D. E. Shaw Research¹

09 Jun 2007

TL;DR: A massively parallel machine called Anton is described, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems and is designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation.

...read moreread less

Abstract: The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, chemistry and medicine. A wide range of biologically interesting phenomena, however, occur over time scales on the order of a millisecond--about three orders of magnitude beyond the duration of the longest current MD simulations.In this paper, we describe a massively parallel machine called Anton, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems. The machine, which is scheduled for completion by the end of 2008, is based on 512 identical MD-specific ASICs that interact in a tightly coupled manner using a specialized high-speed communication network. Anton has been designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation. The remainder of the simulation algorithm is executed by a programmable portion of each chip that achieves a substantial degree of parallelism while preserving the flexibility necessary to accommodate anticipated advances in physical models and simulation methods.

...read moreread less

340 citations

Journal Article•DOI•

A genetic and set partitioning two-phase approach for the vehicle routing problem with time windows

[...]

Guilherme Bastos Alvarenga, Geraldo Robson Mateus¹, G. De Tomi²•Institutions (2)

Universidade Federal de Minas Gerais¹, University of São Paulo²

01 Jun 2007-Computers & Operations Research

TL;DR: A robust heuristic approach is proposed for the VRPTW using travel distance as the main objective through an efficient genetic algorithm and a set partitioning formulation that outperforms all previously known and published heuristic methods in terms of the minimal travel distance.

...read moreread less

250 citations

Proceedings Article•DOI•

A Min-Min Max-Min selective algorihtm for grid task scheduling

[...]

Kobra Etminani¹, Mahmoud Naghibzadeh¹•Institutions (1)

Ferdowsi University of Mashhad¹

12 Dec 2007

TL;DR: A new scheduling algorithm based on two conventional scheduling algorithms, Min-Min and Max-Min, to use their cons and at the same time, cover their pros, which selects between the two algorithms based on standard deviation of the expected completion time of tasks on resources.

...read moreread less

Abstract: Today, the high cost of supercomputers in the one hand and the need for large-scale computational resources on the other hand, has led to use network of computational resources known as Grid. Numerous research groups in universities, research labs, and industries around the world are now working on a type of Grid called Computational Grids that enable aggregation of distributed resources for solving large-scale data intensive problems in science, engineering, and commerce. Several institutions and universities have started research and teaching programs on Grid computing as part of their parallel and distributed computing curriculum. To better use tremendous capabilities of this distributed system, effective and efficient scheduling algorithms are needed. In this paper, we introduce a new scheduling algorithm based on two conventional scheduling algorithms, Min-Min and Max-Min, to use their cons and at the same time, cover their pros. It selects between the two algorithms based on standard deviation of the expected completion time of tasks on resources. We evaluate our scheduling heuristic, the Selective algorithm, within a grid simulator called GridSim. We also compared our approach to its two basic heuristics. The experimental results show that the new heuristic can lead to significant performance gain for a variety of scenarios.

...read moreread less

211 citations

Patent•

Ultrascalable petaflop parallel supercomputer

[...]

Matthias A. Blumrich¹, Dong Chen¹, George Liang-Tai Chiu¹, Thomas Mario Cipolla¹, Paul W. Coteus¹, Alan Gara¹, Mark E. Giampapa¹, Shawn A. Hall¹, Rudolf A. Haring¹, Philip Heidelberger¹, Gerard V. Kopcsay¹, Martin Ohmacht¹, Valentina Salapura¹, Krishnan Sugavanam¹, Todd E. Takken¹ - Show less +11 more•Institutions (1)

IBM¹

26 Jun 2007

TL;DR: In this paper, a massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements.

...read moreread less

Abstract: A novel massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. Novel use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.

...read moreread less

179 citations

Journal Article•DOI•

Scalable Implementations of Ensemble Filter Algorithms for Data Assimilation

[...]

Jeffrey L. Anderson¹, Nancy Collins¹•Institutions (1)

National Center for Atmospheric Research¹

01 Aug 2007-Journal of Atmospheric and Oceanic Technology

TL;DR: A variant of a least squares ensemble (Kalman) filter that is suitable for implementation on parallel architectures and produces results that are identical to those from sequential algorithms when forward observation operators that relate the model state vector to the expected value of observations are linear.

...read moreread less

Abstract: A variant of a least squares ensemble (Kalman) filter that is suitable for implementation on parallel architectures is presented. This parallel ensemble filter produces results that are identical to those from sequential algorithms already described in the literature when forward observation operators that relate the model state vector to the expected value of observations are linear (although actual results may differ due to floating point arithmetic round-off error). For nonlinear forward observation operators, the sequential and parallel algorithms solve different linear approximations to the full problem but produce qualitatively similar results. The parallel algorithm can be implemented to produce identical answers with the state variable prior ensembles arbitrarily partitioned onto a set of processors for the assimilation step (no caveat on round-off is needed for this result). Example implementations of the parallel algorithm are described for environments with low (high) communication lat...

...read moreread less

159 citations

Journal Article•DOI•

nmagic: a fast parallel implementation of a χ2-made-to-measure algorithm for modelling observational data

[...]

Flavio De Lorenzi¹, Victor P. Debattista², Ortwin Gerhard¹, Niranjan Sambhus³•Institutions (3)

Max Planck Society¹, University of Washington², University of Basel³

21 Mar 2007-Monthly Notices of the Royal Astronomical Society

TL;DR: In this paper, a made-to-measure (M2M) algorithm for constructing N-particle models of stellar systems from observational data (X 2 M2M), extending earlier ideas by Syer & Tremaine, is described.

...read moreread less

Abstract: We describe a made-to-measure (M2M) algorithm for constructing N-particle models of stellar systems from observational data (X 2 M2M), extending earlier ideas by Syer & Tremaine. The algorithm properly accounts for observational errors, is flexible, and can be applied to various systems and geometries. We implement this algorithm in a parallel code NMAGIC and carry out a sequence of tests to illustrate its power and performance. (i) We reconstruct an isotropic Hernquist model from density moments and projected kinematics and recover the correct differential energy distribution and intrinsic kinematics. (ii) We build a self-consistent oblate three-integral maximum rotator model and compare how the distribution function is recovered from integral field and slit kinematic data. (iii) We create a non-rotating and a figure rotating triaxial stellar particle model, reproduce the projected kinematics of the figure rotating system by a non-rotating system of the same intrinsic shape, and illustrate the signature of pattern rotation in this model. From these tests, we comment on the dependence of the results from X 2 M2M on the initial model, the geometry, and the amount of available data.

...read moreread less

146 citations

Proceedings Article•DOI•

Software and Algorithms for Graph Queries on Multithreaded Architectures

[...]

Jonathan W. Berry¹, Bruce Hendrickson¹, Simon Kahan², Petr Konecny²•Institutions (2)

Sandia National Laboratories¹, Cray²

26 Mar 2007

TL;DR: A multithreaded algorithm for connected components and a new heuristic for inexact subgraph isomorphism are introduced and explored, and the performance of these and other basic graph algorithms on large scale-free graphs is explored.

...read moreread less

Abstract: Search-based graph queries, such as finding short paths and isomorphic subgraphs, are dominated by memory latency. If input graphs can be partitioned appropriately, large cluster-based computing platforms can run these queries. However, the lack of compute-bound processing at each vertex of the input graph and the constant need to retrieve neighbors implies low processor utilization. Furthermore, graph classes such as scale-free social networks lack the locality to make partitioning clearly effective. Massive multithreading is an alternative architectural paradigm, in which a large shared memory is combined with processors that have extra hardware to support many thread contexts. The processor speed is typically slower than normal, and there is no data cache. Rather than mitigating memory latency, multithreaded machines tolerate it. This paradigm is well aligned with the problem of graph search, as the high ratio of memory requests to computation can be tolerated via multithreading. In this paper, we introduce the multithreaded graph library (MTGL), generic graph query software for processing semantic graphs on multithreaded computers. This library currently runs on serial machines and the Cray MTA-2, but Sandia is developing a run-time system that will make it possible to run MTGL-based code on symmetric multiprocessors. We also introduce a multithreaded algorithm for connected components and a new heuristic for inexact subgraph isomorphism We explore the performance of these and other basic graph algorithms on large scale-free graphs. We conclude with a performance comparison between the Cray MTA-2 and Blue Gene/Light for s-t connectivity.

...read moreread less

108 citations

Journal Article•DOI•

A parallel implementation of the Cellular Potts Model for simulation of cell-based morphogenesis

[...]

Nan Chen¹, James A. Glazier², Jesús A. Izaguirre¹, Mark Alber¹•Institutions (2)

University of Notre Dame¹, Virginia Bioinformatics Institute²

01 Jun 2007-Computer Physics Communications

TL;DR: A parallel CPM algorithm for simulations of morphogenesis, which includes cell–cell adhesion, a cell volume constraint, and cell haptotaxis is presented, which satisfies the balance condition, which is sufficient for convergence of the underlying Markov chain.

...read moreread less

104 citations

Journal Article•DOI•

Evolutionary Computing on Consumer Graphics Hardware

[...]

Ka-Ling Fok¹, Tien-Tsin Wong¹, Man Leung Wong²•Institutions (2)

The Chinese University of Hong Kong¹, Lingnan University²

01 Mar 2007-IEEE Intelligent Systems

TL;DR: This work proposes implementing a parallel EA on consumer graphics cards, which can find in many PCs, and lets more people use the authors' parallel algorithm to solve large-scale, real-world problems such as data mining.

...read moreread less

Abstract: We propose implementing a parallel EA on consumer graphics cards, which we can find in many PCs. This lets more people use our parallel algorithm to solve large-scale, real-world problems such as data mining. Parallel evolutionary algorithms run on consumer-grade graphics hardware achieve better execution times than ordinary evolutionary algorithms and offer greater accessibility than those run on high-performance computers

...read moreread less

102 citations

Journal Article•DOI•

Parallel Genetic-Algorithm Optimization of Shaped Beam Coverage Areas Using Planar 2-D Phased Arrays

[...]

F.J. Villegas

11 Jun 2007-IEEE Transactions on Antennas and Propagation

TL;DR: Various results of optimized coverage patterns are shown herein to illustrate the effectiveness and validity of the GA technique.

...read moreread less

Abstract: A parallel genetic algorithm (GA) optimization tool has been developed for the synthesis of arbitrarily shaped beam coverage using planar 2D phased-array antennas. Typically, the synthesis of a contoured beam footprint using a planar 2D array is difficult because of the inherently large number of degrees of freedom involved (in general, the amplitude and phase of each element must be determined). We make use of a parallel GA tool in this study to compensate for this aspect of the design problem. The algorithm essentially compares a desired pattern envelope with that of trial arrays, and quantifies the effectiveness or desirability of each test case via a fitness function. The GA uses this information to rank and select subsequent arrays over a given number of generations via the conventional stochastic operators, i.e., selection, crossover, and mutation. Each fitness evaluation of a trial pattern is done on a node of the aerospace fellowship cluster supercomputer, which increases the speed of the algorithm linearly with the number of nodes. Because of the continuous nature of the parameters for this optimization problem, a real parameter encoding scheme is employed for the GA chromosome in order to avoid the quantization errors associated with a binary representation. A benchmark 10 times 10 (100) element array is employed, and various results of optimized coverage patterns are shown herein to illustrate the effectiveness and validity of the technique.

...read moreread less

Proceedings Article•

On the Parallelization of UCT

[...]

Tristan Cazenave, Nicolas Jouandeau

15 Jun 2007

TL;DR: In this article, the authors present three parallel algorithms for UCT for 9×9 Go, and they all improve the results of the programs that use them against GNU GO 3.6.

...read moreread less

Abstract: We present three parallel algorithms for UCT. For 9×9 Go, they all improve the results of the programs that use them against GNU GO 3.6. The simplest one, the single-run algorithm, uses very few communications and shows improvements comparable to the more complex ones. Further improvements may be possible sharing more information in the multiple-runs algorithm.

...read moreread less

Journal Article•DOI•

A branch-and-price algorithm for scheduling parallel machines with sequence dependent setup times

[...]

Manuel Lopes¹, J. M. Valério de Carvalho²•Institutions (2)

Instituto Superior de Engenharia do Porto¹, University of Minho²

01 Feb 2007-European Journal of Operational Research

TL;DR: A new branch-and-price optimization algorithm, termed “ primal box ”, and a specific branching variable selection rule that significantly reduces the number of explored nodes are proposed, which solve problems of large size to optimality within reasonable computational time.

...read moreread less

Journal Article•DOI•

Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems

[...]

Ling Zhuo¹, Viktor K. Prasanna¹•Institutions (1)

University of Southern California¹

01 Apr 2007-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper studies designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems, and proposes three parameterized algorithms which can be tuned according to the problem size and the available hardware resources.

...read moreread less

Abstract: The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The processing elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-ll Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1

...read moreread less

Journal Article•DOI•

Parallel machine total tardiness scheduling with a new hybrid metaheuristic approach

[...]

Davide Anghinolfi, Massimo Paolucci

01 Nov 2007-Computers & Operations Research

TL;DR: Evaluating the possibility of defining a hybrid customizable neighbourhood search algorithm for combinatorial problems as a combination of a subset of concepts and features from three main metaheuristics of reference, i.e., the TS, the SA and the VNS aims to evaluate the effectiveness of the HMH.

...read moreread less

Proceedings Article•DOI•

Using early phase termination to eliminate load imbalances at barrier synchronization points

[...]

Martin Rinard¹•Institutions (1)

Massachusetts Institute of Technology¹

21 Oct 2007

TL;DR: A general computational pattern that works well with early phase termination is identified and it is explained why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.

...read moreread less

Abstract: We present a new technique, early phase termination, for eliminating idle processors in parallel computations that use barrier synchronization. This technique simply terminates each parallel phaseas soon as there are too few remaining tasks to keep all of the processors busy. Although this technique completely eliminates the idling that would other wise occur at barrier synchronization points, it may also change the computation and therefore the result that the computation produces. We address this issue by providing probabilistic distortion models that characterize how the use of early phase termination distorts the result that the computation produces. Our experimental results show that for our set of benchmark applications, 1) early phase termination can improve the performance of the parallel computation, 2) the distortion is small (or can be made to be small with the use of an appropriate compensation technique) and 3) the distortion models provide accurate and tight distortion bounds. These bounds can enable users to evaluate the effect of early phase termination and confidently accept results from parallel computations that use this technique if they find the distortion bounds to be acceptable. Finally, we identify a general computational pattern that works well with early phase termination and explain why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.

...read moreread less

Proceedings Article•DOI•

An Efficient Fine-grained Parallel Genetic Algorithm Based on GPU-Accelerated

[...]

Jian-Ming Li¹, Xiao-Jing Wang², Rong-Sheng He¹, Zhong-Xian Chi¹•Institutions (2)

Dalian University of Technology¹, Dongbei University of Finance and Economics²

18 Sep 2007

TL;DR: AFGPGA method based on GPU-acceleration, which maps parallel GA algorithm to texture-rendering on consumer-level graphics cards is proposed, which increases the population size, speeds up its execution and provides ordinary users with a feasible FGPGA solution.

...read moreread less

Abstract: Fine-grained parallel genetic algorithm (FGPGA), though a popular and robust strategy for solving complicated optimization problems, is sometimes inconvenient to use as its population size is restricted by heavy data communication and the parallel computers are relatively difficult to use, manage, maintain and may not be accessible to most researchers. In this paper, we propose a FGPGA method based on GPU-acceleration, which maps parallel GA algorithm to texture-rendering on consumer-level graphics cards. The analytical results demonstrate that the proposed method increases the population size, speeds up its execution and provides ordinary users with a feasible FGPGA solution.

...read moreread less

Journal Article•DOI•

Parallel universal approach to mesh motion and application to rotors in forward flight

[...]

Christian B Allen¹•Institutions (1)

University of Bristol¹

05 Mar 2007-International Journal for Numerical Methods in Engineering

TL;DR: In this article, a universal approach to grid motion is presented, which is applicable to any grid type, unstructured, hybrid or structured single-and multiblock, and is shown to preserve grid quality even when a large deformation case is considered.

...read moreread less

Abstract: Numerical simulation of unsteady flows including moving boundaries, whether rigid, prescribed, or deforming, requires the mesh to move/deform also. Many approaches to mesh motion have been considered, but the approach adopted often depends on both the meshing approach used and the proposed application. A universal approach to grid motion is presented here. The method developed has several important properties, the most significant of which is that the scheme requires no grid point connectivity information, i.e. each point can be moved totally independently to its neighbours. This has two major implications: first, it means the scheme is universal, i.e. applicable to any grid type, unstructured, hybrid or structured single- and multiblock. Second, and equally as important, is that the scheme is perfectly parallel, as no communication is required between points/blocks. Each block can be updated independently to its neighbours so there is no connectivity data required, and this also means the flow-solver does not have to carry the grid motion parameterization data in memory. The scheme accounts for moving surface rotations as well as displacements, and this ensures grid quality is preserved by maintaining orthogonality. The method is applied to a four-bladed lifting rotor in forward flight. This is a particularly challenging unsteady problem as there are multiple bodies in relative motion, each one with its own axis system, as each blade has a cyclic pitch variation. Grid quality is proven to be preserved even when a large deformation case is considered and, significantly, the scheme adds only around 1% to the cost of a flow solution real time step. Copyright © 2006 John Wiley & Sons, Ltd.

...read moreread less

Proceedings Article•DOI•

A parallel dynamic programming algorithm on a multi-core architecture

[...]

Guangming Tan¹, Ninghui Sun¹, Guang R. Gao²•Institutions (2)

Chinese Academy of Sciences¹, University of Delaware²

09 Jun 2007

TL;DR: This paper presents a programming and execution model for multi-core architectures with memory hierarchy, and proposes a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators.

...read moreread less

Abstract: Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between non-consecutive stages, in other words, the data dependence is non-uniform. This kind of dynnamic programming is typically called nonserial polyadic dynamic programming. Owing to the non-uniform data dependence, it is harder to optimize this problem for parallelism and locality on parallel architectures. In this paper, we address the chanllenge of exploiting fine grain parallelism and locality of nonserial polyadic dynamic programming on a multi-core architecture. We present a programming and execution model for multi-core architectures with memory hierarchy. In the framework of the new model, the parallelism and locality benifit from a data dependence transformation. We propose a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators. The new parallel algorithm tolerates the memory access latency using multi-thread and is easily improved with tile technique. We formulate and analytically solve the optimization problem determing the tile size that minimizes the total execution time. The experiments on a simulator give a validation of the proposed model and show that the fine grain parallel algorithm achieves sub-linear speedup and that a potential high scalability on multi-core arichitecture.

...read moreread less

Book Chapter•DOI•

A parallel approximation algorithm for the weighted maximum matching problem

[...]

Fredrik Manne¹, Rob H. Bisseling²•Institutions (2)

University of Bergen¹, Utrecht University²

09 Sep 2007

TL;DR: A distributed algorithm due to Hoepman is analysed and it is shown how this can be turned into a parallel algorithm that scales well using up to 32 processors.

...read moreread less

Abstract: We consider the problem of computing a weighted edge matching in a large graph using a parallel algorithm. This problem has application in several areas of combinatorial scientific computing. Since an exact algorithm for the weighted matching problem is both fairly expensive to compute and hard to parallelise we instead consider fast approximation algorithms. We analyse a distributed algorithm due to Hoepman [8] and show how this can be turned into a parallel algorithm. Through experiments using both complete as well as sparse graphs we show that our new parallel algorithm scales well using up to 32 processors.

...read moreread less

Proceedings Article•DOI•

Implicitly parallel programming models for thousand-core microprocessors

[...]

Wen-mei W. Hwu¹, Shane Ryoo¹, Sain-Zee Ueng¹, John H. Kelm¹, Isaac Gelado², Sam S. Stone¹, Robert E. Kidd¹, Sara S. Baghsorkhi¹, Aqeel Mahesri¹, Stephanie C. Tsao¹, Nacho Navarro², Steve Lumetta¹, Matthew I. Frank¹, Sanjay J. Patel¹ - Show less +10 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Polytechnic University of Catalonia²

04 Jun 2007

TL;DR: It is argued that implicitly parallel programming models are critical for addressing the software development crises and software scalability challenges for many-core microprocessors.

...read moreread less

Abstract: This paper argues for an implicitly parallel programming model for many-core microprocessors, and provides initial technical approaches towards this goal. In an implicitly parallel programming model, programmers maximize algorithm- level parallelism, express their parallel algorithms by asserting high-level properties on top of a traditional sequential programming language, and rely on parallelizing compilers and hardware support to perform parallel execution under the hood. In such a model, compilers and related tools require much more advanced program analysis capabilities and programmer assertions than what are currently available so that a comprehensive understanding of the input program's concurrency can be derived. Such an understanding is then used to drive automatic or interactive parallel code generation tools for a diverse set of parallel hardware organizations. The chip-level architecture and hardware should maintain parallel execution state in such a way that a strictly sequential execution state can always be derived for the purpose of verifying and debugging the program. We argue that implicitly parallel programming models are critical for addressing the software development crises and software scalability challenges for many-core microprocessors.

...read moreread less

Proceedings Article•DOI•

Diagonal Quadratic Approximation for Parallelization of Analytical Target Cascading

[...]

Yanjing Li¹, Zhaosong Lu², Jeremy J. Michalek³•Institutions (3)

Stanford University¹, Simon Fraser University², Carnegie Mellon University³

01 Jan 2007

TL;DR: This paper examines existing ATC methods, providing an alternative to existing nested coordination schemes by using the block coordinate descent method (BCD), and applies diagonal quadratic approximation (DQA) by linearizing the cross term of the augmented Lagrangian function to create separable subproblems.

...read moreread less

Abstract: Analytical Target Cascading (ATC) is an effective decomposition approach used for engineering design optimization problems that have hierarchical structures. With ATC, the overall system is split into subsystems, which are solved separately and coordinated via target/response consistency constraints. As parallel computing becomes more common, it is desirable to have separable subproblems in ATC so that each subproblem can be solved concurrently to increase computational throughput. In this paper, we first examine existing ATC methods, providing an alternative to existing nested coordination schemes by using the block coordinate descent method (BCD). Then we apply diagonal quadratic approximation (DQA) by linearizing the cross term of the augmented Lagrangian function to create separable subproblems. Local and global convergence proofs are described for this method. To further reduce overall computational cost, we introduce the truncated DQA (TDQA) method that limits the number of inner loop iterations of DQA. These two new methods are empirically compared to existing methods using test problems from the literature. Results show that computational cost of nested loop methods is reduced by using BCD and generally the computational cost of the truncated methods, TDQA and ALAD, are superior to other nested loop methods with lower overall computational cost than the best previously reported results.© 2007 ASME

...read moreread less

Proceedings Article•DOI•

A Study of a Transactional Parallel Routing Algorithm

[...]

Ian Watson¹, Chris Kirkham¹, Mikel Luján¹•Institutions (1)

University of Manchester¹

15 Sep 2007

TL;DR: This study reports the experience implementing a realistic application using transactional memory (TM) and evaluates the exploitable parallelism of a transactional parallel implementation and explores how it can be adapted to deliver better performance.

...read moreread less

Abstract: Transactional memory proposes an alternative synchronization primitive to traditional locks. Its promise is to simplify the software development of multi-threaded applications while at the same time delivering the performance of parallel applications using (complex and error prone) fine grain locking. This study reports our experience implementing a realistic application using transactional memory (TM). The application is Lee's routing algorithm and was selected for its abundance of parallelism but difficulty of expressing it with locks. Each route between a source and a destination point in a grid can be considered a unit of parallelism. Starting from this simple approach, we evaluate the exploitable parallelism of a transactional parallel implementation and explore how it can be adapted to deliver better performance. The adaptations do not introduce locks nor alter the essence of the implemented algorithm, but deliver up to 20 times more parallelism. The adaptations are derived from understanding the application itself and TM. The evaluation simulates an abstracted TM system and, thus, the results are independent of specific software or hardware TM implemented, and describe properties of the application.

...read moreread less

Proceedings Article•DOI•

A near-optimal distributed fully dynamic algorithm for maintaining sparse spanners

[...]

Michael Elkin¹•Institutions (1)

Ben-Gurion University of the Negev¹

12 Aug 2007

TL;DR: This paper develops a fully dynamic distributed algorithm for maintaining sparse spanners that improves drastically the quiescence time and improves significantly upon the state-of-the-art algorithm in all efficiency parameters.

...read moreread less

Abstract: Currently, there are no known explicit algorithms for the great majority of graph problems in the dynamic distributed message-passing model. Instead, most state-of-the-art dynamic distributed algorithms are constructed by composing a static algorithm for the problem at hand with a simulation technique that converts static algorithms to dynamic ones. We argue that this powerful methodology does not provide satisfactory solutions for many important dynamic distributed problems, and this necessitates developing algorithms for these problems from scratch.In this paper we develop a fully dynamic distributed algorithm for maintaining sparse spanners. Our algorithm improves drastically the quiescence time of the state-of-the-art algorithm for the problem. Moreover, we show that the quiescence time of our algorithm is optimal up to a small constant factor. In addition, our algorithm improves significantly upon the state-of-the-art algorithm in all efficiency parameters, specifically, it has smaller quiescence message and space complexities, and smaller local processing time. Finally, our algorithm is self-contained and fairly simple, and is, consequently, amenable to implementation on unsophisticated network devices.

...read moreread less

Proceedings Article•DOI•

Monitoring Regular Expressions on Out-of-Order Streams

[...]

T. Johnson¹, S. Muthukrishnan², I. Rozenbaum²•Institutions (2)

AT&T Labs¹, Rutgers University²

12 Mar 2007

TL;DR: Three versions of the algorithm for regular expression matching on streams with out of order data are implemented and it is shown by experimental study that the algorithms are highly effective in matching regular expressions on IP packet streams.

...read moreread less

Abstract: We present an efficient algorithm for regular expression matching on streams with out of order data, while maintaining a small state and without complete stream reconstruction. We have implemented three versions of the algorithm - sequential, parallel and mixed - and show by experimental study that the algorithms are highly effective in matching regular expressions on IP packet streams.

...read moreread less

Book Chapter•DOI•

Molecular dynamics simulations on commodity GPUs with CUDA

[...]

Weiguo Liu¹, Bertil Schmidt², Gerrit Voss¹, Wolfgang Müller-Wittig¹•Institutions (2)

Nanyang Technological University¹, University of New South Wales²

18 Dec 2007

TL;DR: A new approach to high performance molecular dynamics simulations on graphics processing units using the Compute Unified Device Architecture (CUDA) to design and implement a new parallel algorithm with significant runtime savings on an off-the-shelf computer graphics card.

...read moreread less

Abstract: Molecular dynamics simulations are a common and often repeated task in molecular biology. The need for speeding up this treatment comes from the requirement for large system simulations with many atoms and numerous time steps. In this paper we present a new approach to high performance molecular dynamics simulations on graphics processing units. Using modern graphics processing units for high performance computing is facilitated by their enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) to design and implement a new parallel algorithm. This results in an implementation with significant runtime savings on an off-the-shelf computer graphics card.

...read moreread less

Book•DOI•

Handbook of Parallel Computing: Models, Algorithms, and Applications

[...]

Sanguthevar Rajasekaran¹, John H. Reif²•Institutions (2)

Northwestern University¹, University of Padua²

20 Dec 2007

TL;DR: A Programming Model and Architectural Extensions for Fine-Grain Parallelism, and Applications Using FG to reduce the Effect of Latency in Parallel Programs Running on Clusters.

...read moreread less

Abstract: PREFACE MODELS Evolving Computational Systems S.G. Akl Decomposable BSP: A Bandwidth-Latency Model for Parallel and Hierarchical Computation G. Bilardi, A. Pietracaprina, and G. Pucci Membrane Systems: A "Natural" Way of Computing with Cells O.H. Ibarra and A. Paun Optical Transpose Systems: Models and Algorithms C.-F. Wang and S. Sahni Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform U. Vishkin, G. Caragea, and B. Lee Deterministic and Randomized Sorting Algorithms for the Parallel Disks Model S. Rajasekaran A Programming Model and Architectural Extensions for Fine-Grain Parallelism A. Gontmakher, A. Mendelson, A. Schuster, and G. Shklover Computing with Mobile Agents in Distributed Networks E. Kranakis, D. Krizanc, and S. Rajsbaum Transitional Issues: Fine-Grain to Coarse-Grain Multicomputers S. Olariu Distributed Computing in the Presence of Mobile Faults N. Santoro and P. Widmayer A Hierarchical Performance Model for Reconfigurable Computers R. Scorfano and V.K. Prasanna Hierarchical Performance Modeling and Analysis of Distributed Software Systems R.A. Ammar Randomized Packet Routing, Selection, and Sorting on the POPS Network J. Davila and S. Rajasekaran Dynamic Reconfiguration on the R-Mesh R. Vaidyanathan and J.L. Trahan Fundamental Algorithms on the Reconfigurable Mesh K. Nakano Reconfigurable Computing with Optical Buses A.G. Bourgeois ALGORITHMS Distributed Peer-to-Peer Data Structures M.T. Goodrich and M.J. Nelson Parallel Algorithms via the Probabilistic Method L. Kliemann and A. Srivastav Broadcasting on Networks of Workstations S. Khuller, Y.-A. Kim, and Y.-C. Wan Atomic Selfish Routing in Networks: A Survey S. Kontogiannis and P. Spirakis Scheduling in Grid Environments Y-C. Lee and A.Y. Zomaya QoS Scheduling in Network and Storage Systems P.J. Varman and A. Gulati Optimal Parallel Scheduling Algorithms in WDM Packet Interconnects Y. Yang Online Real-Time Scheduling Algorithms for Multiprocessor Systems M.A. Palis Parallel Algorithms for Maximal Independent Set and Maximal Matching Y. Han Efficient Parallel Graph Algorithms for Shared-Memory Multiprocessors D.A. Bader and G. Cong Parallel Algorithms for Volumetric Surface Construction J. JaJa, Q. Shi, and A. Varshney Mesh-Based Parallel Algorithms for Ultra-Fast Computer Vision S. Olariu Prospectus for a Dense Linear Algebra Software Library J. Demmel and J. Dongarra Parallel Algorithms on Strings W. Rytter Design of Multithreaded Algorithms for Combinatorial Problems D.A. Bader, K. Madduri, G. Cong, and J. Feo Parallel Data Mining Algorithms for Association Rules and Clustering J. Li, Y. Liu, W.-K. Liao, and A. Choudhary An Overview of Mobile Computing Algorithmics S. Olariu and A.Y. Zomaya APPLICATIONS Using FG to Reduce the Effect of Latency in Parallel Programs Running on Clusters T.H. Cormen and E.R. Davidson High-Performance Techniques for Parallel I/O A. Ching, K. Coloma, A. Choudhary, and W.-K. Liao Message Dissemination Using Modern Communication Primitives T. Gonzalez Online Computation in Large Networks S. Albers Online Call Admission Control in Wireless Cellular Networks I. Caragiannis, C. Kaklamanis, and E. Papaioannou Minimum Energy Communication in Ad Hoc Wireless Networks I. Caragiannis, C. Kaklamanis, and P. Kanellopoulos Power Aware Mapping of Real-Time Tasks to Multiprocessors D. Zhu, B.R. Childers, D. Mosse, and R. Melhem Perspectives on Robust Resource Allocation for Heterogeneous Parallel and Distributed Systems S. Ali, H.J. Siegel, and A.A. Maciejewski A Transparent Distributed Runtime for Java M. Factor, A. Schuster, and K. Shagin Scalability of Parallel Programs A. Grama and V. Kumar Spatial Domain Decomposition Methods in Parallel Scientific Computing Sudip Seal and Srinivas Aluru Game Theoretical Solutions for Data Replication in Distributed Computing Systems S.U. Khan and I. Ahmad Effectively Managing Data on a Grid C.L. Ruby and R. Miller Fast and Scalable Parallel Matrix Multiplication and Its Applications on Distributed Memory Systems K. Li INDEX

...read moreread less

Proceedings Article•DOI•

On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study of List Ranking

[...]

David A. Bader¹, Virat Agarwal¹, Kamesh Madduri¹•Institutions (1)

Georgia Institute of Technology¹

26 Mar 2007

TL;DR: A generic work-partitioning technique on the Cell to hide memory access latency, and applies this to efficiently implement list ranking, a representative problem from the class of combinatorial and graph-theoretic applications.

...read moreread less

Abstract: The Sony-Toshiba-IBM Cell Broadband Engine is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE), with eight SIMD coprocessing units (SPEs) integrated on-chip. We present a complexity model for designing algorithms on the Cell processor, along with a systematic procedure for algorithm analysis. To estimate the execution time of the algorithm, we consider the computational complexity, memory access patterns (DMA transfer sizes and latency), and the complexity of branching instructions. This model, coupled with the analysis procedure, simplifies algorithm design on the Cell and enables quick identification of potential implementation bottlenecks. Using the model, we design an efficient implementation of list ranking, a representative problem from the class of combinatorial and graph-theoretic applications. Due to its highly irregular memory patterns, list ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures. We describe a generic work-partitioning technique on the Cell to hide memory access latency, and apply this to efficiently implement list ranking. We run our algorithm on a 3.2 GHz Cell processor using an IBM QS20 Cell Blade and demonstrate a substantial speedup for list ranking on the Cell in comparison to traditional cache-based microprocessors. For a random linked list of 1 million nodes, we achieve an an overall speedup of 8.34 over a PPE-only implementation.

...read moreread less

Journal Article•DOI•

Parallel Symbolic Factorization for Sparse LU with Static Pivoting

[...]

Laura Grigori, James Demmel, Xiaoye S. Li

01 May 2007-SIAM Journal on Scientific Computing

TL;DR: The design and implementation of a memory scalable parallel symbolic factorization algorithm for general sparse unsymmetric matrices using a graph partitioning approach applied to the graph of A to partition the matrix in such a way that is good for sparsity preservation as well as for parallel factorization.

...read moreread less

Abstract: This paper presents the design and implementation of a memory scalable parallel symbolic factorization algorithm for general sparse unsymmetric matrices. Our parallel algorithm uses a graph partitioning approach, applied to the graph of $|A|+|A|^T$, to partition the matrix in such a way that is good for sparsity preservation as well as for parallel factorization. The partitioning yields a so-called separator tree which represents the dependencies among the computations. We use the separator tree to distribute the input matrix over the processors using a block cyclic approach and a subtree to subprocessor mapping. The parallel algorithm performs a bottom-up traversal of the separator tree. With a combination of right-looking and left-looking partial factorizations, the algorithm obtains one column structure of $L$ and one row structure of $U$ at each step. The algorithm is implemented in C and MPI. From a performance study on large matrices, we show that the parallel algorithm significantly reduces the memory requirement of the symbolic factorization step, as well as the overall memory requirement of the parallel solver. It also often reduces the runtime of the sequential algorithm, which is already relatively small. In general, the parallel algorithm prevents the symbolic factorization step from being a time or memory bottleneck of the parallel solver.

...read moreread less

Journal Article•DOI•

A note on 'A fully parallel 3D thinning algorithm and its applications'

[...]

Tao Wang¹, Anup Basu¹•Institutions (1)

University of Alberta¹

01 Mar 2007-Pattern Recognition Letters

TL;DR: A correction to Ma and Sonka's thinning algorithm, 'A fully parallel 3D thinning algorithms and its applications', which fails to preserve connectivity of 3D objects, is presented.

...read moreread less

Collapse