Showing papers on "Parallel algorithm published in 2010"

PDF

Open Access

Proceedings Article•DOI•

FlumeJava: easy, efficient data-parallel pipelines

[...]

Craig D. Chambers¹, Ashish Raniwala¹, Frances J. Perry¹, Stephen R. Adams¹, Robert R. Henry¹, Robert Bradshaw¹, Nathan Weizenbaum¹ - Show less +3 more•Institutions (1)

Google¹

05 Jun 2010

TL;DR: The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines.

...read moreread less

Abstract: MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.

...read moreread less

421 citations

Journal Article•DOI•

Towards dense linear algebra for hybrid GPU accelerated manycore systems

[...]

Stanimire Tomov¹, Jack Dongarra², Marc Baboulin¹•Institutions (2)

University of Tennessee¹, Oak Ridge National Laboratory²

01 Jun 2010

TL;DR: The need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components possesses is motivated, and the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems is envisioned.

...read moreread less

Abstract: We highlight the trends leading to the increased appeal of using hybrid multicore+GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.

...read moreread less

398 citations

Journal Article•DOI•

The LATIN multiscale computational method and the Proper Generalized Decomposition

[...]

Pierre Ladevèze¹, Jean-Charles Passieux¹, David Néron¹•Institutions (1)

UniverSud Paris¹

01 Apr 2010-Computer Methods in Applied Mechanics and Engineering

TL;DR: In this article, the synergy between the LATIN multiscale method and the Proper Generalized Decomposition (PGD) method is discussed, which is the key of its performances.

...read moreread less

275 citations

Journal Article•DOI•

A parallel heuristic for the Vehicle Routing Problem with Simultaneous Pickup and Delivery

[...]

Anand Subramanian¹, Lúcia Maria de A. Drummond¹, Cristiana Bentes, Luiz Satoru Ochi¹, Ricardo Farias² - Show less +1 more•Institutions (2)

Federal Fluminense University¹, Federal University of Rio de Janeiro²

01 Nov 2010-Computers & Operations Research

TL;DR: The proposed algorithm not only improved several of the known solutions, but also presented a very satisfying scalability.

...read moreread less

262 citations

Proceedings Article•DOI•

Fast tridiagonal solvers on the GPU

[...]

Yao Zhang¹, Jonathan Cohen², John D. Owens¹•Institutions (2)

University of California, Davis¹, Nvidia²

09 Jan 2010

TL;DR: To combine the benefits of the basic algorithms, this work proposes hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively.

...read moreread less

Abstract: We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.

...read moreread less

232 citations

Journal Article•DOI•

Genetic algorithms for task scheduling problem

[...]

Fatma A. Omara¹, Mona M. Arafa²•Institutions (2)

Cairo University¹, Banha University²

01 Jan 2010-Journal of Parallel and Distributed Computing

TL;DR: Two genetic algorithms are developed with some heuristic principles that have been added to improve the performance and it has been found that the developed algorithms always outperform the traditional algorithms.

...read moreread less

194 citations

Proceedings Article•DOI•

A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

[...]

Charles E. Leiserson¹, Tao B. Schardl¹•Institutions (1)

Massachusetts Institute of Technology¹

13 Jun 2010

TL;DR: A general method for analyzing nondeterministic programs that use reducers and it is shown that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm attains near-perfect linear speedup if P << (V+E)/Dlg3(V/D).

...read moreread less

Abstract: We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS achieves high work-efficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices -- a condition met by many real-world graphs -- PBFS demonstrates good speedup with the number of processing cores.Since PBFS employs a nonconstant-time "reducer" -- "hyperobject" feature of Cilk++ -- the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G=(V,E) with diameter D and bounded out-degree, this data-race-free version of PBFS algorithm runs it time O((V+E)/P + Dlg3(V/D)) on P processors, which means that it attains near-perfect linear speedup if P

...read moreread less

174 citations

Journal Article•DOI•

Fast Construction of k-Nearest Neighbor Graphs for Point Clouds

[...]

Michael Connor¹, Piyush Kumar¹•Institutions (1)

Florida State University¹

01 Jul 2010-IEEE Transactions on Visualization and Computer Graphics

TL;DR: This work presents a parallel algorithm for k-nearest neighbor graph construction that uses Morton ordering, and shows that this approach has the following advantages over existing methods: faster construction of k-NEarest neighbor graphs in practice on multicore machines, less space usage, better cache efficiency, ability to handle large data sets, and ease of parallelization and implementation.

...read moreread less

Abstract: We present a parallel algorithm for k-nearest neighbor graph construction that uses Morton ordering. Experiments show that our approach has the following advantages over existing methods: 1) faster construction of k-nearest neighbor graphs in practice on multicore machines, 2) less space usage, 3) better cache efficiency, 4) ability to handle large data sets, and 5) ease of parallelization and implementation. If the point set has a bounded expansion constant, our algorithm requires one-comparison-based parallel sort of points, according to Morton order plus near-linear additional steps to output the k-nearest neighbor graph.

...read moreread less

163 citations

Proceedings Article•DOI•

LogGOPSim: simulating large-scale applications in the LogGOPS model

[...]

Torsten Hoefler¹, Timo Schneider², Andrew Lumsdaine²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Indiana University²

21 Jun 2010

TL;DR: This work introduces LogGOPSim---a fast simulation framework for parallel algorithms at large-scale that utilizes a slightly extended version of the well-known LogGPS model in combination with full MPI message matching semantics and detailed simulation of collective operations.

...read moreread less

Abstract: We introduce LogGOPSim---a fast simulation framework for parallel algorithms at large-scale. LogGOPSim utilizes a slightly extended version of the well-known LogGPS model in combination with full MPI message matching semantics and detailed simulation of collective operations. In addition, it enables simulation in the traditional LogP, LogGP, and LogGPS models. Its simple and fast single-queue design computes more than 1 million events per second on a single processor and enables large-scale simulations of more than 8 million processes. LogGOPSim also supports the simulation of full MPI applications by reading and simulating MPI profiling traces. We analyze the accuracy and the performance of the simulation and propose a simple extrapolation scheme for parallel applications. Our scheme extrapolates collective operations with high accuracy by rebuilding the communication pattern. Point-to-point operation patterns can be copied in the extrapolation and thus retain the main characteristics of scalable parallel applications.

...read moreread less

159 citations

Proceedings Article•DOI•

Hybrid MPI/OpenMP power-aware computing

[...]

Dong Li¹, Bronis R. de Supinski², Martin Schulz², Kirk W. Cameron¹, Dimitrios S. Nikolopoulos³ - Show less +1 more•Institutions (3)

Virginia Tech¹, Lawrence Livermore National Laboratory², University of Crete³

19 Apr 2010

TL;DR: A new power-aware performance prediction model of hybrid MPI/OpenMP applications is used to derive a novel algorithm for power-efficient execution of realis tic applications from ASCS equoia and N PB MZ bench marks.

...read moreread less

Abstract: Power-aware execution of parallel programs is now a primary concern in large-scale HPC environments. Prior research in this area has explored models and algorithms based on dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) to achieve power-aware execution of programs written in a single programming model, typically MPI or OpenMP. However, hybrid programming models combining MPI and OpenMP are growing in popularity as emerging large-scale systems have many nodes with several processors per node and multiple cores per process or. In th is paper we present and evaluate solutions for power-efficient execution of programs written in this hybrid model targeting large-scale distributed systems with multicore nodes. We use a new power-aware performance prediction model of hybrid MPI/OpenMP applications to derive a novel algorithm for power-efficient execution of realis tic applications from th e ASCS equoia and N PB MZ bench marks. Our new algorithm yields substantial energy savings (4.18% on average and up to 13.8%) with either negligible performance loss or performance gain (up to 7.2%).

...read moreread less

153 citations

Journal Article•DOI•

PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster

[...]

Heiko Burau, R. Widera, Wolfgang Hönig, Guido Juckeland, Alexander Debus, Thomas Kluge, Ulrich Schramm, Thomas E. Cowan, R. Sauerbrey, Michael Bussmann - Show less +6 more

23 Aug 2010-IEEE Transactions on Plasma Science

TL;DR: The simulation code PIConGPU presented in this paper is, to the authors' knowledge, the first scalable GPU cluster implementation of the PIC algorithm in plasma physics.

...read moreread less

Abstract: The particle-in-cell (PIC) algorithm is one of the most widely used algorithms in computational plasma physics. With the advent of graphical processing units (GPUs), large-scale plasma simulations on inexpensive GPU clusters are in reach. We present an implementation of a fully relativistic plasma PIC algorithm for GPUs based on the NVIDIA CUDA library. It supports a hybrid architecture consisting of single computation nodes interconnected in a standard cluster topology, with each node carrying one or more GPUs. The internode communication is realized using the message-passing interface. The simulation code PIConGPU presented in this paper is, to our knowledge, the first scalable GPU cluster implementation of the PIC algorithm in plasma physics.

...read moreread less

Journal Article•DOI•

Parallel domain connectivity algorithm for unsteady flow computations using overlapping and adaptive grids

[...]

Jayanarayanan Sitaraman¹, Matthew Floros², Andrew M. Wissink, Mark Potsdam•Institutions (2)

University of Wyoming¹, United States Army Research Laboratory²

01 Jun 2010-Journal of Computational Physics

TL;DR: The algorithms and functionality of a new module developed to support overset grid assembly associated with performing time-dependent and adaptive moving body calculations of external aerodynamic flows using a multi-solver paradigm are described.

...read moreread less

Journal Article•DOI•

gProximity: Hierarchical GPU-based Operations for Collision and Distance Queries

[...]

Christian Lauterbach¹, Qi Mo¹, Dinesh Manocha¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 May 2010-Computer Graphics Forum

TL;DR: This work uses thread and data parallelism to perform fast hierarchy construction, updating, and traversal using tight‐fitting bounding volumes such as oriented bounding boxes (OBB) and rectangular swept spheres (RSS) and describes efficient algorithms to compute a linear bounding volume hierarchy (LBVH) and update them using refitting methods.

...read moreread less

Abstract: We present novel parallel algorithms for collision detection and separation distance computation for rigid and deformable models that exploit the computational capabilities of many-core GPUs. Our approach uses thread and data parallelism to perform fast hierarchy construction, updating, and traversal using tight-fitting bounding volumes such as oriented bounding boxes (OBB) and rectangular swept spheres (RSS). We also describe efficient algorithms to compute a linear bounding volume hierarchy (LBVH) and update them using refitting methods. Moreover, we show that tight-fitting bounding volume hierarchies offer improved performance on GPU-like throughput architectures. We use our algorithms to perform discrete and continuous collision detection including self-collisions, as well as separation distance computation between non-overlapping models. In practice, our approach (gProximity) can perform these queries in a few milliseconds on a PC with NVIDIA GTX 285 card on models composed of tens or hundreds of thousands of triangles used in cloth simulation, surgical simulation, virtual prototyping and N-body simulation. Moreover, we observe more than an order of magnitude performance improvement over prior GPU-based algorithms.

...read moreread less

Journal Article•DOI•

A scalable lock-free stack algorithm

[...]

Danny Hendler¹, Nir Shavit², Lena Yerushalmi²•Institutions (2)

Ben-Gurion University of the Negev¹, Tel Aviv University²

01 Jan 2010-Journal of Parallel and Distributed Computing

TL;DR: The empirical results show that the resulting elimination-backoff stack performs as well as the simple stack at low loads, and increasingly outperforms all other methods (lock-based and non-blocking) as concurrency increases.

...read moreread less

Journal Article•DOI•

An ensemble of discrete differential evolution algorithms for solving the generalized traveling salesman problem

[...]

M. Fatih Tasgetiren¹, Ponnuthurai Nagaratnam Suganthan², Quan-Ke Pan³•Institutions (3)

Yaşar University¹, Nanyang Technological University², Liaocheng University³

01 Jan 2010-Applied Mathematics and Computation

TL;DR: An ensemble of DDE (eDDE) algorithms where each parameter set and crossover operator is assigned to one of the parallel populations with parallel populations is presented and compared against the best performing algorithms from the literature.

...read moreread less

Journal Article•DOI•

Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models

[...]

Paul D. McNicholas¹, Thomas Brendan Murphy², Aaron McDaid³, D. Frost³•Institutions (3)

University of Guelph¹, University College Dublin², Trinity College, Dublin³

01 Mar 2010-Computational Statistics & Data Analysis

TL;DR: Model-based clustering using a family of Gaussian mixture models, with parsimonious factor analysis-like covariance structure, is described and an ecient algorithm for its implementation is presented, showing its eectiveness when compared to existing software.

...read moreread less

Proceedings Article•DOI•

PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node

[...]

Jidong Zhai¹, Wenguang Chen¹, Weimin Zheng¹•Institutions (1)

Tsinghua University¹

09 Jan 2010

TL;DR: A novel approach to predict the sequential computation time accurately and efficiently for large-scale parallel applications on non-existing target machines is proposed and a performance prediction framework, called PHANTOM, is implemented, which integrates the above computation-time acquisition approach with a trace-driven network simulator.

...read moreread less

Abstract: For designers of large-scale parallel computers, it is greatly desired that performance of parallel applications can be predicted at the design phase. However, this is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time and their convolution. Despite previous efforts, it remains an open problem to estimate sequential computation time in each process accurately and efficiently for large-scale parallel applications on non-existing target machines.This paper proposes a novel approach to predict the sequential computation time accurately and efficiently. We assume that there is at least one node of the target platform but the whole target system need not be available. We make two main technical contributions. First, we employ deterministic replay techniques to execute any process of a parallel application on a single node at real speed. As a result, we can simply measure the real sequential computation time on a target node for each process one by one. Second, we observe that computation behavior of processes in parallel applications can be clustered into a few groups while processes in each group have similar computation behavior. This observation helps us reduce measurement time significantly because we only need to execute representative parallel processes instead of all of them.We have implemented a performance prediction framework, called PHANTOM, which integrates the above computation-time acquisition approach with a trace-driven network simulator. We validate our approach on several platforms. For ASCI Sweep3D, the error of our approach is less than 5% on 1024 processor cores. Compared to a recent regression-based prediction approach, PHANTOM presents better prediction accuracy across different platforms.

...read moreread less

Proceedings Article•DOI•

Engineering a scalable high quality graph partitioner

[...]

Manuel Holtgrewe¹, Peter Sanders¹, Christian Schulz¹•Institutions (1)

Karlsruhe Institute of Technology¹

19 Apr 2010

TL;DR: In this article, the authors describe an approach to parallel graph partitioning that scales to hundreds of processors and produces a high solution quality, for many instances from Walshaw's benchmark collection.

...read moreread less

Abstract: We describe an approach to parallel graph partitioning that scales to hundreds of processors and produces a high solution quality. For example, for many instances from Walshaw's benchmark collection we improve the best known partitioning. We use the well known framework of multi-level graph partitioning. All components are implemented by scalable parallel algorithms. Quality improvements compared to previous systems are due to better prioritization of edges to be contracted, better approximation algorithms for identifying matchings, better local search heuristics, and perhaps most notably, a parallelization of the FM local search algorithm that works more locally than previous approaches.

...read moreread less

Proceedings Article•DOI•

Extreme-Scale AMR

[...]

Carsten Burstedde¹, Omar Ghattas¹, Michael Gurnis², Tobin Isaac¹, Georg Stadler¹, Tim Warburton³, Lucas C. Wilcox¹ - Show less +3 more•Institutions (3)

University of Texas at Austin¹, California Institute of Technology², Rice University³

13 Nov 2010

TL;DR: New parallel algorithms for parallel dynamic AMR on forest-ofoctrees geometries with arbitrary-order continuous and discontinuous finite/spectral element discretizations for multiscale geophysics problems are presented.

...read moreread less

Abstract: Many problems are characterized by dynamics occurring on a wide range of length and time scales One approach to overcoming the tyranny of scales is adaptive mesh refinement/coarsening (AMR), which dynamically adapts the mesh to resolve features of interest However, the benefits of AMR are difficult to achieve in practice, particularly on the petascale computers that are essential for difficult problems Due to the complex dynamic data structures and frequent load balancing, scaling dynamic AMR to hundreds of thousands of cores has long been considered a challenge Another difficulty is extending parallel AMR techniques to high-order-accurate, complex-geometry-respecting methods that are favored for many classes of problems Here we present new parallel algorithms for parallel dynamic AMR on forest-ofoctrees geometries with arbitrary-order continuous and discontinuous finite/spectral element discretizations The implementations of these algorithms exhibit excellent weak and strong scaling to over 224,000 Cray XT5 cores for multiscale geophysics problems

...read moreread less

Proceedings Article•

Parallel Ant Colony Optimization on Graphics Processing Units.

[...]

Audrey Delevacq¹, Pierre Delisle¹, Marc Gravel², Michaël Krajecki¹•Institutions (2)

University of Reims Champagne-Ardenne¹, Université du Québec à Chicoutimi²

01 Jan 2010

TL;DR: In this article, the authors proposed effective parallelization strategies for the ACO metaheuristic on Graphics Processing Units (GPUs) using the Max-Min Ant System (MMAS) algorithm augmented with 3-opt local search.

...read moreread less

Abstract: The purpose of this paper is to propose effective parallelization strategies for the Ant Colony Optimization (ACO) metaheuristic on Graphics Processing Units (GPUs). The Max-Min Ant System (MMAS) algorithm augmented with 3-opt local search is used as a framework for the implementation of the parallel ants and multiple ant colonies general parallelization approaches. The four resulting GPU algorithms are extensively evaluated and compared on both speedup and solution quality on a state-of-the-art Fermi GPU architecture. A rigorous effort is made to keep parallel algorithms true to the original MMAS applied to the Traveling Salesman Problem. We report speedups of up to 23.60 with solution quality similar to the original sequential implementation. With the intent of providing a parallelization framework for ACO on GPUs, a comparative experimental study highlights the performance impact of ACO parameters, GPU technical configuration, memory structures and parallelization granularity.

...read moreread less

Journal Article•DOI•

Mlfma-fft parallel algorithm for the solution of large-scale problems in electromagnetics

[...]

José M. Taboada¹, Marta Gomez Araujo, J. M. Bertolo, Luis Landesa, Fernando Obelleiro, Jose Rodriguez - Show less +2 more•Institutions (1)

University of Vigo¹

01 Jan 2010-Progress in Electromagnetics Research-pier

TL;DR: An e–cient hybrid MPI/OpenMP parallel implementation of an innovative approach that combines the Fast Fourier Transform and Multilevel Fast Multipole Algorithm has been successfully used to solve an electromagnetic problem involving 620millions of unknowns.

...read moreread less

Abstract: MLFMA-FFT PARALLEL ALGORITHM FOR THE SO-LUTION OF LARGE-SCALE PROBLEMS IN ELECTRO-MAGNETICS (INVITED PAPER)J. M. TaboadaDepartment Tecnolog¶‡as de los Computadores y de lasComunicaciones, Escuela Polit¶ecnicaUniversidad de ExtremaduraC¶aceres 10071, SpainM. G. Araujo¶ and J. M. B¶ertoloDepartment Teor¶‡a do Sinal e Comunicaci¶ons, E.T.S.E.Telecomunicaci¶onUniversidade de VigoVigo (Pontevedra) 36310, SpainL. LandesaDepartment Tecnolog¶‡as de los Computadores y de lasComunicaciones, Escuela Polit¶ecnicaUniversidad de ExtremaduraC¶aceres 10071, SpainF. Obelleiro and J. L. RodriguezDepartment Teor¶‡a do Sinal e Comunicaci¶ons, E.T.S.E.Telecomunicaci¶onUniversidade de VigoVigo (Pontevedra) 36310, SpainAbstract|An e–cient hybrid MPI/OpenMP parallel implementationof an innovative approach that combines the Fast Fourier Transform(FFT) and Multilevel Fast Multipole Algorithm (MLFMA) has beensuccessfully used to solve an electromagnetic problem involving 620millions of unknowns. The MLFMA-FFT method can deal withextremely large problems due to its high scalability and its reducedcomputational complexity. The former is provided by the use of the

...read moreread less

Proceedings Article•DOI•

[...]

Ranieri Baraglia¹, Gianmarco De Francisci Morales², Claudio Lucchese¹•Institutions (2)

Istituto di Scienza e Tecnologie dell'Informazione¹, IMT Institute for Advanced Studies Lucca²

13 Dec 2010

TL;DR: This paper focuses on document collections, which are characterized by a sparseness that allows effective pruning strategies, and proposes a new parallel algorithm within the MapReduce framework that outperforms the state of the art by a factor 4.5.

...read moreread less

Abstract: iven a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.

...read moreread less

Proceedings Article•DOI•

Differential evolution algorithm on the GPU with C-CUDA

[...]

Lucas de Paula Veronese¹, Renato A. Krohling¹•Institutions (1)

Universidade Federal do Espírito Santo¹

18 Jul 2010

TL;DR: This paper provides an implementation of the Differential Evolution (DE) algorithm in C-CUDA and demonstrates that the computing time can significantly be reduced using C- CUDA.

...read moreread less

Abstract: Several areas of knowledge are being benefited with the reduction of the computing time by using the technology of Graphics Processing Units (GPU) and the Compute Unified Device Architecture (CUDA) platform. In case of Evolutionary algorithms, which are inherently parallel, this technology may be advantageous for running experiments demanding high computing time. In this paper, we provide an implementation of the Differential Evolution (DE) algorithm in C-CUDA. The algorithm was tested on a suite of well-known benchmark optimization problems and the computing time has been compared with the same algorithm implemented in C. Results demonstrate that the computing time can significantly be reduced using C-CUDA. As far as we know, this is the first implementation of DE algorithm in C-CUDA.

...read moreread less

Proceedings Article•DOI•

STAPL: standard template adaptive parallel library

[...]

Antal Buss¹, Harshvardhan¹, Ioannis Papadopoulos¹, Olga Pearce¹, Timmie Smith¹, Gabriel Tanase¹, Nathan Thomas¹, Xiabing Xu¹, Mauro Bianco¹, Nancy M. Amato¹, Lawrence Rauchwerger¹ - Show less +7 more•Institutions (1)

Texas A&M University¹

24 May 2010

TL;DR: The major components of stapl are described and performance results for both algorithms and data structures showing scalability up to tens of thousands of processors are presented.

...read moreread less

Abstract: The Standard Template Adaptive Parallel Library (stapl) is a high-productivity parallel programming framework that extends C++ and stl with unified support for shared and distributed memory parallelism. stapl provides distributed data structures (pContainers) and parallel algorithms (pAlgorithms) and a generic methodology for extending them to provide customized functionality. The stapl runtime system provides the abstraction for communication and program execution. In this paper, we describe the major components of stapl and present performance results for both algorithms and data structures showing scalability up to tens of thousands of processors.

...read moreread less

Journal Article•DOI•

Large-scale parallel configuration interaction. II. Two- and four-component double-group general active space implementation with application to BiH

[...]

Stefan Knecht¹, Hans Jørgen Aagaard Jensen², Timo Fleig³•Institutions (3)

University of Düsseldorf¹, University of Copenhagen Faculty of Science², Paul Sabatier University³

06 Jan 2010-Journal of Chemical Physics

TL;DR: With the new code, the excellent scalability of the parallelization scheme is demonstrated in large-scale four-component multireference CI (MRCI) benchmark tests on two of the most common computer architectures, and the hardware-dependent aspects with respect to possible speedup limitations are discussed.

...read moreread less

Abstract: We present a parallel implementation of a large-scale relativistic double-group configuration interaction (CI) program. It is applicable with a large variety of two- and four-component Hamiltonians. The parallel algorithm is based on a distributed data model in combination with a static load balancing scheme. The excellent scalability of our parallelization scheme is demonstrated in large-scale four-component multireference CI (MRCI) benchmark tests on two of the most common computer architectures, and we also discuss hardware-dependent aspects with respect to possible speedup limitations. With the new code we have been able to calculate accurate spectroscopic properties for the ground state and the first excited state of the BiH molecule using extensive basis sets. We focused, in particular, on an accurate description of the splitting of these two states which is caused by spin-orbit coupling. Our largest parallel MRCI calculation thereby comprised an expansion length of 2.7×109 Slater determinants.

...read moreread less

Proceedings Article•DOI•

SLAW: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems

[...]

Yi Guo¹, Jisheng Zhao¹, Vincent Cavé¹, Vivek Sarkar¹•Institutions (1)

Rice University¹

09 Jan 2010

TL;DR: SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler, and achieves locality-awareness by grouping workers into places and achieves speedup over locality-oblivious scheduling.

...read moreread less

Abstract: This poster introduces SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler. The SLAW features an adaptive task scheduling algorithm combined with a locality-aware scheduling framework.Past work has demonstrated the pros and cons of using fixed scheduling policies, such as work-first and help-first, in different cases without a clear winner. Prior work also assumes the availability and successful execution of a serial version of the parallel program. This assumption can limit the expressiveness of dynamic task parallel languages.The SLAW scheduler supports both work-first and help-first policies simultaneously. It does so by using an adaptive approach that selects a scheduling policy on a per-task basis at runtime. The SLAW scheduler also establishes bounds on the stack usage and the heap space needed to store tasks. The experimental results for the benchmarks studied show that SLAW's adaptive scheduler achieves 0.98x - 9.2$x speedup over the help-first scheduler and 0.97x - 4.5x speedup over the work-first scheduler for 64-thread executions, thereby establishing the robustness of using an adaptive approach instead of a fixed policy. In contrast, the help-first policy is 9.2x slower than work-first in the worst case for a fixed help-first policy, and the work-first policy is 3.7x slower than help-first in the worst case for a fixed work-first policy. Further, for large irregular recursive parallel computations, the adaptive scheduler runs with bounded stack usage and achieves performance (and supports data sizes) that cannot be delivered by the use of any single fixed policy.The SLAW scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler, and achieves locality-awareness by grouping workers into places. Locality awareness can lead to improved performance by increasing temporal data reuse within a worker and among workers in the same place. Our experimental results show that locality-aware scheduling can achieve up to 2.6x speedup over locality-oblivious scheduling, for the benchmarks studied.

...read moreread less

Journal Article•DOI•

A Parallel Geometric Multigrid Method for Finite Elements on Octree Meshes

[...]

Rahul S. Sampath¹, George Biros•Institutions (1)

Georgia Institute of Technology¹

01 Apr 2010-SIAM Journal on Scientific Computing

TL;DR: To the authors' knowledge, the multigrid algorithm presented in this work is the only matrix-free multiplicative geometric multigrids implementation for solving finite element equations on octree meshes using thousands of processors.

...read moreread less

Abstract: In this article, we present a parallel geometric multigrid algorithm for solving variable-coefficient elliptic partial differential equations on the unit box (with Dirichlet or Neumann boundary conditions) using highly nonuniform, octree-based, conforming finite element discretizations. Our octrees are 2:1 balanced, that is, we allow no more than one octree-level difference between octants that share a face, edge, or vertex. We describe a parallel algorithm whose input is an arbitrary 2:1 balanced fine-grid octree and whose output is a set of coarser 2:1 balanced octrees that are used in the multigrid scheme. Also, we derive matrix-free schemes for the discretized finite element operators and the intergrid transfer operations. The overall scheme is second-order accurate for sufficiently smooth right-hand sides and material properties; its complexity for nearly uniform trees is $\mathcal{O}(\frac{N}{n_p}\log\frac{N}{n_p})+\mathcal{O}(n_p\log n_p)$, where $N$ is the number of octree nodes and $n_p$ is the number of processors. Our implementation uses the Message Passing Interface standard. We present numerical experiments for the Laplace and Navier (linear elasticity) operators that demonstrate the scalability of our method. Our largest run was a highly nonuniform, 8-billion-unknown, elasticity calculation using 32,000 processors on the Teragrid system, “Ranger,” at the Texas Advanced Computing Center. Our implementation is publically available in the Dendro library, which is built on top of the PETSc library from Argonne National Laboratory.

...read moreread less

Book Chapter•DOI•

Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

[...]

YongChul Kwon¹, Dylan Nunley¹, Jeffrey P. Gardner¹, Magdalena Balazinska¹, Bill Howe¹, Sarah Loebman¹ - Show less +2 more•Institutions (1)

University of Washington¹

30 Jun 2010

TL;DR: This paper presents a scalable, parallel algorithm for data clustering in the astronomy simulation domain that matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.

...read moreread less

Abstract: Scientists' ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.

...read moreread less

Proceedings Article•DOI•

Accelerating String Matching Using Multi-Threaded Algorithm on GPU

[...]

Cheng-Hung Lin¹, Sheng-Yu Tsai², Chen-Hsiung Liu², Shih-Chieh Chang², Jyuo-Min Shyu² - Show less +1 more•Institutions (2)

National Taiwan Normal University¹, National Tsing Hua University²

01 Dec 2010

TL;DR: A novel parallel algorithm to speedup string matching performed on GPUs is proposed and a new state machine for string matching is innovated, the state machine of which is more suitable to be performed on GPU.

...read moreread less

Abstract: Network Intrusion Detection System has been widely used to protect computer systems from network attacks. Due to the ever-increasing number of attacks and network complexity, traditional software approaches on uni-processors have become inadequate for the current high-speed network. In this paper, we propose a novel parallel algorithm to speedup string matching performed on GPUs. We also innovate new state machine for string matching, the state machine of which is more suitable to be performed on GPU. We have also described several speedup techniques considering special architecture properties of GPU. The experimental results demonstrate the new algorithm on GPUs achieves up to 4,000 times speedup compared to the AC algorithm on CPU. Compared to other GPU approaches, the new algorithm achieves 3 times faster with significant improvement on memory efficiency. Furthermore, because the new Algorithm reduces the complexity of the Aho-Corasick algorithm, the new algorithm also improves on memory requirements.

...read moreread less

Journal Article•DOI•

A parallel solution - adaptive method for three-dimensional turbulent non-premixed combusting flows

[...]

Xinfeng Gao¹, Clinton P. T. Groth²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of Toronto²

01 May 2010-Journal of Computational Physics

TL;DR: Numerical results demonstrate the validity and potential of the parallel AMR approach for predicting fine-scale features of complex turbulent non-premixed flames and allow for automatic solution-directed mesh adaptation according to physics-based refinement criteria.

...read moreread less

Collapse