Showing papers on "Parallel algorithm published in 2009"

PDF

Open Access

Proceedings Article•DOI•

Large-scale deep unsupervised learning using graphics processors

[...]

Rajat Raina¹, Anand Madhavan¹, Andrew Y. Ng¹•Institutions (1)

14 Jun 2009

TL;DR: It is argued that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods.

...read moreread less

Abstract: The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. We consider two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (Hinton & Salakhutdinov, 2006; Raina et al., 2007). Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer training examples.In this paper, we suggest massively parallel methods to help resolve these problems. We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. We develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. We show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding. Our implementation of DBN learning is up to 70 times faster than a dual-core CPU implementation for large models. For example, we are able to reduce the time required to learn a four-layer DBN with 100 million free parameters from several weeks to around a single day. For sparse coding, we develop a simple, inherently parallel algorithm, that leads to a 5 to 15-fold speedup over previous methods.

...read moreread less

711 citations

Journal Article•DOI•

DL-FIND: An Open-Source Geometry Optimizer for Atomistic Simulations

[...]

Johannes Kästner¹, Joanne M Carr, Thomas W. Keal¹, Walter Thiel¹, Adrian Wander, Paul Sherwood - Show less +2 more•Institutions (1)

Max Planck Society¹

29 Jul 2009-Journal of Physical Chemistry A

TL;DR: A geometry optimizer, called DL-FIND, to be included in atomistic simulation codes, that can optimize structures in Cartesian coordinates, redundant internal coordinates, hybrid-delocalizedinternal coordinates, and also functions of more variables independent of atomic structures.

...read moreread less

Abstract: Geometry optimization, including searching for transition states, accounts for most of the CPU time spent in quantum chemistry, computational surface science, and solid-state physics, and also plays an important role in simulations employing classical force fields. We have implemented a geometry optimizer, called DL-FIND, to be included in atomistic simulation codes. It can optimize structures in Cartesian coordinates, redundant internal coordinates, hybrid-delocalized internal coordinates, and also functions of more variables independent of atomic structures. The implementation of the optimization algorithms is independent of the coordinate transformation used. Steepest descent, conjugate gradient, quasi-Newton, and L-BFGS algorithms as well as damped molecular dynamics are available as minimization methods. The partitioned rational function optimization algorithm, a modified version of the dimer method and the nudged elastic band approach provide capabilities for transition-state search. Penalty function, gradient projection, and Lagrange-Newton methods are implemented for conical intersection optimizations. Various stochastic search methods, including a genetic algorithm, are available for global or local minimization and can be run as parallel algorithms. The code is released under the open-source GNU LGPL license. Some selected applications of DL-FIND are surveyed.

...read moreread less

483 citations

Journal Article•DOI•

Fast BVH Construction on GPUs

[...]

Christian Lauterbach¹, Michael Garland², Shubhabrata Sengupta³, David Luebke², Dinesh Manocha¹ - Show less +1 more•Institutions (3)

University of North Carolina at Chapel Hill¹, Nvidia², University of California, Davis³

01 Apr 2009-Computer Graphics Forum

TL;DR: Preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems and can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications.

...read moreread less

Abstract: We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized for fast ray tracing. Both algorithms are combined into a hybrid algorithm that removes existing bottlenecks in the algorithm for GPU construction performance and scalability leading to significantly decreased build time. The resulting hierarchies are close in to optimized SAH hierarchies, but the construction process is substantially faster, leading to a significant net benefit when both construction and traversal cost are accounted for. Our preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems. In practice, we can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications.

...read moreread less

414 citations

Journal Article•DOI•

An Ant Colony Optimization Approach to a Grid Workflow Scheduling Problem With Various QoS Requirements

[...]

Wei-Neng Chen¹, Jun Zhang¹•Institutions (1)

Sun Yat-sen University¹

01 Jan 2009

TL;DR: This paper proposes an ant colony optimization (ACO) algorithm to schedule large-scale workflows with various QoS parameters and designs seven new heuristics for the ACO approach and proposes an adaptive scheme that allows artificial ants to select heuristic based on pheromone values.

...read moreread less

Abstract: Grid computing is increasingly considered as a promising next-generation computational platform that supports wide-area parallel and distributed computing. In grid environments, applications are always regarded as workflows. The problem of scheduling workflows in terms of certain quality of service (QoS) requirements is challenging and it significantly influences the performance of grids. By now, there have been some algorithms for grid workflow scheduling, but most of them can only tackle the problems with a single QoS parameter or with small-scale workflows. In this frame, this paper aims at proposing an ant colony optimization (ACO) algorithm to schedule large-scale workflows with various QoS parameters. This algorithm enables users to specify their QoS preferences as well as define the minimum QoS thresholds for a certain application. The objective of this algorithm is to find a solution that meets all QoS constraints and optimizes the user-preferred QoS parameter. Based on the characteristics of workflow scheduling, we design seven new heuristics for the ACO approach and propose an adaptive scheme that allows artificial ants to select heuristics based on pheromone values. Experiments are done in ten workflow applications with at most 120 tasks, and the results demonstrate the effectiveness of the proposed algorithm.

...read moreread less

355 citations

Journal Article•DOI•

A type and effect system for deterministic parallel Java

[...]

Robert L. Bocchino¹, Vikram Adve¹, Danny Dig¹, Sarita V. Adve¹, Stephen T. Heumann¹, Rakesh Komuravelli¹, Jeffrey Overbey¹, Patrick Simmons¹, Hyojin Sung¹, Mohsen Vakilian¹ - Show less +6 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

25 Oct 2009

TL;DR: It is demonstrated that a practical type and effect system can simplify parallel programming by guaranteeing deterministic semantics with modular, compile-time type checking even in a rich, concurrent object-oriented language such as Java.

...read moreread less

Abstract: Today's shared-memory parallel programming models are complex and error-prone.While many parallel programs are intended to be deterministic, unanticipated thread interleavings can lead to subtle bugs and nondeterministic semantics. In this paper, we demonstrate that a practical type and effect system can simplify parallel programming by guaranteeing deterministic semantics with modular, compile-time type checking even in a rich, concurrent object-oriented language such as Java. We describe an object-oriented type and effect system that provides several new capabilities over previous systems for expressing deterministic parallel algorithms.We also describe a language called Deterministic Parallel Java (DPJ) that incorporates the new type system features, and we show that a core subset of DPJ is sound. We describe an experimental validation showing thatDPJ can express a wide range of realistic parallel programs; that the new type system features are useful for such programs; and that the parallel programs exhibit good performance gains (coming close to or beating equivalent, nondeterministic multithreaded programs where those are available).

...read moreread less

318 citations

Journal Article•DOI•

Relational query coprocessing on graphics processors

[...]

Bingsheng He¹, Mian Lu², Ke Yang¹, Rui Fang, Naga K. Govindaraju¹, Qiong Luo², Pedro V. Sander² - Show less +3 more•Institutions (2)

Microsoft¹, Hong Kong University of Science and Technology²

14 Dec 2009-ACM Transactions on Database Systems

TL;DR: In this article, the authors present an in-memory relational query coprocessing system, GDB, on the GPU, taking advantage of the GPU hardware features such as split and sort, and use these primitives to implement common relational query processing algorithms.

...read moreread less

Abstract: Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide efficient interprocessor communication through on-chip local memory, and support a general purpose parallel programming model. Nevertheless, many of the GPU features are specialized for graphics processing, including the massively multithreaded architecture, the Single-Instruction-Multiple-Data processing style, and the execution model of a single application at a time. Additionally, GPUs rely on a bus of limited bandwidth to transfer data to and from the CPU, do not allow dynamic memory allocation from GPU kernels, and have little hardware support for write conflicts. Therefore, a careful design and implementation is required to utilize the GPU for coprocessing database queries.In this article, we present our design, implementation, and evaluation of an in-memory relational query coprocessing system, GDB, on the GPU. Taking advantage of the GPU hardware features, we design a set of highly optimized data-parallel primitives such as split and sort, and use these primitives to implement common relational query processing algorithms. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU, and use parallel computation and memory optimizations to effectively reduce memory stalls. Furthermore, we propose coprocessing techniques that take into account both the computation resources and the GPU-CPU data transfer cost so that each operator in a query can utilize suitable processors—the CPU, the GPU, or both—for an optimized overall performance. We have evaluated our GDB system on a machine with an Intel quad-core CPU and an NVIDIA GeForce 8800 GTX GPU. Our workloads include microbenchmark queries on memory-resident data as well as TPC-H queries that involve complex data types and multiple query operators on data sets larger than the GPU memory. Our results show that our GPU-based algorithms are 2--27x faster than their optimized CPU-based counterparts on in-memory data. Moreover, the performance of our coprocessing scheme is similar to, or better than, both the GPU-only and the CPU-only schemes.

...read moreread less

258 citations

Proceedings Article•DOI•

A Massively Parallel Coprocessor for Convolutional Neural Networks

[...]

Murugan Sankaradas¹, Venkata Jakkula¹, Srihari Cadambi¹, Srimat Chakradhar¹, Igor Durdanovic¹, Eric Cosatto¹, Hans Peter Graf¹ - Show less +3 more•Institutions (1)

Princeton University¹

07 Jul 2009

TL;DR: A massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms, is presented, which uses low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation.

...read moreread less

Abstract: We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.

...read moreread less

254 citations

Journal Article•DOI•

Algorithm-based fault tolerance applied to high performance computing

[...]

George Bosilca¹, Remi Delmas¹, Jack Dongarra¹, Julien Langou²•Institutions (2)

University of Tennessee¹, University of Colorado Denver²

01 Apr 2009-Journal of Parallel and Distributed Computing

TL;DR: A careful adaptation of the Algorithmic Based Fault Tolerance technique to the need of parallel distributed computation results in a strongly scalable mechanism for fault tolerance that can also detect and correct errors on the fly of a computation.

...read moreread less

215 citations

Proceedings Article•DOI•

EpiFast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems

[...]

Keith R. Bisset¹, Jiangzhuo Chen¹, Xizhou Feng¹, V. S. Anil Kumar¹, Madhav V. Marathe¹ - Show less +1 more•Institutions (1)

Virginia Tech¹

08 Jun 2009

TL;DR: EpiFast runs extremely fast for realistic simulations that involve large populations consisting of millions of individuals and their heterogeneous details, dynamic interactions between the disease propagation, the individual behaviors, and the exogenous interventions, as well as large number of replicated runs necessary for statistically sound estimates about the stochastic epidemic evolution.

...read moreread less

Abstract: Large scale realistic epidemic simulations have recently become an increasingly important application of high-performance computing. We propose a parallel algorithm, EpiFast, based on a novel interpretation of the stochastic disease propagation in a contact network. We implement it using a master-slave computation model which allows scalability on distributed memory systems.EpiFast runs extremely fast for realistic simulations that involve: (i) large populations consisting of millions of individuals and their heterogeneous details, (ii) dynamic interactions between the disease propagation, the individual behaviors, and the exogenous interventions, as well as (iii) large number of replicated runs necessary for statistically sound estimates about the stochastic epidemic evolution. We find that EpiFast runs several magnitude faster than another comparable simulation tool while delivering similar results.EpiFast has been tested on commodity clusters as well as SGI shared memory machines. For a fixed experiment, if given more computing resources, it scales automatically and runs faster. Finally, EpiFast has been used as the major simulation engine in real studies with rather sophisticated settings to evaluate various dynamic interventions and to provide decision support for public health policy makers.

...read moreread less

205 citations

Journal Article•DOI•

Glimmer: Multilevel MDS on the GPU

[...]

Stephen Ingram¹, Tamara Munzner¹, Marc Olano²•Institutions (2)

University of British Columbia¹, University of Maryland, Baltimore County²

01 Mar 2009-IEEE Transactions on Visualization and Computer Graphics

TL;DR: This work proposes a robust termination condition for GPU-SF based on a filtered approximation of the normalized stress function, and shows that the performance of Glimmer on GPUs is substantially faster than a CPU implementation of the same algorithm.

...read moreread less

Abstract: We present Glimmer, a new multilevel algorithm for multidimensional scaling designed to exploit modern graphics processing unit (GPU) hardware. We also present GPU-SF, a parallel, force-based subsystem used by Glimmer. Glimmer organizes input into a hierarchy of levels and recursively applies GPU-SF to combine and refine the levels. The multilevel nature of the algorithm makes local minima less likely while the GPU parallelism improves speed of computation. We propose a robust termination condition for GPU-SF based on a filtered approximation of the normalized stress function. We demonstrate the benefits of Glimmer in terms of speed, normalized stress, and visual quality against several previous algorithms for a range of synthetic and real benchmark datasets. We also show that the performance of Glimmer on GPUs is substantially faster than a CPU implementation of the same algorithm.

...read moreread less

195 citations

Journal Article•DOI•

Coherent Beam Combining of Fiber Amplifiers Using Stochastic Parallel Gradient Descent Algorithm and Its Application

[...]

Pu Zhou¹, Zejin Liu¹, Xiaolin Wang¹, Yanxing Ma¹, Haotong Ma¹, Xiaojun Xu¹, Shaofeng Guo¹ - Show less +3 more•Institutions (1)

National University of Defense Technology¹

03 Apr 2009-IEEE Journal of Selected Topics in Quantum Electronics

TL;DR: In this paper, the authors present theoretical and experimental research on coherent beam combining of fiber amplifiers using stochastic parallel gradient descent (SPGD) algorithm and demonstrate the feasibility of beam combining using SPGD algorithm analytically.

...read moreread less

Abstract: We present theoretical and experimental research on coherent beam combining of fiber amplifiers using stochastic parallel gradient descent (SPGD) algorithm. The feasibility of coherent beam combining using SPGD algorithm is detailed analytically. Numerical simulation is accomplished to explore the scaling potential to higher number of laser beams. Experimental investigation on coherent beam combining of two and three W-level fiber amplifiers is demonstrated. Several application fields, i.e., atmosphere distortion compensating, beam steering, and beam shaping based on coherent beam combining using SPGD algorithm are proposed.

...read moreread less

Proceedings Article•DOI•

Scaling Genetic Algorithms Using MapReduce

[...]

Abhishek Verma¹, Xavier Llorà¹, David E. Goldberg¹, Roy H. Campbell¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

30 Nov 2009

TL;DR: This paper describes the algorithm design and implementation of GAs on Hadoop, an open source implementation of MapReduce, and demonstrates the convergence and scalability up to 10^5 variable problems.

...read moreread less

Abstract: Genetic algorithms(GAs) are increasingly being applied to large scale problems. The traditional MPI-based parallel GAs require detailed knowledge about machine architecture. On the other hand, MapReduce is a powerful abstraction proposed by Google for making scalable and fault tolerant applications. In this paper, we show how genetic algorithms can be modeled into the MapReduce model. We describe the algorithm design and implementation of GAs on Hadoop, an open source implementation of MapReduce. Our experiments demonstrate the convergence and scalability up to 10^5 variable problems. Adding more resources would enable us to solve even larger problems without any changes in the algorithms and implementation since we do not introduce any performance bottlenecks.

...read moreread less

Proceedings Article•DOI•

A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets

[...]

Kamesh Madduri¹, David Ediger², Karl Jiang², David A. Bader², Daniel Chavarría-Miranda³ - Show less +1 more•Institutions (3)

Lawrence Berkeley National Laboratory¹, Georgia Institute of Technology², Pacific Northwest National Laboratory³

23 May 2009

TL;DR: A new lock-free parallel algorithm for computing betweenness centrality of massive complex networks that achieves better spatial locality compared with previous approaches is presented, and the applicability of this implementation to analyze massive real-world datasets is demonstrated.

...read moreread less

Abstract: We present a new lock-free parallel algorithm for computing betweenness centrality of massive complex networks that achieves better spatial locality compared with previous approaches. Betweenness centrality is a key kernel in analyzing the importance of vertices (or edges) in applications ranging from social networks, to power grids, to the influence of jazz musicians, and is also incorporated into the DARPA HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph analytics. We design an optimized implementation of betweenness centrality for the massively multithreaded Cray XMT system with the Thread-storm processor. For a small-world network of 268 million vertices and 2.147 billion edges, the 16-processor XMT system achieves a TEPS rate (an algorithmic performance count for the number of edges traversed per second) of 160 million per second, which corresponds to more than a 2× performance improvement over the previous parallel implementation. We demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for the large IMDb movie-actor network.

...read moreread less

Journal Article•

A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

[...]

Kamesh Madduri

15 Apr 2009-Lawrence Berkeley National Laboratory

TL;DR: In this paper, a lock-free parallel algorithm for computing betweenness centrality of massive small-world networks is presented, which achieves TEPS scores of 160 million and 90 million respectively.

...read moreread less

Abstract: We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.

...read moreread less

Journal Article•DOI•

A scalable, parallel algorithm for maximal clique enumeration

[...]

Matthew C. Schmidt¹, Nagiza F. Samatova¹, K. Thomas², Byung-Hoon Park³•Institutions (3)

North Carolina State University¹, Cray², Oak Ridge National Laboratory³

01 Apr 2009-Journal of Parallel and Distributed Computing

TL;DR: This paper proposes a parallel, scalable, and memory-efficient MCE algorithm for distributed and/or shared memory high performance computing architectures, whose runtime scales linearly for thousands of processors on real-world application graphs with hundreds and thousands of nodes.

...read moreread less

Journal Article•DOI•

An evolutionary algorithm for the vehicle routing problem with route balancing

[...]

Nicolas Jozefowiez¹, Nicolas Jozefowiez², Frédéric Semet¹, El-Ghazali Talbi²•Institutions (2)

University of Valenciennes and Hainaut-Cambresis¹, Laboratoire d'Informatique Fondamentale de Lille²

16 Jun 2009-European Journal of Operational Research

TL;DR: A meta-heuristic method based on an evolutionary algorithm involving classical multi-objective operators and an elitist diversification mechanism used in cooperation with classical diversification methodologies to improve its efficiency.

...read moreread less

Proceedings Article•DOI•

Parallel community detection on large networks with propinquity dynamics

[...]

Yuzhou Zhang¹, Jianyong Wang¹, Yi Wang², Lizhu Zhou¹•Institutions (2)

Tsinghua University¹, Google²

28 Jun 2009

TL;DR: This paper proposes a novel community detection algorithm, which utilizes a dynamic process by contradicting the network topology and the topology-based propinquity, where thepropinquity is a measure of the probability for a pair of nodes involved in a coherent community structure.

...read moreread less

Abstract: Graphs or networks can be used to model complex systems. Detecting community structures from large network data is a classic and challenging task. In this paper, we propose a novel community detection algorithm, which utilizes a dynamic process by contradicting the network topology and the topology-based propinquity, where the propinquity is a measure of the probability for a pair of nodes involved in a coherent community structure. Through several rounds of mutual reinforcement between topology and propinquity, the community structures are expected to naturally emerge. The overlapping vertices shared between communities can also be easily identified by an additional simple postprocessing. To achieve better efficiency, the propinquity is incrementally calculated. We implement the algorithm on a vertex-oriented bulk synchronous parallel(BSP) model so that the mining load can be distributed on thousands of machines. We obtained interesting experimental results on several real network data.

...read moreread less

Journal Article•DOI•

A cooperative parallel tabu search algorithm for the quadratic assignment problem

[...]

Tabitha L. James¹, César Rego², Fred Glover³•Institutions (3)

Virginia Tech¹, University of Mississippi², University of Colorado Boulder³

16 Jun 2009-European Journal of Operational Research

TL;DR: Out of the 41 test instances obtained from QAPLIB, CPTS is shown to meet or exceed the average solution quality of many of the best sequential and parallel approaches from the literature on all but six problems, whereas no other leading method exhibits a performance that is superior to this.

...read moreread less

Proceedings Article•DOI•

SJMR: Parallelizing spatial join with MapReduce on clusters

[...]

Shubin Zhang¹, Jizhong Han¹, Zhiyong Liu¹, Kai Wang¹, Zhiyong Xu² - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, Suffolk University²

16 Oct 2009

TL;DR: SJMR (Spatial Join with MapReduce), a novel parallel algorithm to relieve the problem of heterogeneous related data sets processing, which is common in operations like spatial joins is presented.

...read moreread less

Abstract: MapReduce is a widely used parallel programming model and computing platform. With MapReduce, it is very easy to develop scalable parallel programs to process data-intensive applications on clusters of commodity machines. However, it does not directly support heterogeneous related data sets processing, which is common in operations like spatial joins. This paper presents SJMR (Spatial Join with MapReduce), a novel parallel algorithm to relieve the problem. The strategies include strip-based plane sweeping algorithm, tile-based spatial partitioning function and duplication avoidance technology. We evalauted the performance of SJMR algorithm in various situations with the real world data sets. It demonstrates the applicability of computing-intensive spatial applications with MapReduce on small scale clusters.

...read moreread less

Journal Article•DOI•

A Parallel Splitting Method for Coupled Monotone Inclusions

[...]

Hedy Attouch, Luis M. Briceño-Arias, Patrick L. Combettes

01 Nov 2009-Siam Journal on Control and Optimization

TL;DR: A parallel splitting method is proposed for solving systems of coupled monotone inclusions in Hilbert spaces, and its convergence is established under the assumption that solutions exist.

...read moreread less

Abstract: A parallel splitting method is proposed for solving systems of coupled monotone inclusions in Hilbert spaces, and its convergence is established under the assumption that solutions exist. Unlike existing alternating algorithms, which are limited to two variables and linear coupling, our parallel method can handle an arbitrary number of variables as well as nonlinear coupling schemes. The breadth and flexibility of the proposed framework is illustrated through applications in the areas of evolution inclusions, variational problems, best approximation, and network flows.

...read moreread less

Journal Article•DOI•

A Hierarchical Partitioning Strategy for an Efficient Parallelization of the Multilevel Fast Multipole Algorithm

[...]

Ozgur Ergul¹, Levent Gurel¹•Institutions (1)

Bilkent University¹

02 Jun 2009-IEEE Transactions on Antennas and Propagation

TL;DR: The improved efficiency on scattering problems discretized with millions of unknowns is demonstrated and the effectiveness of the algorithm is presented by solving very large scattering problems involving a conducting sphere of radius 210 wavelengths and a complicated real-life target with a maximum dimension of 880 wavelengths.

...read moreread less

Abstract: We present a novel hierarchical partitioning strategy for the efficient parallelization of the multilevel fast multipole algorithm (MLFMA) on distributed-memory architectures to solve large-scale problems in electromagnetics. Unlike previous parallelization techniques, the tree structure of MLFMA is distributed among processors by partitioning both clusters and samples of fields at each level. Due to the improved load-balancing, the hierarchical strategy offers a higher parallelization efficiency than previous approaches, especially when the number of processors is large. We demonstrate the improved efficiency on scattering problems discretized with millions of unknowns. In addition, we present the effectiveness of our algorithm by solving very large scattering problems involving a conducting sphere of radius 210 wavelengths and a complicated real-life target with a maximum dimension of 880 wavelengths. Both of the objects are discretized with more than 200 million unknowns.

...read moreread less

Proceedings Article•DOI•

Efficient stream compaction on wide SIMD many-core architectures

[...]

Markus Billeter¹, Ola Olsson¹, Ulf Assarsson¹•Institutions (1)

Chalmers University of Technology¹

01 Aug 2009

TL;DR: In this article, a stream compaction algorithm for wide SIMD many-core architectures is presented, which is designed to maximize concurrent execution, with minimal use of synchronization, and it achieves a 3x speedup over previous published algorithms.

...read moreread less

Abstract: Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3x speedup over previous published algorithms.

...read moreread less

Journal Article•DOI•

A novel parallel quantum genetic algorithm for stochastic job shop scheduling

[...]

Jinwei Gu¹, Xingsheng Gu¹, Manzhan Gu¹•Institutions (1)

East China University of Science and Technology¹

01 Jul 2009-Journal of Mathematical Analysis and Applications

TL;DR: In this article, a novel parallel quantum genetic algorithm (NPQGA) is proposed for the stochastic job shop scheduling problem with the objective of minimizing the expected value of makespan, where the processing times are subjected to independent normal distributions.

...read moreread less

Proceedings Article•

Efficient stream compaction on wide SIMD many-core architectures

[...]

Markus Billeter¹, Ola Olsson, Ulf Assarsson¹•Institutions (1)

Chalmers University of Technology¹

01 Jan 2009

TL;DR: This work presents a novel stream compaction algorithm designed to maximize concurrent execution, with minimal use of synchronization, for wide SIMD many-core architectures and explores several variations thereof.

...read moreread less

Proceedings Article•DOI•

A parallel algorithm for construction of uniform grids

[...]

Javor Kalojanov¹, Philipp Slusallek¹•Institutions (1)

Saarland University¹

01 Aug 2009

TL;DR: A fast, parallel GPU algorithm for construction of uniform grids for ray tracing, which is able to take full advantage of the parallel architecture of the GPU, and construction speed is faster than CPU algorithms running on multiple cores.

...read moreread less

Abstract: We present a fast, parallel GPU algorithm for construction of uniform grids for ray tracing, which we implement in CUDA. The algorithm performance does not depend on the primitive distribution, because we reduce the problem to sorting pairs of primitives and cell indices. Our implementation is able to take full advantage of the parallel architecture of the GPU, and construction speed is faster than CPU algorithms running on multiple cores. Its scalability and robustness make it superior to alternative approaches, especially for scenes with complex primitive distributions.

...read moreread less

Journal Article•DOI•

A repartitioning hypergraph model for dynamic load balancing

[...]

Ümit V. Çatalyürek¹, Erik G. Boman², Karen D. Devine², Doruk Bozdağ¹, Robert Heaphy², Lee Ann Riesen² - Show less +2 more•Institutions (2)

Ohio State University¹, Sandia National Laboratories²

01 Aug 2009

TL;DR: A novel repartitioning hypergraph model for dynamic load balancing that accounts for both communication volume in the application and migration cost to move data, in order to minimize the overall cost is presented.

...read moreread less

Abstract: In parallel adaptive applications, the computational structure of the applications changes over time, leading to load imbalances even though the initial load distributions were balanced. To restore balance and to keep communication volume low in further iterations of the applications, dynamic load balancing (repartitioning) of the changed computational structure is required. Repartitioning differs from static load balancing (partitioning) due to the additional requirement of minimizing migration cost to move data from an existing partition to a new partition. In this paper, we present a novel repartitioning hypergraph model for dynamic load balancing that accounts for both communication volume in the application and migration cost to move data, in order to minimize the overall cost. The use of a hypergraph-based model allows us to accurately model communication costs rather than approximate them with graph-based models. We show that the new model can be realized using hypergraph partitioning with fixed vertices and describe our parallel multilevel implementation within the Zoltan load balancing toolkit. To the best of our knowledge, this is the first implementation for dynamic load balancing based on hypergraph partitioning. To demonstrate the effectiveness of our approach, we conducted experiments on a Linux cluster with 1024 processors. The results show that, in terms of reducing total cost, our new model compares favorably to the graph-based dynamic load balancing approaches, and multilevel approaches improve the repartitioning quality significantly.

...read moreread less

Journal Article•DOI•

A new diffusion-based multilevel algorithm for computing graph partitions

[...]

Henning Meyerhenke, Burkhard Monien¹, Thomas Sauerwald²•Institutions (2)

University of Paderborn¹, International Computer Science Institute²

01 Sep 2009-Journal of Parallel and Distributed Computing

TL;DR: Besides proving that BUBBLE-FOS/C converges towards a local optimum, this paper develops a much faster method for the improvement of partitionings, based on a different diffusive process, which is restricted to local areas of the graph and also contains a high degree of parallelism.

...read moreread less

Book•

Parallel MATLAB for Multicore and Multinode Computers

[...]

Jeremy Kepner

23 Jul 2009

TL;DR: Parallel MATLAB for Multicore and Multinode Computers covers more parallel algorithms and parallel programming models than any other parallel programming book due to the succinctness of MATLAB.

...read moreread less

Abstract: This is the first book on parallel MATLAB and the first parallel computing book focused on the design, code, debug, and test techniques required to quickly produce well-performing parallel programs. MATLAB is currently the dominant language of technical computing with one million users worldwide, many of whom can benefit from the increased power offered by inexpensive multicore and multinode parallel computers. MATLAB is an ideal environment for learning about parallel computing, allowing the user to focus on parallel algorithms instead of the details of implementation. Parallel MATLAB for Multicore and Multinode Computers covers more parallel algorithms and parallel programming models than any other parallel programming book due to the succinctness of MATLAB. It presents a hands-on approach with numerous example programs; wherever possible, the examples are drawn from widely known and well-documented parallel benchmark codes that are representative of many real applications across the field of technical computing. Audience: Intended for professional scientists and engineers, as well as undergraduate or graduate students, who use MATLAB. It is suitable as either the primary book in a parallel computing class or as a supplementary text in a numerical computing class or a computer science algorithms class. Contents: List of Figures; List of Tables; List of Algorithms; Preface; Acknowledgments; Part I: Fundamentals: Chapter 1: Primer: Notation and Interfaces; Chapter 2: Introduction to pMatlab; Chapter 3: Interacting with Distributed Arrays; Part II: Advanced Techniques: Chapter 4: Parallel Programming Models; Chapter 5: Advanced Distributed Array Programming; Chapter 6: Performance Metrics and Software Architecture; Part III: Case Studies: Chapter 7: Parallel Application Analysis; Chapter 8: Stream; Chapter 9: RandomAccess; Chapter 10: Fast Fourier Transform; Chapter 11: High Performance Linpack; Appendix: Notation for Hierarchical Parallel Multicore Algorithms; Index

...read moreread less

Book•

Euro-Par 2009 parallel processing : 15th International Euro-Par Conference Delft, The Netherlands, August 25-28, 2009 : proceedings

[...]

Henk Sips, Dick Epema, Hai Xiang Lin

01 Jan 2009

TL;DR: Multicore Programming Challenges, Peer-to-Peer Computing, and MapReduce Programming Model for .NET-Based Cloud Computing are discussed.

...read moreread less

Abstract: s Invited Talks.- Multicore Programming Challenges.- Ibis: A Programming System for Real-World Distributed Computing.- What Is in a Namespace?.- Topic 1: Support Tools and Environments.- Atune-IL: An Instrumentation Language for Auto-tuning Parallel Applications.- Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions.- A Holistic Approach towards Automated Performance Analysis and Tuning.- Pattern Matching and I/O Replay for POSIX I/O in Parallel Programs.- An Extensible I/O Performance Analysis Framework for Distributed Environments.- Grouping MPI Processes for Partial Checkpoint and Co-migration.- Process Mapping for MPI Collective Communications.- Topic 2: Performance Prediction and Evaluation.- Stochastic Analysis of Hierarchical Publish/Subscribe Systems.- Characterizing and Understanding the Bandwidth Behavior of Workloads on Multi-core Processors.- Hybrid Techniques for Fast Multicore Simulation.- PSINS: An Open Source Event Tracer and Execution Simulator for MPI Applications.- A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors.- Topic 3: Scheduling and Load Balancing.- Dynamic Load Balancing of Matrix-Vector Multiplications on Roadrunner Compute Nodes.- A Unified Framework for Load Distribution and Fault-Tolerance of Application Servers.- On the Feasibility of Dynamically Scheduling DAG Applications on Shared Heterogeneous Systems.- Steady-State for Batches of Identical Task Trees.- A Buffer Space Optimal Solution for Re-establishing the Packet Order in a MPSoC Network Processor.- Using Multicast Transfers in the Replica Migration Problem: Formulation and Scheduling Heuristics.- A New Genetic Algorithm for Scheduling for Large Communication Delays.- Comparison of Access Policies for Replica Placement in Tree Networks.- Scheduling Recurrent Precedence-Constrained Task Graphs on a Symmetric Shared-Memory Multiprocessor.- Energy-Aware Scheduling of Flow Applications on Master-Worker Platforms.- Topic 4: High Performance Architectures and Compilers.- Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs.- Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors.- REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs.- Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation.- Topic 5: Parallel and Distributed Databases.- Unifying Memory and Database Transactions.- A DHT Key-Value Storage System with Carrier Grade Performance.- Selective Replicated Declustering for Arbitrary Queries.- Topic 6: Grid, Cluster, and Cloud Computing.- POGGI: Puzzle-Based Online Games on Grid Infrastructures.- Enabling High Data Throughput in Desktop Grids through Decentralized Data and Metadata Management: The BlobSeer Approach.- MapReduce Programming Model for .NET-Based Cloud Computing.- The Architecture of the XtreemOS Grid Checkpointing Service.- Scalable Transactions for Web Applications in the Cloud.- Provider-Independent Use of the Cloud.- MPI Applications on Grids: A Topology Aware Approach.- Topic 7: Peer-to-Peer Computing.- A Least-Resistance Path in Reasoning about Unstructured Overlay Networks.- SiMPSON: Efficient Similarity Search in Metric Spaces over P2P Structured Overlay Networks.- Uniform Sampling for Directed P2P Networks.- Adaptive Peer Sampling with Newscast.- Exploring the Feasibility of Reputation Models for Improving P2P Routing under Churn.- Selfish Neighbor Selection in Peer-to-Peer Backup and Storage Applications.- Zero-Day Reconciliation of BitTorrent Users with Their ISPs.- Surfing Peer-to-Peer IPTV: Distributed Channel Switching.- Topic 8: Distributed Systems and Algorithms.- Distributed Individual-Based Simulation.- A Self-stabilizing K-Clustering Algorithm Using an Arbitrary Metric.- Active Optimistic Message Logging for Reliable Execution of MPI Applications.- Topic 9: Parallel and Distributed Programming.- A Parallel Numerical Library for UPC.- A Multilevel Parallelization Framework for High-Order Stencil Computations.- Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores.- Parallel Skeletons for Variable-Length Lists in SkeTo Skeleton Library.- Stkm on Sca: A Unified Framework with Components, Workflows and Algorithmic Skeletons.- Grid-Enabling SPMD Applications through Hierarchical Partitioning and a Component-Based Runtime.- Reducing Rollbacks of Transactional Memory Using Ordered Shared Locks.- Topic 10: Parallel Numerical Algorithms.- Wavelet-Based Adaptive Solvers on Multi-core Architectures for the Simulation of Complex Systems.- Localized Parallel Algorithm for Bubble Coalescence in Free Surface Lattice-Boltzmann Method.- Fast Implicit Simulation of Oscillatory Flow in Human Abdominal Bifurcation Using a Schur Complement Preconditioner.- A Parallel Rigid Body Dynamics Algorithm.- Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems.- Parallel Implementation of Runge-Kutta Integrators with Low Storage Requirements.- PSPIKE: A Parallel Hybrid Sparse Linear System Solver.- Out-of-Core Computation of the QR Factorization on Multi-core Processors.- Adaptive Parallel Householder Bidiagonalization.- Topic 11: Multicore and Manycore Programming.- Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor.- An Extension of the StarSs Programming Model for Platforms with Multiple GPUs.- StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.- XJava: Exploiting Parallelism with Object-Oriented Stream Programming.- JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA.- Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades.- Searching for Concurrent Design Patterns in Video Games.- Parallelization of a Video Segmentation Algorithm on CUDA-Enabled Graphics Processing Units.- A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform.- High Performance Matrix Multiplication on Many Cores.- Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm.- Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards.- Topic 12: Theory and Algorithms for Parallel Computation.- Implementing Parallel Google Map-Reduce in Eden.- A Lower Bound for Oblivious Dimensional Routing.- Topic 13: High-Performance Networks.- A Case Study of Communication Optimizations on 3D Mesh Interconnects.- Implementing a Change Assimilation Mechanism for Source Routing Interconnects.- Dependability Analysis of a Fault-Tolerant Network Reconfiguring Strategy.- RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration.- NIC-Assisted Cache-Efficient Receive Stack for Message Passing over Ethernet.- A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks.- Hardware Implementation Study of the SCFQ-CA and DRR-CA Scheduling Algorithms.- Topic 14: Mobile and Ubiquitous Computing.- Optimal and Near-Optimal Energy-Efficient Broadcasting in Wireless Networks.

...read moreread less

Journal Article•DOI•

HPCCD: Hybrid Parallel Continuous Collision Detection using CPUs and GPUs

[...]

Duksu Kim¹, Jae-Pil Heo¹, Jaehyuk Huh¹, John Kim¹, Sung-Eui Yoon¹ - Show less +1 more•Institutions (1)

KAIST¹

01 Oct 2009-Computer Graphics Forum

TL;DR: A novel task decomposition method is proposed that leads to a lock‐free parallel algorithm in the main loop of the BVH‐based collision detection to create a highly scalable algorithm that achieves more than an order of magnitude improvement in performance using four CPU‐cores and two GPUs, compared to using a single CPU‐core.

...read moreread less

Abstract: We present a novel, hybrid parallel continuous collision detection (HPCCD) method that exploits the availability of multi-core CPU and GPU architectures. HPCCD is based on a bounding volume hierarchy (BVH) and selectively performs lazy reconstructions. Our method works with a wide variety of deforming models and supports selfcollision detection. HPCCD takes advantage of hybrid multi-core architectures – using the general-purpose CPUs to perform the BVH traversal and culling while GPUs are used to perform elementary tests that reduce to solving cubic equations. We propose a novel task decomposition method that leads to a lock-free parallel algorithm in the main loop of our BVH-based collision detection to create a highly scalable algorithm. By exploiting the availability of hybrid, multi-core CPU and GPU architectures, our proposed method achieves more than an order of magnitude improvement in performance using four CPU-cores and two GPUs, compared to using a single CPU-core. This improvement results in an interactive performance, up to 148 fps, for various deforming benchmarks consisting of tens or hundreds of thousand triangles.

...read moreread less

Collapse