scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2009"


Proceedings ArticleDOI
14 Jun 2009
TL;DR: It is argued that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods.
Abstract: The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. We consider two well-known unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (Hinton & Salakhutdinov, 2006; Raina et al., 2007). Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer training examples.In this paper, we suggest massively parallel methods to help resolve these problems. We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. We develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. We show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding. Our implementation of DBN learning is up to 70 times faster than a dual-core CPU implementation for large models. For example, we are able to reduce the time required to learn a four-layer DBN with 100 million free parameters from several weeks to around a single day. For sparse coding, we develop a simple, inherently parallel algorithm, that leads to a 5 to 15-fold speedup over previous methods.

711 citations


Journal ArticleDOI
TL;DR: A geometry optimizer, called DL-FIND, to be included in atomistic simulation codes, that can optimize structures in Cartesian coordinates, redundant internal coordinates, hybrid-delocalizedinternal coordinates, and also functions of more variables independent of atomic structures.
Abstract: Geometry optimization, including searching for transition states, accounts for most of the CPU time spent in quantum chemistry, computational surface science, and solid-state physics, and also plays an important role in simulations employing classical force fields. We have implemented a geometry optimizer, called DL-FIND, to be included in atomistic simulation codes. It can optimize structures in Cartesian coordinates, redundant internal coordinates, hybrid-delocalized internal coordinates, and also functions of more variables independent of atomic structures. The implementation of the optimization algorithms is independent of the coordinate transformation used. Steepest descent, conjugate gradient, quasi-Newton, and L-BFGS algorithms as well as damped molecular dynamics are available as minimization methods. The partitioned rational function optimization algorithm, a modified version of the dimer method and the nudged elastic band approach provide capabilities for transition-state search. Penalty function, gradient projection, and Lagrange-Newton methods are implemented for conical intersection optimizations. Various stochastic search methods, including a genetic algorithm, are available for global or local minimization and can be run as parallel algorithms. The code is released under the open-source GNU LGPL license. Some selected applications of DL-FIND are surveyed.

483 citations


Journal ArticleDOI
TL;DR: Preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems and can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications.
Abstract: We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized for fast ray tracing. Both algorithms are combined into a hybrid algorithm that removes existing bottlenecks in the algorithm for GPU construction performance and scalability leading to significantly decreased build time. The resulting hierarchies are close in to optimized SAH hierarchies, but the construction process is substantially faster, leading to a significant net benefit when both construction and traversal cost are accounted for. Our preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems. In practice, we can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications.

414 citations


Journal ArticleDOI
01 Jan 2009
TL;DR: This paper proposes an ant colony optimization (ACO) algorithm to schedule large-scale workflows with various QoS parameters and designs seven new heuristics for the ACO approach and proposes an adaptive scheme that allows artificial ants to select heuristic based on pheromone values.
Abstract: Grid computing is increasingly considered as a promising next-generation computational platform that supports wide-area parallel and distributed computing. In grid environments, applications are always regarded as workflows. The problem of scheduling workflows in terms of certain quality of service (QoS) requirements is challenging and it significantly influences the performance of grids. By now, there have been some algorithms for grid workflow scheduling, but most of them can only tackle the problems with a single QoS parameter or with small-scale workflows. In this frame, this paper aims at proposing an ant colony optimization (ACO) algorithm to schedule large-scale workflows with various QoS parameters. This algorithm enables users to specify their QoS preferences as well as define the minimum QoS thresholds for a certain application. The objective of this algorithm is to find a solution that meets all QoS constraints and optimizes the user-preferred QoS parameter. Based on the characteristics of workflow scheduling, we design seven new heuristics for the ACO approach and propose an adaptive scheme that allows artificial ants to select heuristics based on pheromone values. Experiments are done in ten workflow applications with at most 120 tasks, and the results demonstrate the effectiveness of the proposed algorithm.

355 citations


Journal ArticleDOI
25 Oct 2009
TL;DR: It is demonstrated that a practical type and effect system can simplify parallel programming by guaranteeing deterministic semantics with modular, compile-time type checking even in a rich, concurrent object-oriented language such as Java.
Abstract: Today's shared-memory parallel programming models are complex and error-prone.While many parallel programs are intended to be deterministic, unanticipated thread interleavings can lead to subtle bugs and nondeterministic semantics. In this paper, we demonstrate that a practical type and effect system can simplify parallel programming by guaranteeing deterministic semantics with modular, compile-time type checking even in a rich, concurrent object-oriented language such as Java. We describe an object-oriented type and effect system that provides several new capabilities over previous systems for expressing deterministic parallel algorithms.We also describe a language called Deterministic Parallel Java (DPJ) that incorporates the new type system features, and we show that a core subset of DPJ is sound. We describe an experimental validation showing thatDPJ can express a wide range of realistic parallel programs; that the new type system features are useful for such programs; and that the parallel programs exhibit good performance gains (coming close to or beating equivalent, nondeterministic multithreaded programs where those are available).

318 citations


Journal ArticleDOI
TL;DR: In this article, the authors present an in-memory relational query coprocessing system, GDB, on the GPU, taking advantage of the GPU hardware features such as split and sort, and use these primitives to implement common relational query processing algorithms.
Abstract: Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide efficient interprocessor communication through on-chip local memory, and support a general purpose parallel programming model. Nevertheless, many of the GPU features are specialized for graphics processing, including the massively multithreaded architecture, the Single-Instruction-Multiple-Data processing style, and the execution model of a single application at a time. Additionally, GPUs rely on a bus of limited bandwidth to transfer data to and from the CPU, do not allow dynamic memory allocation from GPU kernels, and have little hardware support for write conflicts. Therefore, a careful design and implementation is required to utilize the GPU for coprocessing database queries.In this article, we present our design, implementation, and evaluation of an in-memory relational query coprocessing system, GDB, on the GPU. Taking advantage of the GPU hardware features, we design a set of highly optimized data-parallel primitives such as split and sort, and use these primitives to implement common relational query processing algorithms. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU, and use parallel computation and memory optimizations to effectively reduce memory stalls. Furthermore, we propose coprocessing techniques that take into account both the computation resources and the GPU-CPU data transfer cost so that each operator in a query can utilize suitable processors—the CPU, the GPU, or both—for an optimized overall performance. We have evaluated our GDB system on a machine with an Intel quad-core CPU and an NVIDIA GeForce 8800 GTX GPU. Our workloads include microbenchmark queries on memory-resident data as well as TPC-H queries that involve complex data types and multiple query operators on data sets larger than the GPU memory. Our results show that our GPU-based algorithms are 2--27x faster than their optimized CPU-based counterparts on in-memory data. Moreover, the performance of our coprocessing scheme is similar to, or better than, both the GPU-only and the CPU-only schemes.

258 citations


Proceedings ArticleDOI
07 Jul 2009
TL;DR: A massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms, is presented, which uses low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation.
Abstract: We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.

254 citations


Journal ArticleDOI
TL;DR: A careful adaptation of the Algorithmic Based Fault Tolerance technique to the need of parallel distributed computation results in a strongly scalable mechanism for fault tolerance that can also detect and correct errors on the fly of a computation.

215 citations


Proceedings ArticleDOI
08 Jun 2009
TL;DR: EpiFast runs extremely fast for realistic simulations that involve large populations consisting of millions of individuals and their heterogeneous details, dynamic interactions between the disease propagation, the individual behaviors, and the exogenous interventions, as well as large number of replicated runs necessary for statistically sound estimates about the stochastic epidemic evolution.
Abstract: Large scale realistic epidemic simulations have recently become an increasingly important application of high-performance computing. We propose a parallel algorithm, EpiFast, based on a novel interpretation of the stochastic disease propagation in a contact network. We implement it using a master-slave computation model which allows scalability on distributed memory systems.EpiFast runs extremely fast for realistic simulations that involve: (i) large populations consisting of millions of individuals and their heterogeneous details, (ii) dynamic interactions between the disease propagation, the individual behaviors, and the exogenous interventions, as well as (iii) large number of replicated runs necessary for statistically sound estimates about the stochastic epidemic evolution. We find that EpiFast runs several magnitude faster than another comparable simulation tool while delivering similar results.EpiFast has been tested on commodity clusters as well as SGI shared memory machines. For a fixed experiment, if given more computing resources, it scales automatically and runs faster. Finally, EpiFast has been used as the major simulation engine in real studies with rather sophisticated settings to evaluate various dynamic interventions and to provide decision support for public health policy makers.

205 citations


Journal ArticleDOI
TL;DR: This work proposes a robust termination condition for GPU-SF based on a filtered approximation of the normalized stress function, and shows that the performance of Glimmer on GPUs is substantially faster than a CPU implementation of the same algorithm.
Abstract: We present Glimmer, a new multilevel algorithm for multidimensional scaling designed to exploit modern graphics processing unit (GPU) hardware. We also present GPU-SF, a parallel, force-based subsystem used by Glimmer. Glimmer organizes input into a hierarchy of levels and recursively applies GPU-SF to combine and refine the levels. The multilevel nature of the algorithm makes local minima less likely while the GPU parallelism improves speed of computation. We propose a robust termination condition for GPU-SF based on a filtered approximation of the normalized stress function. We demonstrate the benefits of Glimmer in terms of speed, normalized stress, and visual quality against several previous algorithms for a range of synthetic and real benchmark datasets. We also show that the performance of Glimmer on GPUs is substantially faster than a CPU implementation of the same algorithm.

195 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present theoretical and experimental research on coherent beam combining of fiber amplifiers using stochastic parallel gradient descent (SPGD) algorithm and demonstrate the feasibility of beam combining using SPGD algorithm analytically.
Abstract: We present theoretical and experimental research on coherent beam combining of fiber amplifiers using stochastic parallel gradient descent (SPGD) algorithm. The feasibility of coherent beam combining using SPGD algorithm is detailed analytically. Numerical simulation is accomplished to explore the scaling potential to higher number of laser beams. Experimental investigation on coherent beam combining of two and three W-level fiber amplifiers is demonstrated. Several application fields, i.e., atmosphere distortion compensating, beam steering, and beam shaping based on coherent beam combining using SPGD algorithm are proposed.

Proceedings ArticleDOI
30 Nov 2009
TL;DR: This paper describes the algorithm design and implementation of GAs on Hadoop, an open source implementation of MapReduce, and demonstrates the convergence and scalability up to 10^5 variable problems.
Abstract: Genetic algorithms(GAs) are increasingly being applied to large scale problems. The traditional MPI-based parallel GAs require detailed knowledge about machine architecture. On the other hand, MapReduce is a powerful abstraction proposed by Google for making scalable and fault tolerant applications. In this paper, we show how genetic algorithms can be modeled into the MapReduce model. We describe the algorithm design and implementation of GAs on Hadoop, an open source implementation of MapReduce. Our experiments demonstrate the convergence and scalability up to 10^5 variable problems. Adding more resources would enable us to solve even larger problems without any changes in the algorithms and implementation since we do not introduce any performance bottlenecks.

Proceedings ArticleDOI
23 May 2009
TL;DR: A new lock-free parallel algorithm for computing betweenness centrality of massive complex networks that achieves better spatial locality compared with previous approaches is presented, and the applicability of this implementation to analyze massive real-world datasets is demonstrated.
Abstract: We present a new lock-free parallel algorithm for computing betweenness centrality of massive complex networks that achieves better spatial locality compared with previous approaches. Betweenness centrality is a key kernel in analyzing the importance of vertices (or edges) in applications ranging from social networks, to power grids, to the influence of jazz musicians, and is also incorporated into the DARPA HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph analytics. We design an optimized implementation of betweenness centrality for the massively multithreaded Cray XMT system with the Thread-storm processor. For a small-world network of 268 million vertices and 2.147 billion edges, the 16-processor XMT system achieves a TEPS rate (an algorithmic performance count for the number of edges traversed per second) of 160 million per second, which corresponds to more than a 2× performance improvement over the previous parallel implementation. We demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for the large IMDb movie-actor network.

Journal Article
TL;DR: In this paper, a lock-free parallel algorithm for computing betweenness centrality of massive small-world networks is presented, which achieves TEPS scores of 160 million and 90 million respectively.
Abstract: We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.

Journal ArticleDOI
TL;DR: This paper proposes a parallel, scalable, and memory-efficient MCE algorithm for distributed and/or shared memory high performance computing architectures, whose runtime scales linearly for thousands of processors on real-world application graphs with hundreds and thousands of nodes.

Journal ArticleDOI
TL;DR: A meta-heuristic method based on an evolutionary algorithm involving classical multi-objective operators and an elitist diversification mechanism used in cooperation with classical diversification methodologies to improve its efficiency.

Proceedings ArticleDOI
28 Jun 2009
TL;DR: This paper proposes a novel community detection algorithm, which utilizes a dynamic process by contradicting the network topology and the topology-based propinquity, where thepropinquity is a measure of the probability for a pair of nodes involved in a coherent community structure.
Abstract: Graphs or networks can be used to model complex systems. Detecting community structures from large network data is a classic and challenging task. In this paper, we propose a novel community detection algorithm, which utilizes a dynamic process by contradicting the network topology and the topology-based propinquity, where the propinquity is a measure of the probability for a pair of nodes involved in a coherent community structure. Through several rounds of mutual reinforcement between topology and propinquity, the community structures are expected to naturally emerge. The overlapping vertices shared between communities can also be easily identified by an additional simple postprocessing. To achieve better efficiency, the propinquity is incrementally calculated. We implement the algorithm on a vertex-oriented bulk synchronous parallel(BSP) model so that the mining load can be distributed on thousands of machines. We obtained interesting experimental results on several real network data.

Journal ArticleDOI
TL;DR: Out of the 41 test instances obtained from QAPLIB, CPTS is shown to meet or exceed the average solution quality of many of the best sequential and parallel approaches from the literature on all but six problems, whereas no other leading method exhibits a performance that is superior to this.

Proceedings ArticleDOI
16 Oct 2009
TL;DR: SJMR (Spatial Join with MapReduce), a novel parallel algorithm to relieve the problem of heterogeneous related data sets processing, which is common in operations like spatial joins is presented.
Abstract: MapReduce is a widely used parallel programming model and computing platform. With MapReduce, it is very easy to develop scalable parallel programs to process data-intensive applications on clusters of commodity machines. However, it does not directly support heterogeneous related data sets processing, which is common in operations like spatial joins. This paper presents SJMR (Spatial Join with MapReduce), a novel parallel algorithm to relieve the problem. The strategies include strip-based plane sweeping algorithm, tile-based spatial partitioning function and duplication avoidance technology. We evalauted the performance of SJMR algorithm in various situations with the real world data sets. It demonstrates the applicability of computing-intensive spatial applications with MapReduce on small scale clusters.

Journal ArticleDOI
TL;DR: A parallel splitting method is proposed for solving systems of coupled monotone inclusions in Hilbert spaces, and its convergence is established under the assumption that solutions exist.
Abstract: A parallel splitting method is proposed for solving systems of coupled monotone inclusions in Hilbert spaces, and its convergence is established under the assumption that solutions exist. Unlike existing alternating algorithms, which are limited to two variables and linear coupling, our parallel method can handle an arbitrary number of variables as well as nonlinear coupling schemes. The breadth and flexibility of the proposed framework is illustrated through applications in the areas of evolution inclusions, variational problems, best approximation, and network flows.

Journal ArticleDOI
TL;DR: The improved efficiency on scattering problems discretized with millions of unknowns is demonstrated and the effectiveness of the algorithm is presented by solving very large scattering problems involving a conducting sphere of radius 210 wavelengths and a complicated real-life target with a maximum dimension of 880 wavelengths.
Abstract: We present a novel hierarchical partitioning strategy for the efficient parallelization of the multilevel fast multipole algorithm (MLFMA) on distributed-memory architectures to solve large-scale problems in electromagnetics. Unlike previous parallelization techniques, the tree structure of MLFMA is distributed among processors by partitioning both clusters and samples of fields at each level. Due to the improved load-balancing, the hierarchical strategy offers a higher parallelization efficiency than previous approaches, especially when the number of processors is large. We demonstrate the improved efficiency on scattering problems discretized with millions of unknowns. In addition, we present the effectiveness of our algorithm by solving very large scattering problems involving a conducting sphere of radius 210 wavelengths and a complicated real-life target with a maximum dimension of 880 wavelengths. Both of the objects are discretized with more than 200 million unknowns.

Proceedings ArticleDOI
01 Aug 2009
TL;DR: In this article, a stream compaction algorithm for wide SIMD many-core architectures is presented, which is designed to maximize concurrent execution, with minimal use of synchronization, and it achieves a 3x speedup over previous published algorithms.
Abstract: Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3x speedup over previous published algorithms.

Journal ArticleDOI
TL;DR: In this article, a novel parallel quantum genetic algorithm (NPQGA) is proposed for the stochastic job shop scheduling problem with the objective of minimizing the expected value of makespan, where the processing times are subjected to independent normal distributions.

Proceedings Article
01 Jan 2009
TL;DR: This work presents a novel stream compaction algorithm designed to maximize concurrent execution, with minimal use of synchronization, for wide SIMD many-core architectures and explores several variations thereof.
Abstract: Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3× speedup over previous published algorithms. Copyright

Proceedings ArticleDOI
01 Aug 2009
TL;DR: A fast, parallel GPU algorithm for construction of uniform grids for ray tracing, which is able to take full advantage of the parallel architecture of the GPU, and construction speed is faster than CPU algorithms running on multiple cores.
Abstract: We present a fast, parallel GPU algorithm for construction of uniform grids for ray tracing, which we implement in CUDA. The algorithm performance does not depend on the primitive distribution, because we reduce the problem to sorting pairs of primitives and cell indices. Our implementation is able to take full advantage of the parallel architecture of the GPU, and construction speed is faster than CPU algorithms running on multiple cores. Its scalability and robustness make it superior to alternative approaches, especially for scenes with complex primitive distributions.

Journal ArticleDOI
01 Aug 2009
TL;DR: A novel repartitioning hypergraph model for dynamic load balancing that accounts for both communication volume in the application and migration cost to move data, in order to minimize the overall cost is presented.
Abstract: In parallel adaptive applications, the computational structure of the applications changes over time, leading to load imbalances even though the initial load distributions were balanced. To restore balance and to keep communication volume low in further iterations of the applications, dynamic load balancing (repartitioning) of the changed computational structure is required. Repartitioning differs from static load balancing (partitioning) due to the additional requirement of minimizing migration cost to move data from an existing partition to a new partition. In this paper, we present a novel repartitioning hypergraph model for dynamic load balancing that accounts for both communication volume in the application and migration cost to move data, in order to minimize the overall cost. The use of a hypergraph-based model allows us to accurately model communication costs rather than approximate them with graph-based models. We show that the new model can be realized using hypergraph partitioning with fixed vertices and describe our parallel multilevel implementation within the Zoltan load balancing toolkit. To the best of our knowledge, this is the first implementation for dynamic load balancing based on hypergraph partitioning. To demonstrate the effectiveness of our approach, we conducted experiments on a Linux cluster with 1024 processors. The results show that, in terms of reducing total cost, our new model compares favorably to the graph-based dynamic load balancing approaches, and multilevel approaches improve the repartitioning quality significantly.

Journal ArticleDOI
TL;DR: Besides proving that BUBBLE-FOS/C converges towards a local optimum, this paper develops a much faster method for the improvement of partitionings, based on a different diffusive process, which is restricted to local areas of the graph and also contains a high degree of parallelism.

Book
23 Jul 2009
TL;DR: Parallel MATLAB for Multicore and Multinode Computers covers more parallel algorithms and parallel programming models than any other parallel programming book due to the succinctness of MATLAB.
Abstract: This is the first book on parallel MATLAB and the first parallel computing book focused on the design, code, debug, and test techniques required to quickly produce well-performing parallel programs. MATLAB is currently the dominant language of technical computing with one million users worldwide, many of whom can benefit from the increased power offered by inexpensive multicore and multinode parallel computers. MATLAB is an ideal environment for learning about parallel computing, allowing the user to focus on parallel algorithms instead of the details of implementation. Parallel MATLAB for Multicore and Multinode Computers covers more parallel algorithms and parallel programming models than any other parallel programming book due to the succinctness of MATLAB. It presents a hands-on approach with numerous example programs; wherever possible, the examples are drawn from widely known and well-documented parallel benchmark codes that are representative of many real applications across the field of technical computing. Audience: Intended for professional scientists and engineers, as well as undergraduate or graduate students, who use MATLAB. It is suitable as either the primary book in a parallel computing class or as a supplementary text in a numerical computing class or a computer science algorithms class. Contents: List of Figures; List of Tables; List of Algorithms; Preface; Acknowledgments; Part I: Fundamentals: Chapter 1: Primer: Notation and Interfaces; Chapter 2: Introduction to pMatlab; Chapter 3: Interacting with Distributed Arrays; Part II: Advanced Techniques: Chapter 4: Parallel Programming Models; Chapter 5: Advanced Distributed Array Programming; Chapter 6: Performance Metrics and Software Architecture; Part III: Case Studies: Chapter 7: Parallel Application Analysis; Chapter 8: Stream; Chapter 9: RandomAccess; Chapter 10: Fast Fourier Transform; Chapter 11: High Performance Linpack; Appendix: Notation for Hierarchical Parallel Multicore Algorithms; Index

Book
01 Jan 2009
TL;DR: Multicore Programming Challenges, Peer-to-Peer Computing, and MapReduce Programming Model for .NET-Based Cloud Computing are discussed.
Abstract: s Invited Talks.- Multicore Programming Challenges.- Ibis: A Programming System for Real-World Distributed Computing.- What Is in a Namespace?.- Topic 1: Support Tools and Environments.- Atune-IL: An Instrumentation Language for Auto-tuning Parallel Applications.- Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions.- A Holistic Approach towards Automated Performance Analysis and Tuning.- Pattern Matching and I/O Replay for POSIX I/O in Parallel Programs.- An Extensible I/O Performance Analysis Framework for Distributed Environments.- Grouping MPI Processes for Partial Checkpoint and Co-migration.- Process Mapping for MPI Collective Communications.- Topic 2: Performance Prediction and Evaluation.- Stochastic Analysis of Hierarchical Publish/Subscribe Systems.- Characterizing and Understanding the Bandwidth Behavior of Workloads on Multi-core Processors.- Hybrid Techniques for Fast Multicore Simulation.- PSINS: An Open Source Event Tracer and Execution Simulator for MPI Applications.- A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors.- Topic 3: Scheduling and Load Balancing.- Dynamic Load Balancing of Matrix-Vector Multiplications on Roadrunner Compute Nodes.- A Unified Framework for Load Distribution and Fault-Tolerance of Application Servers.- On the Feasibility of Dynamically Scheduling DAG Applications on Shared Heterogeneous Systems.- Steady-State for Batches of Identical Task Trees.- A Buffer Space Optimal Solution for Re-establishing the Packet Order in a MPSoC Network Processor.- Using Multicast Transfers in the Replica Migration Problem: Formulation and Scheduling Heuristics.- A New Genetic Algorithm for Scheduling for Large Communication Delays.- Comparison of Access Policies for Replica Placement in Tree Networks.- Scheduling Recurrent Precedence-Constrained Task Graphs on a Symmetric Shared-Memory Multiprocessor.- Energy-Aware Scheduling of Flow Applications on Master-Worker Platforms.- Topic 4: High Performance Architectures and Compilers.- Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs.- Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors.- REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs.- Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation.- Topic 5: Parallel and Distributed Databases.- Unifying Memory and Database Transactions.- A DHT Key-Value Storage System with Carrier Grade Performance.- Selective Replicated Declustering for Arbitrary Queries.- Topic 6: Grid, Cluster, and Cloud Computing.- POGGI: Puzzle-Based Online Games on Grid Infrastructures.- Enabling High Data Throughput in Desktop Grids through Decentralized Data and Metadata Management: The BlobSeer Approach.- MapReduce Programming Model for .NET-Based Cloud Computing.- The Architecture of the XtreemOS Grid Checkpointing Service.- Scalable Transactions for Web Applications in the Cloud.- Provider-Independent Use of the Cloud.- MPI Applications on Grids: A Topology Aware Approach.- Topic 7: Peer-to-Peer Computing.- A Least-Resistance Path in Reasoning about Unstructured Overlay Networks.- SiMPSON: Efficient Similarity Search in Metric Spaces over P2P Structured Overlay Networks.- Uniform Sampling for Directed P2P Networks.- Adaptive Peer Sampling with Newscast.- Exploring the Feasibility of Reputation Models for Improving P2P Routing under Churn.- Selfish Neighbor Selection in Peer-to-Peer Backup and Storage Applications.- Zero-Day Reconciliation of BitTorrent Users with Their ISPs.- Surfing Peer-to-Peer IPTV: Distributed Channel Switching.- Topic 8: Distributed Systems and Algorithms.- Distributed Individual-Based Simulation.- A Self-stabilizing K-Clustering Algorithm Using an Arbitrary Metric.- Active Optimistic Message Logging for Reliable Execution of MPI Applications.- Topic 9: Parallel and Distributed Programming.- A Parallel Numerical Library for UPC.- A Multilevel Parallelization Framework for High-Order Stencil Computations.- Using OpenMP vs. Threading Building Blocks for Medical Imaging on Multi-cores.- Parallel Skeletons for Variable-Length Lists in SkeTo Skeleton Library.- Stkm on Sca: A Unified Framework with Components, Workflows and Algorithmic Skeletons.- Grid-Enabling SPMD Applications through Hierarchical Partitioning and a Component-Based Runtime.- Reducing Rollbacks of Transactional Memory Using Ordered Shared Locks.- Topic 10: Parallel Numerical Algorithms.- Wavelet-Based Adaptive Solvers on Multi-core Architectures for the Simulation of Complex Systems.- Localized Parallel Algorithm for Bubble Coalescence in Free Surface Lattice-Boltzmann Method.- Fast Implicit Simulation of Oscillatory Flow in Human Abdominal Bifurcation Using a Schur Complement Preconditioner.- A Parallel Rigid Body Dynamics Algorithm.- Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems.- Parallel Implementation of Runge-Kutta Integrators with Low Storage Requirements.- PSPIKE: A Parallel Hybrid Sparse Linear System Solver.- Out-of-Core Computation of the QR Factorization on Multi-core Processors.- Adaptive Parallel Householder Bidiagonalization.- Topic 11: Multicore and Manycore Programming.- Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor.- An Extension of the StarSs Programming Model for Platforms with Multiple GPUs.- StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.- XJava: Exploiting Parallelism with Object-Oriented Stream Programming.- JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA.- Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades.- Searching for Concurrent Design Patterns in Video Games.- Parallelization of a Video Segmentation Algorithm on CUDA-Enabled Graphics Processing Units.- A Parallel Point Matching Algorithm for Landmark Based Image Registration Using Multicore Platform.- High Performance Matrix Multiplication on Many Cores.- Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm.- Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards.- Topic 12: Theory and Algorithms for Parallel Computation.- Implementing Parallel Google Map-Reduce in Eden.- A Lower Bound for Oblivious Dimensional Routing.- Topic 13: High-Performance Networks.- A Case Study of Communication Optimizations on 3D Mesh Interconnects.- Implementing a Change Assimilation Mechanism for Source Routing Interconnects.- Dependability Analysis of a Fault-Tolerant Network Reconfiguring Strategy.- RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration.- NIC-Assisted Cache-Efficient Receive Stack for Message Passing over Ethernet.- A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks.- Hardware Implementation Study of the SCFQ-CA and DRR-CA Scheduling Algorithms.- Topic 14: Mobile and Ubiquitous Computing.- Optimal and Near-Optimal Energy-Efficient Broadcasting in Wireless Networks.

Journal ArticleDOI
Duksu Kim1, Jae-Pil Heo1, Jaehyuk Huh1, John Kim1, Sung-Eui Yoon1 
TL;DR: A novel task decomposition method is proposed that leads to a lock‐free parallel algorithm in the main loop of the BVH‐based collision detection to create a highly scalable algorithm that achieves more than an order of magnitude improvement in performance using four CPU‐cores and two GPUs, compared to using a single CPU‐core.
Abstract: We present a novel, hybrid parallel continuous collision detection (HPCCD) method that exploits the availability of multi-core CPU and GPU architectures. HPCCD is based on a bounding volume hierarchy (BVH) and selectively performs lazy reconstructions. Our method works with a wide variety of deforming models and supports selfcollision detection. HPCCD takes advantage of hybrid multi-core architectures – using the general-purpose CPUs to perform the BVH traversal and culling while GPUs are used to perform elementary tests that reduce to solving cubic equations. We propose a novel task decomposition method that leads to a lock-free parallel algorithm in the main loop of our BVH-based collision detection to create a highly scalable algorithm. By exploiting the availability of hybrid, multi-core CPU and GPU architectures, our proposed method achieves more than an order of magnitude improvement in performance using four CPU-cores and two GPUs, compared to using a single CPU-core. This improvement results in an interactive performance, up to 148 fps, for various deforming benchmarks consisting of tens or hundreds of thousand triangles.