Showing papers on "Parallel algorithm published in 2015"

PDF

Open Access

Posted Content•

Deep Image: Scaling up Image Recognition

[...]

Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, Gang Sun - Show less +1 more

13 Jan 2015-arXiv: Computer Vision and Pattern Recognition

TL;DR: A state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning, which achieves excellent results on multiple challenging computer vision benchmarks.

...read moreread less

Abstract: We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm using new strategies for data partitioning and communication, larger deep neural network models, novel data augmentation approaches, and usage of multi-scale high-resolution images. Our method achieves excellent results on multiple challenging computer vision benchmarks.

...read moreread less

363 citations

Book•

Slowing Down Sorting Networks to Obtain Faster Sorting Algorithms

[...]

Richard Cole¹•Institutions (1)

New York University¹

06 Sep 2015

TL;DR: This paper provides a general method that trims a factor of O(log n) time for many applications of this technique.

...read moreread less

Abstract: Megiddo introduced a technique for using a parallel algorithm for one problem to construct an efficient serial algorithm for a second problem. We give a general method that trims a factor o f 0(logn) time (or more) for many applications of this technique.

...read moreread less

301 citations

Journal Article•DOI•

Fine-Grained Parallel Incomplete LU Factorization

[...]

Edmond Chow, Aftab Patel

19 Mar 2015-SIAM Journal on Scientific Computing

TL;DR: Numerical tests show that very few sweeps are needed to construct a factorization that is an effective preconditioner, and the amount of parallelism is large irrespective of the ordering of the matrix, and matrix ordering can be used to enhance the accuracy of the factorization rather than to increase parallelism.

...read moreread less

Abstract: This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros in the incomplete factors can be computed in parallel and asynchronously, using one or more sweeps that iteratively improve the accuracy of the factorization. Unlike existing parallel algorithms, the amount of parallelism is large irrespective of the ordering of the matrix, and matrix ordering can be used to enhance the accuracy of the factorization rather than to increase parallelism. Numerical tests show that very few sweeps are needed to construct a factorization that is an effective preconditioner.

...read moreread less

162 citations

Book•

Cascading Divide-and-conquer: A Technique for Designing Parallel Algorithms

[...]

Mikhail J. Atallah¹, Richard Cole, Michael T. Goodrich¹•Institutions (1)

Purdue University¹

09 Sep 2015

TL;DR: In this article, the authors present techniques for parallel divide-and-conquer, resulting in improved parallel algorithms for a number of problems including intersection detection, trapezoidal decomposition, and planar point location.

...read moreread less

Abstract: We present techniques for parallel divide-and-conquer, resulting in improved parallel algorithms for a number of problems. The problems for which we give improved algorithms include intersection detection, trapezoidal decomposition (hence, polygon triangulation), and planar point location (hence, Voronoi diagram construction). We also give efficient parallel algorithms for fractional cascading, 3-dimensional maxima, 2-set dominance counting, and visibility from a point. All of our algorithms run in O(log n) time with either a linear or sub-linear number of processors in the CREW PRAM model.

...read moreread less

162 citations

Journal Article•

Structured Parallel Programming: patterns for efficient computation

[...]

De Giusti, Armando Eduardo

01 Apr 2015-Journal of Computer Science and Technology

TL;DR: This book describes how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach and gives some specific examples using multiple programming models.

...read moreread less

Abstract: In this book the authors, who are parallel computing experts and industry insiders, describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give some specific examples using multiple programming models. The book begins with two introductory chapters related with “Why is necessary to Think Parallel” and presenting background related with hardware trends that have lead to need explicit parallel programming.

...read moreread less

159 citations

Proceedings Article•DOI•

GraphSC: Parallel Secure Computation Made Easy

[...]

Kartik Nayak¹, Xiao Shaun Wang¹, Stratis Ioannidis², Udi Weinsberg³, Nina Taft⁴, Elaine Shi¹ - Show less +2 more•Institutions (4)

University of Maryland, College Park¹, Yahoo!², Facebook³, Google⁴

17 May 2015

TL;DR: This work builds Graph SC, a framework that provides a programming paradigm that allows non-cryptography experts to write secure code, brings parallelism to such secure implementations, and meets the need for obliviousness, thereby not leaking any private information.

...read moreread less

Abstract: We propose introducing modern parallel programming paradigms to secure computation, enabling their secure execution on large datasets. To address this challenge, we present Graph SC, a framework that (i) provides a programming paradigm that allows non-cryptography experts to write secure code, (ii) brings parallelism to such secure implementations, and (iii) meets the need for obliviousness, thereby not leaking any private information. Using Graph SC, developers can efficiently implement an oblivious version of graph-based algorithms (including sophisticated data mining and machine learning algorithms) that execute in parallel with minimal communication overhead. Importantly, our secure version of graph-based algorithms incurs a small logarithmic overhead in comparison with the non-secure parallel version. We build Graph SC and demonstrate, using several algorithms as examples, that secure computation can be brought into the realm of practicality for big data analysis. Our secure matrix factorization implementation can process 1 million ratings in 13 hours, which is a multiple order-of-magnitude improvement over the only other existing attempt, which requires 3 hours to process 16K ratings.

...read moreread less

152 citations

Journal Article•DOI•

Cloud-Based Collaborative 3D Mapping in Real-Time With Low-Cost Robots

[...]

Gajamohan Mohanarajah¹, Vladyslav Usenko², Mayank Singh³, Raffaello D'Andrea¹, Markus Waibel¹ - Show less +1 more•Institutions (3)

ETH Zurich¹, Technische Universität München², Cisco Systems, Inc.³

10 Mar 2015-IEEE Transactions on Automation Science and Engineering

TL;DR: This paper presents an architecture, protocol, and parallel algorithms for collaborative 3D mapping in the cloud with low-cost robots, as well as quantitative evaluation of localization accuracy, bandwidth usage, processing speeds, and map storage.

...read moreread less

Abstract: This paper presents an architecture, protocol, and parallel algorithms for collaborative 3D mapping in the cloud with low-cost robots. The robots run a dense visual odometry algorithm on a smartphone-class processor. Key-frames from the visual odometry are sent to the cloud for parallel optimization and merging with maps produced by other robots. After optimization the cloud pushes the updated poses of the local key-frames back to the robots. All processes are managed by Rapyuta, a cloud robotics framework that runs in a commercial data center. This paper includes qualitative visualization of collaboratively built maps, as well as quantitative evaluation of localization accuracy, bandwidth usage, processing speeds, and map storage.

...read moreread less

133 citations

Journal Article•DOI•

Parallel Algorithms for Constrained Tensor Factorization via Alternating Direction Method of Multipliers

[...]

A.P. Liavas¹, Nicholas D. Sidiropoulos²•Institutions (2)

Technical University of Crete¹, University of Minnesota²

08 Jul 2015-IEEE Transactions on Signal Processing

TL;DR: A new constrained tensor factorization framework is proposed in this paper, building upon the Alternating Direction Method of Multipliers (ADMoM).

...read moreread less

Abstract: Tensor factorization has proven useful in a wide range of applications, from sensor array processing to communications, speech and audio signal processing, and machine learning. With few recent exceptions, all tensor factorization algorithms were originally developed for centralized, in-memory computation on a single machine; and the few that break away from this mold do not easily incorporate practically important constraints, such as non-negativity. A new constrained tensor factorization framework is proposed in this paper, building upon the Alternating Direction Method of Multipliers (ADMoM). It is shown that this simplifies computations, bypassing the need to solve constrained optimization problems in each iteration; and it naturally leads to distributed algorithms suitable for parallel implementation. This opens the door for many emerging big data-enabled applications. The methodology is exemplified using non-negativity as a baseline constraint, but the proposed framework can incorporate many other types of constraints. Numerical experiments are encouraging, indicating that ADMoM-based non-negative tensor factorization (NTF) has high potential as an alternative to state-of-the-art approaches.

...read moreread less

126 citations

Proceedings Article•DOI•

Tensor-matrix products with a compressed sparse tensor

[...]

Shaden Smith¹, George Karypis¹•Institutions (1)

University of Minnesota¹

15 Nov 2015

TL;DR: The compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication is introduced and offers similar operation reductions as existing compressed methods while using only a single tensor structure.

...read moreread less

Abstract: The Canonical Polyadic Decomposition (CPD) of tensors is a powerful tool for analyzing multi-way data and is used extensively to analyze very large and extremely sparse datasets. The bottleneck of computing the CPD is multiplying a sparse tensor by several dense matrices. Algorithms for tensor-matrix products fall into two classes. The first class saves floating point operations by storing a compressed tensor for each dimension of the data. These methods are fast but suffer high memory costs. The second class uses a single uncompressed tensor at the cost of additional floating point operations. In this work, we bridge the gap between the two approaches and introduce the compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication. CSF offers similar operation reductions as existing compressed methods while using only a single tensor structure. We validate our contributions with experiments comparing against state-of-the-art methods on a diverse set of datasets. Our work uses 58% less memory than the state-of-the-art while achieving 81% of the parallel performance on 16 threads.

...read moreread less

125 citations

Proceedings Article•DOI•

Parallel Triangle Counting and Enumeration Using Matrix Algebra

[...]

Ariful Azad¹, Aydin Buluc¹, John R. Gilbert²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Santa Barbara²

25 May 2015

TL;DR: This work presents a new primitive, masked matrix multiplication, that can be beneficial especially for the enumeration case and provides results from an initial implementation for the counting case along with various optimizations for communication reduction and load balance.

...read moreread less

Abstract: Triangle counting and enumeration are important kernels that are used to characterize graphs. They are also used to compute important statistics such as clustering coefficients. We provide a simple exact algorithm that is based on operations on sparse adjacency matrices. By parallelizing the individual sparse matrix operations, we achieve a parallel algorithm for triangle counting. The algorithm is generalizable to triangle enumeration by modifying the semiring that underlies the matrix algebra. We present a new primitive, masked matrix multiplication, that can be beneficial especially for the enumeration case. We provide results from an initial implementation for the counting case along with various optimizations for communication reduction and load balance.

...read moreread less

124 citations

Proceedings Article•DOI•

Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+

[...]

Julian Shun¹, Laxman Dhulipala¹, Guy E. Blelloch¹•Institutions (1)

Carnegie Mellon University¹

07 Apr 2015

TL;DR: This study studies compression techniques for parallel in-memory graph algorithms, and shows that they can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncompressed graphs.

...read moreread less

Abstract: We study compression techniques for parallel in-memory graph algorithms, and show that we can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncompressed graphs. We integrate the compression techniques into Ligra, a recent shared-memory graph processing system. This system, which we call Ligra+, is able to represent graphs using about half of the space for the uncompressed graphs on average. Furthermore, Ligra+ is slightly faster than Ligra on average on a 40-core machine with hyper-threading. Our experimental study shows that Ligra+ is able to process graphs using less memory, while performing as well as or faster than Ligra.

...read moreread less

Journal Article•DOI•

Scalable electron correlation methods I.: PNO-LMP2 with linear scaling in the molecular size and near-inverse-linear scaling in the number of processors.

[...]

Hans-Joachim Werner¹, Gerald Knizia¹, Christine Krause¹, Max Schwilk¹, Mark Dornbach¹ - Show less +1 more•Institutions (1)

University of Stuttgart¹

13 Jan 2015-Journal of Chemical Theory and Computation

TL;DR: It is proposed to construct electron correlation methods that are scalable in both molecule size and aggregated parallel computational power, in the sense that the total elapsed time of a calculation becomes nearly independent of the molecular size when the number of processors grows linearly with the Molecular size.

...read moreread less

Abstract: We propose to construct electron correlation methods that are scalable in both molecule size and aggregated parallel computational power, in the sense that the total elapsed time of a calculation becomes nearly independent of the molecular size when the number of processors grows linearly with the molecular size. This is shown to be possible by exploiting a combination of local approximations and parallel algorithms. The concept is demonstrated with a linear scaling pair natural orbital local second-order Moller–Plesset perturbation theory (PNO-LMP2) method. In this method, both the wave function manifold and the integrals are transformed incrementally from projected atomic orbitals (PAOs) first to orbital-specific virtuals (OSVs) and finally to pair natural orbitals (PNOs), which allow for minimum domain sizes and fine-grained accuracy control using very few parameters. A parallel algorithm design is discussed, which is efficient for both small and large molecules, and numbers of processors, although tru...

...read moreread less

Journal Article•DOI•

A parallel quadratic programming method for dynamic optimization problems

[...]

Janick V. Frasch¹, Sebastian Sager¹, Moritz Diehl²•Institutions (2)

Otto-von-Guericke University Magdeburg¹, University of Freiburg²

01 May 2015-Mathematical Programming Computation

TL;DR: This work addresses the ubiquitous case where these QPs are strictly convex and proposes a dual Newton strategy that exploits the block-bandedness similarly to an interior-point method.

...read moreread less

Abstract: Quadratic programming problems (QPs) that arise from dynamic optimization problems typically exhibit a very particular structure. We address the ubiquitous case where these QPs are strictly convex and propose a dual Newton strategy that exploits the block-bandedness similarly to an interior-point method. Still, the proposed method features warmstarting capabilities of active-set methods. We give details for an efficient implementation, including tailored numerical linear algebra, step size computation, parallelization, and infeasibility handling. We prove convergence of the algorithm for the considered problem class. A numerical study based on the open-source implementation qpDUNES shows that the algorithm outperforms both well-established general purpose QP solvers as well as state-of-the-art tailored control QP solvers significantly on the considered benchmark problems.

...read moreread less

Journal Article•DOI•

Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-uniform Grids

[...]

Florian Schornbaum, Ulrich Rüde

31 Aug 2015-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This article presents parallel algorithms, distributed data structures, and communication routines that are implemented in the software framework waLBerla in order to support large-scale, massively parallel lattice Boltzmann-based simulations on nonuniform grids, and evaluates the performance on two current petascale supercomputers.

...read moreread less

Abstract: The lattice Boltzmann method exhibits excellent scalability on current supercomputing systems and has thus increasingly become an alternative method for large-scale non-stationary flow simulations, reaching up to a trillion grid nodes. Additionally, grid refinement can lead to substantial savings in memory and compute time. These saving, however, come at the cost of much more complex data structures and algorithms. In particular, the interface between subdomains with different grid sizes must receive special treatment. In this article, we present parallel algorithms, distributed data structures, and communication routines that are implemented in the software framework waLBerla in order to support large-scale, massively parallel lattice Boltzmann-based simulations on non-uniform grids. Additionally, we evaluate the performance of our approach on two current petascale supercomputers. On an IBM Blue Gene/Q system, the largest weak scaling benchmarks with refined grids are executed with almost two million threads, demonstrating not only near-perfect scalability but also an absolute performance of close to a trillion lattice Boltzmann cell updates per second. On an Intel-based system, the strong scaling of a simulation with refined grids and a total of more than 8.5 million cells is demonstrated to reach a performance of less than one millisecond per time step. This enables simulations with complex, non-uniform grids and four million time steps per hour compute time.

...read moreread less

Journal Article•DOI•

A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems

[...]

Wei-Sheng Chin¹, Yong Zhuang¹, Yu-Chin Juan¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

11 Mar 2015-ACM Transactions on Intelligent Systems and Technology

TL;DR: A fast parallel SG method, FPSG, for shared memory systems is developed by dramatically reducing the cache-miss rate and carefully addressing the load balance of threads, which is more efficient than state-of-the-art parallel algorithms for matrix factorization.

...read moreread less

Abstract: Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient (SG) method is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SG is difficult to be parallelized for handling web-scale problems. In this article, we develop a fast parallel SG method, FPSG, for shared memory systems. By dramatically reducing the cache-miss rate and carefully addressing the load balance of threads, FPSG is more efficient than state-of-the-art parallel algorithms for matrix factorization.

...read moreread less

Journal Article•DOI•

Association rules with graph patterns

[...]

Wenfei Fan¹, Xin Wang², Yinghui Wu³, Jingbo Xu¹•Institutions (3)

University of Edinburgh¹, Southwest Jiaotong University², Washington State University³

01 Aug 2015

TL;DR: A parallel scalable algorithm is provided that guarantees a polynomial speedup over sequential algorithms with the increase of processors and develops a parallel algorithm with accuracy bound for the problem of discovering top-k diversified GPARs.

...read moreread less

Abstract: We propose graph-pattern association rules (GPARs) for social media marketing. Extending association rules for item-sets, GPARs help us discover regularities between entities in social graphs, and identify potential customers by exploring social influence. We study the problem of discovering top-k diversified GPARs. While this problem is NP-hard, we develop a parallel algorithm with accuracy bound. We also study the problem of identifying potential customers with GPARs. While it is also NP-hard, we provide a parallel scalable algorithm that guarantees a polynomial speedup over sequential algorithms with the increase of processors. Using real-life and synthetic graphs, we experimentally verify the scalability and effectiveness of the algorithms.

...read moreread less

Journal Article•DOI•

A novel parallel multi-swarm algorithm based on comprehensive learning particle swarm optimization

[...]

Şaban Gülcü, Halife Kodaz¹•Institutions (1)

Selçuk University¹

01 Oct 2015-Engineering Applications of Artificial Intelligence

TL;DR: A better variation of CLPSO is proposed, called the parallel comprehensive learning particle swarm optimizer (PCLPSO) which has multiple swarms based on the master-slave paradigm and works cooperatively and concurrently.

...read moreread less

Journal Article•DOI•

A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

[...]

Weifeng Liu¹, Brian Vinter¹•Institutions (1)

University of Copenhagen¹

01 Nov 2015-Journal of Parallel and Distributed Computing

TL;DR: This work proposes a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors using the CSR format, and proposes an efficient parallel insert method for long rows of the resulting matrix and develops a heuristic-based load balancing strategy.

...read moreread less

Proceedings Article•DOI•

Scalable Community Detection with the Louvain Algorithm

[...]

Xinyu Que¹, Fabio Checconi¹, Fabrizio Petrini¹, John A. Gunnels¹•Institutions (1)

IBM¹

25 May 2015

TL;DR: This paper presents and evaluates a parallel community detection algorithm derived from the state-of-the-art Louvain modularity maximization method, which is able to parallelize graphs with up to 138 billion edges on 8, 192 Blue Gene/Q nodes and 1, 024 P7-IH nodes.

...read moreread less

Abstract: In this paper we present and evaluate a parallel community detection algorithm derived from the state-of-the-artLouvain modularity maximization method. Our algorithm adoptsa novel graph mapping and data representation, and relies onan efficient communication runtime, specifically designed forfine-grained applications executed on large-scale supercomputers. We have been able to parallelize graphs with up to 138 billion edges on 8, 192 Blue Gene/Q nodes and 1, 024 P7-IH nodes. Leveraging the convergence properties of our algorithm and the efficient implementation, we can analyze communities of large scalegraphs in just a few seconds. To the best of our knowledge, this is the first parallel implementation of the Louvain algorithm that scales to these large data and processor configurations.

...read moreread less

Journal Article•DOI•

Parallel hybrid extragradient methods for pseudomonotone equilibrium problems and nonexpansive mappings

[...]

Dang Van Hieu¹, Le Dung Muu, Pham Ky Anh¹•Institutions (1)

Vietnam National University, Hanoi¹

22 Dec 2015-arXiv: Optimization and Control

TL;DR: This paper proposes and analyzes three parallel hybrid extragradient methods for finding a common element of the set of solutions of equilibrium problems involving pseudomonotone bifunctions and theSet of fixed points of nonexpansive mappings in a real Hilbert space based on parallel computation.

...read moreread less

Abstract: In this paper we propose and analyze three parallel hybrid extragradient methods for finding a common element of the set of solutions of equilibrium problems involving pseudomonotone bifunctions and the set of fixed points of nonexpansive mappings in a real Hilbert space. Based on parallel computation we can reduce the overall computational effort under widely used conditions on the bifunctions and the nonexpansive mappings.A simple numerical example is given to illustrate the proposed parallel algorithms.

...read moreread less

Journal Article•DOI•

Multi-Objective Optimal Power Flow Considering Transient Stability Based on Parallel NSGA-II

[...]

Chengjin Ye, Min-Xiang Huang¹•Institutions (1)

Zhejiang University¹

01 Mar 2015-IEEE Transactions on Power Systems

TL;DR: In this article, a multi-objective optimization method is proposed to model transient stability as an objective function rather than an inequality constraint and consider classic transient stability constrained optimal power flow (TSCOPF) as a tradeoff procedure using Pareto ideology.

...read moreread less

Abstract: Stability is an important constraint in power system operation and the transient stability constrained optimal power flow (OPF) has always received considerable attention in recent years. In this paper, the defects of the existing models and algorithms around this topic are firstly analyzed, on the basis of which, a multi-objective optimization method is proposed. The basic idea of the proposed method is to model transient stability as an objective function rather than an inequality constraint and consider classic transient stability constrained OPF (TSCOPF) as a tradeoff procedure using Pareto ideology. Second, a master-slave parallel elitist non-dominated sorting genetic algorithm II is used to solve the proposed multi-objective optimization problem, the parallel algorithm shows an excellent acceleration effect and provides a set of Pareto optimal solutions for decision makers to select. An innovative weight assigning technique based on fuzzy membership variance is also introduced for a more scientific and objective optimal solution decision. Case study results demonstrate the proposed multi-objective method has many advantages, compared with traditional TSCOPF methods.

...read moreread less

Journal Article•DOI•

An efficient meta-heuristic algorithm for grid computing

[...]

Zahra Pooranian¹, Mohammad Shojafar², Jemal H. Abawajy³, Ajith Abraham•Institutions (3)

Islamic Azad University¹, Sapienza University of Rome², Deakin University³

01 Oct 2015-Journal of Combinatorial Optimization

TL;DR: This paper has combined PSO with the gravitational emulation local search (GELS) algorithm to form a new method, PSO–GELS, and experimental results demonstrate the effectiveness of PSO-GELS compared to other algorithms.

...read moreread less

Abstract: A grid computing system consists of a group of programs and resources that are spread across machines in the grid. A grid system has a dynamic environment and decentralized distributed resources, so it is important to provide efficient scheduling for applications. Task scheduling is an NP-hard problem and deterministic algorithms are inadequate and heuristic algorithms such as particle swarm optimization (PSO) are needed to solve the problem. PSO is a simple parallel algorithm that can be applied in different ways to resolve optimization problems. PSO searches the problem space globally and needs to be combined with other methods to search locally as well. In this paper, we propose a hybrid-scheduling algorithm to solve the independent task-scheduling problem in grid computing. We have combined PSO with the gravitational emulation local search (GELS) algorithm to form a new method, PSO---GELS. Our experimental results demonstrate the effectiveness of PSO---GELS compared to other algorithms.

...read moreread less

Proceedings Article•DOI•

Parallel Graph Partitioning for Complex Networks

[...]

Henning Meyerhenke¹, Peter Sanders¹, Christian Schulz¹•Institutions (1)

Karlsruhe Institute of Technology¹

25 May 2015

TL;DR: In this paper, the label propagation technique was adapted for multilevel graph partitioning, and a highly parallel evolutionary algorithm was applied to the coarsest graph to obtain very high quality.

...read moreread less

Abstract: Processing large complex networks like social networks or web graphs has recently attracted considerable interest. To do this in parallel, we need to partition them into pieces of about equal size. Unfortunately, previous parallel graph practitioners originally developed for more regular mesh-like networks do not work well for these networks. This paper addresses this problem by parallelizing and adapting the label propagation technique originally developed for graph clustering. By introducing size constraints, label propagation becomes applicable for both the coarsening and the refinement phase of multilevel graph partitioning. We obtain very high quality by applying a highly parallel evolutionary algorithm to the coarsest graph. The resulting system is both more scalable and achieves higher quality than state-of-the-art systems like ParMetis or PT-Scotch. For large complex networks the performance differences are very big. As an example, our algorithm partitions a web graph with 3.3G edges in 16 seconds using 512 cores of a high-performance cluster while producing a high quality partition -- none of the competing systems can handle this graph on our system.

...read moreread less

Journal Article•DOI•

The Birmingham parallel genetic algorithm and its application to the direct DFT global optimisation of IrN (N = 10–20) clusters

[...]

Jack B. A. Davis¹, Armin Shayeghi², Sarah L. Horswell¹, Roy L. Johnston¹•Institutions (2)

University of Birmingham¹, Technische Universität Darmstadt²

13 Aug 2015-Nanoscale

TL;DR: The scaling capability of the Birmingham parallel genetic algorithm is demonstrated through its application to the global optimisation of iridium clusters with 10 to 20 atoms, a catalytically important system with interesting size-specific effects.

...read moreread less

Abstract: A new open-source parallel genetic algorithm, the Birmingham parallel genetic algorithm, is introduced for the direct density functional theory global optimisation of metallic nanoparticles. The program utilises a pool genetic algorithm methodology for the efficient use of massively parallel computational resources. The scaling capability of the Birmingham parallel genetic algorithm is demonstrated through its application to the global optimisation of iridium clusters with 10 to 20 atoms, a catalytically important system with interesting size-specific effects. This is the first study of its type on Iridium clusters of this size and the parallel algorithm is shown to be capable of scaling beyond previous size restrictions and accurately characterising the structures of these larger system sizes. By globally optimising the system directly at the density functional level of theory, the code captures the cubic structures commonly found in sub-nanometre sized Ir clusters.

...read moreread less

Proceedings Article•DOI•

Improved Parallel Algorithms for Spanners and Hopsets

[...]

Gary L. Miller¹, Richard Peng², Adrian Vladu², Shen Chen Xu¹•Institutions (2)

Carnegie Mellon University¹, Massachusetts Institute of Technology²

13 Jun 2015

TL;DR: In this paper, the authors use exponential start time clustering to design faster parallel graph algorithms involving distances, and give linear work parallel algorithms that construct spanners with O(k) stretch and size O(n 1+1/k log k) in unweighted graphs, and O(m poly log n) in weighted graphs.

...read moreread less

Abstract: We use exponential start time clustering to design faster parallel graph algorithms involving distances. Previous algorithms usually rely on graph decomposition routines with strict restrictions on the diameters of the decomposed pieces. We weaken these bounds in favor of stronger local probabilistic guarantees. This allows more direct analyses of the overall process, giving: Linear work parallel algorithms that construct spanners with O(k) stretch and size O(n1+1/k) in unweighted graphs, and size O(n1+1/k log k) in weighted graphs.Hopsets that lead to the first parallel algorithm for approximating shortest paths in undirected graphs with O(m poly log n) work.

...read moreread less

Journal Article•DOI•

Community discovery by propagating local and global information based on the MapReduce model

[...]

Kun Guo, Wenzhong Guo, Yuzhong Chen, Qirong Qiu¹, Qishan Zhang¹ - Show less +1 more•Institutions (1)

Fuzhou University¹

01 Dec 2015-Information Sciences

TL;DR: Three strategies, namely, localizing propagation of affinity messages, relaxing self-exemplar constraints, and hierarchical processing, are employed in the algorithm to achieve reasonable time and space complexities in social networks.

...read moreread less

Book•

Heterogeneous Computing with OpenCL 2.0

[...]

David Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang

01 Jun 2015

TL;DR: This fully-revised edition includes the latest enhancements in OpenCL 2.0 including: Shared virtual memory to increase programming flexibility and reduce data transfers that consume resources Dynamic parallelism which reduces processor load and avoids bottlenecks

...read moreread less

Abstract: Heterogeneous Computing with OpenCL 2.0 teaches OpenCL and parallel programming for complex systems that may include a variety of device architectures: multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units (APUs). This fully-revised edition includes the latest enhancements in OpenCL 2.0 including: Shared virtual memory to increase programming flexibility and reduce data transfers that consume resources Dynamic parallelism which reduces processor load and avoids bottlenecks Improved imaging support and integration with OpenGL Designed to work on multiple platforms, OpenCL will help you more effectively program for a heterogeneous future. Written by leaders in the parallel computing and OpenCL communities, this book explores memory spaces, optimization techniques, extensions, debugging and profiling. Multiple case studies and examples illustrate high-performance algorithms, distributing work across heterogeneous systems, embedded domain-specific languages, and will give you hands-on OpenCL experience to address a range of fundamental parallel algorithms. Updated content to cover the latest developments in OpenCL 2.0, including improvements in memory handling, parallelism, and imaging support Explanations of principles and strategies to learn parallel programming with OpenCL, from understanding the abstraction models to thoroughly testing and debugging complete applications Example code covering image analytics, web plugins, particle simulations, video editing, performance optimization, and more

...read moreread less

Journal Article•DOI•

A Parallel Solver for Large Scale DFN Flow Simulations

[...]

Stefano Berrone, Sandra Pieraccini, Stefano Scialò, Fabio Vicini

07 May 2015-SIAM Journal on Scientific Computing

TL;DR: The scalability performances and the efficiency of the parallel algorithm are shown, and the robustness of the method is tested on complex and strongly connected DFN configurations which would be very difficult to mesh using conventional app...

...read moreread less

Abstract: Flows in fractured media have been modeled using many different approaches in order to get reliable and efficient simulations for many critical applications. The common issues to be tackled are the wide range of scales involved in the phenomenon, the complexity of the domain, and the huge computational cost. In the present paper we propose a parallel implementation of the PDE-constrained optimization method presented in [S. Berrone, S. Pieraccini, and S. Scialo, SIAM J. Sci. Comput., 35 (2013), pp. B487--B510; S. Berrone, S. Pieraccini, and S. Scialo, SIAM J. Sci. Comput., 35 (2013), pp. A908--A935; S. Berrone, S. Pieraccini, and S. Scialo, J. Comput. Phys., 256 (2014), pp. 838--853] for dealing with arbitrary discrete fracture networks (DFNs) on nonconforming grids. We show the scalability performances and the efficiency of the parallel algorithm, and we also test the robustness of the method on complex and strongly connected DFN configurations which would be very difficult to mesh using conventional app...

...read moreread less

Journal Article•DOI•

Parallel hybrid PSO with CUDA for lD heat conduction equation

[...]

Aijia Ouyang¹, Zhuo Tang¹, Xu Zhou¹, Yuming Xu¹, Guo Pan¹, Keqin Li¹, Keqin Li² - Show less +3 more•Institutions (2)

Hunan University¹, State University of New York at New Paltz²

30 Mar 2015-Computers & Fluids

TL;DR: The results show that using PHPSO to solve the one-dimensional heat conduction equation can outperform two parallel algorithms as well as HPSO itself and is shown to be with strong robustness and high speedup.

...read moreread less

Journal Article•DOI•

Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields

[...]

Kenli Li¹, Wei Ai¹, Zhuo Tang¹, Fan Zhang², Lingang Jiang¹, Keqin Li³, Kai Hwang⁴ - Show less +3 more•Institutions (4)

Hunan University¹, Massachusetts Institute of Technology², State University of New York System³, University of Southern California⁴

01 Nov 2015-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A parallel CRF algorithm called MapReduce CRF (MRCRF) is proposed in this paper, which contains two parallel sub-algorithms to handle two time-consuming steps of the CRF model and outperforms other competing methods in terms of time efficiency and correctness.

...read moreread less

Abstract: Processing large volumes of data has presented a challenging issue, particularly in data-redundant systems. As one of the most recognized models, the conditional random fields (CRF) model has been widely applied in biomedical named entity recognition (Bio-NER). Due to the internally sequential feature, performance improvement of the CRF model is nontrivial, which requires new parallelized solutions. By combining and parallelizing the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) and Viterbi algorithms, we propose a parallel CRF algorithm called MapReduce CRF (MRCRF) in this paper, which contains two parallel sub-algorithms to handle two time-consuming steps of the CRF model. The MapReduce L-BFGS (MRLB) algorithm leverages the MapReduce framework to enhance the capability of estimating parameters. Furthermore, the MapReduce Viterbi (MRVtb) algorithm infers the most likely state sequence by extending the Viterbi algorithm with another MapReduce job. Experimental results show that the MRCRF algorithm outperforms other competing methods by exhibiting significant performance improvement in terms of time efficiency as well as preserving a guaranteed level of correctness.

...read moreread less

Collapse