Showing papers on "Parallel algorithm published in 2019"

PDF

Open Access

Journal Article•DOI•

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis

[...]

Tal Ben-Nun¹, Torsten Hoefler¹•Institutions (1)

30 Aug 2019-ACM Computing Surveys

TL;DR: The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.

...read moreread less

Abstract: Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

...read moreread less

433 citations

Journal Article•DOI•

A Fast PC Algorithm for High Dimensional Causal Discovery with Multi-Core PCs

[...]

Thuc Duy Le¹, Tao Hoang¹, Jiuyong Li¹, Lin Liu¹, Huawen Liu², Shu Hu - Show less +2 more•Institutions (2)

University of South Australia¹, Zhejiang Normal University²

01 Sep 2019-IEEE/ACM Transactions on Computational Biology and Bioinformatics

TL;DR: Parallel-PC is developed, a fast and memory efficient PC algorithm that is suitable for personal computers and does not require end users’ parallel computing knowledge beyond their competency in using the PC algorithm, and integrated into a causal inference method for inferring miRNA-mRNA regulatory relationships.

...read moreread less

Abstract: Discovering causal relationships from observational data is a crucial problem and it has applications in many research areas. The PC algorithm is the state-of-the-art constraint based method for causal discovery. However, runtime of the PC algorithm, in the worst-case, is exponential to the number of nodes (variables), and thus it is inefficient when being applied to high dimensional data, e.g., gene expression datasets. On another note, the advancement of computer hardware in the last decade has resulted in the widespread availability of multi-core personal computers. There is a significant motivation for designing a parallelized PC algorithm that is suitable for personal computers and does not require end users’ parallel computing knowledge beyond their competency in using the PC algorithm. In this paper, we develop parallel-PC, a fast and memory efficient PC algorithm using the parallel computing technique. We apply our method to a range of synthetic and real-world high dimensional datasets. Experimental results on a dataset from the DREAM 5 challenge show that the original PC algorithm could not produce any results after running more than 24 hours; meanwhile, our parallel-PC algorithm managed to finish within around 12 hours with a 4-core CPU computer, and less than six hours with a 8-core CPU computer. Furthermore, we integrate parallel-PC into a causal inference method for inferring miRNA-mRNA regulatory relationships. The experimental results show that parallel-PC helps improve both the efficiency and accuracy of the causal inference algorithm.

...read moreread less

104 citations

Proceedings Article•DOI•

Sublinear algorithms for (Δ + 1) vertex coloring

[...]

Sepehr Assadi¹, Yu Chen¹, Sanjeev Khanna¹•Institutions (1)

University of Pennsylvania¹

06 Jan 2019

TL;DR: A remarkably simple meta-algorithm for the (∆ + 1) coloring problem: Sample O(log n) colors for each vertex independently and uniformly at random from the ∆+ 1 colors; find a proper coloring of the graph using only the sampled colors of each vertex.

...read moreread less

Abstract: Any graph with maximum degree Δ admits a proper vertex coloring with Δ + 1 colors that can be found via a simple sequential greedy algorithm in linear time and space. But can one find such a coloring via a sublinear algorithm?We answer this fundamental question in the affirmative for several canonical classes of sublinear algorithms including graph streaming, sublinear time, and massively parallel computation (MPC) algorithms. In particular, we design:• A single-pass semi-streaming algorithm in dynamic streams using O(n) space. The only known semi-streaming algorithm prior to our work was a folklore O(log n)-pass algorithm obtained by simulating classical distributed algorithms in the streaming model.• A sublinear-time algorithm in the standard query model that allows neighbor queries and pair queries using [MATH HERE] time. We further show that any algorithm that outputs a valid coloring with sufficiently large constant probability requires [MATH HERE] time. No non-trivial sublinear time algorithms were known prior to our work.• A parallel algorithm in the massively parallel computation (MPC) model using O(n) memory per machine and O(1) MPC rounds. Our number of rounds significantly improves upon the recent O(log log Δ · log* (n))-round algorithm of Parter [ICALP 2018].At the core of our results is a remarkably simple meta-algorithm for the (Δ + 1) coloring problem: Sample O(log n) colors for each vertex independently and uniformly at random from the Δ + 1 colors; find a proper coloring of the graph using only the sampled colors of each vertex. As our main result, we prove that the sampled set of colors with high probability contains a proper coloring of the input graph. The sublinear algorithms are then obtained by designing efficient algorithms for finding a proper coloring of the graph from the sampled colors in each model.We note that all our upper bound results for (Δ + 1) coloring are either optimal or close to best possible in each model studied. We also establish new lower bounds that rule out the possibility of achieving similar results in these models for the closely related problems of maximal independent set and maximal matching. Collectively, our results highlight a sharp contrast between the complexity of (Δ+1) coloring vs maximal independent set and maximal matching in various models of sublinear computation even though all three problems are solvable by a simple greedy algorithm in the classical setting.

...read moreread less

97 citations

Journal Article•DOI•

Large-Scale Low-Rank Matrix Learning with Nonconvex Regularizers

[...]

Quanming Yao¹, James T. Kwok², Taifeng Wang³, Tie-Yan Liu³•Institutions (3)

Paradigm¹, Hong Kong University of Science and Technology², Microsoft³

01 Nov 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper develops a fast proximal algorithm and its accelerated variant with inexact proximal step, and shows the proposed algorithm can be parallelized, and the resultant algorithm achieves nearly linear speedup w.r.t. the number of threads.

...read moreread less

Abstract: Low-rank modeling has many important applications in computer vision and machine learning. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better empirical performance. However, the resulting optimization problem is much more challenging. Recent state-of-the-art requires an expensive full SVD in each iteration. In this paper, we show that for many commonly-used nonconvex low-rank regularizers, the singular values obtained from the proximal operator can be automatically threshold. This allows the proximal operator to be efficiently approximated by the power method. We then develop a fast proximal algorithm and its accelerated variant with inexact proximal step. It can be guaranteed that the squared distance between consecutive iterates converges at a rate of $O(1/T)$O(1/T), where $T$T is the number of iterations. Furthermore, we show the proposed algorithm can be parallelized, and the resultant algorithm achieves nearly linear speedup w.r.t. the number of threads. Extensive experiments are performed on matrix completion and robust principal component analysis. Significant speedup over the state-of-the-art is observed.

...read moreread less

77 citations

Proceedings Article•DOI•

Coresets Meet EDCS: algorithms for matching and vertex cover on massive graphs

[...]

Sepehr Assadi¹, MohammadHossein Bateni², Aaron Bernstein³, Vahab Mirrokni², Cliff Stein⁴ - Show less +1 more•Institutions (4)

University of Pennsylvania¹, Google², Rutgers University³, Columbia University⁴

06 Jan 2019

TL;DR: In this paper, the authors give a unified approach that yields better approximation algorithms for matching and vertex cover in all these models, including the streaming model, the distributed communication model, and the massively parallel computation (MPC) model.

...read moreread less

Abstract: There is a rapidly growing need for scalable algorithms that solve classical graph problems, such as maximum matching and minimum vertex cover, on massive graphs. For massive inputs, several different computational models have been introduced, including the streaming model, the distributed communication model, and the massively parallel computation (MPC) model that is a common abstraction of MapReduce-style computation. In each model, algorithms are analyzed in terms of resources such as space used or rounds of communication needed, in addition to the more traditional approximation ratio.In this paper, we give a single unified approach that yields better approximation algorithms for matching and vertex cover in all these models. The highlights include:• The first one pass, significantly-better-than-2-approximation for matching in random arrival streams that uses subquadratic space, namely a (1.5 + e-approximation streaming algorithm that uses O(n1.5) space for constant e > 0.• The first 2-round, better-than-2-approximation for matching in the MPC model that uses sub-quadratic space per machine, namely a (1.5 + e)-approximation algorithm with [MATH HERE] memory per machine for constant e > 0.By building on our unified approach, we further develop parallel algorithms in the MPC model that give a (1+e)-approximation to matching and an 0(1)-approximation to vertex cover in only 0(log log n) MPC rounds and 0(n/polylog(n)) memory per machine. These results settle multiple open questions posed by Czumaj et al. [STOC 2018].We obtain our results by a novel combination of two previously disjoint set of techniques, namely randomized composable coresets and edge degree constrained subgraphs (EDCS). We significantly extend the power of these techniques and prove several new structural results. For example, we show that an EDCS is a sparse certificate for large matchings and small vertex covers that is quite robust to sampling and composition.

...read moreread less

65 citations

Journal Article•DOI•

Accelerating genetic algorithms with GPU computing: A selective overview

[...]

John Runwei Cheng, Mitsuo Gen¹•Institutions (1)

Tokyo University of Science¹

01 Feb 2019-Computers & Industrial Engineering

TL;DR: The concept of granularity of parallelism for GAs on GPU architecture is reexamine, how the aspect of data layout affect the kernel design to maximize memory bandwidth is discussed, and how to organize threads in grid and blocks to expose sufficient parallelism to GPU is explained.

...read moreread less

57 citations

Proceedings Article•DOI•

Conditional Hardness Results for Massively Parallel Computation from Distributed Lower Bounds

[...]

Mohsen Ghaffari¹, Fabian Kuhn², Jara Uitto³•Institutions (3)

ETH Zurich¹, University of Freiburg², Aalto University³

01 Nov 2019

TL;DR: The first conditional hardness results for massively parallel algorithms for some central graph problems including (approximating) maximum matching, vertex cover, maximal independent set, and coloring are presented.

...read moreread less

Abstract: We present the first conditional hardness results for massively parallel algorithms for some central graph problems including (approximating) maximum matching, vertex cover, maximal independent set, and coloring. In some cases, these hardness results match or get close to the state of the art algorithms. Our hardness results are conditioned on a widely believed conjecture in massively parallel computation about the complexity of the connectivity problem. We also note that it is known that an unconditional variant of such hardness results might be somewhat out of reach for now, as it would lead to considerably improved circuit complexity lower bounds and would concretely imply that NC_1 is a proper subset of P. We obtain our conditional hardness result via a general method that lifts unconditional lower bounds from the well-studied LOCAL model of distributed computing to the massively parallel computation setting.

...read moreread less

56 citations

Journal Article•DOI•

Attenuating Crosstalk Noise of Simultaneous-Source Least-Squares Reverse Time Migration With GPU-Based Excitation Amplitude Imaging Condition

[...]

Qingchen Zhang¹, Weijian Mao¹, Yangkang Chen²•Institutions (2)

Chinese Academy of Sciences¹, Zhejiang University²

01 Jan 2019-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: A new LSRTM approach by using the excitation amplitude (EA) imaging condition to suppress the crosstalk noise, which can avoid the frequent memory transfer and is suitable for the graphics processing unit (GPU) parallelization.

...read moreread less

Abstract: Least-squares reverse time migration (LSRTM) can provide higher quality images than conventional reverse time migration, which is helpful to image simultaneous-source data. However, it still faces the problems of the crosstalk noise, great computation time, and storage requirement. We propose a new LSRTM approach by using the excitation amplitude (EA) imaging condition to suppress the crosstalk noise. Since only the maximum amplitude or limited local maximum amplitudes at each imaging point and the corresponding travel time step(s) need to be saved, the great storage problem can be naturally solved. Consequently, the proposed algorithm can avoid the frequent memory transfer and is suitable for the graphics processing unit (GPU) parallelization. Besides, the shared memory with high bandwidth is used to optimize the GPU-based algorithm. In order to further improve the image quality of EA imaging condition, we adopt the shaping regularization as a constraint. The single-source tests with Marmousi and salt models show the feasibility of our algorithm to image the complex and subsalt structures, among which a wrong background velocity is used to test its sensitivity to the velocity error. The noise-free and noise-included simultaneous-source examples demonstrate the ability of EA imaging condition to suppress the crosstalk noise. During the implementation of the GPU parallelization, we find that the shared memory cannot always optimize the GPU parallel algorithm and just works well for the eighth- or higher order spatial finite difference scheme.

...read moreread less

55 citations

Proceedings Article•DOI•

Parallelizable reachability analysis algorithms for feed-forward neural networks

[...]

Hoang-Dung Tran¹, Patrick Musau¹, Diego Manzanas Lopez¹, Xiaodong Yang¹, Luan Viet Nguyen², Weiming Xiang¹, Taylor T. Johnson¹ - Show less +3 more•Institutions (2)

Vanderbilt University¹, University of Pennsylvania²

27 May 2019

TL;DR: This paper presents a reachability analysis method for feed-forward neural networks (FNN) that employ rectified linear units (ReLUs) as activation functions that relies on three reachable-set computation algorithms, namely exact schemes, lazy-approximate schemes, and mixing schemes.

...read moreread less

Abstract: Artificial neural networks (ANN) have displayed considerable utility in a wide range of applications such as image processing, character and pattern recognition, self-driving cars, evolutionary robotics, and non-linear system identification and control. While ANNs are able to carry out complicated tasks efficiently, they are susceptible to unpredictable and errant behavior due to irregularities that emanate from their complex non-linear structure. As a result, there have been reservations about incorporating them into safety-critical systems. In this paper, we present a reachability analysis method for feed-forward neural networks (FNN) that employ rectified linear units (ReLUs) as activation functions. The crux of our approach relies on three reachable-set computation algorithms, namely exact schemes, lazy-approximate schemes, and mixing schemes. The exact scheme computes an exact reachable set for FNN, while the lazy-approximate and mixing schemes generate an over-approximation of the exact reachable set. All schemes are designed efficiently to run on parallel platforms to reduce the computation time and enhance the scalability. Our methods are implemented in a MATLAB® toolbox called, NNV, and is evaluated using a set of benchmarks that consist of realistic neural networks with sizes that range from tens to a thousand neurons. Notably, NNV successfully computes and visualizes the exact reachable sets of the real world ACAS Xu deep neural networks (DNNs), which are a variant of a family of novel airborne collision detection systems known as the ACAS System X, using a representation of tens to hundreds of polyhedra.

...read moreread less

51 citations

Journal Article•DOI•

Berth scheduling at marine container terminals: A universal island-based metaheuristic approach

[...]

Masoud Kavoosi, Maxim A. Dulebenets, Olumide F. Abioye, Junayed Pasha, Oluwatosin Theophilus, Hui Wang, Raphael Kampmann, Marko Mikijeljević - Show less +4 more

18 Nov 2019

TL;DR: A universal island-based metaheuristic algorithm (UIMA) was proposed, aiming to solve the spatially constrained berth scheduling problem and minimize the total cost of serving the arriving vessels at the MCT.

...read moreread less

Abstract: Marine transportation has been faced with an increasing demand for containerized cargo during the past decade. Marine container terminals (MCTs), as the facilities for connecting seaborne and inland transportation, are expected to handle the increasing amount of containers, delivered by vessels. Berth scheduling plays an important role for the total throughput of MCTs as well as the overall effectiveness of the MCT operations. This study aims to propose a novel island-based metaheuristic algorithm to solve the berth scheduling problem and minimize the total cost of serving the arriving vessels at the MCT.,A universal island-based metaheuristic algorithm (UIMA) was proposed in this study, aiming to solve the spatially constrained berth scheduling problem. The UIMA population was divided into four sub-populations (i.e. islands). Unlike the canonical island-based algorithms that execute the same metaheuristic on each island, four different population-based metaheuristics are adopted within the developed algorithm to search the islands, including the following: evolutionary algorithm (EA), particle swarm optimization (PSO), estimation of distribution algorithm (EDA) and differential evolution (DE). The adopted population-based metaheuristic algorithms rely on different operators, which facilitate the search process for superior solutions on the UIMA islands.,The conducted numerical experiments demonstrated that the developed UIMA algorithm returned near-optimal solutions for the small-size problem instances. As for the large-size problem instances, UIMA was found to be superior to the EA, PSO, EDA and DE algorithms, which were executed in isolation, in terms of the obtained objective function values at termination. Furthermore, the developed UIMA algorithm outperformed various single-solution-based metaheuristic algorithms (including variable neighborhood search, tabu search and simulated annealing) in terms of the solution quality. The maximum UIMA computational time did not exceed 306 s.,Some of the previous berth scheduling studies modeled uncertain vessel arrival times and/or handling times, while this study assumed the vessel arrival and handling times to be deterministic.,The developed UIMA algorithm can be used by the MCT operators as an efficient decision support tool and assist with a cost-effective design of berth schedules within an acceptable computational time.,A novel island-based metaheuristic algorithm is designed to solve the spatially constrained berth scheduling problem. The proposed island-based algorithm adopts several types of metaheuristic algorithms to cover different areas of the search space. The considered metaheuristic algorithms rely on different operators. Such feature is expected to facilitate the search process for superior solutions.

...read moreread less

48 citations

Proceedings Article•DOI•

Weighted Matchings via Unweighted Augmentations

[...]

Buddhima Gamlath¹, Sagar Kale¹, Slobodan Mitrovic², Ola Svensson¹•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Massachusetts Institute of Technology²

16 Jul 2019

TL;DR: A generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs and yields a (1/2+c)-approximation algorithm thus breaking the natural barrier of 1/2.

...read moreread less

Abstract: We design a generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs. This method enables us to provide efficient implementations for approximating weighted matchings in the massively parallel computation (MPC) model and in the streaming model.For the MPC and the multi-pass streaming model, we show that any algorithm computing a (1-δ)-approximate unweighted matching in bipartite graphs can be translated into an algorithm that computes a (1-(e(δ))-approximate maximum weighted matching. Furthermore, this translation incurs only a constant factor (that depends on e > 0) overhead in the complexity. Instantiating this with the current best MPC algorithm for unweighted matching yields a (1 - e)-approximation algorithm for maximum weighted matching that uses Oe(log log n) rounds, O(m/n) machines per round, and O(npoly(logn)) memory per machine. This improves upon the previous best approximation guarantee of (1/2-e) for weighted graphs. In the context of single-pass streaming with random edge arrivals, our techniques yield a (1/2+c)-approximation algorithm thus breaking the natural barrier of 1/2.

...read moreread less

Proceedings Article•DOI•

Effective and Efficient Reuse of Past Travel Behavior for Route Recommendation

[...]

Lisi Chen, Shuo Shang¹, Christian S. Jensen², Bin Yao³, Zhiwei Zhang⁴, Ling Shao - Show less +2 more•Institutions (4)

University of Electronic Science and Technology of China¹, Aalborg University², Shanghai Jiao Tong University³, Hong Kong Baptist University⁴

25 Jul 2019

TL;DR: To enable efficient and effective RSL-Psc computation on massive route data, novel search space pruning techniques are developed and the two algorithms are capable of achieving high efficiency and scalability.

...read moreread less

Abstract: With the increasing availability of moving-object tracking data, use of this data for route search and recommendation is increasingly important. To this end, we propose a novel parallel split-and-combine approach to enable route search by locations (RSL-Psc). Given a set of routes, a set of places to visit O, and a threshold θ, we retrieve the route composed of sub-routes that (i) has similarity to O no less than θ and (ii) contains the minimum number of sub-route combinations. The resulting functionality targets a broad range of applications, including route planning and recommendation, ridesharing, and location-based services in general. To enable efficient and effective RSL-Psc computation on massive route data, we develop novel search space pruning techniques and enable use of the parallel processing capabilities of modern processors. Specifically, we develop two parallel algorithms, Fully-Split Parallel Search (FSPS) and Group-Split Parallel Search (GSPS). We divide the route split-and-combine task into ∑k=0 M S(|O|,k+1) sub-tasks, where M is the maximum number of combinations and S(⋅) is the Stirling number of the second kind. In each sub-task, we use network expansion and exploit spatial similarity bounds for pruning. The algorithms split candidate routes into sub-routes and combine them to construct new routes. The sub-tasks are independent and are performed in parallel. Extensive experiments with real data offer insight into the performance of the algorithms, indicating that our RSL-Psc problem can generate high-quality results and that the two algorithms are capable of achieving high efficiency and scalability.

...read moreread less

Proceedings Article•DOI•

pFaces: an acceleration ecosystem for symbolic control

[...]

Mahmoud Khaled¹, Majid Zamani²•Institutions (2)

Technische Universität München¹, Ludwig Maximilian University of Munich²

16 Apr 2019

TL;DR: In this article, the authors propose pFaces, an extensible software-ecosystem, to accelerate symbolic control techniques, which facilitates designing parallel algorithms and supervises their executions to utilize available computing resources.

...read moreread less

Abstract: The correctness of control software in many safety-critical applications such as autonomous vehicles is crucial. One technique to achieve correct control software is called "symbolic control", where complex systems are approximated by finite-state abstractions. Then, using those abstractions, provably-correct digital controllers are algorithmically synthesized for concrete systems, satisfying complex high-level requirements. Unfortunately, the complexity of synthesizing such controllers grows exponentially in the number of state variables. However, if distributed implementations are considered, high-performance computing platforms can be leveraged to mitigate the effects of the state-explosion problem.We propose pFaces, an extensible software-ecosystem, to accelerate symbolic control techniques. It facilitates designing parallel algorithms and supervises their executions to utilize available computing resources. To demonstrate its capabilities, novel parallel algorithms are designed for abstraction-based controller synthesis. Then, they are implemented inside pFaces and dispatched, for parallel execution, in different heterogeneous computing platforms, including CPUs, GPUs and Hardware Accelerators (HWAs). Results show remarkable reduction in the computation time by several orders of magnitudes as number of processing elements (PEs) increases, which easily outperforms all the existing tools.

...read moreread less

Journal Article•DOI•

Parallel multi-swarm PSO strategies for solving many objective optimization problems

[...]

Arion de Campos¹, Aurora Pozo², Elias P. Duarte²•Institutions (2)

Ponta Grossa State University¹, Federal University of Paraná²

01 Apr 2019-Journal of Parallel and Distributed Computing

TL;DR: It is possible to conclude that the parallelization has a positive effect on the convergence and diversity of the optimization process for problems with many objectives, however, there is no single strategy that is the best results for all classes of problems.

...read moreread less

Proceedings Article•DOI•

Submodular maximization with matroid and packing constraints in parallel

[...]

Alina Ene¹, Huy Nguyen², Adrian Vladu¹•Institutions (2)

Boston University¹, Northeastern University²

23 Jun 2019

TL;DR: The first algorithms with low adaptivity for submodular maximization with a matroid constraint are obtained, and the first parallel algorithm for non-monotone submodularity maximization subject to packing constraints is obtained.

...read moreread less

Abstract: We consider the problem of maximizing the multilinear extension of a submodular function subject a single matroid constraint or multiple packing constraints with a small number of adaptive rounds of evaluation queries. We obtain the first algorithms with low adaptivity for submodular maximization with a matroid constraint. Our algorithms achieve a 1−1/e−є approximation for monotone functions and a 1/e−є approximation for non-monotone functions, which nearly matches the best guarantees known in the fully adaptive setting. The number of rounds of adaptivity is O(log2n/є3), which is an exponential speedup over the existing algorithms. We obtain the first parallel algorithm for non-monotone submodular maximization subject to packing constraints. Our algorithm achieves a 1/e−є approximation using O(log(n/є) log(1/є) log(n+m)/ є2) parallel rounds, which is again an exponential speedup in parallel time over the existing algorithms. For monotone functions, we obtain a 1−1/e−є approximation in O(log(n/є)logm/є2) parallel rounds. The number of parallel rounds of our algorithm matches that of the state of the art algorithm for solving packing LPs with a linear objective (Mahoney et al., 2016). Our results apply more generally to the problem of maximizing a diminishing returns submodular (DR-submodular) function.

...read moreread less

Journal Article•DOI•

Task-Based Augmented Contour Trees with Fibonacci Heaps

[...]

Charles Gueunet¹, Pierre Fortin², Julien Jomier¹, Julien Tierny²•Institutions (2)

Kitware¹, University of Paris²

01 Aug 2019-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In this article, a new algorithm for the fast, shared memory, multi-core computation of augmented contour trees on triangulations is presented, which completely revisits the traditional, sequential contour tree algorithm to re-formulate all the steps of the computation as a set of independent local tasks.

...read moreread less

Abstract: This paper presents a new algorithm for the fast, shared memory, multi-core computation of augmented contour trees on triangulations. In contrast to most existing parallel algorithms our technique computes augmented trees, enabling the full extent of contour tree based applications including data segmentation. Our approach completely revisits the traditional, sequential contour tree algorithm to re-formulate all the steps of the computation as a set of independent local tasks. This includes a new computation procedure based on Fibonacci heaps for the join and split trees, two intermediate data structures used to compute the contour tree, whose constructions are efficiently carried out concurrently thanks to the dynamic scheduling of task parallelism. We also introduce a new parallel algorithm for the combination of these two trees into the output global contour tree. Overall, this results in superior time performance in practice, both in sequential and in parallel thanks to the OpenMP task runtime. We report performance numbers that compare our approach to reference sequential and multi-threaded implementations for the computation of augmented merge and contour trees. These experiments demonstrate the run-time efficiency of our approach and its scalability on common workstations. We demonstrate the utility of our approach in data segmentation applications.

...read moreread less

Proceedings Article•DOI•

Near-Optimal Massively Parallel Graph Connectivity

[...]

Soheil Behnezhad¹, Laxman Dhulipala², Hossein Esfandiari³, Jakub Lacki³, Vahab Mirrokni³ - Show less +1 more•Institutions (3)

University of Maryland, College Park¹, Carnegie Mellon University², Google³

01 Nov 2019

TL;DR: This paper presents an algorithm that for graphs with diameter D in the wide range [log^ε n, n], takes O(log D) rounds to identify the connected components and takes O (log log n) rounds for all other graphs and uses an optimal total space of O(m).

...read moreread less

Abstract: Identifying the connected components of a graph, apart from being a fundamental problem with countless applications, is a key primitive for many other algorithms. In this paper, we consider this problem in parallel settings. Particularly, we focus on the Massively Parallel Computations (MPC) model, which is the standard theoretical model for modern parallel frameworks such as MapReduce, Hadoop, or Spark. We consider the truly sublinear regime of MPC for graph problems where the space per machine is n^δ for some desirably small constant δ ∊ (0, 1). We present an algorithm that for graphs with diameter D in the wide range [log^e n, n], takes O(log D) rounds to identify the connected components and takes O(log log n) rounds for all other graphs. The algorithm is randomized, succeeds with high probability, does not require prior knowledge of D, and uses an optimal total space of O(m). We complement this by showing a conditional lower-bound based on the widely believed TwoCycle conjecture that Ω(log D) rounds are indeed necessary in this setting. Studying parallel connectivity algorithms received a resurgence of interest after the pioneering work of Andoni etal [FOCS 2018] who presented an algorithm with O(log D log log n) round-complexity. Our algorithm improves this result for the whole range of values of D and almost settles the problem due to the conditional lower-bound. Additionally, we show that with minimal adjustments, our algorithm can also be implemented in a variant of (CRCW) PRAM in asymptotically the same number of rounds.

...read moreread less

Journal Article•DOI•

A parallel FP-growth algorithm on World Ocean Atlas data with multi-core CPU

[...]

Yu Jiang¹, Minghao Zhao¹, Chengquan Hu¹, Lili He¹, Hongtao Bai¹, Jin Wang², Jin Wang³ - Show less +3 more•Institutions (3)

Jilin University¹, Yangzhou University², Nanjing University of Posts and Telecommunications³

01 Feb 2019-The Journal of Supercomputing

TL;DR: A parallel mining algorithm of association rules to explore the correlation and regularity of oxygen, temperature, phosphate, nitrate and silicate in the ocean and the relationship between the parallel efficiency and the core number of CPU is analyzed.

...read moreread less

Abstract: According to the complexity of ocean data, this paper adopts a parallel mining algorithm of association rules to explore the correlation and regularity of oxygen, temperature, phosphate, nitrate and silicate in the ocean. After the marine data is interpolated, this paper utilizes the parallel FP-growth algorithm to mine the data and then briefly analyzes the mining results of the frequent itemsets and association rules. The relationship between the parallel efficiency and the core number of CPU is analyzed through datasets with different scales. The experimental results indicate that the acceleration effect is ideal when each thread scored 200,000–300,000 data, which leads to more than 1.2 times of performance improvement.

...read moreread less

Posted Content•DOI•

Non-Reversible Parallel Tempering: a Scalable Highly Parallel MCMC Scheme.

[...]

Saifuddin Syed, Alexandre Bouchard-Côté, George Deligiannidis, Arnaud Doucet

08 May 2019-arXiv: Computation

TL;DR: It is shown theoretically and empirically that a class of non-reversible PT methods dominates its reversible counterparts and distinct scaling limits for the non‐reversible and reversible schemes are identified, and an iterative scheme approximating this schedule is developed.

...read moreread less

Abstract: Parallel tempering (PT) methods are a popular class of Markov chain Monte Carlo schemes used to sample complex high-dimensional probability distributions. They rely on a collection of $N$ interacting auxiliary chains targeting tempered versions of the target distribution to improve the exploration of the state-space. We provide here a new perspective on these highly parallel algorithms and their tuning by identifying and formalizing a sharp divide in the behaviour and performance of reversible versus non-reversible PT schemes. We show theoretically and empirically that a class of non-reversible PT methods dominates its reversible counterparts and identify distinct scaling limits for the non-reversible and reversible schemes, the former being a piecewise-deterministic Markov process and the latter a diffusion. These results are exploited to identify the optimal annealing schedule for non-reversible PT and to develop an iterative scheme approximating this schedule. We provide a wide range of numerical examples supporting our theoretical and methodological contributions. The proposed methodology is applicable to sample from a distribution $\pi$ with a density $L$ with respect to a reference distribution $\pi_0$ and compute the normalizing constant. A typical use case is when $\pi_0$ is a prior distribution, $L$ a likelihood function and $\pi$ the corresponding posterior.

...read moreread less

Posted Content•DOI•

Accelerating Sequence Alignment to Graphs

[...]

Chirag Jain¹, Alexander T. Dilthey, Sanchit Misra², Haowen Zhang¹, Srinivas Aluru¹ - Show less +1 more•Institutions (2)

Georgia Institute of Technology¹, Intel²

27 May 2019-bioRxiv

TL;DR: This work proposes the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations, and provides a novel blocked approach to compute the score matrix while ensuring high memory locality.

...read moreread less

Abstract: Aligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations. We take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to three hours for the problem of optimally aligning high coverage long (PacBio/ONT) or short (Illumina) DNA reads to an MHC human variation graph containing 10 million vertices. Availability The implementation of our algorithm is available at https://github.com/ParBLiSS/PaSGAL. Data sets used for evaluation are accessible using https://alurulab.cc.gatech.edu/PaSGAL.

...read moreread less

Journal Article•DOI•

Multiobjective feature selection for microarray data via distributed parallel algorithms

[...]

Bin Cao¹, Jianwei Zhao¹, Po Yang², Peng Yang¹, Xin Liu¹, Jun Qi², Andrew Simpson², Mohamed Elhoseny³, Irfan Mehmood⁴, Khan Muhammad⁵ - Show less +6 more•Institutions (5)

Hebei University of Technology¹, Liverpool John Moores University², Mansoura University³, University of Bradford⁴, Sejong University⁵

01 Nov 2019-Future Generation Computer Systems

TL;DR: This paper constructs a multiobjective feature selection model that simultaneously considers the classification error, the feature number and the feature redundancy, and proposes several distributed parallel algorithms based on different encodings and an adaptive strategy.

...read moreread less

Journal Article•DOI•

A fully implicit constraint-preserving simulator for the black oil model of petroleum reservoirs

[...]

Haijian Yang¹, Shuyu Sun², Yiteng Li², Chao Yang³•Institutions (3)

Hunan University¹, King Abdullah University of Science and Technology², Peking University³

05 Jul 2019-Journal of Computational Physics

TL;DR: This paper presents a parallel and fully implicit simulator for the black oil model based on the variational inequality (VI) framework, which can be used to enforce important mathematical and physical properties to obtain accurate constraint-preserving solutions.

...read moreread less

Proceedings Article•DOI•

Parallel Batch-Dynamic Graph Connectivity

[...]

Umut A. Acar¹, Daniel Anderson¹, Guy E. Blelloch¹, Laxman Dhulipala¹•Institutions (1)

Carnegie Mellon University¹

17 Jun 2019

TL;DR: In this paper, a parallel batch-dynamic connectivity algorithm for small batch sizes was proposed, which achieves O(log n log(1+n / Δ) expected amortized work per edge insertion and deletion and O( log 3 n) depth w.h.p.

...read moreread less

Abstract: In this paper, we study batch parallel algorithms for the dynamic connectivity problem, a fundamental problem that has received considerable attention in the sequential setting. The best sequential algorithm for dynamic connectivity is the elegant level-set algorithm of Holm, de Lichtenberg and Thorup (HDT), which achieves O(log2 n) amortized time per edge insertion or deletion, and O(log n) time per query. We design a parallel batch-dynamic connectivity algorithm that is work-efficient with respect to the HDT algorithm for small batch sizes, and is asymptotically faster when the average batch size is sufficiently large. Given a sequence of batched updates, where Δ is the average batch size of all deletions, our algorithm achieves O(log n log(1+n / Δ)) expected amortized work per edge insertion and deletion and O(log3 n) depth w.h.p. Our algorithm answers a batch of k connectivity queries in O(k log(1 + n/k)) expected work and O(log n) depth w.h.p. To the best of our knowledge, our algorithm is the first parallel batch-dynamic algorithm for connectivity.

...read moreread less

Proceedings Article•DOI•

K-Bucket Based Raft-Like Consensus Algorithm for Permissioned Blockchain

[...]

Rihong Wang¹, Lifeng Zhang¹, Quanqing Xu², Hang Zhou¹•Institutions (2)

Qingdao University¹, Agency for Science, Technology and Research²

01 Dec 2019

TL;DR: The experimental results show that the KRaft consensus algorithm has a 41% improved in transaction throughput and has a 67% improvement in the leader election speed, which satisfied the safety and liveness requirements of Raft consensus algorithm.

...read moreread less

Abstract: With the development of blockchain, more and more blockchain types emerge: public blockchain, consortium blockchain and private blockchain. Because of the node trust in some consortium blockchain and private blockchain, a no byzantine fault tolerance algorithm KRaft(Kademlia-Raft) algorithm with high throughput and high scalability is proposed. KRaft consensus algorithm is a Raft-like consensus algorithm that preserves the logic of part of Raft consensus algorithm. It optimized leader election and consensus process of the Raft consensus algorithm through the established K-Bucket node relationships in the Kademlia protocol, improved leader election speed and throughput. Firstly, the KRaft algorithm uses the K-bucket established by Kademlia protocol to achieve stable and efficient leader election process for the candidate node split vote problem and the low voting efficiency caused by the increase of the Follower node in the Raft algorithm. Secondly, aiming at the low efficiency and load imbalance of the leader single-node log replication in the Raft algorithm consensus process, a parallel log replication scheme with multiple candidate nodes for balancing the leader node load is proposed to improve the throughput and the scalability of the algorithm. Finally, as a Raft-like consensus algorithm, KRaft consensus algorithm satisfied the safety and liveness requirements of Raft consensus algorithm. KRaft consensus algorithm and Raft consensus algorithm were evaluated with local cluster simulation. The experimental results show that the KRaft consensus algorithm has a 41% improvement in transaction throughput and has a 67% improvement in the leader election speed.

...read moreread less

Proceedings Article•DOI•

Massively parallel approximation algorithms for edit distance and longest common subsequence

[...]

MohammadTaghi Hajiaghayi¹, Saeed Seddighin, Xiaorui Sun²•Institutions (2)

University of Maryland, College Park¹, University of Illinois at Chicago²

06 Jan 2019

TL;DR: In this paper, the authors presented a massively parallel algorithm for edit distance and longest common subsequence in the parallel setting, which achieves an approximation factor of 1 + ϵ and round complexity of O(n 2 ).

...read moreread less

Abstract: String similarity measures are among the most fundamental problems in computer science. The notable examples are edit distance (ED) and longest common subsequence (LCS). These problems find their applications in various contexts such as computational biology, text processing, compiler optimization, data analysis, image analysis, etc. In this work, we revisit edit distance and longest common subsequence in the parallel settings. We present massively parallel algorithms for both problems that are optimal in the following senses:• The approximation factor of our algorithms is 1 + ϵ.• The round complexity of our algorithms is constant.• The total running time of our algorithms over all machines is O(n2). This matches the running time of the best-known solutions for approximating edit distance and longest common subsequence within a 1+ϵ factor in the sequential setting.Our result for edit distance substantially improves the massively parallel algorithm of [15] in terms of approximation factor, round complexity, number of machines, and total running time. Our unified approach to tackle both problems is to divide one of the strings into smaller blocks and try to locally predict which intervals of the other string correspond to each block in an optimal solution.Our main technical contribution is a novel parallel algorithm for computing a set of compositions, and recursively decomposing each function into a set of smaller iterative compositions (in terms of memory needed to solve the problem). These two methods together give us a strong tool for approximating combinatorial problems. For instance, LCS can be formulated as a recursive composition of functions and therefore this tool enables us to approximate LCS within a factor 1 + ϵ. Indeed, we recursively decompose the problem until we are able to compute the solution on a single machine. Since our methods are quite general, we expect this technique to find its applications in other combinatorial problems as well.

...read moreread less

Proceedings Article•DOI•

Parallel Batch-Dynamic Graph Connectivity

[...]

Umut A. Acar¹, Daniel Anderson¹, Guy E. Blelloch¹, Laxman Dhulipala¹•Institutions (1)

Carnegie Mellon University¹

21 Mar 2019-arXiv: Data Structures and Algorithms

TL;DR: In this article, a parallel batch-dynamic connectivity algorithm was proposed that is work-efficient with respect to the HDT algorithm for small batch sizes, and is asymptotically faster when the average batch size is sufficiently large.

...read moreread less

Abstract: In this paper, we study batch parallel algorithms for the dynamic connectivity problem, a fundamental problem that has received considerable attention in the sequential setting. The most well known sequential algorithm for dynamic connectivity is the elegant level-set algorithm of Holm, de Lichtenberg and Thorup (HDT), which achieves $O(\log^2 n)$ amortized time per edge insertion or deletion, and $O(\log n / \log\log n)$ time per query. We design a parallel batch-dynamic connectivity algorithm that is work-efficient with respect to the HDT algorithm for small batch sizes, and is asymptotically faster when the average batch size is sufficiently large. Given a sequence of batched updates, where $\Delta$ is the average batch size of all deletions, our algorithm achieves $O(\log n \log(1 + n / \Delta))$ expected amortized work per edge insertion and deletion and $O(\log^3 n)$ depth w.h.p. Our algorithm answers a batch of $k$ connectivity queries in $O(k \log(1 + n/k))$ expected work and $O(\log n)$ depth w.h.p. To the best of our knowledge, our algorithm is the first parallel batch-dynamic algorithm for connectivity.

...read moreread less

Journal Article•DOI•

A parallel Newton-type method for nonlinear model predictive control

[...]

Haoyang Deng¹, Toshiyuki Ohtsuka¹•Institutions (1)

Kyoto University¹

01 Nov 2019-Automatica

TL;DR: Numerical simulation of using the proposed parallel Newton-type method for nonlinear model predictive control to control a quadrotor showed that the proposed method is highly parallelizable and converges in only a few iterations, even to a high accuracy.

...read moreread less

Proceedings Article•DOI•

An optimal approximation for submodular maximization under a matroid constraint in the adaptive complexity model

[...]

Eric Balkanski¹, Aviad Rubinstein², Yaron Singer¹•Institutions (2)

Harvard University¹, Stanford University²

23 Jun 2019

TL;DR: In this article, an adaptive submodular maximization under a matroid constraint in the adaptive complexity model was studied and an approximation algorithm with O(log(n)log(k) was proposed.

...read moreread less

Abstract: In this paper we study submodular maximization under a matroid constraint in the adaptive complexity model. This model was recently introduced in the context of submodular optimization to quantify the information theoretic complexity of black-box optimization in a parallel computation model. Informally, the adaptivity of an algorithm is the number of sequential rounds it makes when each round can execute polynomially-many function evaluations in parallel. Since submodular optimization is regularly applied on large datasets we seek algorithms with low adaptivity to enable speedups via parallelization. Consequently, a recent line of work has been devoted to designing constant factor approximation algorithms for maximizing submodular functions under various constraints in the adaptive complexity model. Despite the burst in work on submodular maximization in the adaptive complexity model, the fundamental problem of maximizing a monotone submodular function under a matroid constraint has remained elusive. In particular, all known techniques fail for this problem and there are no known constant factor approximation algorithms whose adaptivity is sublinear in the rank of the matroid k or in the worst case sublinear in the size of the ground set n. In this paper we present an approximation algorithm for the problem of maximizing a monotone submodular function under a matroid constraint in the adaptive complexity model. The approximation guarantee of the algorithm is arbitrarily close to the optimal 1−1/e and it has near optimal adaptivity of O(log(n)log(k)). This result is obtained using a novel technique of adaptive sequencing which departs from previous techniques for submodular maximization in the adaptive complexity model. In addition to our main result we show how to use this technique to design other approximation algorithms with strong approximation guarantees and polylogarithmic adaptivity.

...read moreread less

Proceedings Article•DOI•

Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS

[...]

Timothy A. Davis¹, Mohsen Mahmoudi Aznaveh¹, Scott P. Kolodziej¹•Institutions (1)

Texas A&M University¹

01 Sep 2019

TL;DR: SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring.

...read moreread less

Abstract: SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The sequential performance of the GraphBLAS solution is 3x to 5x faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10x to 15x speedup on a 20-core Intel processor, 17x on an IBM Power8 system, and 20x on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm.

...read moreread less

Journal Article•DOI•

Optimizing energy consumption of robotic cells by a Branch & Bound algorithm

[...]

Libor Bukata¹, Přemysl Šůcha¹, Zdeněk Hanzálek¹•Institutions (1)

Czech Technical University in Prague¹

01 Feb 2019-Computers & Operations Research

TL;DR: This study proposes a novel parallel Branch & Bound algorithm to optimize the energy consumption of robotic cells without deterioration in throughput and reveals that the performance of the parallel algorithm scales almost linearly up to 12 processor cores, and the quality of obtained solutions is better or comparable to other existing works.

...read moreread less

Collapse