scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2019"


Journal ArticleDOI
Tal Ben-Nun1, Torsten Hoefler1
TL;DR: The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.
Abstract: Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

433 citations


Journal ArticleDOI
TL;DR: Parallel-PC is developed, a fast and memory efficient PC algorithm that is suitable for personal computers and does not require end users’ parallel computing knowledge beyond their competency in using the PC algorithm, and integrated into a causal inference method for inferring miRNA-mRNA regulatory relationships.
Abstract: Discovering causal relationships from observational data is a crucial problem and it has applications in many research areas. The PC algorithm is the state-of-the-art constraint based method for causal discovery. However, runtime of the PC algorithm, in the worst-case, is exponential to the number of nodes (variables), and thus it is inefficient when being applied to high dimensional data, e.g., gene expression datasets. On another note, the advancement of computer hardware in the last decade has resulted in the widespread availability of multi-core personal computers. There is a significant motivation for designing a parallelized PC algorithm that is suitable for personal computers and does not require end users’ parallel computing knowledge beyond their competency in using the PC algorithm. In this paper, we develop parallel-PC, a fast and memory efficient PC algorithm using the parallel computing technique. We apply our method to a range of synthetic and real-world high dimensional datasets. Experimental results on a dataset from the DREAM 5 challenge show that the original PC algorithm could not produce any results after running more than 24 hours; meanwhile, our parallel-PC algorithm managed to finish within around 12 hours with a 4-core CPU computer, and less than six hours with a 8-core CPU computer. Furthermore, we integrate parallel-PC into a causal inference method for inferring miRNA-mRNA regulatory relationships. The experimental results show that parallel-PC helps improve both the efficiency and accuracy of the causal inference algorithm.

104 citations


Proceedings ArticleDOI
06 Jan 2019
TL;DR: A remarkably simple meta-algorithm for the (∆ + 1) coloring problem: Sample O(log n) colors for each vertex independently and uniformly at random from the ∆+ 1 colors; find a proper coloring of the graph using only the sampled colors of each vertex.
Abstract: Any graph with maximum degree Δ admits a proper vertex coloring with Δ + 1 colors that can be found via a simple sequential greedy algorithm in linear time and space. But can one find such a coloring via a sublinear algorithm?We answer this fundamental question in the affirmative for several canonical classes of sublinear algorithms including graph streaming, sublinear time, and massively parallel computation (MPC) algorithms. In particular, we design:• A single-pass semi-streaming algorithm in dynamic streams using O(n) space. The only known semi-streaming algorithm prior to our work was a folklore O(log n)-pass algorithm obtained by simulating classical distributed algorithms in the streaming model.• A sublinear-time algorithm in the standard query model that allows neighbor queries and pair queries using [MATH HERE] time. We further show that any algorithm that outputs a valid coloring with sufficiently large constant probability requires [MATH HERE] time. No non-trivial sublinear time algorithms were known prior to our work.• A parallel algorithm in the massively parallel computation (MPC) model using O(n) memory per machine and O(1) MPC rounds. Our number of rounds significantly improves upon the recent O(log log Δ · log* (n))-round algorithm of Parter [ICALP 2018].At the core of our results is a remarkably simple meta-algorithm for the (Δ + 1) coloring problem: Sample O(log n) colors for each vertex independently and uniformly at random from the Δ + 1 colors; find a proper coloring of the graph using only the sampled colors of each vertex. As our main result, we prove that the sampled set of colors with high probability contains a proper coloring of the input graph. The sublinear algorithms are then obtained by designing efficient algorithms for finding a proper coloring of the graph from the sampled colors in each model.We note that all our upper bound results for (Δ + 1) coloring are either optimal or close to best possible in each model studied. We also establish new lower bounds that rule out the possibility of achieving similar results in these models for the closely related problems of maximal independent set and maximal matching. Collectively, our results highlight a sharp contrast between the complexity of (Δ+1) coloring vs maximal independent set and maximal matching in various models of sublinear computation even though all three problems are solvable by a simple greedy algorithm in the classical setting.

97 citations


Journal ArticleDOI
TL;DR: This paper develops a fast proximal algorithm and its accelerated variant with inexact proximal step, and shows the proposed algorithm can be parallelized, and the resultant algorithm achieves nearly linear speedup w.r.t. the number of threads.
Abstract: Low-rank modeling has many important applications in computer vision and machine learning. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better empirical performance. However, the resulting optimization problem is much more challenging. Recent state-of-the-art requires an expensive full SVD in each iteration. In this paper, we show that for many commonly-used nonconvex low-rank regularizers, the singular values obtained from the proximal operator can be automatically threshold. This allows the proximal operator to be efficiently approximated by the power method. We then develop a fast proximal algorithm and its accelerated variant with inexact proximal step. It can be guaranteed that the squared distance between consecutive iterates converges at a rate of $O(1/T)$O(1/T), where $T$T is the number of iterations. Furthermore, we show the proposed algorithm can be parallelized, and the resultant algorithm achieves nearly linear speedup w.r.t. the number of threads. Extensive experiments are performed on matrix completion and robust principal component analysis. Significant speedup over the state-of-the-art is observed.

77 citations


Proceedings ArticleDOI
06 Jan 2019
TL;DR: In this paper, the authors give a unified approach that yields better approximation algorithms for matching and vertex cover in all these models, including the streaming model, the distributed communication model, and the massively parallel computation (MPC) model.
Abstract: There is a rapidly growing need for scalable algorithms that solve classical graph problems, such as maximum matching and minimum vertex cover, on massive graphs. For massive inputs, several different computational models have been introduced, including the streaming model, the distributed communication model, and the massively parallel computation (MPC) model that is a common abstraction of MapReduce-style computation. In each model, algorithms are analyzed in terms of resources such as space used or rounds of communication needed, in addition to the more traditional approximation ratio.In this paper, we give a single unified approach that yields better approximation algorithms for matching and vertex cover in all these models. The highlights include:• The first one pass, significantly-better-than-2-approximation for matching in random arrival streams that uses subquadratic space, namely a (1.5 + e-approximation streaming algorithm that uses O(n1.5) space for constant e > 0.• The first 2-round, better-than-2-approximation for matching in the MPC model that uses sub-quadratic space per machine, namely a (1.5 + e)-approximation algorithm with [MATH HERE] memory per machine for constant e > 0.By building on our unified approach, we further develop parallel algorithms in the MPC model that give a (1+e)-approximation to matching and an 0(1)-approximation to vertex cover in only 0(log log n) MPC rounds and 0(n/polylog(n)) memory per machine. These results settle multiple open questions posed by Czumaj et al. [STOC 2018].We obtain our results by a novel combination of two previously disjoint set of techniques, namely randomized composable coresets and edge degree constrained subgraphs (EDCS). We significantly extend the power of these techniques and prove several new structural results. For example, we show that an EDCS is a sparse certificate for large matchings and small vertex covers that is quite robust to sampling and composition.

65 citations


Journal ArticleDOI
TL;DR: The concept of granularity of parallelism for GAs on GPU architecture is reexamine, how the aspect of data layout affect the kernel design to maximize memory bandwidth is discussed, and how to organize threads in grid and blocks to expose sufficient parallelism to GPU is explained.

57 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: The first conditional hardness results for massively parallel algorithms for some central graph problems including (approximating) maximum matching, vertex cover, maximal independent set, and coloring are presented.
Abstract: We present the first conditional hardness results for massively parallel algorithms for some central graph problems including (approximating) maximum matching, vertex cover, maximal independent set, and coloring. In some cases, these hardness results match or get close to the state of the art algorithms. Our hardness results are conditioned on a widely believed conjecture in massively parallel computation about the complexity of the connectivity problem. We also note that it is known that an unconditional variant of such hardness results might be somewhat out of reach for now, as it would lead to considerably improved circuit complexity lower bounds and would concretely imply that NC_1 is a proper subset of P. We obtain our conditional hardness result via a general method that lifts unconditional lower bounds from the well-studied LOCAL model of distributed computing to the massively parallel computation setting.

56 citations


Journal ArticleDOI
TL;DR: A new LSRTM approach by using the excitation amplitude (EA) imaging condition to suppress the crosstalk noise, which can avoid the frequent memory transfer and is suitable for the graphics processing unit (GPU) parallelization.
Abstract: Least-squares reverse time migration (LSRTM) can provide higher quality images than conventional reverse time migration, which is helpful to image simultaneous-source data. However, it still faces the problems of the crosstalk noise, great computation time, and storage requirement. We propose a new LSRTM approach by using the excitation amplitude (EA) imaging condition to suppress the crosstalk noise. Since only the maximum amplitude or limited local maximum amplitudes at each imaging point and the corresponding travel time step(s) need to be saved, the great storage problem can be naturally solved. Consequently, the proposed algorithm can avoid the frequent memory transfer and is suitable for the graphics processing unit (GPU) parallelization. Besides, the shared memory with high bandwidth is used to optimize the GPU-based algorithm. In order to further improve the image quality of EA imaging condition, we adopt the shaping regularization as a constraint. The single-source tests with Marmousi and salt models show the feasibility of our algorithm to image the complex and subsalt structures, among which a wrong background velocity is used to test its sensitivity to the velocity error. The noise-free and noise-included simultaneous-source examples demonstrate the ability of EA imaging condition to suppress the crosstalk noise. During the implementation of the GPU parallelization, we find that the shared memory cannot always optimize the GPU parallel algorithm and just works well for the eighth- or higher order spatial finite difference scheme.

55 citations


Proceedings ArticleDOI
27 May 2019
TL;DR: This paper presents a reachability analysis method for feed-forward neural networks (FNN) that employ rectified linear units (ReLUs) as activation functions that relies on three reachable-set computation algorithms, namely exact schemes, lazy-approximate schemes, and mixing schemes.
Abstract: Artificial neural networks (ANN) have displayed considerable utility in a wide range of applications such as image processing, character and pattern recognition, self-driving cars, evolutionary robotics, and non-linear system identification and control. While ANNs are able to carry out complicated tasks efficiently, they are susceptible to unpredictable and errant behavior due to irregularities that emanate from their complex non-linear structure. As a result, there have been reservations about incorporating them into safety-critical systems. In this paper, we present a reachability analysis method for feed-forward neural networks (FNN) that employ rectified linear units (ReLUs) as activation functions. The crux of our approach relies on three reachable-set computation algorithms, namely exact schemes, lazy-approximate schemes, and mixing schemes. The exact scheme computes an exact reachable set for FNN, while the lazy-approximate and mixing schemes generate an over-approximation of the exact reachable set. All schemes are designed efficiently to run on parallel platforms to reduce the computation time and enhance the scalability. Our methods are implemented in a MATLAB® toolbox called, NNV, and is evaluated using a set of benchmarks that consist of realistic neural networks with sizes that range from tens to a thousand neurons. Notably, NNV successfully computes and visualizes the exact reachable sets of the real world ACAS Xu deep neural networks (DNNs), which are a variant of a family of novel airborne collision detection systems known as the ACAS System X, using a representation of tens to hundreds of polyhedra.

51 citations


Journal ArticleDOI
18 Nov 2019
TL;DR: A universal island-based metaheuristic algorithm (UIMA) was proposed, aiming to solve the spatially constrained berth scheduling problem and minimize the total cost of serving the arriving vessels at the MCT.
Abstract: Marine transportation has been faced with an increasing demand for containerized cargo during the past decade. Marine container terminals (MCTs), as the facilities for connecting seaborne and inland transportation, are expected to handle the increasing amount of containers, delivered by vessels. Berth scheduling plays an important role for the total throughput of MCTs as well as the overall effectiveness of the MCT operations. This study aims to propose a novel island-based metaheuristic algorithm to solve the berth scheduling problem and minimize the total cost of serving the arriving vessels at the MCT.,A universal island-based metaheuristic algorithm (UIMA) was proposed in this study, aiming to solve the spatially constrained berth scheduling problem. The UIMA population was divided into four sub-populations (i.e. islands). Unlike the canonical island-based algorithms that execute the same metaheuristic on each island, four different population-based metaheuristics are adopted within the developed algorithm to search the islands, including the following: evolutionary algorithm (EA), particle swarm optimization (PSO), estimation of distribution algorithm (EDA) and differential evolution (DE). The adopted population-based metaheuristic algorithms rely on different operators, which facilitate the search process for superior solutions on the UIMA islands.,The conducted numerical experiments demonstrated that the developed UIMA algorithm returned near-optimal solutions for the small-size problem instances. As for the large-size problem instances, UIMA was found to be superior to the EA, PSO, EDA and DE algorithms, which were executed in isolation, in terms of the obtained objective function values at termination. Furthermore, the developed UIMA algorithm outperformed various single-solution-based metaheuristic algorithms (including variable neighborhood search, tabu search and simulated annealing) in terms of the solution quality. The maximum UIMA computational time did not exceed 306 s.,Some of the previous berth scheduling studies modeled uncertain vessel arrival times and/or handling times, while this study assumed the vessel arrival and handling times to be deterministic.,The developed UIMA algorithm can be used by the MCT operators as an efficient decision support tool and assist with a cost-effective design of berth schedules within an acceptable computational time.,A novel island-based metaheuristic algorithm is designed to solve the spatially constrained berth scheduling problem. The proposed island-based algorithm adopts several types of metaheuristic algorithms to cover different areas of the search space. The considered metaheuristic algorithms rely on different operators. Such feature is expected to facilitate the search process for superior solutions.

48 citations


Proceedings ArticleDOI
16 Jul 2019
TL;DR: A generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs and yields a (1/2+c)-approximation algorithm thus breaking the natural barrier of 1/2.
Abstract: We design a generic method to reduce the task of finding weighted matchings to that of finding short augmenting paths in unweighted graphs. This method enables us to provide efficient implementations for approximating weighted matchings in the massively parallel computation (MPC) model and in the streaming model.For the MPC and the multi-pass streaming model, we show that any algorithm computing a (1-δ)-approximate unweighted matching in bipartite graphs can be translated into an algorithm that computes a (1-(e(δ))-approximate maximum weighted matching. Furthermore, this translation incurs only a constant factor (that depends on e > 0) overhead in the complexity. Instantiating this with the current best MPC algorithm for unweighted matching yields a (1 - e)-approximation algorithm for maximum weighted matching that uses Oe(log log n) rounds, O(m/n) machines per round, and O(npoly(logn)) memory per machine. This improves upon the previous best approximation guarantee of (1/2-e) for weighted graphs. In the context of single-pass streaming with random edge arrivals, our techniques yield a (1/2+c)-approximation algorithm thus breaking the natural barrier of 1/2.

Proceedings ArticleDOI
25 Jul 2019
TL;DR: To enable efficient and effective RSL-Psc computation on massive route data, novel search space pruning techniques are developed and the two algorithms are capable of achieving high efficiency and scalability.
Abstract: With the increasing availability of moving-object tracking data, use of this data for route search and recommendation is increasingly important. To this end, we propose a novel parallel split-and-combine approach to enable route search by locations (RSL-Psc). Given a set of routes, a set of places to visit O, and a threshold θ, we retrieve the route composed of sub-routes that (i) has similarity to O no less than θ and (ii) contains the minimum number of sub-route combinations. The resulting functionality targets a broad range of applications, including route planning and recommendation, ridesharing, and location-based services in general. To enable efficient and effective RSL-Psc computation on massive route data, we develop novel search space pruning techniques and enable use of the parallel processing capabilities of modern processors. Specifically, we develop two parallel algorithms, Fully-Split Parallel Search (FSPS) and Group-Split Parallel Search (GSPS). We divide the route split-and-combine task into ∑k=0 M S(|O|,k+1) sub-tasks, where M is the maximum number of combinations and S(⋅) is the Stirling number of the second kind. In each sub-task, we use network expansion and exploit spatial similarity bounds for pruning. The algorithms split candidate routes into sub-routes and combine them to construct new routes. The sub-tasks are independent and are performed in parallel. Extensive experiments with real data offer insight into the performance of the algorithms, indicating that our RSL-Psc problem can generate high-quality results and that the two algorithms are capable of achieving high efficiency and scalability.

Proceedings ArticleDOI
16 Apr 2019
TL;DR: In this article, the authors propose pFaces, an extensible software-ecosystem, to accelerate symbolic control techniques, which facilitates designing parallel algorithms and supervises their executions to utilize available computing resources.
Abstract: The correctness of control software in many safety-critical applications such as autonomous vehicles is crucial. One technique to achieve correct control software is called "symbolic control", where complex systems are approximated by finite-state abstractions. Then, using those abstractions, provably-correct digital controllers are algorithmically synthesized for concrete systems, satisfying complex high-level requirements. Unfortunately, the complexity of synthesizing such controllers grows exponentially in the number of state variables. However, if distributed implementations are considered, high-performance computing platforms can be leveraged to mitigate the effects of the state-explosion problem.We propose pFaces, an extensible software-ecosystem, to accelerate symbolic control techniques. It facilitates designing parallel algorithms and supervises their executions to utilize available computing resources. To demonstrate its capabilities, novel parallel algorithms are designed for abstraction-based controller synthesis. Then, they are implemented inside pFaces and dispatched, for parallel execution, in different heterogeneous computing platforms, including CPUs, GPUs and Hardware Accelerators (HWAs). Results show remarkable reduction in the computation time by several orders of magnitudes as number of processing elements (PEs) increases, which easily outperforms all the existing tools.

Journal ArticleDOI
TL;DR: It is possible to conclude that the parallelization has a positive effect on the convergence and diversity of the optimization process for problems with many objectives, however, there is no single strategy that is the best results for all classes of problems.

Proceedings ArticleDOI
23 Jun 2019
TL;DR: The first algorithms with low adaptivity for submodular maximization with a matroid constraint are obtained, and the first parallel algorithm for non-monotone submodularity maximization subject to packing constraints is obtained.
Abstract: We consider the problem of maximizing the multilinear extension of a submodular function subject a single matroid constraint or multiple packing constraints with a small number of adaptive rounds of evaluation queries. We obtain the first algorithms with low adaptivity for submodular maximization with a matroid constraint. Our algorithms achieve a 1−1/e−є approximation for monotone functions and a 1/e−є approximation for non-monotone functions, which nearly matches the best guarantees known in the fully adaptive setting. The number of rounds of adaptivity is O(log2n/є3), which is an exponential speedup over the existing algorithms. We obtain the first parallel algorithm for non-monotone submodular maximization subject to packing constraints. Our algorithm achieves a 1/e−є approximation using O(log(n/є) log(1/є) log(n+m)/ є2) parallel rounds, which is again an exponential speedup in parallel time over the existing algorithms. For monotone functions, we obtain a 1−1/e−є approximation in O(log(n/є)logm/є2) parallel rounds. The number of parallel rounds of our algorithm matches that of the state of the art algorithm for solving packing LPs with a linear objective (Mahoney et al., 2016). Our results apply more generally to the problem of maximizing a diminishing returns submodular (DR-submodular) function.

Journal ArticleDOI
TL;DR: In this article, a new algorithm for the fast, shared memory, multi-core computation of augmented contour trees on triangulations is presented, which completely revisits the traditional, sequential contour tree algorithm to re-formulate all the steps of the computation as a set of independent local tasks.
Abstract: This paper presents a new algorithm for the fast, shared memory, multi-core computation of augmented contour trees on triangulations. In contrast to most existing parallel algorithms our technique computes augmented trees, enabling the full extent of contour tree based applications including data segmentation. Our approach completely revisits the traditional, sequential contour tree algorithm to re-formulate all the steps of the computation as a set of independent local tasks. This includes a new computation procedure based on Fibonacci heaps for the join and split trees, two intermediate data structures used to compute the contour tree, whose constructions are efficiently carried out concurrently thanks to the dynamic scheduling of task parallelism. We also introduce a new parallel algorithm for the combination of these two trees into the output global contour tree. Overall, this results in superior time performance in practice, both in sequential and in parallel thanks to the OpenMP task runtime. We report performance numbers that compare our approach to reference sequential and multi-threaded implementations for the computation of augmented merge and contour trees. These experiments demonstrate the run-time efficiency of our approach and its scalability on common workstations. We demonstrate the utility of our approach in data segmentation applications.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: This paper presents an algorithm that for graphs with diameter D in the wide range [log^ε n, n], takes O(log D) rounds to identify the connected components and takes O (log log n) rounds for all other graphs and uses an optimal total space of O(m).
Abstract: Identifying the connected components of a graph, apart from being a fundamental problem with countless applications, is a key primitive for many other algorithms. In this paper, we consider this problem in parallel settings. Particularly, we focus on the Massively Parallel Computations (MPC) model, which is the standard theoretical model for modern parallel frameworks such as MapReduce, Hadoop, or Spark. We consider the truly sublinear regime of MPC for graph problems where the space per machine is n^δ for some desirably small constant δ ∊ (0, 1). We present an algorithm that for graphs with diameter D in the wide range [log^e n, n], takes O(log D) rounds to identify the connected components and takes O(log log n) rounds for all other graphs. The algorithm is randomized, succeeds with high probability, does not require prior knowledge of D, and uses an optimal total space of O(m). We complement this by showing a conditional lower-bound based on the widely believed TwoCycle conjecture that Ω(log D) rounds are indeed necessary in this setting. Studying parallel connectivity algorithms received a resurgence of interest after the pioneering work of Andoni etal [FOCS 2018] who presented an algorithm with O(log D log log n) round-complexity. Our algorithm improves this result for the whole range of values of D and almost settles the problem due to the conditional lower-bound. Additionally, we show that with minimal adjustments, our algorithm can also be implemented in a variant of (CRCW) PRAM in asymptotically the same number of rounds.

Journal ArticleDOI
TL;DR: A parallel mining algorithm of association rules to explore the correlation and regularity of oxygen, temperature, phosphate, nitrate and silicate in the ocean and the relationship between the parallel efficiency and the core number of CPU is analyzed.
Abstract: According to the complexity of ocean data, this paper adopts a parallel mining algorithm of association rules to explore the correlation and regularity of oxygen, temperature, phosphate, nitrate and silicate in the ocean. After the marine data is interpolated, this paper utilizes the parallel FP-growth algorithm to mine the data and then briefly analyzes the mining results of the frequent itemsets and association rules. The relationship between the parallel efficiency and the core number of CPU is analyzed through datasets with different scales. The experimental results indicate that the acceleration effect is ideal when each thread scored 200,000–300,000 data, which leads to more than 1.2 times of performance improvement.

Posted ContentDOI
TL;DR: It is shown theoretically and empirically that a class of non-reversible PT methods dominates its reversible counterparts and distinct scaling limits for the non‐reversible and reversible schemes are identified, and an iterative scheme approximating this schedule is developed.
Abstract: Parallel tempering (PT) methods are a popular class of Markov chain Monte Carlo schemes used to sample complex high-dimensional probability distributions. They rely on a collection of $N$ interacting auxiliary chains targeting tempered versions of the target distribution to improve the exploration of the state-space. We provide here a new perspective on these highly parallel algorithms and their tuning by identifying and formalizing a sharp divide in the behaviour and performance of reversible versus non-reversible PT schemes. We show theoretically and empirically that a class of non-reversible PT methods dominates its reversible counterparts and identify distinct scaling limits for the non-reversible and reversible schemes, the former being a piecewise-deterministic Markov process and the latter a diffusion. These results are exploited to identify the optimal annealing schedule for non-reversible PT and to develop an iterative scheme approximating this schedule. We provide a wide range of numerical examples supporting our theoretical and methodological contributions. The proposed methodology is applicable to sample from a distribution $\pi$ with a density $L$ with respect to a reference distribution $\pi_0$ and compute the normalizing constant. A typical use case is when $\pi_0$ is a prior distribution, $L$ a likelihood function and $\pi$ the corresponding posterior.

Posted ContentDOI
27 May 2019-bioRxiv
TL;DR: This work proposes the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations, and provides a novel blocked approach to compute the score matrix while ensuring high memory locality.
Abstract: Aligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale to high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations. We take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to three hours for the problem of optimally aligning high coverage long (PacBio/ONT) or short (Illumina) DNA reads to an MHC human variation graph containing 10 million vertices. Availability The implementation of our algorithm is available at https://github.com/ParBLiSS/PaSGAL. Data sets used for evaluation are accessible using https://alurulab.cc.gatech.edu/PaSGAL.

Journal ArticleDOI
TL;DR: This paper constructs a multiobjective feature selection model that simultaneously considers the classification error, the feature number and the feature redundancy, and proposes several distributed parallel algorithms based on different encodings and an adaptive strategy.

Journal ArticleDOI
TL;DR: This paper presents a parallel and fully implicit simulator for the black oil model based on the variational inequality (VI) framework, which can be used to enforce important mathematical and physical properties to obtain accurate constraint-preserving solutions.

Proceedings ArticleDOI
17 Jun 2019
TL;DR: In this paper, a parallel batch-dynamic connectivity algorithm for small batch sizes was proposed, which achieves O(log n log(1+n / Δ) expected amortized work per edge insertion and deletion and O( log 3 n) depth w.h.p.
Abstract: In this paper, we study batch parallel algorithms for the dynamic connectivity problem, a fundamental problem that has received considerable attention in the sequential setting. The best sequential algorithm for dynamic connectivity is the elegant level-set algorithm of Holm, de Lichtenberg and Thorup (HDT), which achieves O(log2 n) amortized time per edge insertion or deletion, and O(log n) time per query. We design a parallel batch-dynamic connectivity algorithm that is work-efficient with respect to the HDT algorithm for small batch sizes, and is asymptotically faster when the average batch size is sufficiently large. Given a sequence of batched updates, where Δ is the average batch size of all deletions, our algorithm achieves O(log n log(1+n / Δ)) expected amortized work per edge insertion and deletion and O(log3 n) depth w.h.p. Our algorithm answers a batch of k connectivity queries in O(k log(1 + n/k)) expected work and O(log n) depth w.h.p. To the best of our knowledge, our algorithm is the first parallel batch-dynamic algorithm for connectivity.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: The experimental results show that the KRaft consensus algorithm has a 41% improved in transaction throughput and has a 67% improvement in the leader election speed, which satisfied the safety and liveness requirements of Raft consensus algorithm.
Abstract: With the development of blockchain, more and more blockchain types emerge: public blockchain, consortium blockchain and private blockchain. Because of the node trust in some consortium blockchain and private blockchain, a no byzantine fault tolerance algorithm KRaft(Kademlia-Raft) algorithm with high throughput and high scalability is proposed. KRaft consensus algorithm is a Raft-like consensus algorithm that preserves the logic of part of Raft consensus algorithm. It optimized leader election and consensus process of the Raft consensus algorithm through the established K-Bucket node relationships in the Kademlia protocol, improved leader election speed and throughput. Firstly, the KRaft algorithm uses the K-bucket established by Kademlia protocol to achieve stable and efficient leader election process for the candidate node split vote problem and the low voting efficiency caused by the increase of the Follower node in the Raft algorithm. Secondly, aiming at the low efficiency and load imbalance of the leader single-node log replication in the Raft algorithm consensus process, a parallel log replication scheme with multiple candidate nodes for balancing the leader node load is proposed to improve the throughput and the scalability of the algorithm. Finally, as a Raft-like consensus algorithm, KRaft consensus algorithm satisfied the safety and liveness requirements of Raft consensus algorithm. KRaft consensus algorithm and Raft consensus algorithm were evaluated with local cluster simulation. The experimental results show that the KRaft consensus algorithm has a 41% improvement in transaction throughput and has a 67% improvement in the leader election speed.

Proceedings ArticleDOI
06 Jan 2019
TL;DR: In this paper, the authors presented a massively parallel algorithm for edit distance and longest common subsequence in the parallel setting, which achieves an approximation factor of 1 + ϵ and round complexity of O(n 2 ).
Abstract: String similarity measures are among the most fundamental problems in computer science. The notable examples are edit distance (ED) and longest common subsequence (LCS). These problems find their applications in various contexts such as computational biology, text processing, compiler optimization, data analysis, image analysis, etc. In this work, we revisit edit distance and longest common subsequence in the parallel settings. We present massively parallel algorithms for both problems that are optimal in the following senses:• The approximation factor of our algorithms is 1 + ϵ.• The round complexity of our algorithms is constant.• The total running time of our algorithms over all machines is O(n2). This matches the running time of the best-known solutions for approximating edit distance and longest common subsequence within a 1+ϵ factor in the sequential setting.Our result for edit distance substantially improves the massively parallel algorithm of [15] in terms of approximation factor, round complexity, number of machines, and total running time. Our unified approach to tackle both problems is to divide one of the strings into smaller blocks and try to locally predict which intervals of the other string correspond to each block in an optimal solution.Our main technical contribution is a novel parallel algorithm for computing a set of compositions, and recursively decomposing each function into a set of smaller iterative compositions (in terms of memory needed to solve the problem). These two methods together give us a strong tool for approximating combinatorial problems. For instance, LCS can be formulated as a recursive composition of functions and therefore this tool enables us to approximate LCS within a factor 1 + ϵ. Indeed, we recursively decompose the problem until we are able to compute the solution on a single machine. Since our methods are quite general, we expect this technique to find its applications in other combinatorial problems as well.

Proceedings ArticleDOI
TL;DR: In this article, a parallel batch-dynamic connectivity algorithm was proposed that is work-efficient with respect to the HDT algorithm for small batch sizes, and is asymptotically faster when the average batch size is sufficiently large.
Abstract: In this paper, we study batch parallel algorithms for the dynamic connectivity problem, a fundamental problem that has received considerable attention in the sequential setting. The most well known sequential algorithm for dynamic connectivity is the elegant level-set algorithm of Holm, de Lichtenberg and Thorup (HDT), which achieves $O(\log^2 n)$ amortized time per edge insertion or deletion, and $O(\log n / \log\log n)$ time per query. We design a parallel batch-dynamic connectivity algorithm that is work-efficient with respect to the HDT algorithm for small batch sizes, and is asymptotically faster when the average batch size is sufficiently large. Given a sequence of batched updates, where $\Delta$ is the average batch size of all deletions, our algorithm achieves $O(\log n \log(1 + n / \Delta))$ expected amortized work per edge insertion and deletion and $O(\log^3 n)$ depth w.h.p. Our algorithm answers a batch of $k$ connectivity queries in $O(k \log(1 + n/k))$ expected work and $O(\log n)$ depth w.h.p. To the best of our knowledge, our algorithm is the first parallel batch-dynamic algorithm for connectivity.

Journal ArticleDOI
TL;DR: Numerical simulation of using the proposed parallel Newton-type method for nonlinear model predictive control to control a quadrotor showed that the proposed method is highly parallelizable and converges in only a few iterations, even to a high accuracy.

Proceedings ArticleDOI
23 Jun 2019
TL;DR: In this article, an adaptive submodular maximization under a matroid constraint in the adaptive complexity model was studied and an approximation algorithm with O(log(n)log(k) was proposed.
Abstract: In this paper we study submodular maximization under a matroid constraint in the adaptive complexity model. This model was recently introduced in the context of submodular optimization to quantify the information theoretic complexity of black-box optimization in a parallel computation model. Informally, the adaptivity of an algorithm is the number of sequential rounds it makes when each round can execute polynomially-many function evaluations in parallel. Since submodular optimization is regularly applied on large datasets we seek algorithms with low adaptivity to enable speedups via parallelization. Consequently, a recent line of work has been devoted to designing constant factor approximation algorithms for maximizing submodular functions under various constraints in the adaptive complexity model. Despite the burst in work on submodular maximization in the adaptive complexity model, the fundamental problem of maximizing a monotone submodular function under a matroid constraint has remained elusive. In particular, all known techniques fail for this problem and there are no known constant factor approximation algorithms whose adaptivity is sublinear in the rank of the matroid k or in the worst case sublinear in the size of the ground set n. In this paper we present an approximation algorithm for the problem of maximizing a monotone submodular function under a matroid constraint in the adaptive complexity model. The approximation guarantee of the algorithm is arbitrarily close to the optimal 1−1/e and it has near optimal adaptivity of O(log(n)log(k)). This result is obtained using a novel technique of adaptive sequencing which departs from previous techniques for submodular maximization in the adaptive complexity model. In addition to our main result we show how to use this technique to design other approximation algorithms with strong approximation guarantees and polylogarithmic adaptivity.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring.
Abstract: SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The sequential performance of the GraphBLAS solution is 3x to 5x faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10x to 15x speedup on a 20-core Intel processor, 17x on an IBM Power8 system, and 20x on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm.

Journal ArticleDOI
TL;DR: This study proposes a novel parallel Branch & Bound algorithm to optimize the energy consumption of robotic cells without deterioration in throughput and reveals that the performance of the parallel algorithm scales almost linearly up to 12 processor cores, and the quality of obtained solutions is better or comparable to other existing works.