scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2018"


Book
05 Feb 2018
TL;DR: A linear time and space preprocessing algorithm that enables us to answer each query in $O(1)$ time, as in Harel and Tarjan, which has the advantage of being simple and easily parallelizable.
Abstract: We consider the following problem. Suppose a rooted tree T is available for preprocessing. Answer on-line queries requesting the lowest common ancestor for any pair of vertices in T. We present a linear time and space preprocessing algorithm that enables us to answer each query in $O(1)$ time, as in Harel and Tarjan [SIAM J. Comput., 13 (1984), pp. 338–355]. Our algorithm has the advantage of being simple and easily parallelizable. The resulting parallel preprocessing algorithm runs in logarithmic time using an optimal number of processors on an EREW PRAM. Each query is then answered in $O(1)$ time using a single processor.

549 citations


Journal ArticleDOI
01 Mar 2018
TL;DR: The experimental results show that PACO-3Opt is more efficient and reliable than the other algorithms and can reach the global optimum.
Abstract: This article presented a parallel cooperative hybrid algorithm for solving traveling salesman problem. Although heuristic approaches and hybrid methods obtain good results in solving the TSP, they cannot successfully avoid getting stuck to local optima. Furthermore, their processing duration unluckily takes a long time. To overcome these deficiencies, we propose the parallel cooperative hybrid algorithm (PACO-3Opt) based on ant colony optimization. This method uses the 3-Opt algorithm to avoid local minima. PACO-3Opt has multiple colonies and a master---slave paradigm. Each colony runs ACO to generate the solutions. After a predefined number of iterations, each colony primarily runs 3-Opt to improve the solutions and then shares the best tour with other colonies. This process continues until the termination criterion meets. Thus, it can reach the global optimum. PACO-3Opt was compared with previous algorithms in the literature. The experimental results show that PACO-3Opt is more efficient and reliable than the other algorithms.

118 citations


Proceedings ArticleDOI
23 Apr 2018
TL;DR: This work revisits the iconic algorithm of Chiba and Nishizeki and develops the most efficient parallel algorithm for list all k-cliques in graphs containing up to tens of millions of edges, which is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism.
Abstract: Motivated by recent studies in the data mining community which require to efficiently list all k-cliques, we revisit the iconic algorithm of Chiba and Nishizeki and develop the most efficient parallel algorithm for such a problem. Our theoretical analysis provides the best asymptotic upper bound on the running time of our algorithm for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs shows that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any k) in graphs containing up to tens of millions of edges as well as all $10$-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. Finally, we show how our algorithm can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs.

109 citations


Proceedings ArticleDOI
20 Jun 2018
TL;DR: The result implies that in the realizable case, where there is a true underlying function generating the data, Θ(log n) batches of adaptive samples are necessary and sufficient to approximately “learn to optimize” a monotone submodular function under a cardinality constraint.
Abstract: In this paper we study the adaptive complexity of submodular optimization. Informally, the adaptive complexity of a problem is the minimal number of sequential rounds required to achieve a constant factor approximation when polynomially-many queries can be executed in parallel at each round. Adaptivity is a fundamental concept that is heavily studied in computer science, largely due to the need for parallelizing computation. Somewhat surprisingly, very little is known about adaptivity in submodular optimization. For the canonical problem of maximizing a monotone submodular function under a cardinality constraint, to the best of our knowledge, all that is known to date is that the adaptive complexity is between 1 and Ω(n). Our main result in this paper is a tight characterization showing that the adaptive complexity of maximizing a monotone submodular function under a cardinality constraint is Θ(log n): - We describe an algorithm which requires O(log n) sequential rounds and achieves an approximation that is arbitrarily close to 1/3; - We show that no algorithm can achieve an approximation better than O(1 / log n) with fewer than O(log n / log log n) rounds. Thus, when allowing for parallelization, our algorithm achieves a constant factor approximation exponentially faster than any known existing algorithm for submodular maximization. Importantly, the approximation algorithm is achieved via adaptive sampling and complements a recent line of work on optimization of functions learned from data. In many cases we do not know the functions we optimize and learn them from labeled samples. Recent results show that no algorithm can obtain a constant factor approximation guarantee using polynomially-many labeled samples as in the PAC and PMAC models, drawn from any distribution. Since learning with non-adaptive samples over any distribution results in a sharp impossibility, we consider learning with adaptive samples where the learner obtains poly(n) samples drawn from a distribution of her choice in every round. Our result implies that in the realizable case, where there is a true underlying function generating the data, Θ(log n) batches of adaptive samples are necessary and sufficient to approximately “learn to optimize” a monotone submodular function under a cardinality constraint.

108 citations


Proceedings ArticleDOI
11 Jul 2018
TL;DR: It is shown that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes.
Abstract: There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 13 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We provide a publicly-available benchmark suite containing our implementations.

100 citations


Proceedings ArticleDOI
20 Jun 2018
TL;DR: The above O(logn) round complexity bound is broken even in the case of slightly sublinear memory per machine, and the improvement here is almost exponential: the best MPC round complexity matches what one can already get in the PRAM model, without the need to take advantage of the extra local computation power.
Abstract: For over a decade now we have been witnessing the success of massive parallel computation (MPC) frameworks, such as MapReduce, Hadoop, Dryad, or Spark. One of the reasons for their success is the fact that these frameworks are able to accurately capture the nature of large-scale computation. In particular, compared to the classic distributed algorithms or PRAM models, these frameworks allow for much more local computation. The fundamental question that arises in this context is though: can we leverage this additional power to obtain even faster parallel algorithms? A prominent example here is the maximum matching problem—one of the most classic graph problems. It is well known that in the PRAM model one can compute a 2-approximate maximum matching in O(logn) rounds. However, the exact complexity of this problem in the MPC framework is still far from understood. Lattanzi et al. (SPAA 2011) showed that if each machine has n1+Ω(1) memory, this problem can also be solved 2-approximately in a constant number of rounds. These techniques, as well as the approaches developed in the follow up work, seem though to get stuck in a fundamental way at roughly O(logn) rounds once we enter the (at most) near-linear memory regime. It is thus entirely possible that in this regime, which captures in particular the case of sparse graph computations, the best MPC round complexity matches what one can already get in the PRAM model, without the need to take advantage of the extra local computation power. In this paper, we finally refute that possibility. That is, we break the above O(logn) round complexity bound even in the case of slightly sublinear memory per machine. In fact, our improvement here is almost exponential: we are able to deliver a (2+є)-approximate maximum matching, for any fixed constant є>0, in O((loglogn)2) rounds. To establish our result we need to deviate from the previous work in two important ways that are crucial for exploiting the power of the MPC model, as compared to the PRAM model. Firstly, we use vertex–based graph partitioning, instead of the edge–based approaches that were utilized so far. Secondly, we develop a technique of round compression. This technique enables one to take a (distributed) algorithm that computes an O(1)-approximation of maximum matching in O(logn) independent PRAM phases and implement a super-constant number of these phases in only a constant number of MPC rounds.

90 citations


Proceedings ArticleDOI
21 May 2018
TL;DR: The design of a distributed memory implementation of the Louvain algorithm for parallel community detection is presented, which begins with an arbitrarily partitioned distributed graph input, and employs several heuristics to speedup the computation of the different steps of theLouVain algorithm.
Abstract: In most real-world networks, the nodes/vertices tend to be organized into tightly-knit modules known as communities or clusters, such that nodes within a community are more likely to be "related" to one another than they are to the rest of the network. The goodness of partitioning into communities is typically measured using a well known measure called modularity. However, modularity optimization is an NP-complete problem. In 2008, Blondel, et al. introduced a multi-phase, iterative heuristic for modularity optimization, called the Louvain method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection. In this paper, we present the design of a distributed memory implementation of the Louvain algorithm for parallel community detection. Our approach begins with an arbitrarily partitioned distributed graph input, and employs several heuristics to speedup the computation of the different steps of the Louvain algorithm. We evaluate our implementation and its different variants using real-world networks from various application domains (including internet, biology, social networks). Our MPI+OpenMP implementation yields about 7x speedup (on 4K processes) for soc-friendster network (1.8B edges) over a state-of-the-art shared memory multicore implementation (on 64 threads), without compromising output quality. Furthermore, our distributed implementation was able to process a larger graph (uk-2007; 3.3B edges) in 32 seconds on 1K cores (64 nodes) of NERSC Cori, when the state-of-the-art shared memory implementation failed to run due to insufficient memory on a single Cori node containing 128 GB of memory.

83 citations


Journal ArticleDOI
TL;DR: A very general parallel algorithm is proposed that allows large spin–lattice systems to be efficiently simulated on large numbers of processors without degrading its mathematical accuracy.

81 citations


Proceedings ArticleDOI
11 Jul 2018
TL;DR: In this paper, the authors propose a new integrated method of exploiting model, batch and domain parallelism for the training of deep neural networks (DNNs) on large distributed-memory computers using minibatch stochastic gradient descent (SGD).
Abstract: We propose a new integrated method of exploiting model, batch and domain parallelism for the training of deep neural networks (DNNs) on large distributed-memory computers using minibatch stochastic gradient descent (SGD). Our goal is to find an efficient parallelization strategy for a fixed batch size using P processes. Our method is inspired by the communication-avoiding algorithms in numerical linear algebra. We see P processes as logically divided into a P_r x P_c grid where the P_r dimension is implicitly responsible for model/domain parallelism and the P_c dimension is implicitly responsible for batch parallelism. In practice, the integrated matrix-based parallel algorithm encapsulates these types of parallelism automatically. We analyze the communication complexity and analytically demonstrate that the lowest communication costs are often achieved neither with pure model nor with pure data parallelism. We also show how the domain parallel approach can help in extending the theoretical scaling limit of the typical batch parallel method.

77 citations


Journal ArticleDOI
TL;DR: A parallel repetitive nearest neighbor algorithm for solving the symmetric traveling salesman problem on OTIS-Hypercube and OTis-Mesh optoelectronic architectures is presented and attained almost near-linear speedup and high efficiency among the two selected optoeLECTronic architectures.
Abstract: Over the past years, researchers drew their attention to propose optoelectronic architectures, including optical transpose interconnection system (OTIS) networks. On the other hand, there are limited attempts devoted to design parallel algorithms for applications that could be mapped on such optoelectronic architectures. Thus, exploiting the attractive features of OTIS networks and investigating their performance in solving combinatorial optimization problems become a great necessity. In this paper, a parallel repetitive nearest neighbor algorithm for solving the symmetric traveling salesman problem on OTIS-Hypercube and OTIS-Mesh optoelectronic architectures is presented. This algorithm has been evaluated analytically and by simulation on both optoelectronic architectures in terms of number of communication steps, parallel run time, speedup, efficiency, cost and communication cost. The simulation results attained almost near-linear speedup and high efficiency among the two selected optoelectronic architectures, where OTIS-Hypercube gained better results in comparison with OTIS-Mesh.

77 citations


Book
06 Feb 2018
TL;DR: Given a general arithmetic expression, a computation binary tree representation in O(n) time using n processors on a concurrent-read, exclusive-write, parallel random-access machine is found.
Abstract: Given a general arithmetic expression, we find a computation binary tree representation in O(log n) time using n/log n processors on a concurrent-read, exclusive-write, parallel random-access machine.A new algorithm is introduced for this purpose. Unlike previous serial and parallel solutions, it is not based on using a stack.

Journal ArticleDOI
TL;DR: This pilot study includes several innovative points, and the proposed pSSO is the first parallel algorithm to solve the RAP and the first one to parallelize the simplified swarm optimization (SSO) with the Taguchi method.
Abstract: In recent years, various smart sensor systems have been integrated into the “Internet of Things (IOT)” with the advancement of sensing technology A redundancy allocation is the safest, most convenient, and most economical way to increase the reliability of smart sensor systems To solve the smart sensor systems redundancy allocation problem (RAP) in IOT, a cooperative parallel simplified swarm algorithm (pSSO) is presented in this study This pilot study includes several innovative points First, research is conducted to use the RAP in IOT Second, the proposed pSSO is the first parallel algorithm to solve the RAP and the first one to parallelize the simplified swarm optimization (SSO) with the Taguchi method A simple real-life example regarding shopping and shipping in TAOBAO is given to describe the way how to model the IOT used the RAP As proof of the success of the proposed pSSO, detailed computational results from solving a series-parallel redundancy allocation problem with a mix of components is presented The computational results reflect the efficiency of the pSSO proposed

Journal ArticleDOI
TL;DR: A class of reduced summaries is introduced, characterized by approximate graph pattern matching, that are capable of summarizing entities in terms of their neighborhood similarity up to a certain hop, using small and informative graph patterns.
Abstract: Querying heterogeneous and large-scale knowledge graphs is expensive. This paper studies a graph summarization framework to facilitate knowledge graph search. (1) We introduce a class of reduced summaries . Characterized by approximate graph pattern matching, these summaries are capable of summarizing entities in terms of their neighborhood similarity up to a certain hop, using small and informative graph patterns. (2) We study a diversified graph summarization problem. Given a knowledge graph, it is to discover top- $k$ summaries that maximize a bi-criteria function, characterized by both informativeness and diversity. We show that diversified summarization is feasible for large graphs, by developing both sequential and parallel summarization algorithms. (a) We show that there exists a 2-approximation algorithm to discover diversified summaries. We further develop an anytime sequential algorithm which discovers summaries under resource constraints. (b) We present a new parallel algorithm with quality guarantees. The algorithm is parallel scalable, which ensures its feasibility in distributed graphs. (3) We also develop a summary-based query evaluation scheme, which only refers to a small number of summaries. Using real-world knowledge graphs, we experimentally verify the effectiveness and efficiency of our summarization algorithms, and query processing using summaries.

Journal ArticleDOI
TL;DR: A 3-D uncertain coverage model is proposed that uses a modified3-D sensing model and an uncertain fusion operator and a distributed parallel cooperative coevolutionary multiobjective large-scale evolutionary algorithm for maritime applications.
Abstract: Effectively monitoring maritime environments has become a vital problem in maritime applications. Traditional methods are not only expensive and time consuming but also restricted in both time and space. More recently, the concept of an industrial wireless sensor network (IWSN) has become a promising alternative for monitoring next-generation intelligent maritime grids, because IWSNs are cost-effective and easy to deploy. This paper focuses on solving the issue of 3-D IWSN deployment in a 3-D engine room space of a very large crude-oil carrier and also considers numerous power facilities. To address this 3-D IWSN deployment problem for maritime applications, a 3-D uncertain coverage model is proposed that uses a modified 3-D sensing model and an uncertain fusion operator. The deployment problem is converted into a multiobjective optimization problem that simultaneously addresses three objectives: coverage , lifetime, and reliability . Our goal is to achieve extensive coverage, long network lifetime, and high reliability. We also propose a distributed parallel cooperative coevolutionary multiobjective large-scale evolutionary algorithm for maritime applications. We verify the effectiveness of this algorithm through experiments by comparing it with five state-of-the-art algorithms. Numerical results demonstrate that the proposed method performs most effectively both in optimization performance and in minimizing the computation time.

Journal ArticleDOI
TL;DR: The serial implementation of the novel incremental algorithm, which decompose the graph into biconnected components and proves that processing can be localized within the affected components, is demonstrated to be up to 3.7 times faster than existing serial methods.
Abstract: Betweenness centrality quantifies the importance of nodes in a graph in many applications, including network analysis, community detection and identification of influential users. Typically, graphs in such applications evolve over time. Thus, the computation of betweenness centrality should be performed incrementally. This is challenging because updating even a single edge may trigger the computation of all-pairs shortest paths in the entire graph. Existing approaches cannot scale to large graphs: they either require excessive memory (i.e., quadratic to the size of the input graph) or perform unnecessary computations rendering them prohibitively slow. We propose $i$ Central ; a novel incremental algorithm for computing betweenness centrality in evolving graphs. We decompose the graph into biconnected components and prove that processing can be localized within the affected components. $i$ Central is the first algorithm to support incremental betweeness centrality computation within a graph component. This is done efficiently, in linear space; consequently, $i$ Central scales to large graphs. We demonstrate with real datasets that the serial implementation of $i$ Central is up to 3.7 times faster than existing serial methods. Our parallel implementation that scales to large graphs, is an order of magnitude faster than the state-of-the-art parallel algorithm, while using an order of magnitude less computational resources.

Journal ArticleDOI
TL;DR: The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters, and is more suitable for PGAs than the global and grid model, also in terms of costs when executed on a commercial cloud provider.
Abstract: The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store operations. For this reason, the island model is more suitable for PGAs than the global and grid model, also in terms of costs when executed on a commercial cloud provider.

Journal ArticleDOI
TL;DR: A novel computational scheme for reducing the cost of a core operation in computing the CP decomposition with the traditional alternating least squares (CP-ALS) based algorithm is proposed and effectively parallelize this computational scheme in the context of CP-ALS in shared and distributed memory environments.
Abstract: CANDECOMP/PARAFAC (CP) decomposition of sparse tensors has been successfully applied to many problems in web search, graph analytics, recommender systems, health care data analytics, and many other domains. In these applications, efficiently computing the CP decomposition of sparse tensors is essential in order to be able to process and analyze data of massive scale. For this purpose, we investigate an efficient computation of the CP decomposition of sparse tensors and its parallelization. We propose a novel computational scheme for reducing the cost of a core operation in computing the CP decomposition with the traditional alternating least squares (CP-ALS) based algorithm. We then effectively parallelize this computational scheme in the context of CP-ALS in shared and distributed memory environments and propose data and task distribution models for better scalability. We implement parallel CP-ALS algorithms and compare our implementations with an efficient tensor factorization library using tensors form...

Journal ArticleDOI
TL;DR: Results show that the designed CGPUGA algorithm provides rules of higher quality compared to the state-of-the-art NIGGAR, MSP-MPSO and MPGA algorithms for diversified association rule mining.

Journal ArticleDOI
01 Oct 2018
TL;DR: This paper develops parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures and develops a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem.
Abstract: Sparse matrix-matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM , to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

Journal ArticleDOI
TL;DR: A system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm is presented and complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallelprocessing algorithm.
Abstract: The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where High-Performance Computing solution has become a key issue and has attracted attention in recent years. However, these systems are either memoryless or computational inefficient. Therefore, keeping in view the aforementioned needs, there is a requirement for a system that can efficiently analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm. Moreover, complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallel processing algorithm. The proposed system architecture both read and writes operations that enhance the efficiency of the Input/Output operation. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce. MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes to generate and process graphs with near real-time. Moreover, the system is evaluated in terms of efficiency by considering the system throughput and processing time. The results show that the proposed system is more scalable and efficient.

Journal ArticleDOI
TL;DR: It turns out that the derived algorithm is a competitive candidate for the solution of very large-scale dense nonnegative tensor factorization problems.
Abstract: We consider the problem of nonnegative tensor factorization. Our aim is to derive an efficient algorithm that is also suitable for parallel implementation. We adopt the alternating optimization framework and solve each matrix nonnegative least-squares problem via a Nesterov-type algorithm for strongly convex problems. We describe a parallel implementation of the algorithm and measure the attained speedup in a multicore computing environment. It turns out that the derived algorithm is a competitive candidate for the solution of very large-scale dense nonnegative tensor factorization problems.

Journal ArticleDOI
TL;DR: This paper describes the parallel algorithm of the NOISEtte code for computational fluid dynamics and aeroacoustics simulations based on a family of higher-accuracy numerical schemes for unstructured hybrid meshes and the multilevel MPI + OpenMP parallelization.
Abstract: This paper describes the parallel algorithm of the NOISEtte code for computational fluid dynamics and aeroacoustics simulations. It is based on a family of higher-accuracy numerical schemes for unstructured hybrid meshes. The multilevel MPI + OpenMP parallelization is described in detail. Performance results are presented for various supercomputers and applications.

Journal ArticleDOI
TL;DR: This paper proposes a fast GPU-based frequent itemset mining method called GMiner, which achieves very fast performance by fully exploiting the computational power of GPUs and is suitable for large-scale data.

Journal ArticleDOI
TL;DR: This paper discusses the serial algorithms of dynamic, quasistatic, and hybrid CAC, along with some programming techniques used in the code, and illustrates the parallel algorithm, quantify the parallel scalability, and discuss some software specifications of PyCAC.
Abstract: We present a novel distributed-memory parallel implementation of the concurrent atomistic-continuum (CAC) method. Written mostly in Fortran 2008 and wrapped with a Python scripting interface, the CAC simulator in PyCAC runs in parallel using Message Passing Interface with a spatial decomposition algorithm. Built upon the underlying Fortran code, the Python interface provides a robust and versatile way for users to build system configurations, run CAC simulations, and analyze results. In this paper, following a brief introduction to the theoretical background of the CAC method, we discuss the serial algorithms of dynamic, quasistatic, and hybrid CAC, along with some programming techniques used in the code. We then illustrate the parallel algorithm, quantify the parallel scalability, and discuss some software specifications of PyCAC; more information can be found in the PyCAC user’s manual that is hosted on http://www.pycac.org .

Journal ArticleDOI
TL;DR: A parallel algorithm based on Isolation Forest and Spark for network traffic anomaly detection and big data processing capability of Spark technology is proposed, which can also solve the problem of computation bottleneck on single machine.
Abstract: With the rapid development of large-scale complex networks and proliferation of various social network applications, the amount of network traffic data generated is increasing tremendously, and eff...

Proceedings ArticleDOI
21 May 2018
TL;DR: Afforest is proposed: an extension of the Shiloach-Vishkin connected components algorithm that approaches optimal work efficiency by processing subgraphs in each iteration, and it is shown that the algorithm exhibits higher memory locality than existing methods.
Abstract: Connected component identification is a fundamental problem in graph analytics, serving as a basis for subsequent computations in a wide range of applications. To determine connectivity, several parallel algorithms, whose complexity is proportional to the number of edges or graph diameter, have been proposed. However, an optimal algorithm may extract graph components by working proportionally to the number of vertices, which can be orders of magnitude lower than the number of edges. We propose Afforest: an extension of the Shiloach-Vishkin connected components algorithm that approaches optimal work efficiency by processing subgraphs in each iteration. We prove the convergence of the algorithm, analyze its work efficiency characteristics, and provide further techniques to speed up processing graphs containing a huge component. Designed with modern parallel architectures in mind, we show that the algorithm exhibits higher memory locality than existing methods. Using both synthetic and real-world graphs, we demonstrate that Afforest achieves speedups of up to 67x over the state-of-the-art on multi-core CPUs (Broadwell, POWER8) and up to 23x on GPUs (Pascal).

Proceedings ArticleDOI
21 May 2018
TL;DR: In this paper, the authors show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation, and also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal.
Abstract: The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation.

Journal ArticleDOI
TL;DR: SSiGraM (Spark based Single Graph Mining), a Spark based parallel frequent subgraph mining algorithm in a single large graph that outperforms the existing GraMi (Graph Mining) algorithm by an order of magnitude for all datasets and can work with a lower support threshold.
Abstract: Frequent subgraph mining (FSM) plays an important role in graph mining, attracting a great deal of attention in many areas, such as bioinformatics, web data mining and social networks. In this paper, we propose SSiGraM (Spark based Single Graph Mining), a Spark based parallel frequent subgraph mining algorithm in a single large graph. Aiming to approach the two computational challenges of FSM, we conduct the subgraph extension and support evaluation parallel across all the distributed cluster worker nodes. In addition, we also employ a heuristic search strategy and three novel optimizations: load balancing, pre-search pruning and top-down pruning in the support evaluation process, which significantly improve the performance. Extensive experiments with four different real-world datasets demonstrate that the proposed algorithm outperforms the existing GraMi (Graph Mining) algorithm by an order of magnitude for all datasets and can work with a lower support threshold.

Journal ArticleDOI
TL;DR: In this paper, a parallel algorithm for multivariate Radial Basis Function Partition of Unity Method (RBF-PUM) interpolation is presented, which makes use of shared-memory parallel processors through the OpenCL standard.
Abstract: We present a parallel algorithm for multivariate Radial Basis Function Partition of Unity Method (RBF-PUM) interpolation. The concurrent nature of the RBF-PUM enables designing parallel algorithms for dealing with a large number of scattered data-points in high space dimensions. To efficiently exploit this concurrency, our algorithm makes use of shared-memory parallel processors through the OpenCL standard. This efficiency is achieved by a parallel space partitioning strategy with linear computational time complexity with respect to the input and evaluation points. The speed of our approach allows for computationally more intensive construction of the interpolant. In fact, the RBF-PUM can be coupled with a cross-validation technique that searches for optimal values of the shape parameters associated with each local RBF interpolant, thus reducing the global interpolation error. The numerical experiments support our claims by illustrating the interpolation errors and the running times of our algorithm.

Proceedings ArticleDOI
11 Jul 2018
TL;DR: This paper designs parallel write-efficient geometric algorithms that perform asymptotically fewer writes than standard algorithms for the same problem, and introduces several techniques for obtaining write-efficiency, including DAG tracing, prefix doubling, and α-labeling.
Abstract: In this paper, we design parallel write-efficient geometric algorithms that perform asymptotically fewer writes than standard algorithms for the same problem. This is motivated by emerging non-volatile memory technologies with read performance being close to that of random access memory but writes being significantly more expensive in terms of energy and latency. We design algorithms for planar Delaunay triangulation, k -d trees, and static and dynamic augmented trees. Our algorithms are designed in the recently introduced Asymmetric Nested-Parallel Model, which captures the parallel setting in which there is a small symmetric memory where reads and writes are unit cost as well as a large asymmetric memory where writes are $omega$ times more expensive than reads. In designing these algorithms, we introduce several techniques for obtaining write-efficiency, including DAG tracing, prefix doubling, and α-labeling, which we believe will be useful for designing other parallel write-efficient algorithms.