Showing papers on "Parallel algorithm published in 2018"

PDF

Open Access

Book•

On Finding Lowest Common Ancestors: Simplification and Parallelization

[...]

Baruch Schieber¹, Uzi Vishkin¹•Institutions (1)

05 Feb 2018

TL;DR: A linear time and space preprocessing algorithm that enables us to answer each query in $O(1)$ time, as in Harel and Tarjan, which has the advantage of being simple and easily parallelizable.

...read moreread less

Abstract: We consider the following problem. Suppose a rooted tree T is available for preprocessing. Answer on-line queries requesting the lowest common ancestor for any pair of vertices in T. We present a linear time and space preprocessing algorithm that enables us to answer each query in $O(1)$ time, as in Harel and Tarjan [SIAM J. Comput., 13 (1984), pp. 338–355]. Our algorithm has the advantage of being simple and easily parallelizable. The resulting parallel preprocessing algorithm runs in logarithmic time using an optimal number of processors on an EREW PRAM. Each query is then answered in $O(1)$ time using a single processor.

...read moreread less

549 citations

Journal Article•DOI•

A parallel cooperative hybrid method based on ant colony optimization and 3-Opt algorithm for solving traveling salesman problem

[...]

źAban Gülcü, Mostafa Mahi¹, Ömer Kaan Baykan², Halife Kodaz²•Institutions (2)

Payame Noor University¹, Selçuk University²

01 Mar 2018

TL;DR: The experimental results show that PACO-3Opt is more efficient and reliable than the other algorithms and can reach the global optimum.

...read moreread less

Abstract: This article presented a parallel cooperative hybrid algorithm for solving traveling salesman problem. Although heuristic approaches and hybrid methods obtain good results in solving the TSP, they cannot successfully avoid getting stuck to local optima. Furthermore, their processing duration unluckily takes a long time. To overcome these deficiencies, we propose the parallel cooperative hybrid algorithm (PACO-3Opt) based on ant colony optimization. This method uses the 3-Opt algorithm to avoid local minima. PACO-3Opt has multiple colonies and a master---slave paradigm. Each colony runs ACO to generate the solutions. After a predefined number of iterations, each colony primarily runs 3-Opt to improve the solutions and then shares the best tour with other colonies. This process continues until the termination criterion meets. Thus, it can reach the global optimum. PACO-3Opt was compared with previous algorithms in the literature. The experimental results show that PACO-3Opt is more efficient and reliable than the other algorithms.

...read moreread less

118 citations

Proceedings Article•DOI•

Listing k-cliques in Sparse Real-World Graphs*

[...]

Maximilien Danisch¹, Oana Balalau², Mauro Sozio³•Institutions (3)

University of Paris¹, Max Planck Society², Télécom ParisTech³

23 Apr 2018

TL;DR: This work revisits the iconic algorithm of Chiba and Nishizeki and develops the most efficient parallel algorithm for list all k-cliques in graphs containing up to tens of millions of edges, which is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism.

...read moreread less

Abstract: Motivated by recent studies in the data mining community which require to efficiently list all k-cliques, we revisit the iconic algorithm of Chiba and Nishizeki and develop the most efficient parallel algorithm for such a problem. Our theoretical analysis provides the best asymptotic upper bound on the running time of our algorithm for the case when the input graph is sparse. Our experimental evaluation on large real-world graphs shows that our parallel algorithm is faster than state-of-the-art algorithms, while boasting an excellent degree of parallelism. In particular, we are able to list all k-cliques (for any k) in graphs containing up to tens of millions of edges as well as all $10$-cliques in graphs containing billions of edges, within a few minutes and a few hours respectively. Finally, we show how our algorithm can be employed as an effective subroutine for finding the k-clique core decomposition and an approximate k-clique densest subgraphs in very large real-world graphs.

...read moreread less

109 citations

Proceedings Article•DOI•

The adaptive complexity of maximizing a submodular function

[...]

Eric Balkanski¹, Yaron Singer¹•Institutions (1)

Harvard University¹

20 Jun 2018

TL;DR: The result implies that in the realizable case, where there is a true underlying function generating the data, Θ(log n) batches of adaptive samples are necessary and sufficient to approximately “learn to optimize” a monotone submodular function under a cardinality constraint.

...read moreread less

Abstract: In this paper we study the adaptive complexity of submodular optimization. Informally, the adaptive complexity of a problem is the minimal number of sequential rounds required to achieve a constant factor approximation when polynomially-many queries can be executed in parallel at each round. Adaptivity is a fundamental concept that is heavily studied in computer science, largely due to the need for parallelizing computation. Somewhat surprisingly, very little is known about adaptivity in submodular optimization. For the canonical problem of maximizing a monotone submodular function under a cardinality constraint, to the best of our knowledge, all that is known to date is that the adaptive complexity is between 1 and Ω(n). Our main result in this paper is a tight characterization showing that the adaptive complexity of maximizing a monotone submodular function under a cardinality constraint is Θ(log n): - We describe an algorithm which requires O(log n) sequential rounds and achieves an approximation that is arbitrarily close to 1/3; - We show that no algorithm can achieve an approximation better than O(1 / log n) with fewer than O(log n / log log n) rounds. Thus, when allowing for parallelization, our algorithm achieves a constant factor approximation exponentially faster than any known existing algorithm for submodular maximization. Importantly, the approximation algorithm is achieved via adaptive sampling and complements a recent line of work on optimization of functions learned from data. In many cases we do not know the functions we optimize and learn them from labeled samples. Recent results show that no algorithm can obtain a constant factor approximation guarantee using polynomially-many labeled samples as in the PAC and PMAC models, drawn from any distribution. Since learning with non-adaptive samples over any distribution results in a sharp impossibility, we consider learning with adaptive samples where the learner obtains poly(n) samples drawn from a distribution of her choice in every round. Our result implies that in the realizable case, where there is a true underlying function generating the data, Θ(log n) batches of adaptive samples are necessary and sufficient to approximately “learn to optimize” a monotone submodular function under a cardinality constraint.

...read moreread less

108 citations

Proceedings Article•DOI•

Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

[...]

Laxman Dhulipala¹, Guy E. Blelloch¹, Julian Shun²•Institutions (2)

Carnegie Mellon University¹, Massachusetts Institute of Technology²

11 Jul 2018

TL;DR: It is shown that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes.

...read moreread less

Abstract: There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 13 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We provide a publicly-available benchmark suite containing our implementations.

...read moreread less

100 citations

Proceedings Article•DOI•

Round compression for parallel matching algorithms

[...]

Artur Czumaj¹, Jakub Łącki², Aleksander Mądry³, Slobodan Mitrovic⁴, Krzysztof Onak⁵, Piotr Sankowski⁶ - Show less +2 more•Institutions (6)

University of Warwick¹, Google², Massachusetts Institute of Technology³, École Polytechnique Fédérale de Lausanne⁴, IBM⁵, University of Warsaw⁶

20 Jun 2018

TL;DR: The above O(logn) round complexity bound is broken even in the case of slightly sublinear memory per machine, and the improvement here is almost exponential: the best MPC round complexity matches what one can already get in the PRAM model, without the need to take advantage of the extra local computation power.

...read moreread less

Abstract: For over a decade now we have been witnessing the success of massive parallel computation (MPC) frameworks, such as MapReduce, Hadoop, Dryad, or Spark. One of the reasons for their success is the fact that these frameworks are able to accurately capture the nature of large-scale computation. In particular, compared to the classic distributed algorithms or PRAM models, these frameworks allow for much more local computation. The fundamental question that arises in this context is though: can we leverage this additional power to obtain even faster parallel algorithms? A prominent example here is the maximum matching problem—one of the most classic graph problems. It is well known that in the PRAM model one can compute a 2-approximate maximum matching in O(logn) rounds. However, the exact complexity of this problem in the MPC framework is still far from understood. Lattanzi et al. (SPAA 2011) showed that if each machine has n1+Ω(1) memory, this problem can also be solved 2-approximately in a constant number of rounds. These techniques, as well as the approaches developed in the follow up work, seem though to get stuck in a fundamental way at roughly O(logn) rounds once we enter the (at most) near-linear memory regime. It is thus entirely possible that in this regime, which captures in particular the case of sparse graph computations, the best MPC round complexity matches what one can already get in the PRAM model, without the need to take advantage of the extra local computation power. In this paper, we finally refute that possibility. That is, we break the above O(logn) round complexity bound even in the case of slightly sublinear memory per machine. In fact, our improvement here is almost exponential: we are able to deliver a (2+є)-approximate maximum matching, for any fixed constant є>0, in O((loglogn)2) rounds. To establish our result we need to deviate from the previous work in two important ways that are crucial for exploiting the power of the MPC model, as compared to the PRAM model. Firstly, we use vertex–based graph partitioning, instead of the edge–based approaches that were utilized so far. Secondly, we develop a technique of round compression. This technique enables one to take a (distributed) algorithm that computes an O(1)-approximation of maximum matching in O(logn) independent PRAM phases and implement a super-constant number of these phases in only a constant number of MPC rounds.

...read moreread less

90 citations

Proceedings Article•DOI•

Distributed Louvain Algorithm for Graph Community Detection

[...]

Sayan Ghosh¹, Mahantesh Halappanavar², Antonino Tumeo², Ananth Kalyanaraman¹, Hao Lu, Daniel Chavarría-Miranda³, Arif O. Khan², Assefaw H. Gebremedhin¹ - Show less +4 more•Institutions (3)

Washington State University¹, Pacific Northwest National Laboratory², Oak Ridge National Laboratory³

21 May 2018

TL;DR: The design of a distributed memory implementation of the Louvain algorithm for parallel community detection is presented, which begins with an arbitrarily partitioned distributed graph input, and employs several heuristics to speedup the computation of the different steps of theLouVain algorithm.

...read moreread less

Abstract: In most real-world networks, the nodes/vertices tend to be organized into tightly-knit modules known as communities or clusters, such that nodes within a community are more likely to be "related" to one another than they are to the rest of the network. The goodness of partitioning into communities is typically measured using a well known measure called modularity. However, modularity optimization is an NP-complete problem. In 2008, Blondel, et al. introduced a multi-phase, iterative heuristic for modularity optimization, called the Louvain method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection. In this paper, we present the design of a distributed memory implementation of the Louvain algorithm for parallel community detection. Our approach begins with an arbitrarily partitioned distributed graph input, and employs several heuristics to speedup the computation of the different steps of the Louvain algorithm. We evaluate our implementation and its different variants using real-world networks from various application domains (including internet, biology, social networks). Our MPI+OpenMP implementation yields about 7x speedup (on 4K processes) for soc-friendster network (1.8B edges) over a state-of-the-art shared memory multicore implementation (on 64 threads), without compromising output quality. Furthermore, our distributed implementation was able to process a larger graph (uk-2007; 3.3B edges) in 32 seconds on 1K cores (64 nodes) of NERSC Cori, when the state-of-the-art shared memory implementation failed to run due to insufficient memory on a single Cori node containing 128 GB of memory.

...read moreread less

83 citations

Journal Article•DOI•

Massively parallel symplectic algorithm for coupled magnetic spin dynamics and molecular dynamics

[...]

Julien Tranchida¹, Steven J. Plimpton¹, Pascal Thibaudeau, Aidan P. Thompson¹•Institutions (1)

Sandia National Laboratories¹

01 Nov 2018-Journal of Computational Physics

TL;DR: A very general parallel algorithm is proposed that allows large spin–lattice systems to be efficiently simulated on large numbers of processors without degrading its mathematical accuracy.

...read moreread less

81 citations

Proceedings Article•DOI•

Integrated Model, Batch, and Domain Parallelism in Training Neural Networks

[...]

Amir Gholami¹, Ariful Azad², Peter H. Jin¹, Kurt Keutzer¹, Aydin Buluc² - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

11 Jul 2018

TL;DR: In this paper, the authors propose a new integrated method of exploiting model, batch and domain parallelism for the training of deep neural networks (DNNs) on large distributed-memory computers using minibatch stochastic gradient descent (SGD).

...read moreread less

Abstract: We propose a new integrated method of exploiting model, batch and domain parallelism for the training of deep neural networks (DNNs) on large distributed-memory computers using minibatch stochastic gradient descent (SGD). Our goal is to find an efficient parallelization strategy for a fixed batch size using P processes. Our method is inspired by the communication-avoiding algorithms in numerical linear algebra. We see P processes as logically divided into a P_r x P_c grid where the P_r dimension is implicitly responsible for model/domain parallelism and the P_c dimension is implicitly responsible for batch parallelism. In practice, the integrated matrix-based parallel algorithm encapsulates these types of parallelism automatically. We analyze the communication complexity and analytically demonstrate that the lowest communication costs are often achieved neither with pure model nor with pure data parallelism. We also show how the domain parallel approach can help in extending the theoretical scaling limit of the typical batch parallel method.

...read moreread less

77 citations

Journal Article•DOI•

Solving traveling salesman problem using parallel repetitive nearest neighbor algorithm on OTIS-Hypercube and OTIS-Mesh optoelectronic architectures

[...]

Aryaf Al-Adwan¹, Basel A. Mahafzah¹, Ahmad Sharieh¹•Institutions (1)

University of Jordan¹

01 Jan 2018-The Journal of Supercomputing

TL;DR: A parallel repetitive nearest neighbor algorithm for solving the symmetric traveling salesman problem on OTIS-Hypercube and OTis-Mesh optoelectronic architectures is presented and attained almost near-linear speedup and high efficiency among the two selected optoeLECTronic architectures.

...read moreread less

Abstract: Over the past years, researchers drew their attention to propose optoelectronic architectures, including optical transpose interconnection system (OTIS) networks. On the other hand, there are limited attempts devoted to design parallel algorithms for applications that could be mapped on such optoelectronic architectures. Thus, exploiting the attractive features of OTIS networks and investigating their performance in solving combinatorial optimization problems become a great necessity. In this paper, a parallel repetitive nearest neighbor algorithm for solving the symmetric traveling salesman problem on OTIS-Hypercube and OTIS-Mesh optoelectronic architectures is presented. This algorithm has been evaluated analytically and by simulation on both optoelectronic architectures in terms of number of communication steps, parallel run time, speedup, efficiency, cost and communication cost. The simulation results attained almost near-linear speedup and high efficiency among the two selected optoelectronic architectures, where OTIS-Hypercube gained better results in comparison with OTIS-Mesh.

...read moreread less

77 citations

Book•

Optimal Parallel Generation of a Computation Tree Form

[...]

Ilan Bar-on¹, Uzi Vishkin²•Institutions (2)

Courant Institute of Mathematical Sciences¹, Tel Aviv University²

06 Feb 2018

TL;DR: Given a general arithmetic expression, a computation binary tree representation in O(n) time using n processors on a concurrent-read, exclusive-write, parallel random-access machine is found.

...read moreread less

Abstract: Given a general arithmetic expression, we find a computation binary tree representation in O(log n) time using n/log n processors on a concurrent-read, exclusive-write, parallel random-access machine.A new algorithm is introduced for this purpose. Unlike previous serial and parallel solutions, it is not based on using a stack.

...read moreread less

Journal Article•DOI•

New parallel swarm algorithm for smart sensor systems redundancy allocation problems in the Internet of Things

[...]

Wei-Chang Yeh¹, Jsen-Shung Lin²•Institutions (2)

National Tsing Hua University¹, Central Police University²

01 Sep 2018-The Journal of Supercomputing

TL;DR: This pilot study includes several innovative points, and the proposed pSSO is the first parallel algorithm to solve the RAP and the first one to parallelize the simplified swarm optimization (SSO) with the Taguchi method.

...read moreread less

Abstract: In recent years, various smart sensor systems have been integrated into the “Internet of Things (IOT)” with the advancement of sensing technology A redundancy allocation is the safest, most convenient, and most economical way to increase the reliability of smart sensor systems To solve the smart sensor systems redundancy allocation problem (RAP) in IOT, a cooperative parallel simplified swarm algorithm (pSSO) is presented in this study This pilot study includes several innovative points First, research is conducted to use the RAP in IOT Second, the proposed pSSO is the first parallel algorithm to solve the RAP and the first one to parallelize the simplified swarm optimization (SSO) with the Taguchi method A simple real-life example regarding shopping and shipping in TAOBAO is given to describe the way how to model the IOT used the RAP As proof of the success of the proposed pSSO, detailed computational results from solving a series-parallel redundancy allocation problem with a mix of components is presented The computational results reflect the efficiency of the pSSO proposed

...read moreread less

Journal Article•DOI•

Mining Summaries for Knowledge Graph Search

[...]

Qi Song¹, Yinghui Wu¹, Peng Lin¹, Luna Xin Dong², Hui Sun³ - Show less +1 more•Institutions (3)

Washington State University¹, Amazon.com², Renmin University of China³

01 Oct 2018-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A class of reduced summaries is introduced, characterized by approximate graph pattern matching, that are capable of summarizing entities in terms of their neighborhood similarity up to a certain hop, using small and informative graph patterns.

...read moreread less

Abstract: Querying heterogeneous and large-scale knowledge graphs is expensive. This paper studies a graph summarization framework to facilitate knowledge graph search. (1) We introduce a class of reduced summaries . Characterized by approximate graph pattern matching, these summaries are capable of summarizing entities in terms of their neighborhood similarity up to a certain hop, using small and informative graph patterns. (2) We study a diversified graph summarization problem. Given a knowledge graph, it is to discover top- $k$ summaries that maximize a bi-criteria function, characterized by both informativeness and diversity. We show that diversified summarization is feasible for large graphs, by developing both sequential and parallel summarization algorithms. (a) We show that there exists a 2-approximation algorithm to discover diversified summaries. We further develop an anytime sequential algorithm which discovers summaries under resource constraints. (b) We present a new parallel algorithm with quality guarantees. The algorithm is parallel scalable, which ensures its feasibility in distributed graphs. (3) We also develop a summary-based query evaluation scheme, which only refers to a small number of summaries. Using real-world knowledge graphs, we experimentally verify the effectiveness and efficiency of our summarization algorithms, and query processing using summaries.

...read moreread less

Journal Article•DOI•

3-D Multiobjective Deployment of an Industrial Wireless Sensor Network for Maritime Applications Utilizing a Distributed Parallel Algorithm

[...]

Bin Cao¹, Jianwei Zhao¹, Po Yang², Zhihan Lv³, Xin Liu¹, Geyong Min⁴ - Show less +2 more•Institutions (4)

Hebei University of Technology¹, Liverpool John Moores University², Qingdao University³, University of Exeter⁴

08 Feb 2018-IEEE Transactions on Industrial Informatics

TL;DR: A 3-D uncertain coverage model is proposed that uses a modified3-D sensing model and an uncertain fusion operator and a distributed parallel cooperative coevolutionary multiobjective large-scale evolutionary algorithm for maritime applications.

...read moreread less

Abstract: Effectively monitoring maritime environments has become a vital problem in maritime applications. Traditional methods are not only expensive and time consuming but also restricted in both time and space. More recently, the concept of an industrial wireless sensor network (IWSN) has become a promising alternative for monitoring next-generation intelligent maritime grids, because IWSNs are cost-effective and easy to deploy. This paper focuses on solving the issue of 3-D IWSN deployment in a 3-D engine room space of a very large crude-oil carrier and also considers numerous power facilities. To address this 3-D IWSN deployment problem for maritime applications, a 3-D uncertain coverage model is proposed that uses a modified 3-D sensing model and an uncertain fusion operator. The deployment problem is converted into a multiobjective optimization problem that simultaneously addresses three objectives: coverage , lifetime, and reliability . Our goal is to achieve extensive coverage, long network lifetime, and high reliability. We also propose a distributed parallel cooperative coevolutionary multiobjective large-scale evolutionary algorithm for maritime applications. We verify the effectiveness of this algorithm through experiments by comparing it with five state-of-the-art algorithms. Numerical results demonstrate that the proposed method performs most effectively both in optimization performance and in minimizing the computation time.

...read moreread less

Journal Article•DOI•

Parallel Algorithm for Incremental Betweenness Centrality on Large Graphs

[...]

Fuad Jamour¹, Spiros Skiadopoulos², Panos Kalnis¹•Institutions (2)

King Abdullah University of Science and Technology¹, University of Peloponnese²

01 Mar 2018-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The serial implementation of the novel incremental algorithm, which decompose the graph into biconnected components and proves that processing can be localized within the affected components, is demonstrated to be up to 3.7 times faster than existing serial methods.

...read moreread less

Abstract: Betweenness centrality quantifies the importance of nodes in a graph in many applications, including network analysis, community detection and identification of influential users. Typically, graphs in such applications evolve over time. Thus, the computation of betweenness centrality should be performed incrementally. This is challenging because updating even a single edge may trigger the computation of all-pairs shortest paths in the entire graph. Existing approaches cannot scale to large graphs: they either require excessive memory (i.e., quadratic to the size of the input graph) or perform unnecessary computations rendering them prohibitively slow. We propose $i$ Central ; a novel incremental algorithm for computing betweenness centrality in evolving graphs. We decompose the graph into biconnected components and prove that processing can be localized within the affected components. $i$ Central is the first algorithm to support incremental betweeness centrality computation within a graph component. This is done efficiently, in linear space; consequently, $i$ Central scales to large graphs. We demonstrate with real datasets that the serial implementation of $i$ Central is up to 3.7 times faster than existing serial methods. Our parallel implementation that scales to large graphs, is an order of magnitude faster than the state-of-the-art parallel algorithm, while using an order of magnitude less computational resources.

...read moreread less

Journal Article•DOI•

Using Hadoop MapReduce for parallel genetic algorithms: A comparison of the global, grid and island models

[...]

Filomena Ferrucci¹, Pasquale Salza¹, Federica Sarro²•Institutions (2)

University of Salerno¹, University College London²

01 Dec 2018-Evolutionary Computation

TL;DR: The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters, and is more suitable for PGAs than the global and grid model, also in terms of costs when executed on a commercial cloud provider.

...read moreread less

Abstract: The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store operations. For this reason, the island model is more suitable for PGAs than the global and grid model, also in terms of costs when executed on a commercial cloud provider.

...read moreread less

Journal Article•DOI•

Parallel Candecomp/Parafac Decomposition of Sparse Tensors Using Dimension Trees

[...]

Oguz Kaya, Bora Uçar

20 Feb 2018-SIAM Journal on Scientific Computing

TL;DR: A novel computational scheme for reducing the cost of a core operation in computing the CP decomposition with the traditional alternating least squares (CP-ALS) based algorithm is proposed and effectively parallelize this computational scheme in the context of CP-ALS in shared and distributed memory environments.

...read moreread less

Abstract: CANDECOMP/PARAFAC (CP) decomposition of sparse tensors has been successfully applied to many problems in web search, graph analytics, recommender systems, health care data analytics, and many other domains. In these applications, efficiently computing the CP decomposition of sparse tensors is essential in order to be able to process and analyze data of massive scale. For this purpose, we investigate an efficient computation of the CP decomposition of sparse tensors and its parallelization. We propose a novel computational scheme for reducing the cost of a core operation in computing the CP decomposition with the traditional alternating least squares (CP-ALS) based algorithm. We then effectively parallelize this computational scheme in the context of CP-ALS in shared and distributed memory environments and propose data and task distribution models for better scalability. We implement parallel CP-ALS algorithms and compare our implementations with an efficient tensor factorization library using tensors form...

...read moreread less

Journal Article•DOI•

Mining diversified association rules in big datasets: A cluster/GPU/genetic approach

[...]

Youcef Djenouri¹, Asma Belhadi, Philippe Fournier-Viger², Hamido Fujita³•Institutions (3)

University of Southern Denmark¹, Harbin Institute of Technology², Iwate Prefectural University³

01 Aug 2018-Information Sciences

TL;DR: Results show that the designed CGPUGA algorithm provides rules of higher quality compared to the state-of-the-art NIGGAR, MSP-MPSO and MPGA algorithms for diversified association rule mining.

...read moreread less

Journal Article•DOI•

Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures

[...]

Mehmet Deveci¹, Christian Robert Trott¹, Sivasankaran Rajamanickam¹•Institutions (1)

Sandia National Laboratories¹

01 Oct 2018

TL;DR: This paper develops parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures and develops a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem.

...read moreread less

Abstract: Sparse matrix-matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM , to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

...read moreread less

Journal Article•DOI•

Multilevel Data Processing Using Parallel Algorithms for Analyzing Big Data in High-Performance Computing

[...]

Awais Ahmad¹, Anand Paul², Sadia Din², M. Mazhar Rathore², Gyu Sang Choi¹, Gwanggil Jeon³ - Show less +2 more•Institutions (3)

Yeungnam University¹, Kyungpook National University², Incheon National University³

01 Jun 2018-International Journal of Parallel Programming

TL;DR: A system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm is presented and complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallelprocessing algorithm.

...read moreread less

Abstract: The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where High-Performance Computing solution has become a key issue and has attracted attention in recent years. However, these systems are either memoryless or computational inefficient. Therefore, keeping in view the aforementioned needs, there is a requirement for a system that can efficiently analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm. Moreover, complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallel processing algorithm. The proposed system architecture both read and writes operations that enhance the efficiency of the Input/Output operation. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce. MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes to generate and process graphs with near real-time. Moreover, the system is evaluated in terms of efficiency by considering the system throughput and processing time. The results show that the proposed system is more scalable and efficient.

...read moreread less

Journal Article•DOI•

Nesterov-Based Alternating Optimization for Nonnegative Tensor Factorization: Algorithm and Parallel Implementation

[...]

A.P. Liavas¹, Georgios Kostoulas¹, Georgios Lourakis¹, Kejun Huang², Nicholas D. Sidiropoulos² - Show less +1 more•Institutions (2)

Technical University of Crete¹, University of Minnesota²

15 Feb 2018-IEEE Transactions on Signal Processing

TL;DR: It turns out that the derived algorithm is a competitive candidate for the solution of very large-scale dense nonnegative tensor factorization problems.

...read moreread less

Abstract: We consider the problem of nonnegative tensor factorization. Our aim is to derive an efficient algorithm that is also suitable for parallel implementation. We adopt the alternating optimization framework and solve each matrix nonnegative least-squares problem via a Nesterov-type algorithm for strongly convex problems. We describe a parallel implementation of the algorithm and measure the attained speedup in a multicore computing environment. It turns out that the derived algorithm is a competitive candidate for the solution of very large-scale dense nonnegative tensor factorization problems.

...read moreread less

Journal Article•DOI•

Parallel Algorithm of the NOISEtte Code for CFD and CAA Simulations

[...]

Andrey Gorobets¹•Institutions (1)

Keldysh Institute of Applied Mathematics¹

25 May 2018-Lobachevskii Journal of Mathematics

TL;DR: This paper describes the parallel algorithm of the NOISEtte code for computational fluid dynamics and aeroacoustics simulations based on a family of higher-accuracy numerical schemes for unstructured hybrid meshes and the multilevel MPI + OpenMP parallelization.

...read moreread less

Abstract: This paper describes the parallel algorithm of the NOISEtte code for computational fluid dynamics and aeroacoustics simulations. It is based on a family of higher-accuracy numerical schemes for unstructured hybrid meshes. The multilevel MPI + OpenMP parallelization is described in detail. Performance results are presented for various supercomputers and applications.

...read moreread less

Journal Article•DOI•

GMiner: A fast GPU-based frequent itemset mining method for large-scale data

[...]

Kang Wook Chon¹, Sang Hyun Hwang¹, Min-Soo Kim¹•Institutions (1)

Daegu Gyeongbuk Institute of Science and Technology¹

01 May 2018-Information Sciences

TL;DR: This paper proposes a fast GPU-based frequent itemset mining method called GMiner, which achieves very fast performance by fully exploiting the computational power of GPUs and is suitable for large-scale data.

...read moreread less

Journal Article•DOI•

PyCAC: The concurrent atomistic-continuum simulation environment

[...]

Shuozhi Xu¹, Thomas G. Payne², Hao Chen³, Yongchao Liu², Liming Xiong³, Youping Chen⁴, David L. McDowell² - Show less +3 more•Institutions (4)

University of California, Santa Barbara¹, Georgia Institute of Technology², Iowa State University³, University of Florida⁴

13 Apr 2018-Journal of Materials Research

TL;DR: This paper discusses the serial algorithms of dynamic, quasistatic, and hybrid CAC, along with some programming techniques used in the code, and illustrates the parallel algorithm, quantify the parallel scalability, and discuss some software specifications of PyCAC.

...read moreread less

Abstract: We present a novel distributed-memory parallel implementation of the concurrent atomistic-continuum (CAC) method. Written mostly in Fortran 2008 and wrapped with a Python scripting interface, the CAC simulator in PyCAC runs in parallel using Message Passing Interface with a spatial decomposition algorithm. Built upon the underlying Fortran code, the Python interface provides a robust and versatile way for users to build system configurations, run CAC simulations, and analyze results. In this paper, following a brief introduction to the theoretical background of the CAC method, we discuss the serial algorithms of dynamic, quasistatic, and hybrid CAC, along with some programming techniques used in the code. We then illustrate the parallel algorithm, quantify the parallel scalability, and discuss some software specifications of PyCAC; more information can be found in the PyCAC user’s manual that is hosted on http://www.pycac.org .

...read moreread less

Journal Article•DOI•

A parallel algorithm for network traffic anomaly detection based on Isolation Forest

[...]

Xiaoling Tao¹, Xiaoling Tao², Yang Peng², Feng Zhao², Peichao Zhao², Yong Wang² - Show less +2 more•Institutions (2)

Xidian University¹, Guilin University of Electronic Technology²

27 Nov 2018-International Journal of Distributed Sensor Networks

TL;DR: A parallel algorithm based on Isolation Forest and Spark for network traffic anomaly detection and big data processing capability of Spark technology is proposed, which can also solve the problem of computation bottleneck on single machine.

...read moreread less

Abstract: With the rapid development of large-scale complex networks and proliferation of various social network applications, the amount of network traffic data generated is increasing tremendously, and eff...

...read moreread less

Proceedings Article•DOI•

Optimizing Parallel Graph Connectivity Computation via Subgraph Sampling

[...]

Michael Sutton¹, Tal Ben-Nun², Amnon Barak¹•Institutions (2)

Hebrew University of Jerusalem¹, ETH Zurich²

21 May 2018

TL;DR: Afforest is proposed: an extension of the Shiloach-Vishkin connected components algorithm that approaches optimal work efficiency by processing subgraphs in each iteration, and it is shown that the algorithm exhibits higher memory locality than existing methods.

...read moreread less

Abstract: Connected component identification is a fundamental problem in graph analytics, serving as a basis for subsequent computations in a wide range of applications. To determine connectivity, several parallel algorithms, whose complexity is proportional to the number of edges or graph diameter, have been proposed. However, an optimal algorithm may extract graph components by working proportionally to the number of vertices, which can be orders of magnitude lower than the number of edges. We propose Afforest: an extension of the Shiloach-Vishkin connected components algorithm that approaches optimal work efficiency by processing subgraphs in each iteration. We prove the convergence of the algorithm, analyze its work efficiency characteristics, and provide further techniques to speed up processing graphs containing a huge component. Designed with modern parallel architectures in mind, we show that the algorithm exhibits higher memory locality than existing methods. Using both synthetic and real-world graphs, we demonstrate that Afforest achieves speedups of up to 67x over the state-of-the-art on multi-core CPUs (Broadwell, POWER8) and up to 23x on GPUs (Pascal).

...read moreread less

Proceedings Article•DOI•

Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product

[...]

Grey Ballard¹, Nicholas Knight¹, Kathryn Rouse²•Institutions (2)

Wake Forest University¹, New York University²

21 May 2018

TL;DR: In this paper, the authors show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation, and also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal.

...read moreread less

Abstract: The matricized-tensor times Khatri-Rao product (MTTKRP) computation is the typical bottleneck in algorithms for computing a CP decomposition of a tensor. In order to develop high performance sequential and parallel algorithms, we establish communication lower bounds that identify how much data movement is required for this computation in the case of dense tensors. We also present sequential and parallel algorithms that attain the lower bounds and are therefore communication optimal. In particular, we show that the structure of the computation allows for less communication than the straightforward approach of casting the computation as a matrix multiplication operation.

...read moreread less

Journal Article•DOI•

A Parallel Approach for Frequent Subgraph Mining in a Single Large Graph Using Spark

[...]

Qiao Fengcai, Xin Zhang, Pei Li, Zhaoyun Ding, Jia Shanshan, Hui Wang - Show less +2 more

02 Feb 2018-Applied Sciences

TL;DR: SSiGraM (Spark based Single Graph Mining), a Spark based parallel frequent subgraph mining algorithm in a single large graph that outperforms the existing GraMi (Graph Mining) algorithm by an order of magnitude for all datasets and can work with a lower support threshold.

...read moreread less

Abstract: Frequent subgraph mining (FSM) plays an important role in graph mining, attracting a great deal of attention in many areas, such as bioinformatics, web data mining and social networks. In this paper, we propose SSiGraM (Spark based Single Graph Mining), a Spark based parallel frequent subgraph mining algorithm in a single large graph. Aiming to approach the two computational challenges of FSM, we conduct the subgraph extension and support evaluation parallel across all the distributed cluster worker nodes. In addition, we also employ a heuristic search strategy and three novel optimizations: load balancing, pre-search pruning and top-down pruning in the support evaluation process, which significantly improve the performance. Extensive experiments with four different real-world datasets demonstrate that the proposed algorithm outperforms the existing GraMi (Graph Mining) algorithm by an order of magnitude for all datasets and can work with a lower support threshold.

...read moreread less

Journal Article•DOI•

OpenCL Based Parallel Algorithm for RBF-PUM Interpolation

[...]

Roberto Cavoretto¹, Teseo Schneider², Patrick Zulian²•Institutions (2)

University of Turin¹, University of Lugano²

01 Jan 2018-Journal of Scientific Computing

TL;DR: In this paper, a parallel algorithm for multivariate Radial Basis Function Partition of Unity Method (RBF-PUM) interpolation is presented, which makes use of shared-memory parallel processors through the OpenCL standard.

...read moreread less

Abstract: We present a parallel algorithm for multivariate Radial Basis Function Partition of Unity Method (RBF-PUM) interpolation. The concurrent nature of the RBF-PUM enables designing parallel algorithms for dealing with a large number of scattered data-points in high space dimensions. To efficiently exploit this concurrency, our algorithm makes use of shared-memory parallel processors through the OpenCL standard. This efficiency is achieved by a parallel space partitioning strategy with linear computational time complexity with respect to the input and evaluation points. The speed of our approach allows for computationally more intensive construction of the interpolant. In fact, the RBF-PUM can be coupled with a cross-validation technique that searches for optimal values of the shape parameters associated with each local RBF interpolant, thus reducing the global interpolation error. The numerical experiments support our claims by illustrating the interpolation errors and the running times of our algorithm.

...read moreread less

Proceedings Article•DOI•

Parallel Write-Efficient Algorithms and Data Structures for Computational Geometry

[...]

Guy E. Blelloch¹, Yan Gu¹, Julian Shun², Yihan Sun¹•Institutions (2)

Carnegie Mellon University¹, Massachusetts Institute of Technology²

11 Jul 2018

TL;DR: This paper designs parallel write-efficient geometric algorithms that perform asymptotically fewer writes than standard algorithms for the same problem, and introduces several techniques for obtaining write-efficiency, including DAG tracing, prefix doubling, and α-labeling.

...read moreread less

Abstract: In this paper, we design parallel write-efficient geometric algorithms that perform asymptotically fewer writes than standard algorithms for the same problem. This is motivated by emerging non-volatile memory technologies with read performance being close to that of random access memory but writes being significantly more expensive in terms of energy and latency. We design algorithms for planar Delaunay triangulation, k -d trees, and static and dynamic augmented trees. Our algorithms are designed in the recently introduced Asymmetric Nested-Parallel Model, which captures the parallel setting in which there is a small symmetric memory where reads and writes are unit cost as well as a large asymmetric memory where writes are $omega$ times more expensive than reads. In designing these algorithms, we introduce several techniques for obtaining write-efficiency, including DAG tracing, prefix doubling, and α-labeling, which we believe will be useful for designing other parallel write-efficient algorithms.

...read moreread less

Collapse