Showing papers on "Parallel algorithm published in 2021"

PDF

Open Access

Journal Article•DOI•

Efficient and High-quality Recommendations via Momentum-incorporated Parallel Stochastic Gradient Descent-Based Learning

[...]

Xin Luo¹, Wen Qin², Ani Dong¹, Khaled Sedraoui³, MengChu Zhou⁴ - Show less +1 more•Institutions (4)

Dongguan University of Technology¹, Chongqing University of Posts and Telecommunications², King Abdulaziz University³, New Jersey Institute of Technology⁴

01 Feb 2021-IEEE/CAA Journal of Automatica Sinica

TL;DR: In this paper, a momentum-incorporated parallel stochastic gradient descent (MPSGD) algorithm is proposed to accelerate the convergence rate by integrating momentum effects into its training process.

...read moreread less

Abstract: A recommender system (RS) relying on latent factor analysis usually adopts stochastic gradient descent (SGD) as its learning algorithm. However, owing to its serial mechanism, an SGD algorithm suffers from low efficiency and scalability when handling large-scale industrial problems. Aiming at addressing this issue, this study proposes a momentum-incorporated parallel stochastic gradient descent (MPSGD) algorithm, whose main idea is two-fold: a) implementing parallelization via a novel data-splitting strategy, and b) accelerating convergence rate by integrating momentum effects into its training process. With it, an MPSGD-based latent factor (MLF) model is achieved, which is capable of performing efficient and high-quality recommendations. Experimental results on four high-dimensional and sparse matrices generated by industrial RS indicate that owing to an MPSGD algorithm, an MLF model outperforms the existing state-of-the-art ones in both computational efficiency and scalability.

...read moreread less

108 citations

Journal Article•DOI•

Parallel physics-informed neural networks via domain decomposition

[...]

Khemraj Shukla¹, Ameya D. Jagtap¹, George Em Karniadakis¹•Institutions (1)

Brown University¹

15 Dec 2021-Journal of Computational Physics

TL;DR: In this article, a distributed framework for physics-informed neural networks (PINNs) based on two recent extensions, namely conservative PINNs and extended PINNs (XPINNs), which employ domain decomposition in space and in time-space, respectively, is developed.

...read moreread less

56 citations

Journal Article•DOI•

Decentralized Tracking Optimization Control for Partially Unknown Fuzzy Interconnected Systems via Reinforcement Learning Method

[...]

Kun Zhang¹, Huaguang Zhang¹, Yunfei Mu¹, Chong Liu¹•Institutions (1)

Northeastern University (China)¹

01 Apr 2021-IEEE Transactions on Fuzzy Systems

TL;DR: This article proposes a novel parallel tracking control optimization algorithm for interconnected systems where the working feedback control is considered as a reconstructed dynamic with the virtual control and a new augmented fuzzy interconnected tracking system is built, thus that the performance index is valid for optimal control.

...read moreread less

Abstract: In this article, a novel parallel tracking control optimization algorithm is first proposed for partially unknown fuzzy interconnected systems. In the existing standard optimal tracking control, the bounded or nonasymptotic stable reference trajectory will lead the feedback control not converging to zero, which causes the performance index infinite and invalid. By using the precompensation technique, in this article, the working feedback control is considered as a reconstructed dynamic with the virtual control and a new augmented fuzzy interconnected tracking system is built, thus that the performance index is valid for optimal control. Then, combining the integral reinforcement learning (RL) method and decentralized control design, the novel integral RL parallel algorithm is first developed to solve the tracking controls for interconnected systems, which relax the requirements of exact matrices information $A_i^k$ and $B_i^k$ during the solving process. Both the convergence and stability of the designed control optimization scheme are guaranteed by theorems. Finally, the new parallel tracking algorithm for interconnected systems is verified through the dual-manipulator coordination system and simulation results demonstrate the effectiveness.

...read moreread less

37 citations

Journal Article•DOI•

Few-Shots Parallel Algorithm Portfolio Construction via Co-Evolution

[...]

Ke Tang¹, Shengcai Liu¹, Peng Yang¹, Xin Yao¹•Institutions (1)

Southern University of Science and Technology¹

16 Feb 2021-IEEE Transactions on Evolutionary Computation

TL;DR: A novel competitive co-evolution scheme, named co- Evolution of parameterized search (CEPS), is proposed, capable of obtaining generalizable PAPs with few training instances, and has led to better generalization.

...read moreread less

Abstract: Generalization, i.e., the ability of solving problem instances that are not available during the system design and development phase, is a critical goal for intelligent systems. A typical way to achieve good generalization is to learn a model from vast data. In the context of heuristic search, such a paradigm could be implemented as configuring the parameters of a parallel algorithm portfolio (PAP) based on a set of “training” problem instances, which is often referred to as PAP construction. However, compared to the traditional machine learning, PAP construction often suffers from the lack of training instances, and the obtained PAPs may fail to generalize well. This article proposes a novel competitive co-evolution scheme, named co-evolution of parameterized search (CEPS), as a remedy to this challenge. By co-evolving a configuration population and an instance population, CEPS is capable of obtaining generalizable PAPs with few training instances. The advantage of CEPS in improving generalization is analytically shown in this article. Two concrete algorithms, namely, CEPS-TSP and CEPS-VRPSPDTW, are presented for the traveling salesman problem (TSP) and the vehicle routing problem with simultaneous pickup–delivery and time windows (VRPSPDTW), respectively. The experimental results show that CEPS has led to better generalization, and even managed to find new best-known solutions for some instances.

...read moreread less

30 citations

Journal Article•DOI•

Temporal Parallelization of Bayesian Smoothers

[...]

Simo Särkkä¹, Angel F. Garcia-Fernandez²•Institutions (2)

Aalto University¹, University of Liverpool²

01 Jan 2021-IEEE Transactions on Automatic Control

TL;DR: Algorithms for temporal parallelization of Bayesian smoothers are presented, and the advantage of the proposed algorithms is that they reduce the linear complexity of standard smoothing algorithms with respect to time to logarithmic.

...read moreread less

Abstract: This article presents algorithms for temporal parallelization of Bayesian smoothers. We define the elements and the operators to pose these problems as the solutions to all-prefix-sums operations for which efficient parallel scan-algorithms are available. We present the temporal parallelization of the general Bayesian filtering and smoothing equations, and specialize them to linear/Gaussian models. The advantage of the proposed algorithms is that they reduce the linear complexity of standard smoothing algorithms with respect to time to logarithmic.

...read moreread less

28 citations

Proceedings Article•DOI•

TurboTransformers: an efficient GPU serving system for transformer models

[...]

Jiarui Fang¹, Yang Yu¹, Chengduo Zhao¹, Jie Zhou¹•Institutions (1)

Tencent¹

17 Feb 2021

TL;DR: TurboTransformers as mentioned in this paper is a transformer serving system for NLP tasks on GPUs, which consists of a computing runtime and a serving framework, which can achieve the state-of-the-art transformer model serving performance on GPU platforms.

...read moreread less

Abstract: The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, transformers are able to process on dimensions of sequence lengths in parallel, therefore leads to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. To solve the above challenges, this paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, which are major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batch scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

...read moreread less

24 citations

Proceedings Article•DOI•

Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering

[...]

Yiqiu Wang¹, Shangdi Yu¹, Yan Gu², Julian Shun¹•Institutions (2)

Massachusetts Institute of Technology¹, University of California, Riverside²

09 Jun 2021

TL;DR: In this article, the authors present new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*) based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations.

...read moreread less

Abstract: This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN*. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN*. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time). We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.

...read moreread less

20 citations

Journal Article•DOI•

Parallel and Scalable Heat Methods for Geodesic Distance Computation

[...]

Jiong Tao¹, Juyong Zhang¹, Bailin Deng², Zheng Fang³, Yue Peng¹, Ying He³ - Show less +2 more•Institutions (3)

University of Science and Technology of China¹, Cardiff University², Nanyang Technological University³

01 Feb 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The recovery of geodesic distance with the heat method can be reformulated as optimization of its gradients subject to integrability, which can be solved using an efficient first-order method that requires no linear system solving and converges quickly.

...read moreread less

Abstract: In this paper, we propose a parallel and scalable approach for geodesic distance computation on triangle meshes. Our key observation is that the recovery of geodesic distance with the heat method [1] can be reformulated as optimization of its gradients subject to integrability, which can be solved using an efficient first-order method that requires no linear system solving and converges quickly. Afterward, the geodesic distance is efficiently recovered by parallel integration of the optimized gradients in breadth-first order. Moreover, we employ a similar breadth-first strategy to derive a parallel Gauss-Seidel solver for the diffusion step in the heat method. To further lower the memory consumption from gradient optimization on faces, we also propose a formulation that optimizes the projected gradients on edges, which reduces the memory footprint by about 50 percent. Our approach is trivially parallelizable, with a low memory footprint that grows linearly with respect to the model size. This makes it particularly suitable for handling large models. Experimental results show that it can efficiently compute geodesic distance on meshes with more than 200 million vertices on a desktop PC with 128 GB RAM, outperforming the original heat method and other state-of-the-art geodesic distance solvers.

...read moreread less

20 citations

Journal Article•DOI•

Examination of Semi-Analytical Solution Methods in the Coarse Operator of Parareal Algorithm for Power System Simulation

[...]

Byungkwon Park¹, Kai Sun², Aleksandar Dimitrovski³, Yang Liu², Srdjan Simunovic¹ - Show less +1 more•Institutions (3)

Oak Ridge National Laboratory¹, University of Tennessee², University of Central Florida³

01 Nov 2021-IEEE Transactions on Power Systems

TL;DR: This paper examines semi-analytical solution (SAS) methods as the coarse operators of the Parareal algorithm and explores performance of the SAS methods to the standard numerical time integration methods.

...read moreread less

Abstract: With continuing advances in high-performance parallel computing platforms, parallel algorithms have become powerful tools for development of faster than real-time power system dynamic simulations In particular, it has been demonstrated in recent years that parallel-in-time (Parareal) algorithms have the potential to achieve such an ambitious goal The selection of a fast and reasonably accurate coarse operator of the Parareal algorithm is crucial for its effective utilization and performance This paper examines semi-analytical solution (SAS) methods as the coarse operators of the Parareal algorithm and explores performance of the SAS methods to the standard numerical time integration methods Two promising time-power series-based SAS methods were considered; Adomian decomposition method and Homotopy analysis method with a windowing approach for improving the convergence Numerical performance case studies on 10-generator 39-bus system and 327-generator 2383-bus system were performed for these coarse operators over different disturbances, evaluating the number of Parareal iterations, computational time, and stability of convergence All the coarse operators tested with different scenarios have converged to the same corresponding true solution (if they are convergent) and the SAS methods provide comparable computational speed, while having more stable convergence to the true solution in many cases

...read moreread less

19 citations

Journal Article•DOI•

A parallelized database damage assessment approach after cyberattack for healthcare systems

[...]

Sanaa Kaddoura, Ramzi A. Haraty, Karam Al Kontar, Omar Alfandi

31 Mar 2021-Future Internet

TL;DR: A novel approach for database damage assessment for healthcare systems, inspired by the current behavior of COVID-19 infections, that outperforms other existing algorithms in this domain in terms of both time and memory.

...read moreread less

Abstract: In the current Internet of things era, all companies shifted from paper-based data to the electronic format Although this shift increased the efficiency of data processing, it has security drawbacks Healthcare databases are a precious target for attackers because they facilitate identity theft and cybercrime This paper presents an approach for database damage assessment for healthcare systems Inspired by the current behavior of COVID-19 infections, our approach views the damage assessment problem the same way The malicious transactions will be viewed as if they are COVID-19 viruses, taken from infection onward The challenge of this research is to discover the infected transactions in a minimal time The proposed parallel algorithm is based on the transaction dependency paradigm, with a time complexity O((M+NQ+N^3)/L) (M = total number of transactions under scrutiny, N = number of malicious and affected transactions in the testing list, Q = time for dependency check, and L = number of threads used) The memory complexity of the algorithm is O(N+KL) (N = number of malicious and affected transactions, K = number of transactions in one area handled by one thread, and L = number of threads) Since the damage assessment time is directly proportional to the denial-of-service time, the proposed algorithm provides a minimized execution time Our algorithm is a novel approach that outperforms other existing algorithms in this domain in terms of both time and memory, working up to four times faster in terms of time and with 120,000 fewer bytes in terms of memory

...read moreread less

18 citations

Journal Article•DOI•

A Massively Parallel Implementation of the CCSD(T) Method Using the Resolution-of-the-Identity Approximation and a Hybrid Distributed/Shared Memory Parallelization Model.

[...]

Dipayan Datta¹, Mark S. Gordon¹•Institutions (1)

Iowa State University¹

19 Jul 2021-Journal of Chemical Theory and Computation

TL;DR: In this article, a parallel algorithm for the coupled-cluster singles and doubles method augmented with a perturbative correction for triple excitations [CCSD(T)] using the resolution-of-the-identity (RI) approximation for two-electron repulsion integrals (ERIs).

...read moreread less

Abstract: A parallel algorithm is described for the coupled-cluster singles and doubles method augmented with a perturbative correction for triple excitations [CCSD(T)] using the resolution-of-the-identity (RI) approximation for two-electron repulsion integrals (ERIs). The algorithm bypasses the storage of four-center ERIs by adopting an integral-direct strategy. The CCSD amplitude equations are given in a compact quasi-linear form by factorizing them in terms of amplitude-dressed three-center intermediates. A hybrid MPI/OpenMP parallelization scheme is employed, which uses the OpenMP-based shared memory model for intranode parallelization and the MPI-based distributed memory model for internode parallelization. Parallel efficiency has been optimized for all terms in the CCSD amplitude equations. Two different algorithms have been implemented for the rate-limiting terms in the CCSD amplitude equations that entail O(NO2NV4) and O(NO3NV3)-scaling computational costs, where NO and NV denote the number of correlated occupied and virtual orbitals, respectively. One of the algorithms assembles the four-center ERIs requiring NV4 and NO2NV2-scaling memory costs in a distributed manner on a number of MPI ranks, while the other algorithm completely bypasses the assembling of quartic memory-scaling ERIs and thus largely reduces the memory demand. It is demonstrated that the former memory-expensive algorithm is faster on a few hundred cores, while the latter memory-economic algorithm shows a better strong scaling in the limit of a few thousand cores. The program is shown to exhibit a near-linear scaling, in particular for the compute-intensive triples correction step, on up to 8000 cores. The performance of the program is demonstrated via calculations involving molecules with 24-51 atoms and up to 1624 atomic basis functions. As the first application, the complete basis set (CBS) limit for the interaction energy of the π-stacked uracil dimer from the S66 data set has been investigated. This work reports the first calculation of the interaction energy at the CCSD(T)/aug-cc-pVQZ level without local orbital approximation. The CBS limit for the CCSD correlation contribution to the interaction energy was found to be -8.01 kcal/mol, which agrees very well with the value -7.99 kcal/mol reported by Schmitz, Hattig, and Tew [ Phys. Chem. Chem. Phys. 2014, 16, 22167-22178]. The CBS limit for the total interaction energy was estimated to be -9.64 kcal/mol.

...read moreread less

Journal Article•DOI•

Parallel random matrix particle swarm optimization scheduling algorithms with budget constraints on cloud computing systems

[...]

Xiaoyong Tang¹, Cheng Shi¹, Tan Deng¹, Zhiqiang Wu¹, Li Yang¹ - Show less +1 more•Institutions (1)

Changsha University of Science and Technology¹

01 Dec 2021-Applied Soft Computing

TL;DR: This study formalizes a random matrix particle swarm optimization scheduling algorithm (RMPSO), which uses the random integer matrix to represent its position and a feasible task scheduling scheme, to achieve the optimal total cost of cloud services and proposes two parallel RMPSO algorithms.

...read moreread less

Journal Article•DOI•

Cross-platform programming model for many-core lattice Boltzmann simulations.

[...]

Jonas Latt¹, Christophe Coreixas¹, Joel Beny¹•Institutions (1)

University of Geneva¹

29 Apr 2021-PLOS ONE

TL;DR: In this article, the authors present a hardware-agnostic implementation strategy for lattice Boltzmann simulations, which yields massive performance on homogeneous and heterogeneous many-core platforms.

...read moreread less

Abstract: We present a novel, hardware-agnostic implementation strategy for lattice Boltzmann (LB) simulations, which yields massive performance on homogeneous and heterogeneous many-core platforms. Based solely on C++17 Parallel Algorithms, our approach does not rely on any language extensions, external libraries, vendor-specific code annotations, or pre-compilation steps. Thanks in particular to a recently proposed GPU back-end to C++17 Parallel Algorithms, it is shown that a single code can compile and reach state-of-the-art performance on both many-core CPU and GPU environments for the solution of a given non trivial fluid dynamics problem. The proposed strategy is tested with six different, commonly used implementation schemes to test the performance impact of memory access patterns on different platforms. Nine different LB collision models are included in the tests and exhibit good performance, demonstrating the versatility of our parallel approach. This work shows that it is less than ever necessary to draw a distinction between research and production software, as a concise and generic LB implementation yields performances comparable to those achievable in a hardware specific programming language. The results also highlight the gains of performance achieved by modern many-core CPUs and their apparent capability to narrow the gap with the traditionally massively faster GPU platforms. All code is made available to the community in form of the open-source project stlbm, which serves both as a stand-alone simulation software and as a collection of reusable patterns for the acceleration of pre-existing LB codes.

...read moreread less

Proceedings Article•DOI•

Parallel Algorithms for Finding Large Cliques in Sparse Graphs

[...]

Lukas Gianinazzi¹, Maciej Besta¹, Yannick Schaffner¹, Torsten Hoefler¹•Institutions (1)

ETH Zurich¹

06 Jul 2021

TL;DR: In this paper, a parallel k-clique listing algorithm with improved work bounds was presented for sparse graphs with low degeneracy or arboricity, where the pruning criterion for a backtracking search was introduced and analyzed.

...read moreread less

Abstract: We present a parallel k-clique listing algorithm with improved work bounds (for the same depth) in sparse graphs with low degeneracy or arboricity. We achieve this by introducing and analyzing a new pruning criterion for a backtracking search. Our algorithm has better asymptotic performance, especially for larger cliques (when k is not constant), where we avoid the straightforwardly exponential runtime growth with respect to the clique size. In particular, for cliques that are a constant factor smaller than the graph's degeneracy, the work improvement is an exponential factor in the clique size compared to previous results. Moreover, we present a low-depth approximation to the community degeneracy (which can be arbitrarily smaller than the degeneracy). This approximation enables a low depth clique listing algorithm whose runtime is parameterized by the community degeneracy.

...read moreread less

Journal Article•DOI•

BSF: A parallel computation model for scalability estimation of iterative numerical algorithms on cluster computing systems

[...]

Leonid B. Sokolinsky¹•Institutions (1)

South Ural State University¹

01 Mar 2021-Journal of Parallel and Distributed Computing

TL;DR: This paper examines a new parallel computation model called bulk synchronous farm (BSF) that focuses on estimating the scalability of compute-intensive iterative algorithms aimed at cluster computing systems and presents a cost metric of the BSF model.

...read moreread less

Proceedings Article•DOI•

Efficient Stepping Algorithms and Implementations for Parallel Shortest Paths

[...]

Xiaojun Dong¹, Yan Gu¹, Yihan Sun¹, Yunming Zhang²•Institutions (2)

University of California, Riverside¹, Massachusetts Institute of Technology²

06 Jul 2021

TL;DR: In this paper, the authors propose a new data type, lazy-batched priority queue (LaB-PQ), which abstracts the semantics of the priority queue needed by the stepping algorithms.

...read moreread less

Abstract: The single-source shortest-path (SSSP) problem is a notoriously hard problem in the parallel context. In practice, the Δ-stepping algorithm of Meyer and Sanders has been widely adopted. However, Δ-stepping has no known worst-case bounds for general graphs, and the performance highly relies on the parameter Δ, which requires exhaustive tuning. The parallel SSSP algorithms with provable bounds, such as Radius-stepping, either have no implementations available or are much slower than Δ-stepping in practice. We propose the stepping algorithm framework that generalizes existing algorithms such as Δ-stepping and Radius-stepping. The framework allows for similar analysis and implementations for all stepping algorithms. We also propose a new abstract data type, lazy-batched priority queue (LaB-PQ ) that abstracts the semantics of the priority queue needed by the stepping algorithms. We provide two data structures for LaB-PQ, focusing on theoretical and practical efficiency, respectively. Based on the new framework and LaB-PQ, we show two new stepping algorithms, ρ-stepping and Δ^*-stepping, that are simple, with non-trivial worst-case bounds, and fast in practice. We also show improved bounds for a list of existing algorithms such as Radius-Stepping. Based on our framework, we implement three algorithms: Bellman-Ford, Δ^*-stepping, and ρ-stepping. We compare the performance with four state-of-the-art implementations. On five social and web graphs, ρ-stepping is 1.3--2.6x faster than all the existing implementations. On two road graphs, our Δ^*-stepping is at least 14% faster than existing ones, while ρ-stepping is also competitive. The almost identical implementations for stepping algorithms also allow for in-depth analyses among the stepping algorithms in practice.

...read moreread less

Journal Article•DOI•

Parallel and scalable Dunn Index for the validation of big data clusters

[...]

Chiheb-Eddine Ben N'cir¹, Abdallah Hamza¹, Waad Bouaguel¹•Institutions (1)

Tunis University¹

01 May 2021

TL;DR: A parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results and a good scalability and a reliable validation compared to other existing measures when handling large scale data are proposed.

...read moreread less

Abstract: Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a cluster of nodes and to the parallel computation models. Although the effectiveness of parallel models to deal with increasing volume of data little work is done on the validation of big clusters. To deal with this issue, we propose a parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results. Rather than computing the Dunn Index on a single machine in the clustering validation process, the new proposed measure is computed by distributing the partitioning among a cluster of nodes using a customized parallel model under Apache Spark framework. The proposed S-DI is also enhanced by a Sketch and Validate sampling technique which aims to approximate the Dunn Index value by using a small representative data-sample. Different experiments on simulated and real datasets showed a good scalability of our proposed measure and a reliable validation compared to other existing measures when handling large scale data.

...read moreread less

Journal Article•DOI•

Parallel Blockwise Knowledge Distillation for Deep Neural Network Compression

[...]

Cody Blakeney¹, Xiaomin Li¹, Yan Yan¹, Ziliang Zong¹•Institutions (1)

Texas State University¹

01 Jul 2021-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Wang et al. as discussed by the authors proposed a parallel blockwise knowledge distillation algorithm to accelerate the distillation process of sophisticated DNNs, which leverages local information to conduct independent blockwise distillation and utilizes depthwise separable layers as the efficient replacement block architecture.

...read moreread less

Abstract: Deep neural networks (DNNs) have been extremely successful in solving many challenging AI tasks in natural language processing, speech recognition, and computer vision nowadays. However, DNNs are typically computation intensive, memory demanding, and power hungry, which significantly limits their usage on platforms with constrained resources. Therefore, a variety of compression techniques (e.g., quantization, pruning, and knowledge distillation) have been proposed to reduce the size and power consumption of DNNs. Blockwise knowledge distillation is one of the compression techniques that can effectively reduce the size of a highly complex DNN. However, it is not widely adopted due to its long training time. In this article, we propose a novel parallel blockwise distillation algorithm to accelerate the distillation process of sophisticated DNNs. Our algorithm leverages local information to conduct independent blockwise distillation, utilizes depthwise separable layers as the efficient replacement block architecture, and properly addresses limiting factors (e.g., dependency, synchronization, and load balancing) that affect parallelism. The experimental results running on an AMD server with four Geforce RTX 2080Ti GPUs show that our algorithm can achieve 3x speedup plus 19 percent energy savings on VGG distillation, and 3.5x speedup plus 29 percent energy savings on ResNet distillation, both with negligible accuracy loss. The speedup of ResNet distillation can be further improved to 3.87 when using four RTX6000 GPUs in a distributed cluster.

...read moreread less

Journal Article•DOI•

GPU-based Parallel Algorithm for Super-Quadric Discrete Element Method and Its Applications for Non-Spherical Granular Flows

[...]

Siqiang Wang¹, Qi Zhang², Shunying Ji¹•Institutions (2)

Dalian University of Technology¹, Taiyuan University of Technology²

01 Jan 2021-Advances in Engineering Software

TL;DR: These studies demonstrate that the proposed CUDA-GPU parallel algorithms are applicable and reliable for the large-scale engineering applications of non-spherical granular systems.

...read moreread less

Journal Article•DOI•

Parallel Theatre: An actor framework in Java for high performance computing

[...]

Libero Nigro¹•Institutions (1)

University of Calabria¹

01 Jan 2021-Simulation Modelling Practice and Theory

TL;DR: A novel extension of Theatre, Parallel Theatre, which is developed for an exploitation of the computing potential of nowadays multi-core machines with shared memory and the particular control forms which were developed for untimed and timed parallel systems are described.

...read moreread less

Proceedings Article•DOI•

Fast Density-Peaks Clustering: Multicore-based Parallelization Approach

[...]

Daichi Amagata¹, Takahiro Hara¹•Institutions (1)

Osaka University¹

09 Jun 2021

TL;DR: Wang et al. as discussed by the authors proposed an exact algorithm, Ex-DPC, and two approximation algorithms, ApproxDPC and S-Approx-Dpc, to enable DPC on large datasets.

...read moreread less

Abstract: Clustering multi-dimensional points is a fundamental task in many fields, and density-based clustering supports many applications as it can discover clusters of arbitrary shapes. This paper addresses the problem of Density-Peaks Clustering (DPC), a recently proposed density-based clustering framework. Although DPC already has many applications, its straightforward implementation incurs a quadratic time computation to the number of points in a given dataset, thereby does not scale to large datasets. To enable DPC on large datasets, we propose efficient algorithms for DPC. Specifically, we propose an exact algorithm, Ex-DPC, and two approximation algorithms, Approx-DPC and S-Approx-DPC. Under a reasonable assumption about a DPC parameter, our algorithms are sub-quadratic, i.e., break the quadratic barrier. Besides, Approx-DPC does not require any additional parameters and can return the same cluster centers as those of Ex-DPC, rendering an accurate clustering result. S-Approx-DPC requires an approximation parameter but can speed up its computational efficiency. We further present that their efficiencies can be accelerated by leveraging multicore processing. We conduct extensive experiments using synthetic and real datasets, and our experimental results demonstrate that our algorithms are efficient, scalable, and accurate.

...read moreread less

Proceedings Article•DOI•

Parallel Index-Based Structural Graph Clustering and Its Approximation

[...]

Tom Tseng¹, Laxman Dhulipala¹, Julian Shun¹•Institutions (1)

Massachusetts Institute of Technology¹

09 Jun 2021

TL;DR: In this article, a parallel index-based SCAN algorithm based on GS*-Index was proposed. But the parallel algorithm is not as efficient as the sequential algorithm, since it does not effectively share work among queries with different SCAN parameter settings.

...read moreread less

Abstract: SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely used graph clustering algorithm. For large graphs, however, sequential SCAN variants are prohibitively slow, and parallel SCAN variants do not effectively share work among queries with different SCAN parameter settings. Since users of SCAN often explore many parameter settings to find good clusterings, it is worthwhile to precompute an index that speeds up queries. This paper presents a practical and provably efficient parallel index-based SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel algorithm improves upon the asymptotic work of the sequential algorithm by using integer sorting. It is also highly parallel, achieving logarithmic span (parallel time) for both index construction and clustering queries. Furthermore, we apply locality-sensitive hashing (LSH) to design a novel approximate SCAN algorithm and prove guarantees for its clustering behavior. We present an experimental evaluation of our algorithms on large real-world graphs. On a 48-core machine with two-way hyper-threading, our parallel index construction achieves 50--151× speedup over the construction of GS*-Index. In fact, even on a single thread, our index construction algorithm is faster than GS*-Index. Our parallel index query implementation achieves 5--32× speedup over GS*-Index queries across a range of SCAN parameter values, and our implementation is always faster than ppSCAN, a state-of-the-art parallel SCAN algorithm. Moreover, our experiments show that applying LSH results in faster index construction while maintaining good clustering quality.

...read moreread less

Journal Article•DOI•

Parallel and distributed association rule mining in life science: A novel parallel algorithm to mine genomics data

[...]

Giuseppe Agapito¹, Pietro Hiram Guzzi¹, Mario Cannataro¹•Institutions (1)

Magna Græcia University¹

01 Oct 2021-Information Sciences

TL;DR: A comparison among different sequential, parallels and distributed ARM techniques, and the presentation of a novel ARM algorithm, named Balanced Parallel Association Rule Extractor from SNPs (BPARES), that employs parallel computing and a novel balancing strategy to improve response time.

...read moreread less

Journal Article•DOI•

Direct Numerical Simulation of Incompressible Flows on Parallel Octree Grids

[...]

Raphael Egan¹, Arthur Guittet¹, Fernando Temprano-Coleto¹, Tobin Isaac², François J. Peaudecerf³, Julien R. Landel, Paolo Luzzatto-Fegiz¹, Carsten Burstedde⁴, Frederic Gibou¹ - Show less +5 more•Institutions (4)

University of California, Santa Barbara¹, Georgia Institute of Technology², ETH Zurich³, University of Bonn⁴

01 Mar 2021-Journal of Computational Physics

TL;DR: In this article, an approach for solving the incompressible Navier-Stokes equations on a forest of Octree grids in a parallel environment is presented. But the approach is not suitable for large-scale data structures.

...read moreread less

Journal Article•DOI•

A Multiperiod Multiobjective Portfolio Selection Model With Fuzzy Random Returns for Large Scale Securities Data

[...]

Chen Li¹, Yulei Wu², Zhonghua Lu¹, Jue Wang¹, Yonghong Hu³ - Show less +1 more•Institutions (3)

Chinese Academy of Sciences¹, University of Exeter², Central University of Finance and Economics³

01 Jan 2021-IEEE Transactions on Fuzzy Systems

TL;DR: A constrained multiperiod multiobjective portfolio model is established that introduces several constraints to reflect the trading restrictions and quantifies future security returns by fuzzy random variables to capture fuzzy and random uncertainties in the financial market.

...read moreread less

Abstract: It is agreed that portfolio selection models are of great importance for the financial market. In this article, a constrained multiperiod multiobjective portfolio model is established. This model introduces several constraints to reflect the trading restrictions and quantifies future security returns by fuzzy random variables to capture fuzzy and random uncertainties in the financial market. Meanwhile, it considers terminal wealth, conditional value at risk (CVaR), and skewness as tricriteria for decision making. Obviously, the proposed model is computationally challenging. This situation gets worse when investors are interested in a larger financial market since the data they need to analyze may constitute typical big data. Whereafter, a novel intelligent hybrid algorithm is devised to solve the presented model. In this algorithm, the uncertain objectives of the model are approximated by a simulated annealing resilient back propagation (SARPROP) neural network which is trained on the data provided by fuzzy random simulation. An improved imperialist competitive algorithm, named IFMOICA, is designed to search the solution space. The intelligent hybrid algorithm is compared with the one obtained by combining NSGA-II, SARPROP neural network, and fuzzy random simulation. The results demonstrate that the proposed algorithm significantly outperforms the compared one not only in the running time but also in the quality of obtained Pareto frontier. To improve the computational efficiency and handle the large scale securities data, the algorithm is parallelized using MPI. The conducted experiments illustrate that the parallel algorithm is scalable and can solve the model with the size of securities more than 400 in an acceptable time.

...read moreread less

Journal Article•DOI•

Efficient parallel simulation of hemodynamics in patient-specific abdominal aorta with aneurysm

[...]

Shanlin Qin¹, Bokai Wu¹, Jia Liu¹, Wen-Shin Shiu¹, Zhengzheng Yan¹, Rongliang Chen¹, Xiao-Chuan Cai² - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, University of Macau²

24 Jul 2021-Computers in Biology and Medicine

TL;DR: In this paper, the authors presented a highly parallel algorithm for the numerical simulation of unsteady blood flows in the patient-specific abdominal aorta before and after the aneurysmic repair.

...read moreread less

Journal Article•DOI•

Multi-level parallel chaotic Jaya optimization algorithms for solving constrained engineering design problems

[...]

Héctor Migallón, Antonio Jimeno-Morenilla¹, Héctor Rico¹, Jose-Luis Sanchez-Romero¹, Akram Belazi² - Show less +1 more•Institutions (2)

University of Alicante¹, Tunis El Manar University²

06 Apr 2021-The Journal of Supercomputing

TL;DR: In this article, the authors combined coarse-grained strategies, based on multi-populations, with fine-general strategies based on a diffusion grid, to efficiently use a large number of processes.

...read moreread less

Abstract: Several heuristic optimization algorithms have been applied to solve engineering problems. Most of these algorithms are based on populations that evolve according to different rules and parameters to reach the optimal value of a function cost through an iterative process. Different parallel strategies have been proposed to accelerate these algorithms. In this work, we combined coarse-grained strategies, based on multi-populations, with fine-grained strategies, based on a diffusion grid, to efficiently use a large number of processes, thus drastically decreasing the computing time. The Chaotic Jaya optimization algorithm has been considered in this work due to its good optimization and computational behaviors in solving both the constrained optimization engineering problems (seven problems) and the unconstrained benchmark functions (a set of 18 functions). The experimental results show that the proposed parallel algorithms outperform the state-of-the-art algorithms in terms of optimization behavior, according to the quality of the obtained solutions, and efficiently exploit shared memory parallel platforms.

...read moreread less

Journal Article•DOI•

Parallel Greedy Algorithm to Multiple Influence Maximization in Social Network

[...]

Guanhao Wu¹, Xiaofeng Gao¹, Ge Yan¹, Guihai Chen¹•Institutions (1)

Shanghai Jiao Tong University¹

21 Apr 2021-ACM Transactions on Knowledge Discovery From Data

TL;DR: Wang et al. as mentioned in this paper proposed a greedy framework to solve multiple influence maximization problem, where multiple information can propagate in a single network with different propagation probabilities, and the goal of MIM problems is to maximize the overall accumulative influence spreads of different information with the limit of seed budget.

...read moreread less

Abstract: Influence Maximization (IM) problem is to select influential users to maximize the influence spread, which plays an important role in many real-world applications such as product recommendation, epidemic control, and network monitoring. Nowadays multiple kinds of information can propagate in online social networks simultaneously, but current literature seldom discuss about this phenomenon. Accordingly, in this article, we propose Multiple Influence Maximization (MIM) problem where multiple information can propagate in a single network with different propagation probabilities. The goal of MIM problems is to maximize the overall accumulative influence spreads of different information with the limit of seed budget . To solve MIM problems, we first propose a greedy framework to solve MIM problems which maintains an -approximate ratio. We further propose parallel algorithms based on semaphores, an inter-thread communication mechanism, which significantly improves our algorithms efficiency. Then we conduct experiments for our framework using complex social network datasets with 12k, 154k, 317k, and 1.1m nodes, and the experimental results show that our greedy framework outperforms other heuristic algorithms greatly for large influence spread and parallelization of algorithms reduces running time observably with acceptable memory overhead.

...read moreread less

Proceedings Article•DOI•

Parallel Algorithms for Finding Large Cliques in Sparse Graphs

[...]

Lukas Gianinazzi¹, Maciej Besta¹, Yannick Schaffner¹, Torsten Hoefler¹•Institutions (1)

ETH Zurich¹

20 Sep 2021-arXiv: Data Structures and Algorithms

...read moreread less

Journal Article•DOI•

SLPA-based parallel overlapping community detection approach in large complex social networks

[...]

Aminollah Mahabadi¹, Mohammad Hosseini¹•Institutions (1)

Shahed University¹

01 Feb 2021-Multimedia Tools and Applications

TL;DR: The key ideas of the approach are increasing the communities’ conductance score, limiting the speaking-listening stages and executing a strategic updating order to develop a speaker-listeners label propagation algorithm for getting better speedup and semi-deterministic results without using prior training or requiring particular predefined features.

...read moreread less

Abstract: Performance improvement of community detection is an NP problem in large social networks analysis where by integrating the overlapped communities’ information and modularity maximization increases the time complexity and memory usage. This paper presents an online parallel overlapping community detection approach based on a speaker-listener propagation algorithm by proposing a novel parallel algorithm and applying three new metrics. This approach is presented to improve modularity and expand scalability for getting a significantly speedup in low time-consuming and usage memory through an agent-based parallel implementation in a multi-core architecture. The key ideas of our approach are increasing the communities’ conductance score, limiting the speaking-listening stages and executing a strategic updating order to develop a speaker-listeners label propagation algorithm for getting better speedup and semi-deterministic results without using prior training or requiring particular predefined features. Experimental results of used large datasets compared with state-of-the-art algorithms show that the proposed method is extremely convergence and achieves an average 820% speedup in the label propagation algorithm, and significantly improves the modularity that are effective in finding better overlapping communities in a linear time complexity O(m) and lower usage memory O(n).

...read moreread less

Collapse