scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2022"


Journal ArticleDOI
03 Feb 2022
TL;DR: In this article , a parallel dimension-independent node positioning algorithm based on Poisson disc sampling is presented for use on shared-memory computers, such as the modern workstations with multi-core processors.
Abstract: In this paper, we present a novel parallel dimension-independent node positioning algorithm that is capable of generating nodes with variable density, suitable for meshless numerical analysis. A very efficient sequential algorithm based on Poisson disc sampling is parallelized for use on shared-memory computers, such as the modern workstations with multi-core processors. The parallel algorithm uses a global spatial indexing method with its data divided into two levels, which allows for an efficient multi-threaded implementation. The addition of bootstrapping enables the algorithm to use any number of parallel threads while remaining as general as its sequential variant. We demonstrate the algorithm performance on six complex 2- and 3-dimensional domains, which are either of non rectangular shape or have varying nodal spacing or both. We perform a run-time analysis of the algorithm, to demonstrate its ability to reach high speedups regardless of the domain and to show how well it scales on the experimental hardware with 16 processor cores. We also analyse the algorithm in terms of the effects of domain shape, quality of point placement, and various parallelization overheads.

17 citations


Journal ArticleDOI
TL;DR: In this paper, a heterogeneous parallel algorithm for simulation of turbulent flows and its portable software implementation is presented, based on a family of higher accuracy edge-based reconstruction schemes on unstructured mixed-element meshes.

17 citations


Journal ArticleDOI
TL;DR: Experiments show that the proposed parallel algorithm is efficient, and outperforms state-of-the-art serial algorithms in terms of runtime, memory consumption, and scalability.

9 citations


Proceedings ArticleDOI
25 May 2022
TL;DR: This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and proposes general approaches to do so, and uses two types of general techniques to enable work-efficiency and high parallelism.
Abstract: Some recent papers showed that many sequential iterative algorithms can be directly parallelized, by identifying the dependences between the input objects. This approach yields many simple and practical parallel algorithms, but there are still challenges to achieve work-efficiency and high-parallelism. Work-efficiency means that the number of operations is asymptotically the same as the best sequential solution. This can be hard for certain problems where the number of dependences between objects is asymptotically more than optimal sequential work, and we cannot even afford the cost to generate them. To achieve high-parallelism, we always want it to process as many objects as possible in parallel. The goal is to achieve O (D) span for a problem with the deepest dependence length D. We refer to this property as round-efficiency. This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and propose general approaches to do so. To efficiently parallelize many sequential iterative algorithms, we propose the phase-parallel framework. The framework assigns a rank to each object and processes the objects based on the order of their ranks. All objects with the same rank can be processed in parallel. To enable work-efficiency and high parallelism, we use two types of general techniques. Type 1 algorithms aim to use range queries to extract all objects with the same rank to avoid evaluating all the dependences. We discuss activity selection, and Dijkstra's algorithm using Type 1 framework. Type 2 algorithms aim to wake up an object when the last object it depends on is finished. We discuss activity selection, longest increasing subsequence (LIS), greedy maximal independent set (MIS), and many other algorithms using Type 2 framework. All of our algorithms are (nearly) work-efficient and round-efficient, and some of them (e.g., LIS) are the first to achieve the both. Many of them improve the previous best bounds. Moreover, we implement many of them for experimental studies. On inputs with reasonable dependence depth, our algorithms are highly parallelized and significantly outperform their sequential counterparts.

9 citations


Journal ArticleDOI
TL;DR: In this article, a component-oriented power flow model for district heating networks is presented, in which the models of the three basic components in DHN, the pipelines, pressure sources and junctions, are investigated in detail.

7 citations


Journal ArticleDOI
TL;DR: The impact of GPS spoofing attack on synchrophasor-assisted power system state estimation algorithm at the estimation level as well as the coordination level is studied.

6 citations


Journal ArticleDOI
TL;DR: This work presents a novel parallel algorithmic framework for updating the Single Source Shortest Path in large-scale dynamic networks and implements it on the shared-memory and GPU platforms.
Abstract: The Single Source Shortest Path (SSSP) problem is a classic graph theory problem that arises frequently in various practical scenarios; hence, many parallel algorithms have been developed to solve it. However, these algorithms operate on static graphs, whereas many real-world problems are best modeled as dynamic networks, where the structure of the network changes with time. This gap between the dynamic graph modeling and the assumed static graph model in the conventional SSSP algorithms motivates this work. We present a novel parallel algorithmic framework for updating the SSSP in large-scale dynamic networks and implement it on the shared-memory and GPU platforms. The basic idea is to identify the portion of the network affected by the changes and update the information in a rooted tree data structure that stores the edges of the network that are most relevant to the analysis. Extensive experimental evaluations on real-world and synthetic networks demonstrate that our proposed parallel updating algorithm is scalable and, in most cases, requires significantly less execution time than the state-of-the-art recomputing-from-scratch algorithms.

6 citations


Journal ArticleDOI
TL;DR: In this paper , a unified MPI + MPI technique has been introduced for extreme parallelism on a large-scale computer cluster, where MPI persistent nonblocking two-side communication and direct data connection between processors in the same node via MPI shared memory windows are implemented to minimize the communication.
Abstract: In this communication, a novel message passing interface (MPI) parallel algorithm for nodal discontinuous Galerkin time-domain (NDGTD) method has been developed. A unified MPI + MPI technique has been introduced for extreme parallelism on a large-scale computer cluster. Through the data transmission between CPU nodes using MPI persistent nonblocking two-side communication and the direct data connection between processors in the same node via MPI shared memory windows, a two-layered parallel architecture is implemented to minimize the communication. To further accelerate the solution of the multiscale problems, the local time stepping (LTS) technique has been employed in the NDGTD method. A fast time step estimation method has been presented in this communication. With high overlap between the information transmission and the data calculation, the proposed MPI + MPI scheme overcomes the degradation of the parallel efficiency of the pure MPI technique in the scenario of the LTS technique and the large-scale CPU cores. Up to 94% parallel efficiency in 6400 CPU cores is achieved for the average single-core loading about 1700 finite elements, and 18 times acceleration for time step estimation can be obtained with the fourth-order basis function. Three practical complex examples are given to demonstrate a good performance of the proposed method.

5 citations


Journal ArticleDOI
TL;DR: In this paper , the authors propose a new parallel algorithm for computing formal concepts, which is composed of two parallel phases, which parallelizes both the computations of the top L recursion levels and the workload distribution, decouples worker threads from the main thread.

4 citations


Journal ArticleDOI
TL;DR: In this article , a reduced-communication parallel algorithm for deterministic dynamic mode decomposition of large datasets on distributed memory architectures is presented, which has notable savings in computational and communication costs.

3 citations


Proceedings ArticleDOI
01 May 2022
TL;DR: This paper designs shared-memory parallel algorithms that obtain the biconnected components of a graph subsequent to the insertion or deletion of a batch of edges and indicates that these algorithms outperform parallel state-of-the-art static algorithms.
Abstract: Finding the biconnected components of a graph has a large number of applications in many other graph problems including planarity testing, computing the centrality metrics, finding the (weighted) vertex cover, coloring, and the like. Recent years saw the design of efficient algorithms for this problem across sequential and parallel computational models. However, current algorithms do not work in the setting where the underlying graph changes over time in a dynamic manner via the insertion or deletion of edges. Dynamic algorithms in the sequential setting that obtain the biconnected components of a graph upon insertion or deletion of a single edge are known from over two decades ago. Parallel algorithms for this problem are not heavily studied. In this paper, we design shared-memory parallel algorithms that obtain the biconnected components of a graph subsequent to the insertion or deletion of a batch of edges. Our algorithms hence will be capable of exploiting the parallelism adduced due to a batch of updates. We implement our algorithms on an AMD EPYC 7742 CPU having 128 cores. Our experiments on a collection of 10 real-world graphs from multiple classes indicate that our algorithms outperform parallel state-of-the-art static algorithms.11The implementation and an extended version of this paper is at [5].

Journal ArticleDOI
TL;DR: The TOC-based AFS algorithm (TOC-PAFS) proposed in this paper effectively reduces the search time and improves the search performance of complex multi-peaked function optimization problems.

Journal ArticleDOI
TL;DR: The work is devoted to developing the parallel algorithms for solving the initial boundary problem for the time-fractional diffusion equation, based on the Thomas algorithm, parallel sweep algorithm, and accelerated over-relaxation method.
Abstract: The work is devoted to developing the parallel algorithms for solving the initial boundary problem for the time-fractional diffusion equation. After applying the finite-difference scheme to approximate the basis equation, the problem is reduced to solving a system of linear algebraic equations for each subsequent time level. The developed parallel algorithms are based on the Thomas algorithm, parallel sweep algorithm, and accelerated over-relaxation method for solving this system. Stability of the approximation scheme is established. The parallel implementations are developed for the multicore CPU using the OpenMP technology. The numerical experiments are performed to compare these methods and to study the performance of parallel implementations. The parallel sweep method shows the lowest computing time.

Journal ArticleDOI
TL;DR: In this paper , a parallel contact algorithm based on the partial Dirichlet-Neumann boundary conditions is proposed to solve numerically a nonlinear contact problem between rigid and deformable bodies in a whole parallel framework.

Journal ArticleDOI
TL;DR: In this paper , the authors present a parallel algorithm for updating the Single Source Shortest Path (SSSP) problem in large-scale dynamic networks and implement it on the shared-memory and GPU platforms.
Abstract: The Single Source Shortest Path (SSSP) problem is a classic graph theory problem that arises frequently in various practical scenarios; hence, many parallel algorithms have been developed to solve it. However, these algorithms operate on static graphs, whereas many real-world problems are best modeled as dynamic networks, where the structure of the network changes with time. This gap between the dynamic graph modeling and the assumed static graph model in the conventional SSSP algorithms motivates this work. We present a novel parallel algorithmic framework for updating the SSSP in large-scale dynamic networks and implement it on the shared-memory and GPU platforms. The basic idea is to identify the portion of the network affected by the changes and update the information in a rooted tree data structure that stores the edges of the network that are most relevant to the analysis. Extensive experimental evaluations on real-world and synthetic networks demonstrate that our proposed parallel updating algorithm is scalable and, in most cases, requires significantly less execution time than the state-of-the-art recomputing-from-scratch algorithms.

Journal ArticleDOI
TL;DR: Based on the cellular automata evacuation model which is on the basis of triangular meshing, the CPU-based parallel algorithm is applied to enhance the efficiency of the evacuation simulation algorithm which analyzes the model from the aspects of correctness, speedup and scalability as mentioned in this paper.


Proceedings ArticleDOI
19 Feb 2022
TL;DR: This work is the first ones to parallelise the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs in a fine-grained manner and demonstrates experimentally a linear performance scaling.
Abstract: Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallel algorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallel algorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallel algorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders.

Journal ArticleDOI
TL;DR: In this paper , a new parallel accurate algorithm called PAccSumK for computing summation of floating-point numbers is presented, which is based on AccSumK algorithm and it is designed to compute a result as if computed internally in K-fold the working precision.


Journal ArticleDOI
TL;DR: Based on the shared memory feature of MPI-3, the electromagnetic particle simulation parallel algorithm and dynamic load balancing algorithm are designed in the particle simulation software and the implementation of the two algorithms can improve the parallel efficiency from different aspects.
Abstract: Simulation of high-power microwave source devices generally uses parallel algorithms to speed up the operation. In recent years, with the upgrade of parallel technology, the parallel efficiency of the particle simulation software has been further improved. Existing MPI-2 parallel technology of particle simulation software CHIPIC realizes the access to the local memory space of other processes through message passing. The new version of the MPI-3 standard provides the shared memory feature, which allows the data to be directly called by each process in the shared memory window, which reduces the information transmission. In this paper, based on the shared memory feature of MPI-3, the electromagnetic particle simulation parallel algorithm and dynamic load balancing algorithm are designed in the particle simulation software. The implementation of the two algorithms can improve the parallel efficiency from different aspects. The RKA and magnetic isolation oscillator high-power microwave devices are used as the test models. The test results show that the electromagnetic particle simulation parallel algorithm based on the shared memory feature of MPI-3 can improve the efficiency of the software by up to 44%. The efficiency of the dynamic load balancing algorithm based on MPI-3 can also be improved by up to 38%.

Journal ArticleDOI
TL;DR: In this paper, a scalable domain decomposition method based 3D incompressible Navier-Stokes solver for the simulation of unsteady, complex wind flows around urban communities is presented.

Proceedings ArticleDOI
Yan Gu, Ziyang Men, Zheqi Shen, Yihan Sun, Zijin Wan 
21 Aug 2022
TL;DR: In this paper , a parallel algorithm for the longest increasing subsequence (LIS) problem was proposed, which can be solved in O(n log n) time using dynamic programming.
Abstract: This paper studies parallel algorithms for the longest increasing subsequence (LIS) problem. Let n be the input size and k be the LIS length of the input. Sequentially, LIS is a simple problem that can be solved using dynamic programming (DP) in O(n log n) work. However, parallelizing LIS is a long-standing challenge. We are unaware of any parallel LIS algorithm that has optimal O(n log n) work and non-trivial parallelism (i.e., Õ(k) or o(n) span). This paper proposes a parallel LIS algorithm that costs O(n log k) work, Õ(k) span, and O(n) space, and is much simpler than the previous parallel LIS algorithms. We also generalize the algorithm to a weighted version of LIS, which maximizes the weighted sum for all objects in an increasing subsequence. To achieve a better work bound for the weighted LIS algorithm, we designed parallel algorithms for the van Emde Boas (vEB tree, which has the same structure as the sequential vEB tree, and supports work-efficient parallel batch insertion, deletion, and range queries. We also implemented our parallel LIS algorithms. Our implementation is light-weighted, efficient, and scalable. On input size 109, our LIS algorithm outperforms a highly-optimized sequential algorithm (with O(n log k)cost) on inputs with k ≤ 3 x 105. Our algorithm is also much faster than the best existing parallel implementation by Shen et al. (2022) on all input instances.

Proceedings ArticleDOI
03 Mar 2022
TL;DR: In this article , a comparative computational performance study between the sequential and the parallel versions of FCM, K-means and their 2 parallel versions is presented, focusing on the execution time of the parallel and sequential implementations in addition to the speed up of parallel version with respect to the sequential one.
Abstract: Classification task is a very popular preprocessing step in different research fields. Its main role is to separate the different components of an object or dataset into homogeneous regions or groups based on the similarity of properties and features. Among the most popular clustering algorithms we cite fuzzy C-means (FCM) and K-means. In this these iterative techniques, a distance metric between each actual dataset point and the estimated centroids is calculated at each iteration. In this paper, we implement four algorithms; sequential FCM, sequential K-means and their 2 parallel versions. A comparative computational performance study between the sequential and the parallel version is presented. This study, will focus on the execution time of the parallel and sequential implementations in addition to the speed up of the parallel version with respect to the sequential one. The experimental tests were conducted on a randomly generated number dataset. The parallel versions were implemented on a SIMD architecture of Nvidia GPU. The influence of the variation of the data size and the number of clusters on the execution time was analyzed and interpreted.

Journal ArticleDOI
TL;DR: In this paper , a fine grained parallel algorithm for computing the Morse-Smale complex and a GPU implementation (gMSC) is described. But it is not suitable for the non-trivial structure of the connections between the saddle critical points and is not amenable to parallel computation.
Abstract: The Morse-Smale complex is a well studied topological structure that represents the gradient flow behavior between critical points of a scalar function. It supports multi-scale topological analysis and visualization of feature-rich scientific data. Several parallel algorithms have been proposed towards the fast computation of the 3D Morse-Smale complex. Its computation continues to pose significant algorithmic challenges. In particular, the non-trivial structure of the connections between the saddle critical points are not amenable to parallel computation. This paper describes a fine grained parallel algorithm for computing the Morse-Smale complex and a GPU implementation (gMSC). The algorithm first determines the saddle-saddle reachability via a transformation into a sequence of vector operations, and next computes the paths between saddles by transforming it into a sequence of matrix operations. Computational experiments show that the method achieves up to 8.6x speedup over pyms3d and 6x speedup over TTK, the current shared memory implementations. The paper also presents a comprehensive experimental analysis of different steps of the algorithm and reports on their contribution towards runtime performance. Finally, it introduces a CPU based data parallel algorithm for simplifying the Morse-Smale complex via iterative critical point pair cancellation.

Book ChapterDOI
01 Jan 2022
TL;DR: This article proposes two efficient parallel algorithms for summing n floating-point numbers to improve the reproducibility of the summations compared to those computed by the naive algorithm and this regardless of the number of processors used for the computations.
Abstract: Floating-point arithmetic is prone to accuracy problems due to the round-off errors. The combination of the round-off errors and of the out of order execution of arithmetic operations due to the scheduling of parallel tasks, introduces additional numerical accuracy issues. In this article, we address the problem of improving the numerical accuracy and reproducibility of summation operators. We propose two efficient parallel algorithms for summing n floating-point numbers. The first objective of our algorithms is to obtain an accurate result without increasing the linear complexity of the naive algorithm. The second objective is to improve the reproducibility of the summations compared to those computed by the naive algorithm and this regardless of the number of processors used for the computations.


Journal ArticleDOI
TL;DR: In this article, a simplified method to calculate the high-order transverse leakage terms was proposed based on the NEFD nodal SN scheme in XYZ geometry, which was used to solve the discretization system of a small fast reactor and the ZPPR-10B large fast reactor.

Journal ArticleDOI
TL;DR: An enhanced version of the SCA algorithm called as ESCA algorithm is proposed, which behaves outstandingly well in terms of exploration and exploitation behaviors, local optima avoidance, and convergence speed toward the optimum.
Abstract: The sine cosine algorithm’s main idea is the sine and cosine-based vacillation outwards or towards the best solution. The first main contribution of this paper proposes an enhanced version of the SCA algorithm called as ESCA algorithm. The supremacy of the proposed algorithm over a set of state-of-the-art algorithms in terms of solution accuracy and convergence speed will be demonstrated by experimental tests. When these algorithms are transferred to the business sector, they must meet time requirements dependent on the industrial process. If these temporal requirements are not met, an efficient solution is to speed them up by designing parallel algorithms. The second major contribution of this work is the design of several parallel algorithms for efficiently exploiting current multicore processor architectures. First, one-level synchronous and asynchronous parallel ESCA algorithms are designed. They have two favors; retain the proposed algorithm’s behavior and provide excellent parallel performance by combining coarse-grained parallelism with fine-grained parallelism. Moreover, the parallel scalability of the proposed algorithms is further improved by employing a two-level parallel strategy. Indeed, the experimental results suggest that the one-level parallel ESCA algorithms reduce the computing time, on average, by 87.4% and 90.8%, respectively, using 12 physical processing cores. The two-level parallel algorithms provide extra reductions of the computing time by 91.4%, 93.1%, and 94.5% with 16, 20, and 24 processing cores, including physical and logical cores. Comparison analysis is carried out on 30 unconstrained benchmark functions and three challenging engineering design problems. The experimental outcomes show that the proposed ESCA algorithm behaves outstandingly well in terms of exploration and exploitation behaviors, local optima avoidance, and convergence speed toward the optimum. The overall performance of the proposed algorithm is statistically validated using three non-parametric statistical tests, namely Friedman, Friedman aligned, and Quade tests. constrained optimization; metaheuristic; heuristic algorithm; OpenMP; parallel algorithms; SCA algorithm; unconstrained optimization

Journal ArticleDOI
TL;DR: In this article , a scalable parallel algorithm for 3D magnetotelluric (MT) finite element modeling in anisotropic media is proposed, which is based on distributed mesh storage, including multiple parallel granularities, and is implemented through multiple tools.
Abstract: 3-D magnetotelluric (MT) forward modeling has always been faced with the problems of high memory requirements and long computing time. In this article, we design a scalable parallel algorithm for 3-D MT finite element modeling in anisotropic media. The parallel algorithm is based on the distributed mesh storage, including multiple parallel granularities, and is implemented through multiple tools. Message-passing interface (MPI) is used to exploit process parallelisms for subdomains, frequencies, and solving equations. Thread parallelisms for merge sorting, element analysis, matrix assembly, and imposing Dirichlet boundary conditions are developed by Open Multi-Processing (OpenMP). We validate the algorithm through several model simulations and study the effects of topography and conductivity anisotropy on apparent resistivities and phase responses. Scalability tests are performed on the Tianhe-2 supercomputer to analyze the parallel performance of different parallel granularities. Three parallel direct solvers Supernodal LU (SUPERLU), MUltifrontal Massively Parallel sparse direct Solver (MUMPS), and Parallel Sparse matriX package (PASTIX) are compared in solving sparse systems of equations. As a result, reasonable parallel parameters are suggested for practical applications. The developed parallel algorithm is proven to be efficient and scalable.