scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2021"


Journal ArticleDOI
TL;DR: In this paper, a momentum-incorporated parallel stochastic gradient descent (MPSGD) algorithm is proposed to accelerate the convergence rate by integrating momentum effects into its training process.
Abstract: A recommender system (RS) relying on latent factor analysis usually adopts stochastic gradient descent (SGD) as its learning algorithm. However, owing to its serial mechanism, an SGD algorithm suffers from low efficiency and scalability when handling large-scale industrial problems. Aiming at addressing this issue, this study proposes a momentum-incorporated parallel stochastic gradient descent (MPSGD) algorithm, whose main idea is two-fold: a) implementing parallelization via a novel data-splitting strategy, and b) accelerating convergence rate by integrating momentum effects into its training process. With it, an MPSGD-based latent factor (MLF) model is achieved, which is capable of performing efficient and high-quality recommendations. Experimental results on four high-dimensional and sparse matrices generated by industrial RS indicate that owing to an MPSGD algorithm, an MLF model outperforms the existing state-of-the-art ones in both computational efficiency and scalability.

108 citations


Journal ArticleDOI
TL;DR: In this article, a distributed framework for physics-informed neural networks (PINNs) based on two recent extensions, namely conservative PINNs and extended PINNs (XPINNs), which employ domain decomposition in space and in time-space, respectively, is developed.

56 citations


Journal ArticleDOI
TL;DR: This article proposes a novel parallel tracking control optimization algorithm for interconnected systems where the working feedback control is considered as a reconstructed dynamic with the virtual control and a new augmented fuzzy interconnected tracking system is built, thus that the performance index is valid for optimal control.
Abstract: In this article, a novel parallel tracking control optimization algorithm is first proposed for partially unknown fuzzy interconnected systems. In the existing standard optimal tracking control, the bounded or nonasymptotic stable reference trajectory will lead the feedback control not converging to zero, which causes the performance index infinite and invalid. By using the precompensation technique, in this article, the working feedback control is considered as a reconstructed dynamic with the virtual control and a new augmented fuzzy interconnected tracking system is built, thus that the performance index is valid for optimal control. Then, combining the integral reinforcement learning (RL) method and decentralized control design, the novel integral RL parallel algorithm is first developed to solve the tracking controls for interconnected systems, which relax the requirements of exact matrices information $A_i^k$ and $B_i^k$ during the solving process. Both the convergence and stability of the designed control optimization scheme are guaranteed by theorems. Finally, the new parallel tracking algorithm for interconnected systems is verified through the dual-manipulator coordination system and simulation results demonstrate the effectiveness.

37 citations


Journal ArticleDOI
TL;DR: A novel competitive co-evolution scheme, named co- Evolution of parameterized search (CEPS), is proposed, capable of obtaining generalizable PAPs with few training instances, and has led to better generalization.
Abstract: Generalization, i.e., the ability of solving problem instances that are not available during the system design and development phase, is a critical goal for intelligent systems. A typical way to achieve good generalization is to learn a model from vast data. In the context of heuristic search, such a paradigm could be implemented as configuring the parameters of a parallel algorithm portfolio (PAP) based on a set of “training” problem instances, which is often referred to as PAP construction. However, compared to the traditional machine learning, PAP construction often suffers from the lack of training instances, and the obtained PAPs may fail to generalize well. This article proposes a novel competitive co-evolution scheme, named co-evolution of parameterized search (CEPS), as a remedy to this challenge. By co-evolving a configuration population and an instance population, CEPS is capable of obtaining generalizable PAPs with few training instances. The advantage of CEPS in improving generalization is analytically shown in this article. Two concrete algorithms, namely, CEPS-TSP and CEPS-VRPSPDTW, are presented for the traveling salesman problem (TSP) and the vehicle routing problem with simultaneous pickup–delivery and time windows (VRPSPDTW), respectively. The experimental results show that CEPS has led to better generalization, and even managed to find new best-known solutions for some instances.

30 citations


Journal ArticleDOI
TL;DR: Algorithms for temporal parallelization of Bayesian smoothers are presented, and the advantage of the proposed algorithms is that they reduce the linear complexity of standard smoothing algorithms with respect to time to logarithmic.
Abstract: This article presents algorithms for temporal parallelization of Bayesian smoothers. We define the elements and the operators to pose these problems as the solutions to all-prefix-sums operations for which efficient parallel scan-algorithms are available. We present the temporal parallelization of the general Bayesian filtering and smoothing equations, and specialize them to linear/Gaussian models. The advantage of the proposed algorithms is that they reduce the linear complexity of standard smoothing algorithms with respect to time to logarithmic.

28 citations


Proceedings ArticleDOI
17 Feb 2021
TL;DR: TurboTransformers as mentioned in this paper is a transformer serving system for NLP tasks on GPUs, which consists of a computing runtime and a serving framework, which can achieve the state-of-the-art transformer model serving performance on GPU platforms.
Abstract: The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, transformers are able to process on dimensions of sequence lengths in parallel, therefore leads to better accuracy on long sequences. However, efficient deployments of them for online services in data centers equipped with GPUs are not easy. First, more computation introduced by transformer structures makes it more challenging to meet the latency and throughput constraints of serving. Second, NLP tasks take in sentences of variable length. The variability of input dimensions brings a severe problem to efficient memory management and serving optimization. To solve the above challenges, this paper designed a transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework. Three innovative features make it stand out from other similar works. An efficient parallel algorithm is proposed for GPU-based batch reduction operations, like Softmax and LayerNorm, which are major hot spots besides BLAS routines. A memory allocation algorithm, which better balances the memory footprint and allocation/free efficiency, is designed for variable-length input situations. A serving framework equipped with a new batch scheduler using dynamic programming achieves the optimal throughput on variable-length requests. The system can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into your PyTorch code with a few lines of code.

24 citations


Proceedings ArticleDOI
09 Jun 2021
TL;DR: In this article, the authors present new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*) based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations.
Abstract: This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN*. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN*. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time). We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.

20 citations


Journal ArticleDOI
TL;DR: The recovery of geodesic distance with the heat method can be reformulated as optimization of its gradients subject to integrability, which can be solved using an efficient first-order method that requires no linear system solving and converges quickly.
Abstract: In this paper, we propose a parallel and scalable approach for geodesic distance computation on triangle meshes. Our key observation is that the recovery of geodesic distance with the heat method [1] can be reformulated as optimization of its gradients subject to integrability, which can be solved using an efficient first-order method that requires no linear system solving and converges quickly. Afterward, the geodesic distance is efficiently recovered by parallel integration of the optimized gradients in breadth-first order. Moreover, we employ a similar breadth-first strategy to derive a parallel Gauss-Seidel solver for the diffusion step in the heat method. To further lower the memory consumption from gradient optimization on faces, we also propose a formulation that optimizes the projected gradients on edges, which reduces the memory footprint by about 50 percent. Our approach is trivially parallelizable, with a low memory footprint that grows linearly with respect to the model size. This makes it particularly suitable for handling large models. Experimental results show that it can efficiently compute geodesic distance on meshes with more than 200 million vertices on a desktop PC with 128 GB RAM, outperforming the original heat method and other state-of-the-art geodesic distance solvers.

20 citations


Journal ArticleDOI
TL;DR: This paper examines semi-analytical solution (SAS) methods as the coarse operators of the Parareal algorithm and explores performance of the SAS methods to the standard numerical time integration methods.
Abstract: With continuing advances in high-performance parallel computing platforms, parallel algorithms have become powerful tools for development of faster than real-time power system dynamic simulations In particular, it has been demonstrated in recent years that parallel-in-time (Parareal) algorithms have the potential to achieve such an ambitious goal The selection of a fast and reasonably accurate coarse operator of the Parareal algorithm is crucial for its effective utilization and performance This paper examines semi-analytical solution (SAS) methods as the coarse operators of the Parareal algorithm and explores performance of the SAS methods to the standard numerical time integration methods Two promising time-power series-based SAS methods were considered; Adomian decomposition method and Homotopy analysis method with a windowing approach for improving the convergence Numerical performance case studies on 10-generator 39-bus system and 327-generator 2383-bus system were performed for these coarse operators over different disturbances, evaluating the number of Parareal iterations, computational time, and stability of convergence All the coarse operators tested with different scenarios have converged to the same corresponding true solution (if they are convergent) and the SAS methods provide comparable computational speed, while having more stable convergence to the true solution in many cases

19 citations


Journal ArticleDOI
TL;DR: A novel approach for database damage assessment for healthcare systems, inspired by the current behavior of COVID-19 infections, that outperforms other existing algorithms in this domain in terms of both time and memory.
Abstract: In the current Internet of things era, all companies shifted from paper-based data to the electronic format Although this shift increased the efficiency of data processing, it has security drawbacks Healthcare databases are a precious target for attackers because they facilitate identity theft and cybercrime This paper presents an approach for database damage assessment for healthcare systems Inspired by the current behavior of COVID-19 infections, our approach views the damage assessment problem the same way The malicious transactions will be viewed as if they are COVID-19 viruses, taken from infection onward The challenge of this research is to discover the infected transactions in a minimal time The proposed parallel algorithm is based on the transaction dependency paradigm, with a time complexity O((M+NQ+N^3)/L) (M = total number of transactions under scrutiny, N = number of malicious and affected transactions in the testing list, Q = time for dependency check, and L = number of threads used) The memory complexity of the algorithm is O(N+KL) (N = number of malicious and affected transactions, K = number of transactions in one area handled by one thread, and L = number of threads) Since the damage assessment time is directly proportional to the denial-of-service time, the proposed algorithm provides a minimized execution time Our algorithm is a novel approach that outperforms other existing algorithms in this domain in terms of both time and memory, working up to four times faster in terms of time and with 120,000 fewer bytes in terms of memory

18 citations


Journal ArticleDOI
TL;DR: In this article, a parallel algorithm for the coupled-cluster singles and doubles method augmented with a perturbative correction for triple excitations [CCSD(T)] using the resolution-of-the-identity (RI) approximation for two-electron repulsion integrals (ERIs).
Abstract: A parallel algorithm is described for the coupled-cluster singles and doubles method augmented with a perturbative correction for triple excitations [CCSD(T)] using the resolution-of-the-identity (RI) approximation for two-electron repulsion integrals (ERIs). The algorithm bypasses the storage of four-center ERIs by adopting an integral-direct strategy. The CCSD amplitude equations are given in a compact quasi-linear form by factorizing them in terms of amplitude-dressed three-center intermediates. A hybrid MPI/OpenMP parallelization scheme is employed, which uses the OpenMP-based shared memory model for intranode parallelization and the MPI-based distributed memory model for internode parallelization. Parallel efficiency has been optimized for all terms in the CCSD amplitude equations. Two different algorithms have been implemented for the rate-limiting terms in the CCSD amplitude equations that entail O(NO2NV4) and O(NO3NV3)-scaling computational costs, where NO and NV denote the number of correlated occupied and virtual orbitals, respectively. One of the algorithms assembles the four-center ERIs requiring NV4 and NO2NV2-scaling memory costs in a distributed manner on a number of MPI ranks, while the other algorithm completely bypasses the assembling of quartic memory-scaling ERIs and thus largely reduces the memory demand. It is demonstrated that the former memory-expensive algorithm is faster on a few hundred cores, while the latter memory-economic algorithm shows a better strong scaling in the limit of a few thousand cores. The program is shown to exhibit a near-linear scaling, in particular for the compute-intensive triples correction step, on up to 8000 cores. The performance of the program is demonstrated via calculations involving molecules with 24-51 atoms and up to 1624 atomic basis functions. As the first application, the complete basis set (CBS) limit for the interaction energy of the π-stacked uracil dimer from the S66 data set has been investigated. This work reports the first calculation of the interaction energy at the CCSD(T)/aug-cc-pVQZ level without local orbital approximation. The CBS limit for the CCSD correlation contribution to the interaction energy was found to be -8.01 kcal/mol, which agrees very well with the value -7.99 kcal/mol reported by Schmitz, Hattig, and Tew [ Phys. Chem. Chem. Phys. 2014, 16, 22167-22178]. The CBS limit for the total interaction energy was estimated to be -9.64 kcal/mol.

Journal ArticleDOI
TL;DR: This study formalizes a random matrix particle swarm optimization scheduling algorithm (RMPSO), which uses the random integer matrix to represent its position and a feasible task scheduling scheme, to achieve the optimal total cost of cloud services and proposes two parallel RMPSO algorithms.

Journal ArticleDOI
29 Apr 2021-PLOS ONE
TL;DR: In this article, the authors present a hardware-agnostic implementation strategy for lattice Boltzmann simulations, which yields massive performance on homogeneous and heterogeneous many-core platforms.
Abstract: We present a novel, hardware-agnostic implementation strategy for lattice Boltzmann (LB) simulations, which yields massive performance on homogeneous and heterogeneous many-core platforms. Based solely on C++17 Parallel Algorithms, our approach does not rely on any language extensions, external libraries, vendor-specific code annotations, or pre-compilation steps. Thanks in particular to a recently proposed GPU back-end to C++17 Parallel Algorithms, it is shown that a single code can compile and reach state-of-the-art performance on both many-core CPU and GPU environments for the solution of a given non trivial fluid dynamics problem. The proposed strategy is tested with six different, commonly used implementation schemes to test the performance impact of memory access patterns on different platforms. Nine different LB collision models are included in the tests and exhibit good performance, demonstrating the versatility of our parallel approach. This work shows that it is less than ever necessary to draw a distinction between research and production software, as a concise and generic LB implementation yields performances comparable to those achievable in a hardware specific programming language. The results also highlight the gains of performance achieved by modern many-core CPUs and their apparent capability to narrow the gap with the traditionally massively faster GPU platforms. All code is made available to the community in form of the open-source project stlbm, which serves both as a stand-alone simulation software and as a collection of reusable patterns for the acceleration of pre-existing LB codes.

Proceedings ArticleDOI
06 Jul 2021
TL;DR: In this paper, a parallel k-clique listing algorithm with improved work bounds was presented for sparse graphs with low degeneracy or arboricity, where the pruning criterion for a backtracking search was introduced and analyzed.
Abstract: We present a parallel k-clique listing algorithm with improved work bounds (for the same depth) in sparse graphs with low degeneracy or arboricity. We achieve this by introducing and analyzing a new pruning criterion for a backtracking search. Our algorithm has better asymptotic performance, especially for larger cliques (when k is not constant), where we avoid the straightforwardly exponential runtime growth with respect to the clique size. In particular, for cliques that are a constant factor smaller than the graph's degeneracy, the work improvement is an exponential factor in the clique size compared to previous results. Moreover, we present a low-depth approximation to the community degeneracy (which can be arbitrarily smaller than the degeneracy). This approximation enables a low depth clique listing algorithm whose runtime is parameterized by the community degeneracy.

Journal ArticleDOI
TL;DR: This paper examines a new parallel computation model called bulk synchronous farm (BSF) that focuses on estimating the scalability of compute-intensive iterative algorithms aimed at cluster computing systems and presents a cost metric of the BSF model.

Proceedings ArticleDOI
06 Jul 2021
TL;DR: In this paper, the authors propose a new data type, lazy-batched priority queue (LaB-PQ), which abstracts the semantics of the priority queue needed by the stepping algorithms.
Abstract: The single-source shortest-path (SSSP) problem is a notoriously hard problem in the parallel context. In practice, the Δ-stepping algorithm of Meyer and Sanders has been widely adopted. However, Δ-stepping has no known worst-case bounds for general graphs, and the performance highly relies on the parameter Δ, which requires exhaustive tuning. The parallel SSSP algorithms with provable bounds, such as Radius-stepping, either have no implementations available or are much slower than Δ-stepping in practice. We propose the stepping algorithm framework that generalizes existing algorithms such as Δ-stepping and Radius-stepping. The framework allows for similar analysis and implementations for all stepping algorithms. We also propose a new abstract data type, lazy-batched priority queue (LaB-PQ ) that abstracts the semantics of the priority queue needed by the stepping algorithms. We provide two data structures for LaB-PQ, focusing on theoretical and practical efficiency, respectively. Based on the new framework and LaB-PQ, we show two new stepping algorithms, ρ-stepping and Δ^*-stepping, that are simple, with non-trivial worst-case bounds, and fast in practice. We also show improved bounds for a list of existing algorithms such as Radius-Stepping. Based on our framework, we implement three algorithms: Bellman-Ford, Δ^*-stepping, and ρ-stepping. We compare the performance with four state-of-the-art implementations. On five social and web graphs, ρ-stepping is 1.3--2.6x faster than all the existing implementations. On two road graphs, our Δ^*-stepping is at least 14% faster than existing ones, while ρ-stepping is also competitive. The almost identical implementations for stepping algorithms also allow for in-depth analyses among the stepping algorithms in practice.

Journal ArticleDOI
01 May 2021
TL;DR: A parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results and a good scalability and a reliable validation compared to other existing measures when handling large scale data are proposed.
Abstract: Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a cluster of nodes and to the parallel computation models. Although the effectiveness of parallel models to deal with increasing volume of data little work is done on the validation of big clusters. To deal with this issue, we propose a parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results. Rather than computing the Dunn Index on a single machine in the clustering validation process, the new proposed measure is computed by distributing the partitioning among a cluster of nodes using a customized parallel model under Apache Spark framework. The proposed S-DI is also enhanced by a Sketch and Validate sampling technique which aims to approximate the Dunn Index value by using a small representative data-sample. Different experiments on simulated and real datasets showed a good scalability of our proposed measure and a reliable validation compared to other existing measures when handling large scale data.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a parallel blockwise knowledge distillation algorithm to accelerate the distillation process of sophisticated DNNs, which leverages local information to conduct independent blockwise distillation and utilizes depthwise separable layers as the efficient replacement block architecture.
Abstract: Deep neural networks (DNNs) have been extremely successful in solving many challenging AI tasks in natural language processing, speech recognition, and computer vision nowadays. However, DNNs are typically computation intensive, memory demanding, and power hungry, which significantly limits their usage on platforms with constrained resources. Therefore, a variety of compression techniques (e.g., quantization, pruning, and knowledge distillation) have been proposed to reduce the size and power consumption of DNNs. Blockwise knowledge distillation is one of the compression techniques that can effectively reduce the size of a highly complex DNN. However, it is not widely adopted due to its long training time. In this article, we propose a novel parallel blockwise distillation algorithm to accelerate the distillation process of sophisticated DNNs. Our algorithm leverages local information to conduct independent blockwise distillation, utilizes depthwise separable layers as the efficient replacement block architecture, and properly addresses limiting factors (e.g., dependency, synchronization, and load balancing) that affect parallelism. The experimental results running on an AMD server with four Geforce RTX 2080Ti GPUs show that our algorithm can achieve 3x speedup plus 19 percent energy savings on VGG distillation, and 3.5x speedup plus 29 percent energy savings on ResNet distillation, both with negligible accuracy loss. The speedup of ResNet distillation can be further improved to 3.87 when using four RTX6000 GPUs in a distributed cluster.

Journal ArticleDOI
TL;DR: These studies demonstrate that the proposed CUDA-GPU parallel algorithms are applicable and reliable for the large-scale engineering applications of non-spherical granular systems.

Journal ArticleDOI
TL;DR: A novel extension of Theatre, Parallel Theatre, which is developed for an exploitation of the computing potential of nowadays multi-core machines with shared memory and the particular control forms which were developed for untimed and timed parallel systems are described.

Proceedings ArticleDOI
09 Jun 2021
TL;DR: Wang et al. as discussed by the authors proposed an exact algorithm, Ex-DPC, and two approximation algorithms, ApproxDPC and S-Approx-Dpc, to enable DPC on large datasets.
Abstract: Clustering multi-dimensional points is a fundamental task in many fields, and density-based clustering supports many applications as it can discover clusters of arbitrary shapes. This paper addresses the problem of Density-Peaks Clustering (DPC), a recently proposed density-based clustering framework. Although DPC already has many applications, its straightforward implementation incurs a quadratic time computation to the number of points in a given dataset, thereby does not scale to large datasets. To enable DPC on large datasets, we propose efficient algorithms for DPC. Specifically, we propose an exact algorithm, Ex-DPC, and two approximation algorithms, Approx-DPC and S-Approx-DPC. Under a reasonable assumption about a DPC parameter, our algorithms are sub-quadratic, i.e., break the quadratic barrier. Besides, Approx-DPC does not require any additional parameters and can return the same cluster centers as those of Ex-DPC, rendering an accurate clustering result. S-Approx-DPC requires an approximation parameter but can speed up its computational efficiency. We further present that their efficiencies can be accelerated by leveraging multicore processing. We conduct extensive experiments using synthetic and real datasets, and our experimental results demonstrate that our algorithms are efficient, scalable, and accurate.

Proceedings ArticleDOI
09 Jun 2021
TL;DR: In this article, a parallel index-based SCAN algorithm based on GS*-Index was proposed. But the parallel algorithm is not as efficient as the sequential algorithm, since it does not effectively share work among queries with different SCAN parameter settings.
Abstract: SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely used graph clustering algorithm. For large graphs, however, sequential SCAN variants are prohibitively slow, and parallel SCAN variants do not effectively share work among queries with different SCAN parameter settings. Since users of SCAN often explore many parameter settings to find good clusterings, it is worthwhile to precompute an index that speeds up queries. This paper presents a practical and provably efficient parallel index-based SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel algorithm improves upon the asymptotic work of the sequential algorithm by using integer sorting. It is also highly parallel, achieving logarithmic span (parallel time) for both index construction and clustering queries. Furthermore, we apply locality-sensitive hashing (LSH) to design a novel approximate SCAN algorithm and prove guarantees for its clustering behavior. We present an experimental evaluation of our algorithms on large real-world graphs. On a 48-core machine with two-way hyper-threading, our parallel index construction achieves 50--151× speedup over the construction of GS*-Index. In fact, even on a single thread, our index construction algorithm is faster than GS*-Index. Our parallel index query implementation achieves 5--32× speedup over GS*-Index queries across a range of SCAN parameter values, and our implementation is always faster than ppSCAN, a state-of-the-art parallel SCAN algorithm. Moreover, our experiments show that applying LSH results in faster index construction while maintaining good clustering quality.

Journal ArticleDOI
TL;DR: A comparison among different sequential, parallels and distributed ARM techniques, and the presentation of a novel ARM algorithm, named Balanced Parallel Association Rule Extractor from SNPs (BPARES), that employs parallel computing and a novel balancing strategy to improve response time.

Journal ArticleDOI
TL;DR: In this article, an approach for solving the incompressible Navier-Stokes equations on a forest of Octree grids in a parallel environment is presented. But the approach is not suitable for large-scale data structures.

Journal ArticleDOI
TL;DR: A constrained multiperiod multiobjective portfolio model is established that introduces several constraints to reflect the trading restrictions and quantifies future security returns by fuzzy random variables to capture fuzzy and random uncertainties in the financial market.
Abstract: It is agreed that portfolio selection models are of great importance for the financial market. In this article, a constrained multiperiod multiobjective portfolio model is established. This model introduces several constraints to reflect the trading restrictions and quantifies future security returns by fuzzy random variables to capture fuzzy and random uncertainties in the financial market. Meanwhile, it considers terminal wealth, conditional value at risk (CVaR), and skewness as tricriteria for decision making. Obviously, the proposed model is computationally challenging. This situation gets worse when investors are interested in a larger financial market since the data they need to analyze may constitute typical big data. Whereafter, a novel intelligent hybrid algorithm is devised to solve the presented model. In this algorithm, the uncertain objectives of the model are approximated by a simulated annealing resilient back propagation (SARPROP) neural network which is trained on the data provided by fuzzy random simulation. An improved imperialist competitive algorithm, named IFMOICA, is designed to search the solution space. The intelligent hybrid algorithm is compared with the one obtained by combining NSGA-II, SARPROP neural network, and fuzzy random simulation. The results demonstrate that the proposed algorithm significantly outperforms the compared one not only in the running time but also in the quality of obtained Pareto frontier. To improve the computational efficiency and handle the large scale securities data, the algorithm is parallelized using MPI. The conducted experiments illustrate that the parallel algorithm is scalable and can solve the model with the size of securities more than 400 in an acceptable time.

Journal ArticleDOI
TL;DR: In this paper, the authors presented a highly parallel algorithm for the numerical simulation of unsteady blood flows in the patient-specific abdominal aorta before and after the aneurysmic repair.

Journal ArticleDOI
TL;DR: In this article, the authors combined coarse-grained strategies, based on multi-populations, with fine-general strategies based on a diffusion grid, to efficiently use a large number of processes.
Abstract: Several heuristic optimization algorithms have been applied to solve engineering problems. Most of these algorithms are based on populations that evolve according to different rules and parameters to reach the optimal value of a function cost through an iterative process. Different parallel strategies have been proposed to accelerate these algorithms. In this work, we combined coarse-grained strategies, based on multi-populations, with fine-grained strategies, based on a diffusion grid, to efficiently use a large number of processes, thus drastically decreasing the computing time. The Chaotic Jaya optimization algorithm has been considered in this work due to its good optimization and computational behaviors in solving both the constrained optimization engineering problems (seven problems) and the unconstrained benchmark functions (a set of 18 functions). The experimental results show that the proposed parallel algorithms outperform the state-of-the-art algorithms in terms of optimization behavior, according to the quality of the obtained solutions, and efficiently exploit shared memory parallel platforms.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a greedy framework to solve multiple influence maximization problem, where multiple information can propagate in a single network with different propagation probabilities, and the goal of MIM problems is to maximize the overall accumulative influence spreads of different information with the limit of seed budget.
Abstract: Influence Maximization (IM) problem is to select influential users to maximize the influence spread, which plays an important role in many real-world applications such as product recommendation, epidemic control, and network monitoring. Nowadays multiple kinds of information can propagate in online social networks simultaneously, but current literature seldom discuss about this phenomenon. Accordingly, in this article, we propose Multiple Influence Maximization (MIM) problem where multiple information can propagate in a single network with different propagation probabilities. The goal of MIM problems is to maximize the overall accumulative influence spreads of different information with the limit of seed budget . To solve MIM problems, we first propose a greedy framework to solve MIM problems which maintains an -approximate ratio. We further propose parallel algorithms based on semaphores, an inter-thread communication mechanism, which significantly improves our algorithms efficiency. Then we conduct experiments for our framework using complex social network datasets with 12k, 154k, 317k, and 1.1m nodes, and the experimental results show that our greedy framework outperforms other heuristic algorithms greatly for large influence spread and parallelization of algorithms reduces running time observably with acceptable memory overhead.

Proceedings ArticleDOI
TL;DR: In this paper, a parallel k-clique listing algorithm with improved work bounds was presented for sparse graphs with low degeneracy or arboricity, where the pruning criterion for a backtracking search was introduced and analyzed.
Abstract: We present a parallel k-clique listing algorithm with improved work bounds (for the same depth) in sparse graphs with low degeneracy or arboricity. We achieve this by introducing and analyzing a new pruning criterion for a backtracking search. Our algorithm has better asymptotic performance, especially for larger cliques (when k is not constant), where we avoid the straightforwardly exponential runtime growth with respect to the clique size. In particular, for cliques that are a constant factor smaller than the graph's degeneracy, the work improvement is an exponential factor in the clique size compared to previous results. Moreover, we present a low-depth approximation to the community degeneracy (which can be arbitrarily smaller than the degeneracy). This approximation enables a low depth clique listing algorithm whose runtime is parameterized by the community degeneracy.

Journal ArticleDOI
TL;DR: The key ideas of the approach are increasing the communities’ conductance score, limiting the speaking-listening stages and executing a strategic updating order to develop a speaker-listeners label propagation algorithm for getting better speedup and semi-deterministic results without using prior training or requiring particular predefined features.
Abstract: Performance improvement of community detection is an NP problem in large social networks analysis where by integrating the overlapped communities’ information and modularity maximization increases the time complexity and memory usage. This paper presents an online parallel overlapping community detection approach based on a speaker-listener propagation algorithm by proposing a novel parallel algorithm and applying three new metrics. This approach is presented to improve modularity and expand scalability for getting a significantly speedup in low time-consuming and usage memory through an agent-based parallel implementation in a multi-core architecture. The key ideas of our approach are increasing the communities’ conductance score, limiting the speaking-listening stages and executing a strategic updating order to develop a speaker-listeners label propagation algorithm for getting better speedup and semi-deterministic results without using prior training or requiring particular predefined features. Experimental results of used large datasets compared with state-of-the-art algorithms show that the proposed method is extremely convergence and achieves an average 820% speedup in the label propagation algorithm, and significantly improves the modularity that are effective in finding better overlapping communities in a linear time complexity O(m) and lower usage memory O(n).