scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2020"


Proceedings ArticleDOI
01 Feb 2020
TL;DR: This work proposes a lightweight reconfigurable sparse-computation accelerator (Alrescha), which achieves an average speedup of 15.6x for scientific sparse problems, and 8x for graph algorithms, compared to GPU and consumes 14x less energy.
Abstract: Sparse problems that dominate a wide range of applications fail to effectively benefit from high memory bandwidth and concurrent computations in modern high-performance computer systems. Therefore, hardware accelerators have been proposed to capture a high degree of parallelism in sparse problems. However, the unexplored challenge for sparse problems is the limited opportunity for parallelism because of data dependencies, a common computation pattern in scientific sparse problems. Our key insight is to extract parallelism by mathematically transforming the computations into equivalent forms. The transformation breaks down the sparse kernels into a majority of independent parts and a minority of data-dependent ones and reorders these parts to gain performance. To implement the key insight, we propose a lightweight reconfigurable sparse-computation accelerator (Alrescha). To efficiently run the data-dependent and parallel parts and to enable fast switching between them, Alrescha makes two contributions. First, it implements a compute engine with a fixed compute unit for the parallel parts and a lightweight reconfigurable engine for the execution of the data-dependent parts. Second, Alrescha benefits from a locally-dense storage format, with the right order of non-zero values to yield the order of computations dictated by the transformation. The combination of the lightweight reconfigurable hardware and the storage format enables uninterrupted streaming from memory. Our simulation results show that compared to GPU, Alrescha achieves an average speedup of 15.6x for scientific sparse problems, and 8x for graph algorithms. Moreover, compared to GPU, Alrescha consumes 14x less energy.

34 citations


Book ChapterDOI
17 Jun 2020
TL;DR: The study has been focused on the quantum version of the k-NN algorithm that allows to understand the fundamentals when transcribing classical machine learning algorithms into its quantum versions.
Abstract: Artificial intelligence algorithms, developed for traditional computing, based on Von Neumann’s architecture, are slow and expensive in terms of computational resources. Quantum mechanics has opened up a new world of possibilities within this field, since, thanks to the basic properties of a quantum computer, a great degree of parallelism can be achieved in the execution of the quantum version of machine learning algorithms. In this paper, a study has been carried out on these properties and on the design of their quantum computing versions. More specifically, the study has been focused on the quantum version of the k-NN algorithm that allows to understand the fundamentals when transcribing classical machine learning algorithms into its quantum versions.

23 citations


Journal ArticleDOI
TL;DR: This work has implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2, a novel library for auto-tunable codes for linear algebra operations based on LASs library that presents an improvement in terms of execution time against other reference libraries.

20 citations


Journal ArticleDOI
TL;DR: A dynamic load balancing scheme is presented, which allows to increase the efficiency of complex coupled simulations with non-trivial domain decompositions and allows to attenuate imbalances also for parallel computations on heterogeneous computing hardware.

15 citations


Journal ArticleDOI
TL;DR: In this paper, a novel algorithm for online approximate string matching (OASM) able to filter shadow hits on the fly, according to general purpose priority rules that greedily assign priorities to overlapping hits, is proposed and compared with a serial software version.
Abstract: Among the basic cognitive skills of the biological brain in humans and other mammals, a fundamental one is the ability to recognize inexact patterns in a sequence of objects or events. Accelerating inexact string matching procedures is of utmost importance when dealing with practical applications where huge amounts of data must be processed in real time, as usual in bioinformatics or cybersecurity. Inexact matching procedures can yield multiple shadow hits, which must be filtered, according to some criterion, to obtain a concise and meaningful list of occurrences. The filtering procedures are often computationally demanding and are performed offline in a post-processing phase. This paper introduces a novel algorithm for online approximate string matching (OASM) able to filter shadow hits on the fly, according to general purpose priority rules that greedily assign priorities to overlapping hits. A field-programmable gate array (FPGA) hardware implementation of OASM is proposed and compared with a serial software version. Even when implemented on entry-level FPGAs, the proposed procedure can reach a high degree of parallelism and superior performance in time compared to the software implementation, while keeping low the usage of logic elements. This makes the developed architecture very competitive in terms of both performance and cost of the overall computing system.

15 citations


Journal ArticleDOI
TL;DR: It is shown that an analog deep neural network based on the proposed vector-matrix multiplier can achieve an inference accuracy comparable to digital solutions with an energy efficiency of 26.4 TOPs/J, a layer latency close to $100~\mu \text{s}$ and an intrinsically high degree of parallelism.
Abstract: We propose a CMOS Analog Vector-Matrix Multiplier for Deep Neural Networks, implemented in a standard single-poly 180 nm CMOS technology. The learning weights are stored in analog floating-gate memory cells embedded in current mirrors implementing the multiplication operations. We experimentally verify the analog storage capability of designed single-poly floating-gate cells, the accuracy of the multiplying function of proposed tunable current mirrors, and the effective number of bits of the analog operation. We perform system-level simulations to show that an analog deep neural network based on the proposed vector-matrix multiplier can achieve an inference accuracy comparable to digital solutions with an energy efficiency of 26.4 TOPs/J, a layer latency close to $100~\mu \text{s}$ and an intrinsically high degree of parallelism. Our proposed design has also a cost advantage, considering that it can be implemented in a standard single-poly CMOS process flow.

13 citations


Proceedings ArticleDOI
09 Mar 2020
TL;DR: A runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations is presented.
Abstract: Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.

12 citations


Journal ArticleDOI
TL;DR: In this article, the authors describe the algorithms implemented in FDPS to make efficient use of accelerator hardware such as GPGPUs, and they have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth.
Abstract: In this paper, we describe the algorithms we implemented in FDPS to make efficient use of accelerator hardware such as GPGPUs. We have developed FDPS to make it possible for many researchers to develop their own high-performance parallel particle-based simulation programs without spending large amount of time for parallelization and performance tuning. The basic idea of FDPS is to provide a high-performance implementation of parallel algorithms for particle-based simulations in a "generic" form, so that researchers can define their own particle data structure and interparticle interaction functions and supply them to FDPS. FDPS compiled with user-supplied data type and interaction function provides all necessary functions for parallelization, and using those functions researchers can write their programs as though they are writing simple non-parallel program. It has been possible to use accelerators with FDPS, by writing the interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of user-provided interaction function so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the side of CPU and amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a systems with NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27 \% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth.

12 citations


Proceedings ArticleDOI
29 Jun 2020
TL;DR: This work proposes a fuzzy logic-based fairness control mechanism that characterizes the degree of flow intensity of a workload and assigns priorities to the workloads and observes that the proposed mechanism improves the fairness, weighted speedup, and harmonic speedup of SSD by 29.84, 11.24, and 24.90% on average over state of the art.
Abstract: Modern NVMe SSDs are widely deployed in diverse domains due to characteristics like high performance, robustness, and energy efficiency. It has been observed that the impact of interference among the concurrently running workloads on their overall response time differs significantly in these devices, which leads to unfairness. Workload intensity is a dominant factor influencing the interference. Prior works use a threshold value to characterize a workload as high-intensity or low-intensity; this type of characterization has drawbacks due to lack of information about the degree of low- or high-intensity. A data cache in an SSD controller - usually based on DRAMs - plays a crucial role in improving device throughput and lifetime. However, the degree of parallelism is limited at this level compared to the SSD back-end consisting of several channels, chips, and planes. Therefore, the impact of interference can be more pronounced at the data cache level. No prior work has addressed the fairness issue at the data cache level to the best of our knowledge. In this work, we address this issue by proposing a fuzzy logic-based fairness control mechanism. A fuzzy fairness controller characterizes the degree of flow intensity (i.e., the rate at which requests are generated) of a workload and assigns priorities to the workloads. We implement the proposed mechanism in the MQSim framework and observe that our technique improves the fairness, weighted speedup, and harmonic speedup of SSD by 29.84%, 11.24%, and 24.90% on average over state of the art, respectively. The peak gains in fairness, weighted speedup, and harmonic speedup are 2.02x, 29.44%, and 56.30%, respectively.

11 citations


Proceedings ArticleDOI
01 Nov 2020
TL;DR: GPU-TRIDENT as mentioned in this paper is an accurate and scalable technique for modeling error propagation in GPU programs, which is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.
Abstract: Fault injection (FI) techniques are typically used to determine the reliability profiles of programs under soft errors. However, these techniques are highly resource- and time-intensive. Prior research developed a model, TRIDENT to analytically predict Silent Data Corruption (SDC, i.e., incorrect output without any indication) probabilities of single-threaded CPU applications without requiring FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures than CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications.In this paper, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. We find that GPU-TRIDENT is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.

9 citations


Journal ArticleDOI
TL;DR: This work proposes Runtime Balance Clustering Algorithm (RBCA), which employs the Backtracking approach to make the runtime of each cluster more balanced, and DBCA, which defines the dependency correlation to measure the similarity between tasks in terms of data dependencies.
Abstract: Distributed computing, such as Cloud, provides traditional workflow applications with completely new deployment architecture offering high performance and scalability. However, when executing the workflow tasks in a distributed computing environment, significant scheduling overheads are generated. Task clustering is a key technology to optimize process execution. Unreasonable task clustering can lead to the problems of runtime and dependency imbalance, which reduces the degree of parallelism during task execution. In order to solve the problem of runtime imbalance, we propose Runtime Balance Clustering Algorithm (RBCA), which employs the Backtracking approach to make the runtime of each cluster more balanced. In addition, to address the problem of dependency imbalance, we also propose Dependency Balance Clustering Algorithm (DBCA), which defines the dependency correlation to measure the similarity between tasks in terms of data dependencies. The tasks with high dependency correlation are clustered together so as to avoid the dependency imbalance to most extent. We conducted extensive experiments on the WorkflowSim platform and compared our algorithms with the existing task clustering algorithms. The results show that RBCA and DBCA significantly reduce the execution time of the whole workflow.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a new, perfectly parallel approach to simulate cosmic structure formation, based on the spatial COmoving Lagrangian Acceleration (sCOLA) framework.
Abstract: Existing cosmological simulation methods lack a high degree of parallelism due to the long-range nature of the gravitational force, which limits the size of simulations that can be run at high resolution. To solve this problem, we propose a new, perfectly parallel approach to simulate cosmic structure formation, based on the spatial COmoving Lagrangian Acceleration (sCOLA) framework. Building upon a hybrid analytical and numerical description of particles' trajectories, our algorithm allows an efficient tiling of a cosmological volume, where the dynamics within each tile is computed independently. As a consequence, the degree of parallelism is equal to the number of tiles. We optimise the accuracy of sCOLA by the use of a buffer region around tiles, and of appropriate Dirichlet boundary conditions around sCOLA boxes. As a result, we show that cosmological simulations at the degree of accuracy required for the analysis of the next generation of surveys can be run in drastically reduced wall-clock times and with very low memory requirements. The perfect scalability of our algorithm unlocks profoundly new possibilities of computing larger and higher-resolution cosmological simulations, taking advantage of a variety of hardware architectures.

Posted Content
TL;DR: A novel parallel tip-decomposition algorithm -- REfine CoarsE-grained Independent Tasks (RECEIPT) that relaxes the peeling order restrictions by partitioning the vertices into multiple independent subsets that can be concurrently peeled to simultaneously achieve a high degree of parallelism and dramatic reduction in synchronizations.
Abstract: Tip decomposition is a crucial kernel for mining dense subgraphs in bipartite networks, with applications in spam detection, analysis of affiliation networks etc. It creates a hierarchy of vertex-induced subgraphs with varying densities determined by the participation of vertices in butterflies (2,2-bicliques). To build the hierarchy, existing algorithms iteratively follow a delete-update(peeling) process: deleting vertices with the minimum number of butterflies and correspondingly updating the butterfly count of their 2-hop neighbors. The need to explore 2-hop neighborhood renders tip-decomposition computationally very expensive. Furthermore, the inherent sequentiality in peeling only minimum butterfly vertices makes derived parallel algorithms prone to heavy synchronization. In this paper, we propose a novel parallel tip-decomposition algorithm -- REfine CoarsE-grained Independent Tasks (RECEIPT) that relaxes the peeling order restrictions by partitioning the vertices into multiple independent subsets that can be concurrently peeled. This enables RECEIPT to simultaneously achieve a high degree of parallelism and dramatic reduction in synchronizations. Further, RECEIPT employs a hybrid peeling strategy along with other optimizations that drastically reduce the amount of wedge exploration and execution time. We perform detailed experimental evaluation of RECEIPT on a shared-memory multicore server. It can process some of the largest publicly available bipartite datasets orders of magnitude faster than the state-of-the-art algorithms -- achieving up to 1100x and 64x reduction in the number of thread synchronizations and traversed wedges, respectively. Using 36 threads, RECEIPT can provide up to 17.1x self-relative speedup. Our implementation of RECEIPT is available at this https URL.

Journal ArticleDOI
TL;DR: In this paper, a parallel speedup model that accounts for the variations on the average data-access delay is proposed to describe the limiting effect of the memory wall on parallel speedups in homogeneous shared-memory architectures.
Abstract: After Amdahl’s trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes a new parallel speedup model that accounts for the variations on the average data-access delay to describe the limiting effect of the memory wall on parallel speedups in homogeneous shared-memory architectures. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as the degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, our experiments show that conventional machine-learning modeling, besides being black-boxed, needs about one order of magnitude more measurements to reach the same level of accuracy achieved by the proposed model.

Proceedings ArticleDOI
20 Apr 2020
TL;DR: This paper is the first to establish best practices for implementing transcoding platforms for interactive streaming videos encoded using a modem video codec by designing and implementing CONTRAST, a Container- based Distributed Transcoding Framework for Interactive Video Streaming.
Abstract: Interactive video streaming applications are becoming increasingly popular. To maintain the Quality of Experience (QoE) of an end user, interactive streaming platforms need to transcode a video stream, i.e., adapt the quality of the video content, to match the network conditions between the platform and the user as well as the device capabilities of the end user. Modern video codecs such as High Efficiency Video Coding (HEVC) require significant computational resources for transcoding operations. Consequently, there is a need for systems that can perform transcoding quickly at runtime to sustain the real-time performance required for interactive streaming while at the same time using just the right amount of computational resources for the transcoding operations. This paper addresses this need by designing and implementing CONTRAST, a Container- based Distributed Transcoding Framework for Interactive Video Streaming. For any given stream and transcoding resolution, CONTRAST exploits a profiling technique to automatically determine the degree of parallelism, Le., the number of processing cores, demanded by the transcoding process to sustain the stream’s frame rate. It then launches Docker containers configured with the required number of cores to perform the transcoding. Experiments using a set of realistic video streams show that CONTRAST is able to sustain the frame rate requirements for interactive streams in a more resource efficient manner compared to baseline techniques that do not consider the degree of parallelism. To the best of our knowledge, our paper is the first to establish best practices for implementing transcoding platforms for interactive streaming videos encoded using a modem video codec.

Journal ArticleDOI
TL;DR: This work constructs rigorous cost models to analyze the throughput dynamics of Storm workflows and forms a budget-constrained topology mapping problem to maximize Storm workflow throughput in clouds and designs a heuristic solution that takes into consideration not only the selection of virtual machine type but also the degree of parallelism for each task in the topology.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work argues that determining the optimal or near-optimal DOP for query execution is a fundamental and challenging task that benefits both query performance and cost-benefit tradeoffs and presents promising preliminary results on how ML techniques can be applied to automate DOP tuning.
Abstract: Determining the degree of parallelism (DOP) for query execution is of great importance to both performance and resource provisioning. However, recent work that applies machine learning (ML) to query optimization and query performance prediction in relational database management systems (RDBMSs) has ignored the effect of intra-query parallelism. In this work, we argue that determining the optimal or near-optimal DOP for query execution is a fundamental and challenging task that benefits both query performance and cost-benefit tradeoffs. We then present promising preliminary results on how ML techniques can be applied to automate DOP tuning. We conclude with a list of challenges we encountered, as well as future directions for our work.

Book ChapterDOI
14 Sep 2020
TL;DR: This paper addressed the problem of the optimization of UDFs in data-intensive workflows and presented the approach to construct a cost model to determine the degree of parallelism for parallelizable UDF’s.
Abstract: Optimizing Data Processing Pipelines (DPPs) is challenging in the context of both, data warehouse architectures and data science architectures. Few approaches to this problem have been proposed so far. The most challenging issue is to build a cost model of the whole DPP, especially if user defined functions (UDFs) are used. In this paper we addressed the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs .

Proceedings ArticleDOI
12 Nov 2020
TL;DR: A fully asynchronous, hybrid CPU–GPU in situ architecture that emphasizes interactivity is proposed that minimizes visual latencies, and achieves frame rates between 6 and 60 frames/second – depending on simulation data size and degree of parallelism.
Abstract: Live in situ visualization of numerical simulations – interactive visualization while the simulation is running – can enable new modes of interaction, including computational steering. Designing easy-to-use distributed in situ architectures, with viewing latency low enough, and frame rate high enough, for interactive use, is challenging. Here, we propose a fully asynchronous, hybrid CPU–GPU in situ architecture that emphasizes interactivity. We also present a transparent implementation of this architecture embedded into the OpenFPM simulation framework. The benchmarks show that our architecture minimizes visual latencies, and achieves frame rates between 6 and 60 frames/second – depending on simulation data size and degree of parallelism – by changing only a few lines of an existing simulation code.

Journal ArticleDOI
TL;DR: This paper leverages low-level (yet simple) APIs to integrate on-demand malleability across all Level-3 BLAS routines, and demonstrates the performance benefits of this approach by means of a higher-level dense matrix operation: the LU factorization with partial pivoting and look-ahead.
Abstract: Malleability is a property of certain applications (or tasks) that, given an external request or autonomously, can accommodate a dynamic modification of the degree of parallelism being exploited at runtime. Malleability improves resource usage (core occupation) on modern multicore architectures for applications that exhibit irregular and divergent execution paths and heavily depend on the underlying library performance to attain high performance. The integration of malleability within high-performance instances of the Basic Linear Algebra Subprograms (BLAS) is nonexistent, and, in addition, it is difficult to attain given the rigidity of current application programming interfaces (APIs). In this paper, we overcome these issues presenting the integration of a malleability mechanism within BLIS, a high-performance and portable framework to implement BLAS-like operations. For this purpose, we leverage low-level (yet simple) APIs to integrate on-demand malleability across all Level-3 BLAS routines, and we demonstrate the performance benefits of this approach by means of a higher-level dense matrix operation: the LU factorization with partial pivoting and look-ahead.

Posted ContentDOI
13 May 2020-bioRxiv
TL;DR: This work examines parallel evolution as the degree of covariance between replicate populations, providing a justification for the use of dimensionality reduction and finds evidence suggesting that temporal patterns of parallelism are comparatively easier to detect and that these patterns may reflect the evolutionary dynamics of microbial populations.
Abstract: Parallel evolution is consistently observed across the tree of life. However, the degree of parallelism between replicate populations in evolution experiments is rarely quantified at the gene level. Here we examine parallel evolution as the degree of covariance between replicate populations, providing a justification for the use of dimensionality reduction. We examine the extent that signals of gene-level covariance can be inferred in microbial evolve-and-resequence evolution experiments, finding that deviations from parallelism are difficult to quantify at a given point in time. However, this low statistical signal means that covariance between replicate populations is unlikely to interfere with the ability to detect divergent evolutionary trajectories for populations in different environments. Finally, we find evidence suggesting that temporal patterns of parallelism are comparatively easier to detect and that these patterns may reflect the evolutionary dynamics of microbial populations.

Proceedings ArticleDOI
01 Oct 2020
TL;DR: This interactive demonstration allows visitors to gradually increase the degree of parallelism in Kvazaar and see the benefits of parallelization in live HEVC encoding.
Abstract: This paper presents a demonstration setup for distributed real-time HEVC encoding on a multi-computer system. The demonstrated multi-level parallelization scheme is implemented in the practical Kvazaar open-source HEVC encoder. It allows Kvazaar to exploit parallelism at three levels: 1) Single Instruction Multiple Data (SIMD) optimized coding tools at the data level; 2) Wavefront Parallel Processing (WPP) and Overlapped Wavefront (OWF) parallelization strategies at the thread level; and 3) distributed slice encoding on multi-computer systems at the process level. This interactive demonstration allows visitors to gradually increase the degree of parallelism in Kvazaar and see the benefits of parallelization in live HEVC encoding. Exploiting all three levels of parallelism on a three-laptop setup speeds up Kvazaar by almost 21× over a non-parallelized single-core implementation of Kvazaar.

Journal ArticleDOI
13 Nov 2020
TL;DR: This paper presents a pragmatic variant of the actor model in which messages can be grouped into units that are executed in a serializable manner, whilst still retaining a high degree of parallelism.
Abstract: A major challenge in writing applications that execute across hosts, such as distributed online services, is to reconcile (a) parallelism (i.e., allowing components to execute independently on disjoint tasks), and (b)cooperation (i.e., allowing components to work together on common tasks). A good compromise between the two is vital to scalability, a core concern in distributed networked applications. The actor model of computation is a widely promoted programming model for distributed applications, as actors can execute in individual threads (parallelism) across different hosts and interact via asynchronous message passing (collaboration). However, this makes it hard for programmers to reason about combinations of messages as opposed to individual messages, which is essential in many scenarios. This paper presents a pragmatic variant of the actor model in which messages can be grouped into units that are executed in a serializable manner, whilst still retaining a high degree of parallelism. In short, our model is based on an orchestration of actors along a directed acyclic graph that supports efficient decentralized synchronization among actors based on their actual interaction. We present the implementation of this model, based on a dynamic DAG-inducing referencing discipline, in the actor-based programming language AEON. We argue serializability and the absence of deadlocks in our model, and demonstrate its scalability and usability through extensive evaluation and case studies of wide-ranging applications.

Book ChapterDOI
25 May 2020
TL;DR: A two-level aging mechanism is developed and its effect in the context of 6 dynamic scheduling algorithms for heterogeneous systems is analyzed and shows a speed up of the average total makespan in 9 out of 12 conducted experiments when aging is used with the cost of additional waiting time for the applications/jobs with higher priority.
Abstract: The high degree of parallelism of today’s computing systems often requires executing applications and their tasks in parallel due to a limited scaling capability of individual applications. In such scenarios, considering the differing importance of applications while scheduling tasks is done by assigning priorities to the tasks. However, priorities may lead to starvation in highly utilized systems. A solution is offered by aging mechanisms that raise the priority of long waiting tasks. As modern systems are often dynamic in nature, we developed a two-level aging mechanism and analyzed its effect in the context of 6 dynamic scheduling algorithms for heterogeneous systems. In the context of task scheduling, aging refers to a method that increases the priority of a task over its lifetime. We used a task-based runtime system to evaluate the mechanism on a real system in two scenarios. The results show a speed up of the average total makespan in 9 out of 12 conducted experiments when aging is used with the cost of additional waiting time for the applications/jobs with higher priority. However, the job/application with the highest priority is still finished first in all cases. Considering the scheduling algorithms, Minimum Completion Time, Sufferage, and Relative Cost benefit in both experiments by the aging mechanism. Additionally, no algorithm significantly dominates all other algorithms when total makespans are compared.

Journal ArticleDOI
TL;DR: This work proposes a new, perfectly parallel approach to simulate cosmic structure formation, which is based on the spatial COmoving Lagrangian Acceleration (sCOLA) framework, and allows for an efficient tiling of a cosmological volume, where the dynamics within each tile is computed independently.
Abstract: Existing cosmological simulation methods lack a high degree of parallelism due to the long-range nature of the gravitational force, which limits the size of simulations that can be run at high resolution. To solve this problem, we propose a new, perfectly parallel approach to simulate cosmic structure formation, which is based on the spatial COmoving Lagrangian Acceleration (sCOLA) framework. Building upon a hybrid analytical and numerical description of particles' trajectories, our algorithm allows for an efficient tiling of a cosmological volume, where the dynamics within each tile is computed independently. As a consequence, the degree of parallelism is equal to the number of tiles. We optimised the accuracy of sCOLA through the use of a buffer region around tiles and of appropriate Dirichlet boundary conditions around sCOLA boxes. As a result, we show that cosmological simulations at the degree of accuracy required for the analysis of the next generation of surveys can be run in drastically reduced wall-clock times and with very low memory requirements. The perfect scalability of our algorithm unlocks profoundly new possibilities for computing larger cosmological simulations at high resolution, taking advantage of a variety of hardware architectures.

Proceedings ArticleDOI
01 Jul 2020
TL;DR: This work presents a low latency and high throughput data pipeline that performs operations such as data cleansing, filtering, aggregations and join of streams on pipeline stages for event-driven machine learning applications.
Abstract: Developing an event-driven applications for network management systems(NMS) requires data-wrangling operations to be performed in a data pipeline with low latency and high throughput. Data stream processing engines(SPEs) enables the pipeline to feature extract complex data in near real time. We present a low latency and high throughput data pipeline that performs operations such as data cleansing, filtering, aggregations and join of streams on pipeline stages for event-driven machine learning applications. When the data stream has imbalance in arrival of records, a standalone system handle it ineffectively that reduces the performance of data pipeline. In this work, the system powered by SPEs handle the imbalance by scaling out resources and degree of parallelism such that performance is not degraded.

Proceedings ArticleDOI
Sunwoo Lee1, Qiao Kang1, Ankit Agrawal1, Alok Choudhary1, Wei-keng Liao1 
10 Dec 2020
TL;DR: In this paper, the authors proposed a hierarchical parallel training strategy for local SGD with data parallelism, where each worker can update its own model and periodically average the model parameters across all the workers.
Abstract: Synchronous Stochastic Gradient Descent (SGD) with data parallelism, the most popular parallel training strategy for deep learning, suffers from expensive gradient communications. Local SGD with periodic model averaging is a promising alternative to synchronous SGD. The algorithm allows each worker to locally update its own model, and periodically averages the model parameters across all the workers. While this algorithm enjoys less frequent communications, the convergence rate is strongly affected by the number of workers. In order to scale up the local SGD training without losing accuracy, the number of workers should be sufficiently small so that the model converges reasonably fast. In this paper, we discuss how to exploit the degree of parallelism in local SGD while maintaining model accuracy. Our training strategy employs multiple groups of processes and each group trains a local model based on data parallelism. The local models are periodically averaged across all the groups. Based on this hierarchical parallelism, we design a model averaging algorithm that has a cheaper communication cost than allreduce-based approach. We also propose a practical metric for finding the maximum number of workers that does not cause a significant accuracy loss. Our experimental results demonstrate that our proposed training strategy provides a significantly improved scalability while achieving a comparable model accuracy to synchronous SGD.

Proceedings Article
01 Jan 2020
TL;DR: In this paper, the authors propose a parallel tip decomposition algorithm called RECEIPT, which relaxes the peeling order restrictions by partitioning the vertices into multiple independent subsets that can be concurrently peeled.
Abstract: Tip decomposition is a crucial kernel for mining dense subgraphs in bipartite networks, with applications in spam detection, analysis of affiliation networks etc. It creates a hierarchy of vertex-induced subgraphs with varying densities determined by the participation of vertices in butterflies (2,2-bicliques). To build the hierarchy, existing algorithms iteratively follow a delete-update(peeling) process: deleting vertices with the minimum number of butterflies and correspondingly updating the butterfly count of their 2-hop neighbors. The need to explore 2-hop neighborhood renders tip-decomposition computationally very expensive. Furthermore, the inherent sequentiality in peeling only minimum butterfly vertices makes derived parallel algorithms prone to heavy synchronization. In this paper, we propose a novel parallel tip-decomposition algorithm -- REfine CoarsE-grained Independent Tasks (RECEIPT) that relaxes the peeling order restrictions by partitioning the vertices into multiple independent subsets that can be concurrently peeled. This enables RECEIPT to simultaneously achieve a high degree of parallelism and dramatic reduction in synchronizations. Further, RECEIPT employs a hybrid peeling strategy along with other optimizations that drastically reduce the amount of wedge exploration and execution time. We perform detailed experimental evaluation of RECEIPT on a shared-memory multicore server. It can process some of the largest publicly available bipartite datasets orders of magnitude faster than the state-of-the-art algorithms -- achieving up to 1100x and 64x reduction in the number of thread synchronizations and traversed wedges, respectively. Using 36 threads, RECEIPT can provide up to 17.1x self-relative speedup. Our implementation of RECEIPT is available at this https URL.

Journal ArticleDOI
TL;DR: In this paper, the authors describe the lessons learnt during the optimization of the widely used codes for computational astrophysics P-Gadget3, Flash and Echo, and present results for the visualization and analysis tools VisIt and yt.

Posted Content
TL;DR: This work presents an algorithm that is fast enough to speed up several matrix operations, and increases the degree of parallelism of an underlying matrix multiplication where H is an orthogonal matrix represented by a product of Householder matrices.
Abstract: Various Neural Networks employ time-consuming matrix operations like matrix inversion. Many such matrix operations are faster to compute given the Singular Value Decomposition (SVD). Previous work allows using the SVD in Neural Networks without computing it. In theory, the techniques can speed up matrix operations, however, in practice, they are not fast enough. We present an algorithm that is fast enough to speed up several matrix operations. The algorithm increases the degree of parallelism of an underlying matrix multiplication $H\cdot X$ where $H$ is an orthogonal matrix represented by a product of Householder matrices. Code is available at this http URL .