scispace - formally typeset
Search or ask a question
Topic

Degree of parallelism

About: Degree of parallelism is a research topic. Over the lifetime, 1515 publications have been published within this topic receiving 25546 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: In this paper, it is shown that the inertia matrix associated with any open-or closed-loop mechanism is positive definite by finding a simple mathematical expression for the quadratic form expressing the kinetic energy in an associated state space.
Abstract: In this paper, advantage is taken of the problem structure in multibody dynamics simulation when the mechanical system is modeled using a minimal set of generalized coordinates. It is shown that the inertia matrix associated with any open- or closed-loop mechanism is positive definite by finding a simple mathematical expression for the quadratic form expressing the kinetic energy in an associated state space. Based on this result, an algorithm that efficiently solves for second time derivatives of the generalized coordinates is presented. Significant speed-ups accrue due to both the no fill-in factorization of the composite inertia matrix technique and the degree of parallelism attainable with the new algorithm.

20 citations

Journal ArticleDOI
TL;DR: This work has implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2, a novel library for auto-tunable codes for linear algebra operations based on LASs library that presents an improvement in terms of execution time against other reference libraries.

20 citations

Journal ArticleDOI
01 Nov 2013
TL;DR: A GPU-based simulation kernel (gDES) to support DES is presented and three algorithms to support high efficiency are proposed to increase the degree of parallelism while retaining the number of synchronizations.
Abstract: The graphic processing unit (GPU) can perform some large-scale simulations in an economical way. However, harnessing the power of a GPU to discrete event simulation (DES) is difficult because of the mismatch between GPU's synchronous execution mode and DES's asynchronous time advance mechanism. In this paper, we present a GPU-based simulation kernel (gDES) to support DES and propose three algorithms to support high efficiency. Since both limited parallelism and redundant synchronization affect the performance of DES based on a GPU, we propose a breadth-expansion conservative time window algorithm to increase the degree of parallelism while retaining the number of synchronizations. By using the expansion method, it can import as many as possible 'safe' events. The irregular and dynamic requirement for storing the events leads to uneven and sparse memory usage, thereby causing waste of memory and unnecessary overhead. A memory management algorithm is proposed to store events in a balanced and compact way by using a lightweight stochastic method. When events processed by threads in a warp have different types, the performance of gDES decreases rapidly because of branch divergence. An event redistribution algorithm is proposed by reassigning events of the same type to neighboring threads to reduce the probability of branch divergence. We analyze the superiority of the proposed algorithms and gDES with a series of experiments. Compared to a CPU-based simulator on a multicore platform, the gDES can achieve up to 11A—, 5A—, and 8A— speedup in PHOLD, QUEUING NETWORK, and epidemic simulation, respectively.

20 citations

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This paper proposes a protocol to reconfigure the degree of parallelism in parallel SMR on-the-fly and shows the gains due to reconfiguration and shed some light on the behavior of parallel and reconfigurable SMR.
Abstract: State Machine Replication (SMR) is a well-known technique to implement fault-tolerant systems. In SMR, servers are replicated and client requests are deterministically executed in the same order by all replicas. To improve performance in multi-processor systems, some approaches have proposed to parallelize the execution of non-conflicting requests. Such approaches perform remarkably well in workloads dominated by non-conflicting requests. Conflicting requests introduce expensive synchronization and result in considerable performance loss. Current approaches to parallel SMR define the degree of parallelism statically. However, it is often difficult to predict the best degree of parallelism for a workload and workloads experience variations that change their best degree of parallelism. This paper proposes a protocol to reconfigure the degree of parallelism in parallel SMR on-the-fly. Experiments show the gains due to reconfiguration and shed some light on the behavior of parallel and reconfigurable SMR.

20 citations

Book ChapterDOI
25 Aug 2014
TL;DR: A performance assessment of a massively parallel and portable Lattice Boltzmann code, based on the Open Computing Language (OpenCL) and the Message Passing Interface (MPI), and techniques to move data between accelerators minimizing overheads of communication latencies are presented.
Abstract: High performance computing increasingly relies on heterogeneous systems, based on multi-core CPUs, tightly coupled to accelerators: GPUs or many core systems. Programming heterogeneous systems raises new issues: reaching high sustained performances means that one must exploit parallelism at several levels; at the same time the lack of a standard programming environment has an impact on code portability. This paper presents a performance assessment of a massively parallel and portable Lattice Boltzmann code, based on the Open Computing Language (OpenCL) and the Message Passing Interface (MPI). Exactly the same code runs on standard clusters of multi-core CPUs, as well as on hybrid clusters including accelerators. We consider a state-of-the-art Lattice Boltzmann model that accurately reproduces the thermo-hydrodynamics of a fluid in 2 dimensions. This algorithm has a regular structure suitable for accelerator architectures with a large degree of parallelism, but it is not straightforward to obtain a large fraction of the theoretically available performance. In this work we focus on portability of code across several heterogeneous architectures preserving performances and also on techniques to move data between accelerators minimizing overheads of communication latencies. We describe the organization of the code and present and analyze performance and scalability results on a cluster of nodes based on NVIDIA K20 GPUs and Intel Xeon-Phi accelerators.

20 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
85% related
Scheduling (computing)
78.6K papers, 1.3M citations
83% related
Network packet
159.7K papers, 2.2M citations
80% related
Web service
57.6K papers, 989K citations
80% related
Quality of service
77.1K papers, 996.6K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20221
202147
202048
201952
201870
201775