scispace - formally typeset
Search or ask a question

Showing papers by "Francesco Quaglia published in 2018"


Proceedings Article•DOI•
14 May 2018
TL;DR: This article presents an innovative share-everything PDES system that provides fully non-blocking coordination of the threads when accessing shared data structures and fully speculative processing capabilities---Time Warp style processing--- of the events.
Abstract: The share-everything PDES (Parallel Discrete Event Simulation) paradigm is based on fully sharing the possibility to process any individual event across concurrent threads, rather than binding Logical Processes (LPs) and their events to threads. It allows concentrating, at any time, the computing power---the CPU-cores on board of a shared-memory machine---towards the unprocessed events that stand closest to the current commit horizon of the simulation run. This fruitfully biases the delivery of the computing power towards the hot portion of the model execution trajectory. In this article we present an innovative share-everything PDES system that provides (1) fully non-blocking coordination of the threads when accessing shared data structures and (2) fully speculative processing capabilities---Time Warp style processing---of the events. As we show via an experimental study, our proposal can cope with hard workloads where both classical Time Warp systems---based on LPs to threads binding---and previous share-everything proposals---not able to exploit fully speculative processing of the events---tend to fail in delivering adequate performance.

20 citations


Posted Content•
TL;DR: This article presents a fully non-blocking buddy-system, that allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata, which is resilient to performance degradation in face of concurrent accesses independently of the current level of fragmentation of the handled memory blocks.
Abstract: Common implementations of core memory allocation components, like the Linux buddy system, handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is clearly not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators-the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. Conflict detection relies on conventional atomic machine instructions in the Read-Modify-Write (RMW) class. Furthermore, beyond improving scalability and performance, it can also avoid wasting clock cycles for spin-lock operations by threads that could in principle carry out their memory allocation/release in full concurrency. Thus, it is resilient to performance degradation---in face of concurrent accesses---independently of the current level of fragmentation of the handled memory blocks.

5 citations


Proceedings Article•DOI•
30 Mar 2018
TL;DR: In this article, the authors consider the problem of maximizing the performance of multi-threaded applications under a power cap by dynamically tuning the thread-level parallelism and the power state of CPU-cores in combination.
Abstract: Energy consumption has become a core concern in computing systems. In this context, power capping is an approach that aims at ensuring that the power consumption of a system does not overcome a predefined threshold. Although various power capping techniques exist in the literature, they do not fit well the nature of multi-threaded workloads with shared data accesses and non-minimal thread-level concurrency. For these workloads, scalability may be limited by thread contention on hardware resources and/or data, to the point that performance may even decrease while increasing the thread-level parallelism, indicating scarce ability to exploit the actual computing power available in highly parallel hardware. In this paper, we consider the problem of maximizing the performance of multi-thread applications under a power cap by dynamically tuning the thread-level parallelism and the power state of CPU-cores in combination. Based on experimental observations, we design a technique that adaptively identifies, in linear time within a bi-dimensional space, the optimal parallelism and power state setting. We evaluated the proposed technique with different benchmark applications, and using different methods for synchronizing threads when accessing shared data, and we compared it with other state-of-the-art power capping techniques.

4 citations


Proceedings Article•DOI•
14 May 2018
TL;DR: An innovative Time Warp architecture oriented to efficiently run parallel simulations under a power cap is presented, which considers power usage as a foundational design principle, as opposed to classical power-unaware Time Warp design.
Abstract: Controlling power usage has become a core objective in modern computing platforms. In this article we present an innovative Time Warp architecture oriented to efficiently run parallel simulations under a power cap. Our architectural organization considers power usage as a foundational design principle, as opposed to classical power-unaware Time Warp design. We provide early experimental results showing the potential of our proposal.

3 citations


Proceedings Article•DOI•
14 May 2018
TL;DR: The design of a middleware layer that allows ECS to be ported to distributed-memory clusters of machines and retain the possibility to rely on the enriched ECS programming model while still enabling deployments of PDES models on convenient (Cloud-based) infrastructures is presented.
Abstract: Along the years, Parallel Discrete Event Simulation (PDES) has been enriched with programming facilities to bypass state disjointness across the concurrent Logical Processes (LPs). New supports have been proposed, offering the programmer approaches alternative to message passing to code complex LPs' relations. Along this path we find Event &Cross-State (ECS), which allows writing event handlers which can perform in-place accesses to the state of any LP, by simply relying on pointers. This programming model has been shipped with a runtime support enabling concurrent speculative execution of LPs limited to shared-memory machines. In this paper, we present the design of a middleware layer that allows ECS to be ported to distributed-memory clusters of machines. A core application of our middleware is to let ECS-coded models be hosted on top of (low-cost) resources from the Cloud. Overall, ECS-coded models no longer demand for powerful shared-memory machines to execute in reasonable time. Thanks to our solution, we retain indeed the possibility to rely on the enriched ECS programming model while still enabling deployments of PDES models on convenient (Cloud-based) infrastructures. An experimental assessment of our proposal is also provided.

2 citations


Proceedings Article•DOI•
01 Sep 2018
TL;DR: In this paper, the authors present a fully non-blocking buddy-system that allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata.
Abstract: Common implementations of core memory allocation components handle concurrent allocation/release requests by synchronizing threads via spin-locks. This approach is not prone to scale with large thread counts, a problem that has been addressed in the literature by introducing layered allocation services or replicating the core allocators—the bottom most ones within the layered architecture. Both these solutions tend to reduce the pressure of actual concurrent accesses to each individual core allocator. In this article we explore an alternative approach to scalability of memory allocation/release, which can be still combined with those literature proposals. We present a fully non-blocking buddy-system, that allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata. Beyond improving scalability and performance it is resilient to performance degradation in face of concurrent accesses independently of the current level of fragmentation of the handled memory blocks.

2 citations


Proceedings Article•DOI•
09 Dec 2018
TL;DR: This assessment illustrates the effects of the various tuning parameters related to the Share-Everything paradigm when the simulation models have a variable granularity, opening to a higher understanding of this innovative paradigm.
Abstract: Modern advancements in computing architectures have been accompanied by new emergent paradigms to run Parallel Discrete Event Simulation models efficiently. Indeed, many new paradigms to effectively use the available underlying hardware have been proposed in the literature. Among these, the Share-Everything paradigm tackles massively-parallel shared-memory machines, in order to support speculative simulation by taking into account the limits and benefits related to this family of architectures. Previous results have shown how this paradigm outperforms traditional speculative strategies (such as data-separated Time Warp systems) whenever the granularity of executed events is small. In this paper, we show performance implications of this simulation-engine organization when the simulation models have a variable granularity. To this end, we have selected a traffic model, tailored for smart cities-oriented simulation. Our assessment illustrates the effects of the various tuning parameters related to the approach, opening to a higher understanding of this innovative paradigm.

Proceedings Article•DOI•
01 Dec 2018
TL;DR: An analytical model is presented that predicts the abort probability of transactions handled via read-validation schemes, which may lead to early aborting doomed transactions, thus saving CPU time and improving performance.
Abstract: Concurrency control protocols based on read-validation schemes allow transactions which are doomed to abort to still run until a subsequent validation check reveals them as invalid. These late aborts do not favor the reduction of wasted computation and can penalize performance. To counteract this problem, we present an analytical model that predicts the abort probability of transactions handled via read-validation schemes. Our goal is to determine what are the suited points-along a transaction lifetime-to carry out a validation check. This may lead to early aborting doomed transactions, thus saving CPU time. We show how to exploit the abort probability predictions returned by the model in combination with a threshold-based scheme to trigger read-validations. We also show how this approach can definitely improve performance-leading up to 14 % better turnaround-as demonstrated by some experiments carried out with a port of the TPC-C benchmark to Software Transactional Memory.