scispace - formally typeset
Search or ask a question

Showing papers by "Francesco Quaglia published in 2016"


Proceedings ArticleDOI
16 May 2016
TL;DR: A lightweight operating system support for Linux is presented, referred to as multi-view address space, explicitly targeting accuracy of per-thread working-set estimation in truly multi-thread applications with non-partitioned accesses to virtual pages, and an associated thread/data migration policy is presented.
Abstract: A common approach to improve memory access in NUMA machines exploits operating system (OS) page protection mechanisms to induce faults to determine which pages are accessed by what thread, so as to move the thread and its working-set of pages to the same NUMA node. However, existing proposals do not fully fit the requirements of truly multi-thread applications with non-partitioned accesses to virtual pages. In fact, these proposals exploit (induced) faults on a same page-table for all the threads of a same process to determine the access pattern. Hence, the fault by one thread (and the consequent re-opening of the access to the corresponding page) would mask those by other threads on the same page. This may lead to inaccuracy in the estimation of the working-set of individual threads. We overcome this drawback by presenting a lightweight operating system support for Linux, referred to as multi-view address space, explicitly targeting accuracy of per-thread working-set estimation in truly multi-thread applications with non-partitioned accesses, and an associated thread/data migration policy. Our solution is fully transparent to user-space code. It is embedded in a Linux/x86_64 module that installs any required modification to the original kernel image by solely relying on dynamic patching. A motivated case study in the context of HPC is also presented for an assessment of our proposal.

15 citations


Proceedings ArticleDOI
15 May 2016
TL;DR: This is the first study where the problem of clustering Time Warp simulation objects is addressed for the case of in-place cross-object state accesses by the application code, and where dynamic granulation of multiple objects in a larger one is supported in a fully transparent manner.
Abstract: A recent trend has shown the relevance of PDES paradigms where simulation objects are no longer seen as fully disjoint entities only interacting via events' scheduling Particularly, mutual cross-state access (as a form of state sharing) can represent an approach enabling the simplification of the programmer's job In this article, we present a multi-core oriented Time Warp platform supporting so called granular objects, where cross-state access is transparently enabled jointly with the dynamic clustering (granulation) of objects into groups depending on the volume of mutual state accesses along phases of the model execution Each group represents an island where activities are sequentially dispatched in timestamp order Concurrency is still preserved by enabling the optimistic execution of the different islands Granulated objects do not pay synchronization costs due to mutual causal inconsistencies Also, the underlying Time Warp platform does not pay memory management (eg memory access tracing) overheads to determine that mutual accesses are taking place within a group Overall, the platform transparently (and dynamically) determines a well-suited granulation of the overall model state, and a corresponding level of concurrency, depending on the actual state access pattern by the simulation code As far as we know, this is the first study where the problem of clustering Time Warp simulation objects is addressed for the case of in-place cross-object state accesses by the application code, and where dynamic granulation of multiple objects in a larger one is supported in a fully transparent manner We integrated our proposal in the open source ROOT-Sim platform

13 citations


Journal ArticleDOI
TL;DR: This article focuses on speculative PDES systems that run on top of multi-core machines, where simulation objects can concurrently process their events with no guarantee of causal consistency and actual violations of causality rules are recovered through rollback/recovery schemes.
Abstract: Parallelizing (compute-intensive) discrete event simulation (DES) applications is a classical approach for speeding up their execution and for making very large/complex simulation models tractable. This has been historically achieved via parallel DES (PDES) techniques, which are based on partitioning the simulation model into distinct simulation objects (somehow resembling objects in classical object-oriented programming), whose states are disjoint, which are executed concurrently and rely on explicit event-exchange (or event-scheduling) primitives as the means to support mutual dependencies and notification of their state updates. With this approach, the application developer is necessarily forced to reason about state separation across the objects, thus being not allowed to rely on shared information, such as global variables, within the application code. This implicitly leads to the shift of the user-exposed programming model to one where sequential-style global variable accesses within the application code are not allowed. In this article we remove this limitation by providing support for managing global variables in the context of DES code developed in ANSI-C, which gets automatically parallelized. Particularly, we focus on speculative (also termed optimistic) PDES systems that run on top of multi-core machines, where simulation objects can concurrently process their events with no guarantee of causal consistency and actual violations of causality rules are recovered through rollback/recovery schemes. In compliance with the nature of speculative processing, in our proposal global variables are transparently mapped to multi-versions, so as to avoid any form of safety predicate verification upon their updates. Consistency is ensured via the introduction of a new rollback/recovery scheme based on detecting global variables' reads on non-correct versions. At the same time, efficiency in the execution is guaranteed by managing multi-version variables' lists via non-blocking algorithms. Furthermore, the whole approach is fully transparent, being it based on automatized instrumentation of the application software (particularly ELF objects). Hence the programmer is exposed to the classical (and easy to code) sequential-style programming scheme while accessing any global variable. An experimental assessment of our proposal, based on a suite of case study applications, run on top of an off-the-shelf Linux machine equipped with 32 CPU-cores and 64 GB of RAM, is also presented.

11 citations


Proceedings ArticleDOI
21 Sep 2016
TL;DR: This article presents a lock-free event pool which also provides amortized O(1) time complexity for both insertions and extractions, and can sustain highly concurrent accesses, while not leading to noticeable performance degradation when scaling up the thread count.
Abstract: The large diffusion of highly-parallel shared-memory multi-core machines has led Parallel Discrete Event Simulation (PDES) platforms to a shift towards a share-everything model. This model is based on loose coupling between simulation objects and threads, lasting (as an extreme) no more than the lifetime of individual events. Concurrent threads can therefore CPU-dispatch events destined to any object at any point in time, thus fully sharing the workload of events to be processed on a fine grain basis. This demands for efficient mechanisms to share the overall pool of pending events by enabling parallelism in insertion and extraction operations. In this article we present a lock-free event pool which also provides amortized O(1) time complexity for both insertions and extractions. It can sustain highly concurrent accesses, while not leading to noticeable performance degradation when scaling up the thread count. Experimental results demonstrate that our solution stands as a core facility capable of further raising up the pragmatical impact of such an emerging share-everything PDES paradigm.

10 citations


Proceedings ArticleDOI
22 Aug 2016
TL;DR: The design and implementation of a concurrent non-blocking pending events' set data structure, which can be seen as a variant of a classical calendar queue is presented, showing excellent scalability of the proposal on a machine equipped with 32 CPU-cores.
Abstract: The large diffusion of shared-memory multi-core machines has impacted the way Parallel Discrete Event Simulation (PDES) engines are built. While they were originally conceived as data-partitioned platforms, where each thread is in charge of managing a subset of simulation objects, nowadays the trend is to shift towards share-everything settings. In this scenario, any thread can (in principle) take care of CPU-dispatching pending events bound to whichever simulation object, which helps to fully share the load across the available CPU-cores. Hence, a fundamental aspect to be tackled is to provide an efficient globally-shared pending events' set from which multiple worker threads can concurrently extract events to be processed, and into which they can concurrently insert new produced events to be processed in the future. To cope with this aspect, we present the design and implementation of a concurrent non-blocking pending events' set data structure, which can be seen as a variant of a classical calendar queue. Early experimental data collected with a synthetic stress test are reported, showing excellent scalability of our proposal on a machine equipped with 32 CPU-cores.

9 citations


Journal ArticleDOI
TL;DR: It is shown that GMU achieves almost linear scalability and that it introduces reduced overhead, with respect to solutions ensuring non-serializable semantics, in a wide range of workloads.
Abstract: In this article we introduce GMU, a genuine partial replication protocol for transactional systems, which exploits an innovative, highly scalable, distributed multiversioning scheme. Unlike existing multiversion-based solutions, GMU does not rely on any global logical clock, which may represent a contention point and a major impairment to system scalability. Also, GMU never aborts read-only transactions and spares them from undergoing distributed validation schemes. This makes GMU particularly efficient in presence of read-intensive workloads, as typical of a wide range of real-world applications. GMU guarantees the Extended Update Serializability (EUS) isolation level. This consistency criterion is particularly attractive as it is sufficiently strong to ensure correctness even for very demanding applications (such as TPC-C), but is also weak enough to allow efficient and scalable implementations, such as GMU. Further, unlike several relaxed consistency models proposed in literature, EUS shows simple and intuitive semantics, thus being an attractive consistency model for ordinary programmers. We integrated GMU in a popular open source in-memory transactional data grid, namely Infinispan. On the basis of a wide experimental study performed on heterogeneous platforms and using industry standard benchmarks (namely TPC-C and YCSB), we show that GMU achieves almost linear scalability and that it introduces reduced overhead, with respect to solutions ensuring non-serializable semantics, in a wide range of workloads.

9 citations


Proceedings ArticleDOI
23 May 2016
TL;DR: An adaptive transaction scheduling policy relying on a Markov Chain-based model of STM systems, which schedules transactions depending on throughput predictions by the model as a function of the current system state and is periodically re-instated at run-time to adapt it to dynamic variations of the workload.
Abstract: Software Transactional Memory (STM) may suffer from performance degradation due to excessive conflicts among concurrent transactions. An approach to cope with this issue consists in putting in place smart scheduling policies which temporarily suspend the execution of some transaction in order to reduce the actual conflict rate. In this paper, we present an adaptive transaction scheduling policyrelying on a Markov Chain-based model of STM systems. The policy is adaptive in a twofold sense: (i) it schedules transactions depending on throughput predictions by the model as a function of the current system state, (ii) its underlying Markov Chain-based model is periodically re-instantiated at run-time to adapt it to dynamic variations of the workload. We also present an implementation of our adaptive transaction scheduler which has been integrated within the open source TinySTM package. The accuracy of our performance model in predicting the system throughput and the advantages of the adaptive scheduling policy over state-of-the-art approaches have been assessed via an experimental study based on the STAMP benchmark suite.

8 citations


Proceedings ArticleDOI
11 Dec 2016
TL;DR: This article study programmability and performance aspects of the last-generation ROOT-Sim speculative PDES environment for multi/many-core shared-memory architectures and introduces programming guidelines for systematic exploitation of these facilities in agent-based simulations.
Abstract: Agent-based modeling and simulation is a versatile and promising methodology to capture complex interactions among entities and their surrounding environment. A great advantage is its ability to model phenomena at a macro scale by exploiting simpler descriptions at a micro level. It has been proven effective in many fields, and it is rapidly becoming a de-facto standard in the study of population dynamics. In this article we study programmability and performance aspects of the last-generation ROOT-Sim speculative PDES environment for multi/many-core shared-memory architectures. ROOT-Sim transparently offers a programming model where interactions can be based on both explicit message passing and in-place state accesses. We introduce programming guidelines for systematic exploitation of these facilities in agent-based simulations, and we study the effects on performance of an innovative load-sharing policy targeting these types of dependencies. An experimental assessment with synthetic and real-world applications is provided, to assess the validity of our proposal.

4 citations


Book ChapterDOI
07 Jul 2016
TL;DR: This article exploits the Hardware Transactional Memory support, as offered by Intel Haswell CPUs, to process events as in-memory transactions, which are possibly committed only after their causal consistency is verified, and exploits an innovative software-based reversibility technique.
Abstract: Speculative parallel discrete event simulation requires a support for reversing processed events, also called state recovery, when causal inconsistencies are revealed. In this article we present an approach where state recovery relies on a mix of hardware- and software-based techniques. We exploit the Hardware Transactional Memory (HTM) support, as offered by Intel Haswell CPUs, to process events as in-memory transactions, which are possibly committed only after their causal consistency is verified. At the same time, we exploit an innovative software-based reversibility technique, fully relying on transparent software instrumentation targeting x86/ELF objects, which enables undoing side effects by events with no actual backward re-computation. Each thread within our speculative processing engine dynamically (on a per-event basis) selects which recovery mode to rely on (hardware vs software) depending on varying runtime dynamics. The latter are captured by a lightweight analytic model indicating to what extent the HTM support (not paying any instrumentation cost) is efficient, and after what level of events’ parallelism it starts degrading its performance, e.g., due to excessive data conflicts while manipulating causality meta-data within HTM-based transactions. We released our implementation as open source software and provide experimental results for an assessment of its effectiveness.

3 citations


Proceedings ArticleDOI
01 Sep 2016
TL;DR: This article proposes a memory access tracing approach based on static x86 binary instrumentation that selectively instruments a subset of those instructions that are the most (or fully) representative of the actual memory access pattern.
Abstract: Memory access tracing is aprogram analysis technique with many different applications, ranging from architectural simulation to (on-line) data placement optimization and security enforcement. In this article we propose a memory access tracing approach based on static x86 binary instrumentation. Unlike non-selective schemes, whichinstrument all the memory access instructions, our proposal selectively instruments a subset of those instructions that are the most (or fully) representative of the actual memory access pattern. The selection of the memory access instructions to be instrumented is based on a new method, which clusters instructions on the basis of their compile/link-time observable address expressions and selects representatives of these clusters. This allows for reducing the runtime cost for running instrumented code, while still enabling high accuracy in the determination of memory accesses. The trade-off between overhead and precision of the tracing process is user-tunable, so that it can be set depending on the final objective of memory access tracing (say on-line vs off-line exploitation). Additionally, our approach can track memory access at different granularity (e.g., virtual-pages or cache line-sized buffers), thus having applications in a variety of different contexts. The effectiveness of our proposal is demonstrated via experiments with applications taken from the PARSEC benchmark suite.

2 citations


Book ChapterDOI
24 Aug 2016
TL;DR: This paper targets last-generation Parallel Discrete Event Simulation (PDES) platforms for multicore systems and discusses a programming model to support both implicit and explicit interactions across concurrent Logical Processes (LPs).
Abstract: Execution parallelism in agent-Based Simulation (ABS) allows to deal with complex/large-scale models. This raises the need for runtime environments able to fully exploit hardware parallelism, while jointly offering ABS-suited programming abstractions. In this paper, we target last-generation Parallel Discrete Event Simulation (PDES) platforms for multicore systems. We discuss a programming model to support both implicit (in-place access) and explicit (message passing) interactions across concurrent Logical Processes (LPs). We discuss different load-sharing policies combining event rate and implicit/explicit LPs’ interactions. We present a performance study conducted on a synthetic test case, representative of a class of agent-based models.