scispace - formally typeset
Search or ask a question

Showing papers by "Francesco Quaglia published in 2015"


Proceedings ArticleDOI
31 Jan 2015
TL;DR: Several hybrid/gray box techniques are explored that exploit AM and ML in synergy in synergy to get the best of the two worlds, targeting two complex and widely adopted middleware systems.
Abstract: Classical approaches to performance prediction rely on two, typically antithetic, techniques: Machine Learning (ML) and Analytical Modeling (AM). ML takes a black box approach, whose accuracy strongly depends on the representativeness of the dataset used during the initial training phase. Specifically, it can achieve very good accuracy in areas of the features' space that have been sufficiently explored during the training process. Conversely, AM techniques require no or minimal training, hence exhibiting the potential for supporting prompt instantiation of the performance model of the target system. However, in order to ensure their tractability, they typically rely on a set of simplifying assumptions. Consequently, AM's accuracy can be seriously challenged in scenarios (e.g., workload conditions) in which such assumptions are not matched.In this paper we explore several hybrid/gray box techniques that exploit AM and ML in synergy in order to get the best of the two worlds. We evaluate the proposed techniques in case studies targeting two complex and widely adopted middleware systems: a NoSQL distributed key-value store and a Total Order Broadcast (TOB) service.

62 citations


Journal ArticleDOI
TL;DR: The design and implementation of an autonomic state manager (ASM) tailored for integration within optimistic parallel discrete event simulation (PDES) environments based on the C programming language and the executable and linkable format is presented, and developed for execution on ×86_64 architectures.
Abstract: We present the design and implementation of an autonomic state manager (ASM) tailored for integration within optimistic parallel discrete event simulation (PDES) environments based on the C programming language and the executable and linkable format (ELF), and developed for execution on x86_64 architectures. With ASM, the state of any logical process (LP), namely the individual (concurrent) simulation unit being part of the simulation model, is allowed to be scattered on dynamically allocated memory chunks managed via standard API (e.g., malloc / free ). Also, the application programmer is not required to provide any serialization/deserialization module in order to take a checkpoint of the LP state, or to restore it in case a causality error occurs during the optimistic run, or to provide indications on which portions of the state are updated by event processing, so to allow incremental checkpointing. All these tasks are handled by ASM in a fully transparent manner via (A) runtime identification (with chunk-level granularity) of the memory map associated with the LP state, and (B) runtime tracking of the memory updates occurring within chunks belonging to the dynamic memory map. The co-existence of the incremental and non-incremental log/restore modes is achieved via dual versions of the same application code, transparently generated by ASM via compile/link time facilities. Also, the dynamic selection of the best suited log/restore mode is actuated by ASM on the basis of an innovative modeling/optimization approach which takes into account stability of each operating mode with respect to variations of the model/environmental execution parameters.

32 citations


Proceedings ArticleDOI
21 Jul 2015
TL;DR: It is shown that adopting WRTO makes it possible to design a strictly DAP TM with invisible and wait-free read-only transactions, while preserving strong progressiveness for write transactions and an isolation level known in literature as Extended Update Serializability.
Abstract: Disjoint-Access Parallelism (DAP) is considered one of the most desirable properties to maximize the scalability of Transactional Memory (TM). This paper investigates the possibility and inherent cost of implementing a DAP TM that ensures two properties that are regarded as important to maximize efficiency in read-dominated workloads, namely having invisible and wait-free read-only transactions. We first prove that relaxing Real-Time Order (RTO) is necessary to implement such a TM. This result motivates us to introduce Witnessable Real-Time Order (WRTO), a weaker variant of RTO that demands enforcing RTO only between directly conflicting transactions. Then we show that adopting WRTO makes it possible to design a strictly DAP TM with invisible and wait-free read-only transactions, while preserving strong progressiveness for write transactions and an isolation level known in literature as Extended Update Serializability. Finally, we shed light on the inherent inefficiency of DAP TM implementations that have invisible and wait-free read-only transactions, by establishing lower bounds on the time and space complexity of such TMs.

22 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a flexible simulation framework offering skeleton simulation models that can be easily specialized in order to capture the dynamics of diverse data grid systems, such as those related to the specific (distributed) protocol used to provide data consistency and/or transactional guarantees.

14 citations


Proceedings ArticleDOI
16 Dec 2015
TL;DR: An innovative runtime support for speculative parallel processing of discrete event simulation models on multi-core architectures, which exploits Hardware-Transactional-Memory (HTM) facilities for the purpose of state recoverability.
Abstract: This article presents an innovative runtime support for speculative parallel processing of discrete event simulation models on multi-core architectures, which exploits Hardware-Transactional-Memory (HTM) facilities for the purpose of state recoverability. In this proposal, the speculative updates on the state of the simulation model are executed as concurrent HTM-based transactions that are also in charge of detecting whether the update is consistent with the advancement of logical-time along model execution. Our proposal is fully transparent to the application code. Hence, our HTM-based run-time support can host conventionally developed discrete event models relying on the concept of event-handlers to be dispatched by an underlying simulation engine. Experimental data show that our proposal provides 75% to 92% of the ideal speedup on an Intel Haswell based platform (equipped with 4 physical cores and HTM support) for discrete event models with event granularity ranging between 2 and 12 microseconds. The data also show that these same models cannot be executed efficiently on top of a last generation parallel discrete event simulation platform employing software-based recoverability.

13 citations


Journal ArticleDOI
TL;DR: The article presents a low-overhead constant-time implementation of the well known Lowest-Timestamp-First algorithm for the identification of the next LP to be CPU-dispatched, suited for contexts where the optimistic simulation system conforms to the best-practice of keeping separate event lists for the hosted LPs.

10 citations


Proceedings ArticleDOI
10 Jun 2015
TL;DR: This article provides an innovative Linux-based architecture allowing per simulation-object management of memory segments made up by disjoint sets of pages, and supporting both static and dynamic binding of the memory pages reserved for an individual object to the different NUMA nodes.
Abstract: It is well known that Time Warp may suffer from large usage of memory, which may hamper the efficiency of the memory hierarchy. To cope with this issue, several approaches have been devised, mostly based on the reduction of the amount of used virtual memory, e.g., by the avoidance of checkpointing and the exploitation of reverse computing. In this article we present an orthogonal solution aimed at optimizing the latency for memory access operations when running Time Warp systems on Non-Uniform Memory Access (NUMA) multi-processor/multi-core computing systems. More in detail, we provide an innovative Linux-based architecture allowing per simulation-object management of memory segments made up by disjoint sets of pages, and supporting both static and dynamic binding of the memory pages reserved for an individual object to the different NUMA nodes, depending on what worker thread is in charge of running that simulation object along a given wall-clock-time window. Our proposal not only manages the virtual pages used for the live state image of the simulation object, rather, it also copes with memory pages destined to keep the simulation object's event buffers and any recoverability data. Further, the architecture allows memory access optimization for data (messages) exchanged across the different simulation objects running on the NUMA machine. Our proposal is fully transparent to the application code, thus operating in a seamless manner. Also, a free software release of our NUMA memory manager for Time Warp has been made available within the open source ROOT-Sim simulation platform. Experimental data for an assessment of our innovative proposal are also provided in this article.

9 citations


Book ChapterDOI
24 Aug 2015
TL;DR: RAMSES offers parallel execution capabilities based on speculative event processing and an innovative software reversibility technique that copes with state restore in case the run slides along a non-consistent speculative path.
Abstract: This paper presents RAMSES, a framework for easily specifying agent-based discrete event models entailing both environment and agent entities RAMSES offers parallel execution capabilities based on speculative event processing and an innovative software reversibility technique that copes with state restore in case the run slides along a non-consistent speculative path Reversibility in RAMSES relies on transparent static software instrumentation, thus allowing the model developer to concentrate on the actual forward-execution logic of the simulation events occurring in the system An experimental assessment of RAMSES is also presented, which is aimed at determining its run-time effectiveness and its potential for simplifying the development of agent-based models when compared to other (general purpose) speculative frameworks for parallel discrete event simulation

6 citations


Book ChapterDOI
TL;DR: This chapter overviews a set of recent techniques aimed at building “application-specific” performance models that can be exploited to dynamically tune the level of concurrency to the best-suited value.
Abstract: Synchronization transparency offered by Software Transactional Memory (STM) must not come at the expense of run-time efficiency, thus demanding from the STM-designer the inclusion of mechanisms properly oriented to performance and other quality indexes. Particularly, one core issue to cope with in STM is related to exploiting parallelism while also avoiding thrashing phenomena due to excessive transaction rollbacks, caused by excessively high levels of contention on logical resources, namely concurrently accessed data portions. A means to address run-time efficiency consists in dynamically determining the best-suited level of concurrency (number of threads) to be employed for running the application (or specific application phases) on top of the STM layer. For too low levels of concurrency, parallelism can be hampered. Conversely, over-dimensioning the concurrency level may give rise to the aforementioned thrashing phenomena caused by excessive data contention—an aspect which has reflections also on the side of reduced energy-efficiency. In this chapter we overview a set of recent techniques aimed at building “application-specific” performance models that can be exploited to dynamically tune the level of concurrency to the best-suited value. Although they share some base concepts while modeling the system performance vs the degree of concurrency, these techniques rely on disparate methods, such as machine learning or analytic methods (or combinations of the two), and achieve different tradeoffs in terms of the relation between the precision of the performance model and the latency for model instantiation. Implications of the different tradeoffs in real-life scenarios are also discussed.

5 citations


Proceedings ArticleDOI
10 Jun 2015
TL;DR: This article presents the design and implementation of a time-sharing Time Warp platform, to be run on multi-core machines, where the platform-level software is allowed to take back control on a periodical basis (with fine grain period), and to possibly preempt any ongoing event processing activity in favor of dispatching any other event that is revealed to have higher priority.
Abstract: The order according to which the different tasks are carried out within a Time Warp platform has a direct impact on performance, given that event processing is speculative, thus being subject to the possibility of being rolled-back. It is typically recognized that not-yet-executed events having lower timestamps should be given higher CPU-schedule priority, since this contributes to keep low the amount of rollbacks. However, common Time Warp platforms usually execute events as atomic actions. Hence control is bounced back to the underlying simulation platform only at the end of the current event processing routine. In other words, CPU-scheduling of events resembles classical batch-multitasking scheduling, which is recognized not to promptly react to variations of the priority of pending tasks (e.g. associated with the injection of new events in the system). In this article we present the design and implementation of a time-sharing Time Warp platform, to be run on multi-core machines, where the platform-level software is allowed to take back control on a periodical basis (with fine grain period), and to possibly preempt any ongoing event processing activity in favor of dispatching (along the same thread) any other event that is revealed to have higher priority. Our proposal is based on an ad-hoc kernel module for Linux, which implements a fine grain timer-interrupt mechanism with lightweight management, which is fully integrated with the modern top/bottom-half timer-interrupt Linux architecture, and which does not induce any bias in terms of relative CPU-usage planning across Time Warp vs non-Time Warp threads running on the machine. Our time-sharing architecture has been integrated within the open source ROOT-Sim optimistic simulation package, and we also report some experimental data for an assessment of our proposal.

2 citations