Showing papers presented at "Virtual Execution Environments in 2014"
••
01 Mar 2014TL;DR: Ginseng is presented, the first market-driven cloud system that allocates memory efficiently to selfish cloud clients and achieves a 6.2×--15.8x improvement in aggregate client satisfaction when compared with state-of-the-art approaches for cloud memory allocation.
Abstract: Physical memory is the scarcest resource in today's cloud computing platforms. Cloud providers would like to maximize their clients' satisfaction by renting precious physical memory to those clients who value it the most. But real-world cloud clients are selfish: they will only tell their providers the truth about how much they value memory when it is in their own best interest to do so. How can real-world cloud providers allocate memory efficiently to those (selfish) clients who value it the most?We present Ginseng, the first market-driven cloud system that allocates memory efficiently to selfish cloud clients. Ginseng incentivizes selfish clients to bid their true value for the memory they need when they need it. Ginseng continuously collects client bids, finds an efficient memory allocation, and re-allocates physical memory to the clients that value it the most. Ginseng achieves a 6.2×--15.8x improvement (83%--100% of the optimum) in aggregate client satisfaction when compared with state-of-the-art approaches for cloud memory allocation.
75 citations
••
01 Mar 2014TL;DR: A new platform for secure static binary instrumentation (PSI) is developed that overcomes these drawbacks of DBI techniques, while retaining the security, robustness and ease-of-use features.
Abstract: Program instrumentation techniques form the basis of many recent software security defenses, including defenses against common exploits and security policy enforcement As compared to source-code instrumentation, binary instrumentation is easier to use and more broadly applicable due to the ready availability of binary code Two key features needed for security instrumentations are (a) it should be applied to all application code, including code contained in various system and application libraries, and (b) it should be non-bypassable So far, dynamic binary instrumentation (DBI) techniques have provided these features, whereas static binary instrumentation (SBI) techniques have lacked them These features, combined with ease of use, have made DBI the de facto choice for security instrumentations However, DBI techniques can incur high overheads in several common usage scenarios, such as application startups, system-calls, and many real-world applications We therefore develop a new platform for secure static binary instrumentation (PSI) that overcomes these drawbacks of DBI techniques, while retaining the security, robustness and ease-of-use features We illustrate the versatility of PSI by developing several instrumentation applications: basic block counting, shadow stack defense against control-flow hijack and return-oriented programming attacks, and system call and library policy enforcement While being competitive with the best DBI tools on CPU-intensive SPEC 2006 benchmark, PSI provides an order of magnitude reduction in overheads on a collection of real-world applications
69 citations
••
01 Mar 2014TL;DR: This work develops a real-time kernel data structure monitoring (RTKDSM) system that leverages the rich OS analysis capabilities of Volatility, an open source computer forensics framework, to significantly simplify and automate analysis of VM execution states.
Abstract: Virtual Machine Introspection (VMI) provides the ability to monitor virtual machines (VM) in an agentless fashion by gathering VM execution states from the hypervisor and analyzing those states to extract information about a running operating system (OS) without installing an agent inside the VM. VMI's main challenge lies in the difficulty in converting low-level byte string values into high-level semantic states of the monitored VM's OS. In this work, we tackle this challenge by developing a real-time kernel data structure monitoring (RTKDSM) system that leverages the rich OS analysis capabilities of Volatility, an open source computer forensics framework, to significantly simplify and automate analysis of VM execution states. The RTKDSM system is designed as an extensible software framework that is meant to be extended to perform application-specific VM state analysis. In addition, the RTKDSM system is able to perform real-time monitoring of any changes made to the extracted OS states of guest VMs. This real-time monitoring capability is especially important for VMI-based security applications. To minimize the performance overhead associated with real-time kernel data structure monitoring, the RTKDSM system has incorporated several optimizations whose effectiveness is reported in this paper.
53 citations
••
01 Mar 2014TL;DR: A lightweight page Classification-based Memory Deduplication approach named CMD is proposed to reduce futile page comparison overhead meanwhile to detect page sharing opportunities efficiently and the experimental results show that CMD can efficiently reduce page comparisons.
Abstract: Limited main memory size is considered as one of the major bottlenecks in virtualization environments. Content-Based Page Sharing (CBPS) is an efficient memory deduplication technique to reduce server memory requirements, in which pages with same content are detected and shared into a single copy. As the widely used implementation of CBPS, Kernel Samepage Merging (KSM) maintains the whole memory pages into two global comparison trees (a stable tree and an unstable tree). To detect page sharing opportunities, each tracked page needs to be compared with pages already in these two large global trees. However since the vast majority of compared pages have different content with it, that will induce massive futility comparisons and thus heavy overhead.In this paper, we propose a lightweight page Classification-based Memory Deduplication approach named CMD to reduce futile page comparison overhead meanwhile to detect page sharing opportunities efficiently. The main innovation of CMD is that pages are grouped into different classifications based on page access characteristics. Pages with similar access characteristics are suggested to have higher possibility with same content, thus they are grouped into the same classification. In CMD, the large global comparison trees are divided into multiple small trees with dedicated local ones in each page classification. Page comparisons are performed just in the same classification, and pages from different classifications are never compared (since they probably result in futile comparisons). The experimental results show that CMD can efficiently reduce page comparisons (by about 68.5%) meanwhile detect nearly the same (by more than 98%) or even more page sharing opportunities.
45 citations
••
01 Mar 2014TL;DR: This paper tries to see how far one can push a naive implementation while remaining portable and not requiring expertise in compilers and runtime systems.
Abstract: Dynamic languages have been gaining popularity to the point that their performance is starting to matter. The effort required to develop a production-quality, high-performance runtime is, however, staggering and the expertise required to do so is often out of reach of the community maintaining a particular language. Many domain specific languages remain stuck with naive implementations, as they are easy to write and simple to maintain for domain scientists. In this paper, we try to see how far one can push a naive implementation while remaining portable and not requiring expertise in compilers and runtime systems. We choose the R language, a dynamic language used in statistics, as the target of our experiment and adopt the simplest possible implementation strategy, one based on evaluation of abstract syntax trees. We build our interpreter on top of a Java virtual machine and use only facilities available to all Java programmers. We compare our results to other implementations of R.
42 citations
••
01 Mar 2014TL;DR: This paper formulates the multi-tier application migration problem, and presents a new communication-impact-driven coordinated approach, as well as a system called COMMA that realizes this approach, which is highly effective in minimizing migration's impact on multi- tier applications' performance.
Abstract: Multi-tier applications are widely deployed in today's virtualized cloud computing environments. At the same time, management operations in these virtualized environments, such as load balancing, hardware maintenance, workload consolidation, etc., often make use of live virtual machine (VM) migration to control the placement of VMs. Although existing solutions are able to migrate a single VM efficiently, little attention has been devoted to migrating related VMs in multi-tier applications. Ignoring the relatedness of VMs during migration can lead to serious application performance degradation. This paper formulates the multi-tier application migration problem, and presents a new communication-impact-driven coordinated approach, as well as a system called COMMA that realizes this approach. Through extensive testbed experiments, numerical analyses, and a demonstration of COMMA on Amazon EC2, we show that this approach is highly effective in minimizing migration's impact on multi-tier applications' performance.
35 citations
••
01 Mar 2014TL;DR: By expanding and contracting the data store size based on the free memory available, Mortar improves average response time of a web application by up to 35% compared to a fixed size memcached deployment, and improves overall video streaming performance by 45% through prefetching.
Abstract: Data center servers are typically overprovisioned, leaving spare memory and CPU capacity idle to handle unpredictable workload bursts by the virtual machines running on them. While this allows for fast hotspot mitigation, it is also wasteful. Unfortunately, making use of spare capacity without impacting active applications is particularly difficult for memory since it typically must be allocated in coarse chunks over long timescales. In this work we propose re- purposing the poorly utilized memory in a data center to store a volatile data store that is managed by the hypervisor. We present two uses for our Mortar framework: as a cache for prefetching disk blocks, and as an application-level distributed cache that follows the memcached protocol. Both prototypes use the framework to ask the hypervisor to store useful, but recoverable data within its free memory pool. This allows the hypervisor to control eviction policies and prioritize access to the cache. We demonstrate the benefits of our prototypes using realistic web applications and disk benchmarks, as well as memory traces gathered from live servers in our university's IT department. By expanding and contracting the data store size based on the free memory available, Mortar improves average response time of a web application by up to 35% compared to a fixed size memcached deployment, and improves overall video streaming performance by 45% through prefetching.
27 citations
••
01 Mar 2014TL;DR: Embedded shadow page tables (ESPT) is proposed, which embeds a shadow page table into the address space of a cross-ISA dynamic binary translation (DBT) and uses hardware memory management unit in the CPU to translate memory addresses, instead of software translation in a current DBT emulator like QEMU.
Abstract: Cross-ISA system-mode emulation has many important applications. For example, Cross-ISA system-mode emulation helps computer architects and OS developers trace and debug kernel execution-flow efficiently by emulating a slower platform (such as ARM) on a more powerful plat-form (such as an x86 machine). Cross-ISA system-mode emulation also enables workload consolidation in data centers with platforms of different instruction-set architectures (ISAs). However, system-mode emulation is much slower. One major overhead in system-mode emulation is the multi-level memory address translation that maps guest virtual address to host physical address. Shadow page tables (SPT) have been used to reduce such overheads, but primarily for same-ISA virtualization. In this paper we propose a novel approach called embedded shadow page tables (ESPT). EPST embeds a shadow page table into the address space of a cross-ISA dynamic binary translation (DBT) and uses hardware memory management unit in the CPU to translate memory addresses, instead of software translation in a current DBT emulator like QEMU. We also use the larger address space on modern 64-bit CPUs to accommodate our DBT emulator so that it will not interfere with the guest operating system. We incorporate our new scheme into QEMU, a popular, retargetable cross-ISA system emulator. SPEC CINT2006 benchmark results indicate that our technique achieves an average speedup of 1.51 times in system mode when emulating ARM on x86, and a 1.59 times speedup for emulating IA32 on x86_64.
25 citations
••
01 Mar 2014TL;DR: Tesseract is presented, a system that directly and transparently addresses the double-paging problem and can significantly reduce the costs of double- paging, and is evaluated on a synthetic benchmark designed to highlight its effects.
Abstract: Double-paging is an often-cited, if unsubstantiated, problem in multi-level scheduling of memory between virtual machines (VMs) and the hypervisor. This problem occurs when both a virtualized guest and the hypervisor overcommit their respective physical address-spaces. When the guest pages out memory previously swapped out by the hypervisor, it initiates an expensive sequence of steps causing the contents to be read in from the hypervisor swapfile only to be written out again, significantly lengthening the time to complete the guest I/O request. As a result, performance rapidly drops.We present Tesseract, a system that directly and transparently addresses the double-paging problem. Tesseract tracks when guest and hypervisor I/O operations are redundant and modifies these I/Os to create indirections to existing disk blocks containing the page contents. Although our focus is on reconciling I/Os between the guest disks and hypervisor swap, our technique is general and can reconcile, or deduplicate, I/Os for guest pages read or written by the VM.Deduplication of disk blocks for file contents accessed in a common manner is well-understood. One challenge that our approach faces is that the locality of guest I/Os (reflecting the guest's notion of disk layout) often differs from that of the blocks in the hypervisor swap. This loss of locality through indirection results in significant performance loss on subsequent guest reads. We propose two alternatives to recovering this lost locality, each based on the idea of asynchronously reorganizing the indirected blocks in persistent storage.We evaluate our system and show that it can significantly reduce the costs of double-paging. We focus our experiments on a synthetic benchmark designed to highlight its effects. In our experiments we observe Tesseract can improve our benchmark's throughput by as much as 200% when using traditional disks and by as much as 30% when using SSD. At the same time worst case application responsiveness can be improved by a factor of 5.
23 citations
••
01 Mar 2014TL;DR: The design of the Quest-V separation kernel is discussed, which partitions services of different criticalities in separate virtual machines, or sandboxes, which encapsulates a subset of machine physical resources that it manages without requiring intervention of a hypervisor.
Abstract: Multi- and many-core processors are becoming increasingly popular in embedded systems. Many of these processors now feature hardware virtualization capabilities, such as the ARM Cortex A15, and x86 processors with Intel VT-x or AMD-V support. Hardware virtualization offers opportunities to partition physical resources, including processor cores, memory and I/O devices amongst guest virtual machines. Mixed criticality systems and services can then co-exist on the same platform in separate virtual machines. However, traditional virtual machine systems are too expensive because of the costs of trapping into hypervisors to multiplex and manage machine physical resources on behalf of separate guests. For example, hypervisors are needed to schedule separate VMs on physical processor cores. In this paper, we discuss the design of the Quest-V separation kernel, which partitions services of different criticalities in separate virtual machines, or sandboxes. Each sandbox encapsulates a subset of machine physical resources that it manages without requiring intervention of a hypervisor. Moreover, a hypervisor is not needed for normal operation, except to bootstrap the system and establish communication channels between sandboxes.
20 citations
••
01 Mar 2014TL;DR: This paper proposes a novel technique that enables deoptimization for dynamic language runtimes implemented on top of typed, stack-based virtual machines, and implements this technique in a JavaScript language implementation, MCJS, running ontop of the Mono runtime (CLR).
Abstract: We are interested in implementing dynamic language runtimes on top of language-level virtual machines. Type specialization is a critical optimization for dynamic language runtimes: generic code that handles any type of data is replaced with specialized code for particular types observed during execution. However, types can change, and the runtime must recover whenever unexpected types are encountered. The state-of-the-art recovery mechanism is called deoptimization. Deoptimization is a well-known technique for dynamic language runtimes implemented in low-level languages like C. However, no dynamic language runtime implemented on top of a virtual machine such as the Common Language Runtime (CLR) or the Java Virtual Machine (JVM) uses deoptimization, because the implementation thereof used in low-level languages is not possible.In this paper we propose a novel technique that enables deoptimization for dynamic language runtimes implemented on top of typed, stack-based virtual machines. Our technique does not require any changes to the underlying virtual machine. We implement our proposed technique in a JavaScript language implementation, MCJS, running on top of the Mono runtime (CLR). We evaluate our implementation against the current state-of-the-art recovery mechanism for virtual machine-based runtimes, as implemented both in MCJS and in IronJS. We show that deoptimization provides significant performance benefits, even for runtimes running on top of a virtual machine.
••
01 Mar 2014
TL;DR: This paper presents DBILL, a cross-ISA and re- targetable dynamic binary instrumentation framework that builds on both QEMU and LLVM, and enables LLVM-based static instrumentation tools to become DBI ready, and deployable to different target architectures.
Abstract: Dynamic Binary Instrumentation (DBI) is a core technology for building debugging and profiling tools for application executables. Most state-of-the-art DBI systems have focused on the same instruction set architecture (ISA) where the guest binary and the host binary have the same ISA. It is uncommon to have a cross-ISA DBI system, such as a system that instruments ARM executables to run on x86 machines. We believe cross-ISA DBI systems are increasingly more important, since ARM executables could be more productively analyzed on x86 based machines such as commonly available PCs and servers. In this paper, we present DBILL, a cross-ISA and re- targetable dynamic binary instrumentation framework that builds on both QEMU and LLVM. The DBILL framework enables LLVM-based static instrumentation tools to become DBI ready, and deployable to different target architectures. Using address sanitizer and memory sanitizer as implementation examples, we show DBILL is an efficient, versatile and easy to use cross-ISA retargetable DBI framework.
••
01 Mar 2014TL;DR: The challenges the authors faced in creating Stackdb are described, the solutions they devised are presented, and Stackdb is evaluated through its application to a security-focused, whole-system case study.
Abstract: Virtual machine introspection (VMI) allows users to debug software that executes within a virtual machine. To support rich, whole-system analyses, a VMI tool must inspect and control systems at multiple levels of the software stack. Traditional debuggers enable inspection and control, but they limit users to treating a whole system as just one kind of target: e.g., just a kernel, or just a process, but not both.We created Stackdb, a debugging library with VMI support that allows one to monitor and control a whole system through multiple, coordinated targets. A target corresponds to a particular level of the system's software stack; multiple targets allow a user to observe a VM guest at several levels of abstraction simultaneously. For example, with Stackdb, one can observe a PHP script running in a Linux process in a Xen VM via three coordinated targets at the language, process, and kernel levels. Within Stackdb, higher-level targets are components that utilize lower-level targets; a key contribution of Stackdb is its API that supports multi-level and flexible "stacks" of targets. This paper describes the challenges we faced in creating Stackdb, presents the solutions we devised, and evaluates Stackdb through its application to a security-focused, whole-system case study.
••
01 Mar 2014TL;DR: The proposed scheme enables virtual CPUs to be dynamically performance-asymmetric based on their hosted workloads and introduces a guest extension that manipulates the scheduling policy of an operating system in favor of the hypervisor-level scheme so that interactive performance can be further improved.
Abstract: This paper presents virtual asymmetric multiprocessor, a new scheme of virtual desktop scheduling on multi-core processors for user-interactive performance. The proposed scheme enables virtual CPUs to be dynamically performance-asymmetric based on their hosted workloads. To enhance user experience on consolidated desktops, our scheme provides interactive workloads with fast virtual CPUs, which have more computing power than those hosting background workloads in the same virtual machine. To this end, we devise a hypervisor extension that transparently classifies background tasks from potentially interactive workloads. In addition, we introduce a guest extension that manipulates the scheduling policy of an operating system in favor of our hypervisor-level scheme so that interactive performance can be further improved. Our evaluation shows that the proposed scheme significantly improves interactive performance of application launch, Web browsing, and video playback applications when CPU-intensive workloads highly disturb the interactive workloads.
••
01 Mar 2014
TL;DR: This work shows that it is possible to implement an efficient network switch for virtual machines as an unprivileged userspace component running in the host system including the driver for the upstream network adapter and compares favorably to the existing in-kernel implementation with respect to throughput and latency.
Abstract: Efficient and secure networking between virtual machines is crucial in a time where a large share of the services on the Internet and in private datacenters run in virtual machines. To achieve this efficiency, virtualization solutions, such as Qemu/KVM, move toward a monolithic system architecture in which all performance critical functionality is implemented directly in the hypervisor in privileged mode. This is an attack surface in the hypervisor that can be used from compromised VMs to take over the virtual machine host and all VMs running on it.We show that it is possible to implement an efficient network switch nfor virtual machines as an unprivileged userspace component running in the host system including the driver for the upstream network adapter. Our network switch relies on functionality already present in the KVM hypervisor and requires no changes to Linux, the host operating system, and the guest.Our userspace implementation compares favorably to the existing in-kernel implementation with respect to throughput and latency. We reduced per-packet overhead by using a run-to-completion model an are able to outperform the unmodified system for VM-to-VM traffic by a large margin when packet rates are high.
••
01 Mar 2014TL;DR: In this article, the authors show that the dynamic overhead of work-stealing is dominated by introspection of the victim's stack when a steal takes place and exploit the idea of a low overhead return barrier to reduce the dynamic overheads.
Abstract: This paper addresses the problem of efficiently supporting parallelism within a managed runtime. A popular approach for exploiting software parallelism on parallel hardware is task parallelism, where the programmer explicitly identifies potential parallelism and the runtime then schedules the work. Work-stealing is a promising scheduling strategy that a runtime may use to keep otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. Recent work identified sequential overheads of work-stealing, those that occur even when no stealing takes place, as a significant source of overhead. That work was able to reduce sequential overheads to just 15%.In this work, we turn to dynamic overheads, those that occur each time a steal takes place. We show that the dynamic overhead is dominated by introspection of the victim's stack when a steal takes place. We exploit the idea of a low overhead return barrier to reduce the dynamic overhead by approximately half, resulting in total performance improvements of as much as 20%. Because, unlike prior work, we attack the overheads directly due to stealing and therefore attack the overheads that grow as parallelism grows, we improve the scalability of work-stealing applications. This result is complementary to recent work addressing the sequential overheads of work-stealing. This work therefore substantially relieves work-stealing of the increasing pressure due to increasing intra-node hardware parallelism.
••
IBM1
TL;DR: This paper proposes two approaches to reduce the duplication in Java string in a single Java VM (JVM) and across JVMs by using a read-only memory-mapped file and selectively unify string objects created at runtime in the web applications.
Abstract: To increase the memory efficiency in physical servers is a significant concern for increasing the number of virtual machines (VM) in them. When similar web application service runs in each guest VM, many string data with the same values are created in every guest VMs. These duplications of string data are redundant from the viewpoint of memory efficiency in the host OS. This paper proposes two approaches to reduce the duplication in Java string in a single Java VM (JVM) and across JVMs. The first approach is to share string objects cross JVMs by using a read-only memory-mapped file. The other approach is to selectively unify string objects created at runtime in the web applications. This paper evaluates our approach by using the Apache DayTrader and the DaCapo benchmark suite. Our prototype implementation chieved 7% to 12% reduction in the total size of the objects allocated over the lifetime of the programs. In addition, we observed the performance of DayTrader was maintained even under a situation of high density guest VMs in a KVM host machine.
••
01 Mar 2014TL;DR: This paper proposes a layered architecture, called MuscalietJS2, that splits the responsibilities of a JavaScript engine between a high-level, JavaScript-specific component and a low- level, language-agnostic .NET VM, and proposes a two pronged approach to make up for the performance loss due to layering.
Abstract: Layered JavaScript engines, in which the JavaScript runtime is built on top another managed runtime, provide better extensibility and portability compared to traditional monolithic engines. In this paper, we revisit the design of layered JavaScript engines and propose a layered architecture, called MuscalietJS2, that splits the responsibilities of a JavaScript engine between a high-level, JavaScript-specific component and a low-level, language-agnostic .NET VM. To make up for the performance loss due to layering, we propose a two pronged approach: high-level JavaScript optimizations and exploitation of low-level VM features that produce very efficient code for hot functions. We demonstrate the validity of the MuscalietJS design through a comprehensive evaluation using both the Sunspider benchmarks and a set of web workloads. We demonstrate that our approach outperforms other layered engines such as IronJS and Rhino engines while providing extensibility, adaptability and portability.
••
01 Mar 2014TL;DR: It is argued that the field of computer systems research should be: repetition of results, independent reproduction, as well as rigorous evaluation, and some baby steps taken by several computer conferences are outlined.
Abstract: Computer systems research spans sub-disciplines that include embedded systems, programming languages, network- ing, and operating systems. In this talk my contention is that a number of structural factors inhibit quality systems re- search. Symptoms of the problem include unrepeatable and unreproduced results as well as results that are either devoid of meaning or that measure the wrong thing. I will illustrate the impact of these issues on our research output with examples from the development and empirical evaluation of the Schism real-time garbage collection algorithm that is shipped with the FijiVM -- a Java virtual machine for embedded and mobile devices. I will argue that our field should fos- ter: repetition of results, independent reproduction, as well as rigorous evaluation. I will outline some baby steps taken by several computer conferences. In particular I will focus on the introduction of Artifact Evaluation Committees or AECs to ECOOP, OOPLSA, PLDI and soon POPL. The goal of the AECs is to encourage author to package the soft- ware artifacts that they used to support the claims made in their paper and to submit these artifacts for evaluation. AECs were carefully designed to provide positive feedback to the authors that take the time to create repeatable research.