scispace - formally typeset
Search or ask a question

Showing papers presented at "Virtual Execution Environments in 2010"


Proceedings ArticleDOI
17 Mar 2010
TL;DR: The goal in this paper is to adapt the virtual machine scheduler to be more soft-real-time friendly and improve two aspects of the VMM scheduler -- managing scheduling latency as a first-class resource and managing shared caches.
Abstract: Virtualization technology enables server consolidation and has given an impetus to low-cost green data centers. However, current hypervisors do not provide adequate support for real-time applications, and this has limited the adoption of virtualization in some domains. Soft real-time applications, such as media-based ones, are impeded by components of virtualization including low-performance virtualization I/O, increased scheduling latency, and shared-cache contention. The virtual machine scheduler is central to all these issues. The goal in this paper is to adapt the virtual machine scheduler to be more soft-real-time friendly.We improve two aspects of the VMM scheduler -- managing scheduling latency as a first-class resource and managing shared caches. We use enterprise IP telephony as an illustrative soft real-time workload and design a scheduler S that incorporates the knowledge of soft real-time applications in all aspects of the scheduler to support responsiveness. For this we first define a laxity value that can be interpreted as the target scheduling latency that the workload desires. The load balancer is also designed to minimize the latency for real-time tasks. For cache management, we take cache-affinity into account for real time tasks and load-balance accordingly to prevent cache thrashing. We measured cache misses and demonstrated that cache management is essential for soft real time tasks. Although our scheduler S employs a different design philosophy, interestingly enough it can be implemented with simple modifications to the Xen hypervisor's credit scheduler. Our experiments demonstrate that the Xen scheduler with our modifications can support soft real-time guests well, without penalizing non-real-time domains.

144 citations


Proceedings ArticleDOI
17 Mar 2010
TL;DR: The results indicate that asymmetry support can be implemented with low overheads, and resulting performance improvements can be significant, reaching up to 36% in the authors' experiments.
Abstract: Asymmetric multicore processors (AMP) consist of cores exposing the same instruction-set architecture (ISA) but varying in size, frequency, power consumption and performance. AMPs were shown to be more power efficient than conventional symmetric multicore processors, and it is therefore likely that future multicore systems will include cores of different types. AMPs derive their efficiency from core specialization: instruction streams can be assigned to run on the cores best suited to their demands for architectural resources. System efficiency is improved as a result. To perform effective matching of threads to cores, the thread scheduler must be asymmetry-aware; and while asymmetry-aware schedulers for operating systems are a well studied topic, asymmetry-awareness in hypervisors has not been addressed. A hypervisor must be asymmetry-aware to enable proper functioning of asymmetry-aware guest operating systems; otherwise they will be ineffective in virtual environments. Furthermore, a hypervisor must ensure that asymmetric cores are shared among multiple guests in a fair fashion or in accordance with their priorities.This work for the first time implements simple changes to the hypervisor scheduler, required to make it asymmetry-aware, and evaluates the benefits and overheads of these asymmetry-aware mechanisms. Our evaluation was performed using an open source hypervisor Xen on a real multicore system where asymmetry was emulated via CPU frequency scaling. We compared the asymmetry-aware hypervisor to default Xen. Our results indicate that asymmetry support can be implemented with low overheads, and resulting performance improvements can be significant, reaching up to 36% in our experiments. Most performance improvements are derived from the fact that an asymmetry-aware hypervisor ensures that the fast cores do not go idle before slow cores and from the fact that it maps virtual cores to physical cores for asymmetry-aware guests according to the guest's expectations. Other benefits from asymmetry awareness are fairer sharing of computing resources among VMs and more stable execution times.

80 citations


Proceedings ArticleDOI
17 Mar 2010
TL;DR: VMKit is described and evaluates, a first attempt to build a common substrate that eases the development of high-level MREs, and has performance comparable to the well established open source M REs Cacao, Apache Harmony and Mono.
Abstract: Managed Runtime Environments (MREs), such as the JVM and the CLI, form an attractive environment for program execution, by providing portability and safety, via the use of a bytecode language and automatic memory management, as well as good performance, via just-in-time (JIT) compilation. Nevertheless, developing a fully featured MRE, including e.g. a garbage collector and JIT compiler, is a herculean task. As a result, new languages cannot easily take advantage of the benefits of MREs, and it is difficult to experiment with extensions of existing MRE based languages.This paper describes and evaluates VMKit, a first attempt to build a common substrate that eases the development of high-level MREs. We have successfully used VMKit to build two MREs: a Java Virtual Machine and a Common Language Runtime. We provide an extensive study of the lessons learned in developing this infrastructure, and assess the ease of implementing new MREs or MRE extensions and the resulting performance. In particular, it took one of the authors only one month to develop a Common Language Runtime using VMKit. VMKit furthermore has performance comparableto the well established open source MREs Cacao, Apache Harmony and Mono, and is 1.2 to 3 times slower than JikesRVM on most of the Dacapo benchmarks.

53 citations


Proceedings ArticleDOI
17 Mar 2010
TL;DR: A virtual machine monitor system called Neon is described that transparently labels derived data using byte-level "tints" and tracks these labels end to end across commodity applications, operating systems and networks to explore the viability and utility of transparent information flow tracking within conventional networked systems.
Abstract: Modern organizations face increasingly complex information management requirements. A combination of commercial needs, legal liability and regulatory imperatives has created a patchwork of mandated policies. Among these, personally identifying customer records must be carefully access-controlled, sensitive files must be encrypted on mobile computers to guard against physical theft, and intellectual property must be protected from both exposure and "poisoning." However, enforcing such policies can be quite difficult in practice since users routinely share data over networks and derive new files from these inputs--incidentally laundering any policy restrictions. In this paper, we describe a virtual machine monitor system called Neon that transparently labels derived data using byte-level "tints" and tracks these labels end to end across commodity applications, operating systems and networks. Our goal with Neon is to explore the viability and utility of transparent information flow tracking within conventional networked systems when used in the manner in which they were intended. We demonstrate that this mechanism allows the enforcement of a variety of data management policies, including data-dependent confinement, mandatory I/O encryption, and intellectual property management.

48 citations


Proceedings ArticleDOI
17 Mar 2010
TL;DR: Crosscut uses replay itself to transform logs into a more efficient, secure, and usable form for replay-based applications, and shows how to retarget the abstraction level of the log to enable more convenient use during replay debugging.
Abstract: Deterministic record-replay has many useful applications, ranging from fault tolerance and forensics to reproducing and diagnosing bugs. When choosing a record-replay solution, the system administrator must choose a priori how comprehensively to record the execution and at what abstraction level to record it. Unfortunately, these choices may not match well with how the recording is eventually used. A recording may contain too little information to support the end use of replay, or it may contain more sensitive information than is allowed to be shown to the end user of replay. Similarly, fixing the abstraction level at the time of recording often leads to a semantic mismatch with the end use of replay.This paper describes how to remedy these problems by adding customizable replay stages to create special-purpose logs for the end users of replay. Our system, called Crosscut, allows replay logs to be "sliced" along time and abstraction boundaries. Using this approach, users can create slices that include only the processes, applications, or components of interest, excluding parts that handle sensitive data. Users can also retarget the abstraction level of the replay log to higher-level platforms, such as Perl or Valgrind. Execution can then be augmented with additional analysis code at replay time, without disturbing the replayed components in the slice. Crosscut thus uses replay itself to transform logs into a more efficient, secure, and usable form for replay-based applications.Our current Crosscut prototype builds on VMware Workstation's record-replay capabilities, and supports a variety of different replay environments. We show how Crosscut can create slices of only the parts of the computation of interest and thereby avoid leaking sensitive information, and we show how to retarget the abstraction level of the log to enable more convenient use during replay debugging.

34 citations


Proceedings ArticleDOI
Lei Ye1, Gen Lu1, Sushanth Kumar1, Chris Gniady1, John H. Hartman1 
17 Mar 2010
TL;DR: This paper proposes three mechanisms to address the isolation between VMM and VMs, and increase the burstiness of hard disk accesses to increase energy efficiency of a hard disk.
Abstract: Current trends in increasing storage capacity and virtualization of resources combined with the need for energy efficiency put a challenging task in front of system designers. Previous studies have suggested many approaches to reduce hard disk energy dissipation in native OS environments; however, those mechanisms do not perform well in virtual machine environments because a virtual machine (VM) and the virtual machine monitor (VMM) that runs it have different semantic contexts. This paper explores the disk I/O activities between VMM and VMs using trace driven simulation to understand the I/O behavior of the VM system. Subsequently, this paper proposes three mechanisms to address the isolation between VMM and VMs, and increase the burstiness of hard disk accesses to increase energy efficiency of a hard disk. Compared to standard shutdown mechanisms, with eight VMs the proposed mechanisms reduce disk spin-ups, increase the disk sleep time, and reduce energy consumption by 14.8% with only 0.5% increase in execution time. We implemented the proposed mechanisms in Xen and validated our simulation results.

19 citations


Proceedings ArticleDOI
Rei Odaira1, Kazunori Ogata1, Kiyokuni Kawachiya1, Tamiya Onodera1, Toshio Nakatani1 
17 Mar 2010
TL;DR: Two novel approaches to track the allocation sites of every object in Java with only a 1.0% slow-down are proposed and the usefulness of these low-overhead trackers are demonstrated by an allocation-site-aware memory leak detector and allocation- site-based pretenuring in generational GC.
Abstract: Tracking the allocation site of every object at runtime is useful for reliable, optimized Java. To be used in production environments, the tracking must be accurate with minimal speed loss. Previous approaches suffer from performance degradation due to the additional field added to each object or track the allocation sites only probabilistically. We propose two novel approaches to track the allocation sites of every object in Java with only a 1.0% slow-down on average. Our first approach, the Allocation-Site-as-a-Hash-code (ASH) Tracker, encodes the allocation site ID of an object into the hash code field of its header by regarding the ID as part of the hash code. ASH Tracker avoids an excessive increase in hash code collisions by dynamically shrinking the bit-length of the ID as more and more objects are allocated at that site. For those Java VMs without the hash code field, our second approach, the Allocation-Site-via-a-Class-pointer (ASC) Tracker, makes the class pointer field in an object header refer to the allocation site structure of the object, which in turn points to the actual class structure. ASC Tracker mitigates the indirection overhead by constant-class-field duplication and allocation-site equality checks. While a previous approach of adding a 4-byte field caused up to 14.4% and an average 5% slowdown, both ASH and ASC Trackers incur at most a 2.0% and an average 1.0% loss. We demonstrate the usefulness of our low-overhead trackers by an allocation-site-aware memory leak detector and allocation-site-based pretenuring in generational GC. Our pretenuring achieved on average 1.8% and up to 11.8% speedups in SPECjvm2008.

19 citations


Proceedings ArticleDOI
17 Mar 2010
TL;DR: The drawbacks of the current reactive mechanism of online profiling during selective compilation are illustrated and a novel strategy to achieve similar performance benefits with an online profiling approach is proposed that uses early determination of loop iteration bounds to predict future method hotness.
Abstract: Application profiling is a popular technique to improve program performance based on its behavior. Offline profiling, although beneficial for several applications, fails in cases where prior program runs may not be feasible, or if changes in input cause the profile to not match the behavior of the actual program run. Managed languages, like Java and C\#, provide a unique opportunity to overcome the drawbacks of offline profiling by generating the profile information online during the current program run. Indeed, online profiling is extensively used in current VMs, especially during selective compilation to improve program startup performance, as well as during other feedback-directed optimizations.In this paper we illustrate the drawbacks of the current reactive mechanism of online profiling during selective compilation. Current VM profiling mechanisms are slow -- thereby delaying associated transformations, and estimate future behavior based on the program's immediate past -- leading to potential misspeculation that limit the benefits of compilation. We show that these drawbacks produce an average performance loss of over 14.5% on our set of benchmark programs, over an ideal offline approach that accurately compiles the hot methods early. We then propose and evaluate the potential of a novel strategy to achieve similar performance benefits with an online profiling approach. Our new online profiling strategy uses early determination of loop iteration bounds to predict future method hotness. We explore and present promising results on the potential, feasibility, and other issues involved for the successful implementation of this approach.

19 citations


Proceedings ArticleDOI
17 Mar 2010
TL;DR: This paper describes the design and implementation of a carefully designed strict compiler-runtime interface and the XIR language and shows a significant reduction in backend complexity with XIR and an overall reduction in the compiler- runtime interface complexity while still generating comparable quality code with only minor impact on compilation time.
Abstract: Intense research on virtual machines has highlighted the need for flexible software architectures that allow quick evaluation of new design and implementation techniques. The interface between the compiler and runtime system is a principal factor in the flexibility of both components and is critical to enabling rapid pursuit of new optimizations and features. Although many virtual machines have demonstrated modularity for many components, significant dependencies often remain between the compiler and the runtime system components such as the object model and memory management system. This paper addresses this challenge with a carefully designed strict compiler-runtime interface and the XIR language. Instead of the compiler backend lowering object operations to machine operations using hard-wired runtime-specific logic, XIR allows the runtime system to implement this logic, simultaneously simplifying and separating the backend from runtime-system details. In this paper we describe the design and implementation of this compiler-runtime interface and the XIR language in the C1X dynamic compiler, a port of the HotSpotTM Client compiler. Our results show a significant reduction in backend complexity with XIR and an overall reduction in the compiler-runtime interface complexity while still generating comparable quality code with only minor impact on compilation time.

16 citations


Proceedings ArticleDOI
Michiaki Tatsubori1, Akihiko Tozawa1, Toyotaro Suzumura1, Scott Trent1, Tamiya Onodera1 
17 Mar 2010
TL;DR: Results show that the acceleration of dynamic scripting language processing does matter in a realistic Web application server environment and that further improvements of dynamic compilers would provide little additional return unless other major overheads such as heavy memory copy between the language runtime and Web server frontend are reduced.
Abstract: Programmers who develop Web applications often use dynamic scripting languages such as Perl, PHP, Python, and Ruby. For general purpose scripting language usage, interpreter-based implementations are efficient and popular but the server-side usage for Web application development implies an opportunity to significantly enhance Web server throughput. This paper summarizes a study of the optimization of PHP script processing. We developed a PHP processor, P9, by adapting an existing production-quality just-in-time (JIT) compiler for a Java virtual machine, for which optimization technologies have been well-established, especially for server-side application. This paper describes and contrasts microbenchmarks and SPECweb2005 benchmark results for a well-tuned configuration of a traditional PHP interpreter and our JIT compiler-based implementation, P9. Experimental results with the microbenchmarks show 2.5-9.5x advantage with P9, and the SPECweb2005 measurements show about 20-30% improvements. These results show that the acceleration of dynamic scripting language processing does matter in a realistic Web application server environment. CPU usage profiling shows our simple JIT compiler introduction reduces the PHP core runtime overhead from 45% to 13% for a SPECweb2005 scenario, implying that further improvements of dynamic compilers would provide little additional return unless other major overheads such as heavy memory copy between the language runtime and Web server frontend are reduced.

15 citations


Proceedings ArticleDOI
17 Mar 2010
TL;DR: A path selection strategy is proposed that reduces memory demands and improves performance by 5-20% compared to an industrial-strength DBT and enumerates all the aspects involved in a path selection design and evaluates a comprehensive set of approaches for each aspect.
Abstract: Dynamic binary translators(DBTs) provide powerful platforms for building dynamic program monitoring and adaptation tools. DBTs, however, have high memory demands because they cache translated code and auxiliary code to a software code cache and must also maintain data structures to support the code cache. The high memory demands make it difficult for memory-constrained embedded systems to take advantage of DBT-based tools. Previous research on DBT memory management focused on the translated code and auxiliary code only. However, we found that data structures are comparable to the code cache in size. We show that the translated code size, auxiliary code size and the data structure size interact in a complex manner, depending on the path selection (trace selection and link formation) strategy. Therefore, holistic memory efficiency (comprising translated code, auxiliary code and data structures) cannot be improved by focusing on the code cache only. In this paper, we use path selection for improving holistic memory efficiency which in turn impacts performance in memory-constrained environments. Although there has been previous research on path selection, such research only considered performance in memory-unconstrained environments.The challenge for holistic memory efficiency is that the path selection strategy results in complex interactions between the memory demand components. Also, individual aspects of path selection and the holistic memory efficiency may impact performance in complex ways. We explore these interactions to motivate path selection targeting holistic memory demand. We enumerate all the aspects involved in a path selection design and evaluate a comprehensive set of approaches for each aspect. Finally, we propose a path selection strategy that reduces memory demands by 20% and at the same time improves performance by 5-20% compared to an industrial-strength DBT.

Proceedings ArticleDOI
Goh Kondoh1, Hideaki Komatsu1
17 Mar 2010
TL;DR: The design and implementation of a novel dynamic binary translation technique specialized for embedded systems is described, which showed that the speed of the specialized code was up to 39% faster than the unspecialized code.
Abstract: This paper describes the design and implementation of a novel dynamic binary translation technique specialized for embedded systems. Virtual platforms have been widely used to develop embedded software and dynamic binary translation is essential to boost their speed in simulations. However, unlike application simulation, the code generated for systems simulation is still slow because the simulator must replicate all of the functions of the target hardware. Embedded systems, which focus on providing one or a few functions, utilize only a small portion of the processor's features most of the time. For example, they may use a Memory Management Unit (MMU) in a processor to map physical memory to effective addresses, but they may not need paged memory support as in an OS. We can exploit this to specialize the dynamically translated code for more performance.We built a specialization framework on top of a functional simulator with a dynamic binary translator. Using the framework, we implemented three specializers for an MMU, bi-endianness, and register banks. Experiments with the EEMBC1.1 benchmark showed that the speed of the specialized code was up to 39% faster than the unspecialized code.

Proceedings Article
17 Mar 2010
TL;DR: The accepted papers span a wide range of virtualization, broadly construed, and the program committee is confident they will make for an interesting conference.
Abstract: It is our pleasure to welcome you to the 6th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE'10). As the leading conference for presentation of research results on all aspects of virtualization, VEE brings together researchers representing a diverse set of interests. As with previous VEEs, this year's program represents an exciting mix of papers ranging from hardware virtualization to virtual machines for programming languages. In selecting papers, the program committee placed a high priority on choosing work that would be broadly interesting and applicable. This year's conference has upheld a number of other traditions: as in the previous three years, the conference was co-chaired by researchers from both the operating systems and programming languages communities. The program committee also represents a mix of research backgrounds, and features strong representation from industrial research organizations. VEE is again co-located with the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). VEE'10 had 50 submissions, 15 of which were chosen for presentation at the conference. The accepted papers span a wide range of virtualization, broadly construed, and we are confident they will make for an interesting conference.

Proceedings ArticleDOI
17 Mar 2010
TL;DR: A working prototype, Vicover, is implemented that optimizes core dump on system crash of a virtual machine in Xen, to minimize the MTTR of core dump and recovery as a whole.
Abstract: Crash dump, or core dump is the typical way to save memory image on system crash for future offline debugging and analysis. However, for typical server machines with likely abundant memory, the time of core dump can significantly increase the mean time to repair (MTTR) by delaying the reboot-based recovery, while not dumping the failure context for analysis would risk recurring crashes on the same problems.In this paper, we propose several optimization techniques for core dump in virtualized environments, in order to shorten the MTTR of consolidated virtual machines during crashes. First, we parallelize the process of crash dump and the process of rebooting the crashed VM, by dynamically reclaiming and allocating memory between the crashed VM and the newly spawned VM. Second, we use the virtual machine management layer to introspect the critical data structures of the crashed VM to filter out the dump of unused memory. Finally, we implement disk I/O rate control between core dump and the newly spawned VM according to user-tuned rate control policy to balance the time of crash dump and quality of services in the recovery VM.We have implemented a working prototype, Vicover, that optimizes core dump on system crash of a virtual machine in Xen, to minimize the MTTR of core dump and recovery as a whole. In our experiment on a virtualized TPC-W server, Vicover shortens the downtime caused by crash dump by around 5X.

Proceedings ArticleDOI
17 Mar 2010
TL;DR: This paper analyzes how to adapt Valgrind to a non-POSIX environment and describes the port to the Fiasco.OC microkernel, and analyzes bug classes that are indigenous to capability systems and shows how Valgrin's flexibility can be leveraged to create custom debugging tools detecting these errors.
Abstract: Not all operating systems are created equal. Contrasting traditional monolithic kernels, there is a class of systems called microkernels more prevalent in embedded systems like cellphones, chip cards or real-time controllers. These kernels offer an abstraction very different from the classical POSIX interface. The resulting unfamiliarity for programmers complicates development and debugging. Valgrind is a well-known debugging tool that virtualizes execution to perform dynamic binary analysis. However, it assumes to run on a POSIX-like kernel and closely interacts with the system to control execution. In this paper we analyze how to adapt Valgrind to a non-POSIX environment and describe our port to the Fiasco.OC microkernel. Additionally, we analyze bug classes that are indigenous to capability systems and show how Valgrind's flexibility can be leveraged to create custom debugging tools detecting these errors.