Showing papers in &quot;Operating Systems Review in 2009&quot;

Factored operating systems (fos): the case for a scalable operating system for multicores

TL;DR: Using a range of VM workloads, post-copy improves several metrics including pages transferred, total migration time, and network overhead and is facilitated with adaptive prepaging techniques to minimize the number of page faults across the network.

...read moreread less

Abstract: We present the design, implementation, and evaluation of post-copy based live migration for virtual machines (VMs) across a Gigabit LAN. Post-copy migration defers the transfer of a VM's memory contents until after its processor state has been sent to the target host. This deferral is in contrast to the traditional pre-copy approach, which first copies the memory state over multiple iterations followed by a final transfer of the processor state. The post-copy strategy can provide a "win-win" by reducing total migration time while maintaining the liveness of the VM during migration. We compare post-copy extensively against the traditional pre-copy approach on the Xen Hypervisor. Using a range of VM workloads we show that post-copy improves several metrics including pages transferred, total migration time, and network overhead. We facilitate the use of post-copy with adaptive prepaging techniques to minimize the number of page faults across the network. We propose different prepaging strategies and quantitatively compare their effectiveness in reducing network-bound page faults. Finally, we eliminate the transfer of free memory pages in both pre-copy and post-copy through a dynamic self-ballooning (DSB) mechanism. DSB periodically reclaims free pages from a VM and significantly speeds up migration with negligible performance impact on VM workload.

...read moreread less

358 citations

Journal Article•DOI•

[...]

David Wentzlaff¹, Anant Agarwal¹•Institutions (1)

Massachusetts Institute of Technology¹

HASS: a scheduler for heterogeneous multicore systems

TL;DR: How fos's design is well suited to attack the scalability challenge of future multicores is described and how traditional application-operating systems interfaces can be redesigned to improve scalability is discussed.

...read moreread less

Abstract: The next decade will afford us computer chips with 100's to 1,000's of cores on a single piece of silicon. Contemporary operating systems have been designed to operate on a single core or small number of cores and hence are not well suited to manage and provide operating system services at such large scale. If multicore trends continue, the number of cores that an operating system will be managing will continue to double every 18 months. The traditional evolutionary approach of redesigning OS subsystems when there is insufficient parallelism will cease to work because the rate of increasing parallelism will far outpace the rate at which OS designers will be capable of redesigning subsystems. The fundamental design of operating systems and operating system data structures must be rethought to put scalability as the prime design constraint. This work begins by documenting the scalability problems of contemporary operating systems. These studies are used to motivate the design of a factored operating system (fos). fos is a new operating system targeting manycore systems with scalability as the primary design constraint, where space sharing replaces time sharing to increase scalability.We describe fos, which is built in a message passing manner, out of a collection of Internet inspired services. Each operating system service is factored into a set of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but instead of providing high level Internet services, these servers provide traditional kernel services and replace traditional kernel data structures in a factored, spatially distributed manner. fos replaces time sharing with space sharing. In other words, fos's servers are bound to distinct processing cores and by doing so do not fight with end user applications for implicit resources such as TLBs and caches. We describe how fos's design is well suited to attack the scalability challenge of future multicores and discuss how traditional application-operating systems interfaces can be redesigned to improve scalability.

...read moreread less

342 citations

Journal Article•DOI•

[...]

Daniel Shelepov¹, Juan Carlos Saez Alcaide², Stacey Jeffery³, Alexandra Fedorova¹, Nestor Perez¹, Zhi Feng Huang¹, Sergey Blagodurov¹, Viren Kumar¹ - Show less +4 more•Institutions (3)

Simon Fraser University¹, Complutense University of Madrid², University of Waterloo³

COTSon: infrastructure for full system simulation

TL;DR: This work proposes a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline, and is comparatively simple and scalable.

...read moreread less

Abstract: Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline. The resulting algorithm does not rely on dynamic profiling, and is comparatively simple and scalable. We implemented HASS in OpenSolaris, and achieved average workload speedups of up to 13%, matching best static assignment, achievable only by an oracle. We have also implemented a dynamic IPC-driven algorithm proposed earlier that relies on online profiling. We found that the complexity, load imbalance and associated performance degradation resulting from dynamic profiling are significant challenges to using this algorithm successfully. As a result it failed to deliver expected performance gains and to outperform HASS.

...read moreread less

256 citations

Journal Article•DOI•

[...]

Eduardo Argollo¹, Ayose Falcón¹, Paolo Faraboschi¹, Matteo Monchiero¹, Daniel Ortega¹ - Show less +1 more•Institutions (1)

GPU virtualization on VMware's hosted I/O architecture

TL;DR: COTSon opens up a new dimension in the speed/accuracy space, allowing simulation of a cluster of nodes several orders of magnitude faster with a minimal accuracy loss, and abandon the idea of "always-on" cycle-based simulation in favor of statistical sampling approaches that can trade accuracy for speed.

...read moreread less

Abstract: Simulation has historically been the primary technique used for evaluating the performance of new proposals in computer architecture. Speed and complexity considerations have traditionally limited its applicability to single-thread processors running application-level code. This is no longer sufficient to model modern multicore systems running the complex workloads of commercial interest today.COTSon is a simulator framework jointly developed by HP Labs and AMD. The goal of COTSon is to provide fast and accurate evaluation of current and future computing systems, covering the full software stack and complete hardware models. It targets cluster-level systems composed of hundreds of commodity multicore nodes and their associated devices connected through a standard communication network. COTSon adopts a functional-directed philosophy, where fast functional emulators and timing models cooperate to improve the simulation accuracy at a speed sufficient to simulate the full stack of applications, middleware and OSs.This paper describes the changes in simulation philosophy we embraced in COTSon to address these new challenges. We base functional emulation on established, fast and validated tools that support commodity OSs and complex multitier applications. Through a robust interface between the functional and timing domain, we can leverage other existing simulators for individual sub-components, such as disks or networks. We abandon the idea of "always-on" cycle-based simulation in favor of statistical sampling approaches that can trade accuracy for speed.COTSon opens up a new dimension in the speed/accuracy space, allowing simulation of a cluster of nodes several orders of magnitude faster with a minimal accuracy loss.

...read moreread less

206 citations

Journal Article•DOI•

[...]

Micah Dowty¹, Jeremy Sugerman¹•Institutions (1)

VMware¹

The SmartFrog configuration management framework

TL;DR: This paper describes in detail the specific GPU virtualization architecture developed for VMware's hosted products (VMware Workstation and VMware Fusion) and finds that taking advantage of hardware acceleration significantly closes the gap between pure emulation and native, but that different implementations and host graphics stacks show distinct variation.

...read moreread less

Abstract: Modern graphics co-processors (GPUs) can produce high fidelity images several orders of magnitude faster than general purpose CPUs, and this performance expectation is rapidly becoming ubiquitous in personal computers. Despite this, GPU virtualization is a nascent field of research. This paper introduces a taxonomy of strategies for GPU virtualization and describes in detail the specific GPU virtualization architecture developed for VMware's hosted products (VMware Workstation and VMware Fusion).We analyze the performance of our GPU virtualization with a combination of applications and microbenchmarks. We also compare against software rendering, the GPU virtualization in Parallels Desktop 3.0, and the native GPU. We find that taking advantage of hardware acceleration significantly closes the gap between pure emulation and native, but that different implementations and host graphics stacks show distinct variation. The microbenchmarks show that our architecture amplifies the overheads in the traditional graphics API bottlenecks: draw calls, downloading buffers, and batch sizes.Our virtual GPU architecture runs modern graphics-intensive games and applications at interactive frame rates while preserving virtual machine portability. The applications we tested achieve from 86% to 12% of native rates and 43 to 18 frames per second with VMware Fusion 2.0.

...read moreread less

179 citations

Journal Article•DOI•

[...]

Patrick Goldsack¹, Julio Guijarro¹, Steve Loughran¹, Alistair Coles¹, Andrew Farrell¹, Antonio Lain¹, Paul Murray¹, Peter Toft¹ - Show less +4 more•Institutions (1)

What does control theory bring to systems research

TL;DR: The rationale for the design of the SmartFrog framework, details of its design, plus a description of the further research that is in progress are covered.

...read moreread less

Abstract: SmartFrog is a framework for creating configuration-driven systems. It has been designed with the express purpose of making the design, deployment and management of distributed component-based systems simpler and more robust. Over the last decade it has been the focus for ongoing research into aspects of configuration management and large-scale distributed systems, providing a platform for experimentation. The paper covers the rationale for the design of the framework, details of its design, plus a description of the further research that is in progress.

...read moreread less

106 citations

Journal Article•DOI•

[...]

Xiaoyun Zhu¹, Mustafa Uysal¹, Zhikui Wang¹, Sharad Singhal¹, Arif Merchant¹, Pradeep Padala², Kang G. Shin² - Show less +3 more•Institutions (2)

Hewlett-Packard¹, University of Michigan²

Dynamic memory balancing for virtual machines

TL;DR: An overview of the joint research at HP Labs and University of Michigan in the past few years, where control theory was applied to automated resource and service level management in data centers, and the key benefits of a control-theoretic approach for systems research are highlighted.

...read moreread less

Abstract: Feedback mechanisms can help today's increasingly complex computer systems adapt to changes in workloads or operating conditions. Control theory offers a principled way for designing feedback loops to deal with unpredictable changes, uncertainties, and disturbances in systems. We provide an overview of the joint research at HP Labs and University of Michigan in the past few years, where control theory was applied to automated resource and service level management in data centers. We highlight the key benefits of a control-theoretic approach for systems research, and present specific examples from our experience of designing adaptive resource control systems where this approach worked well. In addition, we outline the main limitations of this approach, and discuss the lessons learned from our experience.

...read moreread less

106 citations

Journal Article•DOI•

[...]

Weiming Zhao¹, Zhenlin Wang¹, Yingwei Luo²•Institutions (2)

Michigan Technological University¹, Peking University²

Enhancing operating system support for multicore processors by using hardware performance monitoring

TL;DR: MEmory Balancer is introduced which dynamically monitors the memory usage of each virtual machine, accurately predicts its memory needs, and periodically reallocates host memory and the overall system throughput can be significantly improved with MEB.

...read moreread less

Abstract: Virtualization essentially enables multiple operating systems and applications to run on one physical computer by multiplexing hardware resources. A key motivation for applying virtualization is to improve hardware resource utilization while maintaining reasonable quality of service. However, such a goal cannot be achieved without efficient resource management. Though most physical resources, such as processor cores and I/O devices, are shared among virtual machines using time slicing and can be scheduled flexibly based on priority, allocating an appropriate amount of main memory to virtual machines is more challenging. Different applications have different memory requirements. Even a single application shows varied working set sizes during its execution. An optimal memory management strategy under a virtualized environment thus needs to dynamically adjust memory allocation for each virtual machine, which further requires a prediction model that forecasts its host physical memory needs on the fly. This paper introduces MEmory Balancer (MEB) which dynamically monitors the memory usage of each virtual machine, accurately predicts its memory needs, and periodically reallocates host memory. MEB uses two effective memory predictors which, respectively, estimate the amount of memory available for reclaiming without a notable performance drop, and additional memory required for reducing the virtual machine paging penalty. Our experimental results show that our prediction schemes yield high accuracy and low overhead. Furthermore, the overall system throughput can be significantly improved with MEB.

...read moreread less

104 citations

Journal Article•DOI•

[...]

Reza Azimi¹, David Tam¹, Livio Soares¹, Michael Stumm¹•Institutions (1)

University of Toronto¹

Supporting MapReduce on large-scale asymmetric multi-core clusters

TL;DR: This paper shows how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system.

...read moreread less

Abstract: Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.

...read moreread less

82 citations

Journal Article•DOI•

[...]

M. Mustafa Rafique¹, Benjamin Rose¹, Ali R. Butt¹, Dimitrios S. Nikolopoulos•Institutions (1)

Virginia Tech¹

FlexDCP: a QoS framework for CMP architectures

TL;DR: This paper explores the design space exploration for the configuration and software stack of hybrid clusters of AMPs and GPPs in an implementation of the popular MapReduce programming model and finds that in a cluster with resource-constrained and well-provisioned AMP accelerators, a streaming approach achieves 50.5% and 73.1% better performance compared to the non-streaming approach.

...read moreread less

Abstract: Asymmetric multi-core processors (AMPs) with general-purpose and specialized cores packaged on the same chip, are emerging as a leading paradigm for high-end computing. A large body of existing research explores the use of standalone AMPs in computationally challenging and data-intensive applications. AMPs are rapidly deployed as high-performance accelerators on clusters. In these settings, scheduling, communication and I/O are managed by generalpurpose processors (GPPs), while computation is off-loaded to AMPs. Design space exploration for the configuration and software stack of hybrid clusters of AMPs and GPPs is an open problem. In this paper, we explore this design space in an implementation of the popular MapReduce programming model. Our contributions are: An exploration of various design alternatives for hybrid asymmetric clusters of AMPs and GPPs; the adoption of a streaming approach to supporting MapReduce computations on clusters with asymmetric components; and adaptive schedulers that take into account individual component capabilities in asymmetric clusters. Throughout our design, we remove I/O bottlenecks, using double-buffering and asynchronous I/O. We present an evaluation of the design choices through experiments on a real cluster with MapReduce workloads of varying degrees of computation intensity. We find that in a cluster with resource-constrained and well-provisioned AMP accelerators, a streaming approach achieves 50.5% and 73.1% better performance compared to the non-streaming approach, respectively, and scales almost linearly with increasing number of compute nodes.We also show that our dynamic scheduling mechanisms adapt effectively the parameters of the scheduling policies between applications with different computation density.

...read moreread less

67 citations

Journal Article•DOI•

[...]

Miquel Moreto¹, Francisco J. Cazorla², Alex Ramirez, Rizos Sakellariou³, Mateo Valero - Show less +1 more•Institutions (3)

Polytechnic University of Catalonia¹, Barcelona Supercomputing Center², University of Manchester³

Live migration of direct-access devices

TL;DR: FlexDCP is a framework that allows the Operating System to guarantee a QoS for each application running in a chip multiprocessor and offers more flexibility to the OS as it can optimize different QoS metrics like per-application performance or global performance metrics such as fairness, weighted speed up or throughput.

...read moreread less

Abstract: Current multicore architectures offer high throughput by increasing hardware resource utilization. As the number of cores in a multicore system increases, providing Quality of Service (QoS) to applications in addition to throughput is becoming an important problem. In this work, we present FlexDCP, a framework that allows the Operating System (OS) to guarantee a QoS for each application running in a chip multiprocessor. FlexDCP directly estimates the performance of applications for different cache configurations instead of using indirect measures of performance like the number of misses. This information allows the OS to convert QoS requirements into resource assignments. Consequently, it offers more flexibility to the OS as it can optimize different QoS metrics like per-application performance or global performance metrics such as fairness, weighted speed up or throughput. Our results show that FlexDCP is able to force applications in a workload to run at a certain percentage of their maximum performance in 94% of the cases considered, being on average 1:48% under the objective for remaining cases. When optimizing a global QoS metric like fairness, FlexDCP consistently outperforms traditional eviction policies like LRU, pseudo LRU and previous dynamic cache partitioning proposals for two-, four- and eightcore configurations. In an eight-core architecture FlexDCP obtains a fairness improvement of 10:1% over Fair, the best policy in the literature optimizing fairness.

...read moreread less

Journal Article•DOI•

[...]

Asim Kadav¹, Michael M. Swift¹•Institutions (1)

University of Wisconsin-Madison¹

Efficient detection of large-scale redundancy in enterprise file systems

TL;DR: In this article, the authors describe a lightweight software mechanism for migrating virtual machines with direct hardware access, based on shadow drivers, an agent in the guest OS kernel that efficiently captures and restores the state of a device driver.

...read moreread less

Abstract: Virtual machine migration greatly aids management by allowing flexible provisioning of resources and decommissioning of hardware for maintenance. However, efforts to improve network performance by granting virtual machines direct access to hardware currently prevent migration. This occurs because (1) theVMMcannot migrate the state of the device, and (2) the source and destination machines may have different network devices, requiring different drivers to run in the migrated virtual machine.In this paper, we describe a lightweight software mechanism for migrating virtual machines with direct hardware access. We base our solution on shadow drivers, an agent in the guest OS kernel that efficiently captures and restores the state of a device driver. On the source machine, the shadow driver monitors the state of the driver and device. After migration, the shadow driver uses this information to configure a driver for the corresponding device on the destination machine. We implement shadow driver migration for Linux network drivers running on the Xen hypervisor. Shadow driver migration requires a migration downtime similar to the driver initialization time, short enough to avoid disrupting active TCP connections. We find that the performance overhead, compared to direct hardware access, is negligible and is much lower than using a virtual NIC.

...read moreread less

Journal Article•DOI•

[...]

George Forman¹, Kave Eshghi¹, Jaap Suermondt¹•Institutions (1)

DataSeries: an efficient, flexible data format for structured serial data

TL;DR: This work has developed very efficient technology to detect approximate duplication of large directory hierarchies, which can be caused by unnecessary mirroring of repositories by uncoordinated employees or departments.

...read moreread less

Abstract: In order to catch and reduce waste in the exponentially increasing demand for disk storage, we have developed very efficient technology to detect approximate duplication of large directory hierarchies. Such duplication can be caused, for example, by unnecessary mirroring of repositories by uncoordinated employees or departments. Identifying these duplicate or near-duplicate hierarchies allows appropriate action to be taken at a high level. For example, one could coordinate and consolidate multiple copies in one location.

...read moreread less

Journal Article•DOI•

[...]

Eric Anderson¹, Martin Arlitt¹, Charles B. Morrey¹, Alistair Veitch¹•Institutions (1)

Fast switching of threads between cores

TL;DR: This paper introduces DataSeries, an on-disk format, run-time library and set of tools for storing and analyzing structured serial data, and identifies six key properties of a system to store and analyze this type of data.

...read moreread less

Abstract: Structured serial data is used in many scientific fields; such data sets consist of a series of records, and are typically written once, read many times, chronologically ordered, and read sequentially. In this paper we introduce DataSeries, an on-disk format, run-time library and set of tools for storing and analyzing structured serial data. We identify six key properties of a system to store and analyze this type of data, and describe how DataSeries was designed to provide these properties. We quantify the benefits of DataSeries through several experiments. In particular, we demonstrate that DataSeries exceeds the performance of common trace formats by at least a factor of two.

...read moreread less

Journal Article•DOI•

[...]

Richard Strong¹, Jayaram Mudigonda², Jeffrey C. Mogul², Nathan Binkert², Dean M. Tullsen¹ - Show less +1 more•Institutions (2)

University of California, San Diego¹, Hewlett-Packard²

Investigating virtual passthrough I/O on commodity devices

TL;DR: This work addresses the software costs of switching threads between cores in a multicore processor, and describes the implementation of core switching in the Linux kernel, as well as software changes that can decrease switching costs.

...read moreread less

Abstract: We address the software costs of switching threads between cores in a multicore processor. Fast core switching enables a variety of potential improvements, such as thread migration for thermal management, fine-grained load balancing, and exploiting asymmetric multicores, where performance asymmetry creates opportunities for more efficient resource utilization. Successful exploitation of these opportunities demands low core-switching costs. We describe our implementation of core switching in the Linux kernel, as well as software changes that can decrease switching costs. We use detailed simulations to evaluate several alternative implementations. We also explore how some simple architectural variations can reduce switching costs. We evaluate system efficiency using both real (but symmetric) hardware, and simulated asymmetric hardware, using both microbenchmarks and realistic applications.

...read moreread less

Journal Article•DOI•

[...]

Lei Xia¹, John R. Lange¹, Peter A. Dinda¹, Chang Bae¹•Institutions (1)

Northwestern University¹

Trusted virtual platforms: a key enabler for converged client devices

TL;DR: The VPIO model is described, and preliminary results in using it to support two commodity network cards within the Palacios VMM the authors are building are presented, and an appropriate model for an I/O device could be produced by the hardware vendor as part of the design, implementation, and testing process.

...read moreread less

Abstract: A commodity I/O device has no support for virtualization. A VMM can assign such a device to a single guest with direct, fast, but insecure access by the guest's native device driver. Alternatively, the VMMcan build virtual devices on top of the physical device, allowing it to be multiplexed across VMs, but with lower performance. We propose a technique that provides an intermediate option. In virtual passthrough I/O (VPIO), the guest interacts directly with the physical device most of the time, achieving high performance, as in passthrough I/O. Additionally, the guest/device interactions drive a model that in turn identifies (1) when the physical device can be handed off to another VM, and (2) if the guest programs the device to behave illegitimately. In this paper, we describe the VPIO model, and present preliminary results in using it to support two commodity network cards within the Palacios VMM we are building. We believe that an appropriate model for an I/O device could be produced by the hardware vendor as part of the design, implementation, and testing process.

...read moreread less

Journal Article•DOI•

[...]

Chris I. Dalton¹, David Plaquin¹, Wolfgang Weidner¹, Dirk Kuhlmann¹, Boris Balacheff¹, Richard Brown¹ - Show less +2 more•Institutions (1)

Conference reviewing considered harmful

TL;DR: The architecture for reducing and containing the privileged code of the Xen Hypervisor is described and the Trusted Virtual Platform architecture is described, aimed at supporting the strong enforcement of integrity and security policy controls over a virtual entity.

...read moreread less

Abstract: This paper introduces our work around combining machine virtualization technology with Trusted Computing Group technology We first describe our architecture for reducing and containing the privileged code of the Xen Hypervisor Secondly we describe our Trusted Virtual Platform architecture This is aimed at supporting the strong enforcement of integrity and security policy controls over a virtual entity where a virtual entity can be either a full guest operating system or virtual appliance running on a virtualized platform The architecture includes a virtualization-specific integrity measurement and reporting framework This is designed to reflect all the dependencies of the virtual environment of a guest operating system The work is a core enabling component of our research around converged devices -- client platforms such as notebooks or desktop PCs that can safely host multiple virtual operating systems and virtual appliances concurrently and report accurately on the trustworthiness of the individually executing entities

...read moreread less

Journal Article•DOI•

[...]

Thomas Anderson¹•Institutions (1)

University of Washington¹

Dynamic heterogeneity and the need for multicore virtualization

TL;DR: A model of computer systems research is developed to help prospective authors understand the often obscure workings of conference program committees, and it is argued that paper merit is likely to be zipf distributed, making it inherently difficult for program committees to distinguish between most papers.

...read moreread less

Abstract: This paper develops a model of computer systems research to help prospective authors understand the often obscure workings of conference program committees. We present data to show that the variability between reviewers is often the dominant factor as to whether a paper is accepted. We argue that paper merit is likely to be zipf distributed, making it inherently difficult for program committees to distinguish between most papers. We use game theory to show that with noisy reviews and zipf merit, authors have an incentive to submit papers too early and too often. These factors make conference reviewing, and systems research as a whole, less efficient and less effective. We describe some recent changes in conference design to address these issues, and we suggest some further potential improvements.

...read moreread less

Journal Article•DOI•

[...]

Philip M. Wells¹, Koushik Chakraborty², Gurindar S. Sohi³•Institutions (3)

Google¹, Utah State University², University of Wisconsin-Madison³

Towards better performance per watt in virtual environments on asymmetric single-ISA multi-core systems

TL;DR: It is argued that hardware should take a more active role in the management of its computation resources, and hardware techniques to virtualize the cores of a multicore processor are proposed, allowing hardware to flexibly reassign the virtual processors that are exposed to a single operating system to any subset of the physical cores.

...read moreread less

Abstract: As the computing industry enters the multicore era, exponential growth in the number of transistors on a chip continues to present challenges and opportunities for computer architects and system designers. We examine one emerging issue in particular: that of dynamic heterogeneity, which can arise, even among physically homogeneous cores, from changing reliability, power, or thermal conditions, different cache and TLB contents, or changing resource configurations. This heterogeneity results in a constantly varying pool of hardware resources, which greatly complicates software's traditional task of assigning computation to cores. In part to address dynamic heterogeneity, we argue that hardware should take a more active role in the management of its computation resources. We propose hardware techniques to virtualize the cores of a multicore processor, allowing hardware to flexibly reassign the virtual processors that are exposed, even to a single operating system, to any subset of the physical cores. We show that multicore virtualization operates with minimal overhead, and that it enables several novel resource management applications for improving both performance and reliability.

...read moreread less

Journal Article•DOI•

[...]

Viren Kumar¹, Alexandra Fedorova¹•Institutions (1)

Simon Fraser University¹

Managing operational business intelligence workloads

TL;DR: It is proposed that the controlling domain in a Virtual MachineMonitor or hypervisor is relatively insensitive to changes in core frequency, and thus scheduling it on a slower core saves power while only slightly affecting guest domain performance.

...read moreread less

Abstract: Single-ISA heterogeneous multicore architectures promise to deliver plenty of cores with varying complexity, speed and performance in the near future. Virtualization enables multiple operating systems to run concurrently as distinct, independent guest domains, thereby reducing core idle time and maximizing throughput. This paper seeks to identify a heuristic that can aid in intelligently scheduling these virtualized workloads to maximize performance while reducing power consumption.We propose that the controlling domain in a Virtual MachineMonitor or hypervisor is relatively insensitive to changes in core frequency, and thus scheduling it on a slower core saves power while only slightly affecting guest domain performance. We test and validate our hypothesis and further propose a metric, the Combined Usage of a domain, to assist in future energy-efficient scheduling. Our preliminary findings show that the Combined Usage metric can be used as a starting point to gauge the sensitivity of a guest domain to variations in the controlling domain's frequency.

...read moreread less

Journal Article•DOI•

[...]

Umeshwar Dayal¹, Harumi Kuno¹, Janet L. Wiener¹, Kevin Wilkinson¹, Archana Ganapathi², Stefan Krompass³ - Show less +2 more•Institutions (3)

Hewlett-Packard¹, University of California, Berkeley², Technische Universität München³

Memory buddies: exploiting page sharing for smart colocation in virtualized data centers

TL;DR: First, the ability of workload management algorithms to handle workloads that include unexpectedly long-running queries are evaluated, and a new and more accurate method for predicting the resource usage of queries before runtime is described.

...read moreread less

Abstract: We explore how to manage database workloads that contain a mixture of OLTP-like queries that run for milliseconds as well as business intelligence queries and maintenance tasks that last for hours. As data warehouses grow in size to petabytes and complex analytic queries play a greater role in day-to-day business operations, factors such as inaccurate cardinality estimates, data skew, and resource contention all make it notoriously difficult to predict how such queries will behave before they start executing. However, traditional workload management assumes that accurate expectations for the resource requirements and performance characteristics of a workload are available at compile-time, and relies on such information in order to make critical workload management decisions. In this paper, we describe our approach to dealing with inaccurate predictions. First, we evaluate the ability of workload management algorithms to handle workloads that include unexpectedly long-running queries. Second, we describe a new and more accurate method for predicting the resource usage of queries before runtime. We have carried out an extensive set of experiments, and report on a few of our results.

...read moreread less

Journal Article•DOI•

[...]

Timothy Wood¹, Gabriel Tarasuk-Levin¹, Prashant Shenoy¹, Peter Desnoyers¹, Emmanuel Cecchet¹, Mark D. Corner¹ - Show less +2 more•Institutions (1)

University of Massachusetts Amherst¹

Paper rating vs. paper ranking

TL;DR: Memory Buddies, a memory sharingaware placement system for virtual machines that includes a memory fingerprinting system to efficiently determine the sharing potential among a set of VMs, and compute more efficient placements, makes use of live migration to optimize VM placement as workloads change.

...read moreread less

Abstract: Many data center virtualization solutions, such as VMware ESX, employ content-based page sharing to consolidate the resources of multiple servers. Page sharing identifies virtual machine memory pages with identical content and consolidates them into a single shared page. This technique, implemented at the host level, applies only between VMs placed on a given physical host. In a multiserver data center, opportunities for sharing may be lost because the VMs holding identical pages are resident on different hosts. In order to obtain the full benefit of content-based page sharing it is necessary to place virtual machines such that VMs with similar memory content are located on the same hosts.In this paper we present Memory Buddies, a memory sharingaware placement system for virtual machines. This system includes a memory fingerprinting system to efficiently determine the sharing potential among a set of VMs, and compute more efficient placements. In addition it makes use of live migration to optimize VM placement as workloads change.We have implemented a prototype Memory Buddies system with VMware ESX Server and present experimental results on our testbed, as well as an analysis of an extensive memory trace study. Evaluation of our prototype using a mix of enterprise and e-commerce applications demonstrates an increase of data center capacity (i.e. number of VMs supported) of 17%, while imposing low overhead and scaling to as many as a thousand servers.

...read moreread less

Journal Article•DOI•

[...]

John R. Douceur¹•Institutions (1)

Microsoft¹

Auto-tuning support for manycore applications: perspectives for operating systems and compilers

TL;DR: In this article, the authors proposed an alternative approach based on rankings rather than ratings for the evaluation of computer science papers in the computer science community, where individual reviewers are asked to provide their evaluations of papers by assigning a rating to each paper's overall quality.

...read moreread less

Abstract: Within the computer-science community, submitted conference papers are typically evaluated by means of rating, in two respects: First, individual reviewers are asked to provide their evaluations of papers by assigning a rating to each paper's overall quality. Second, program committees collectively rate each paper as being either worthy or unworthy of acceptance, according to the aggregate judgment of the committee members. This paper proposes an alternative approach to these two processes, based on rankings rather than ratings. It also presents experiences from employing rankings in PC discussions of a major CS conference.

...read moreread less

Journal Article•DOI•

[...]

Thomas Karcher¹, Christoph A. Schaefer¹, Victor Pankratius¹•Institutions (1)

Karlsruhe Institute of Technology¹

Providing secure services for a virtual infrastructure

TL;DR: The experimental results show that auto-tuning is promising and worth integrating into operating systems and compilers, so that manycore applications can be tuned more effectively at run-time.

...read moreread less

Abstract: Due to stagnating processor clock rates, parallelism will be the source for future performance improvements. Despite the growing complexity, users now demand for better performance of general-purpose parallel software without sacrificing portability and maintainability. This is difficult to achieve, and a coordinated approach is needed for the entire software hierarchy, including operating systems, compilers, and applications. In particular, performance optimization becomes more difficult because of the growing number of targets to optimize for. With mass markets for multicore systems, the diversity of multicore architectures has increased as well. Their characteristics often differ slightly, e.g., in the number of executable threads, the cache sizes and architectures, or the memory access times and bandwidth. Many parallel applications are optimized at design-time to achieve peak performance on a particular machine, but perform poorly on others. This is unacceptable for applications used in everyday life. Auto-tuners [1] have great potential to tackle this problem effectively. Instead of being hard-wired in the code, the performance-relevant parameters of a multicore application are made configurable. An auto-tuner is used on the target platform where the program is executed to systematically find an optimal configuration; this is typically not known beforehand and may be counter-intuitive. When the program is migrated to another machine, auto-tuning is repeated, thus preserving portability. In this paper, we present our novel contributions to make auto-tuning work for general-purpose parallel programs, not just scientific numerical programs. Our experimental results show that auto-tuning is promising and worth integrating into operating systems and compilers, so that manycore applications can be tuned more effectively at run-time.

...read moreread less

Journal Article•DOI•

[...]

Adrian Baldwin¹, Chris I. Dalton¹, Simon Shiu¹, Krzysztof Kostienko², Qasim Mahmood Rajpoot² - Show less +1 more•Institutions (2)

Hewlett-Packard¹, University of Birmingham²

Virtual machines: a whole new world for performance analysis

TL;DR: This paper addresses the problems associated with managing the cryptographic keys upon which such services rely by ensuring that keys remain within the trusted computing base.

...read moreread less

Abstract: Virtualization brings exibility to the data center and enables separations allowing for better security properties. For these security properties to be fully utilized, virtual machines need to be able to connect to secure services such as networking and storage. This paper addresses the problems associated with managing the cryptographic keys upon which such services rely by ensuring that keys remain within the trusted computing base. Here we describe a general architecture for managing keys tied to the underlying virtualized systems, with a specific example given for secure storage.

...read moreread less

Journal Article•DOI•

[...]

Stanislav Viktorovich Bratanov¹, Roman Belenov¹, Nikita Manovich¹•Institutions (1)

Intel¹

Traveling to Rome: a retrospective on the journey

TL;DR: A novel method of 'cooperative' performance data collection is disclosed that effectively enables time-based sampling of virtualized workloads combined with hardware event counting, is applicable to unmodified, commercially available virtual machines, and has competitive precision and overhead.

...read moreread less

Abstract: This article addresses a problem of performance monitoring inside virtual machines (VMs). It advocates focused monitoring of particular virtualized programs, explains the need for and the importance of such an approach to performance monitoring in virtualized execution environments, and emphasizes its benefits for virtual machine manufacturers, virtual machine users (mostly, software developers) and hardware (processor) manufacturers. The article defines the problem of in-VM performance monitoring as the ability to employ modern methods and hardware performance monitoring capabilities inside virtual machines to an extent comparable with what is being done in real environments. Unfortunately, there are numerous reasons preventing us from achieving such an ambitious goal, one of those reasons being the lack of support from virtualization engines; that is why a novel method of 'cooperative' performance data collection is disclosed. The method implies collection of performance data at physical hardware and simultaneous tracking of software states inside a virtual machine. Each statistically visible execution point of the virtualized software may then be associated with information on real hardware events. The method effectively enables time-based sampling of virtualized workloads combined with hardware event counting, is applicable to unmodified, commercially available virtual machines, and has competitive precision and overhead. The practical significance and value of the method are further illustrated by studying a parallel workload and uncovering virtualization-specific performance issues of multithreaded programs.

...read moreread less

Journal Article•DOI•

[...]

John Wilkes¹•Institutions (1)

OS execution on multi-cores: is out-sourcing worthwhile?

TL;DR: A short retrospective on the Storage Systems Program's decade-long journey to automate the management of enterprise storage systems by means of a technique the authors initially called attribute-managed storage, which resulted in a specification language they called Rome 1.

...read moreread less

Abstract: Starting in 1994/5, the Storage Systems Program at HP Labs embarked on a decade-long journey to automate the management of enterprise storage systems by means of a technique we initially called attribute-managed storage. The key idea was to provide declarative specifications of workloads and their needs, and of storage devices and their capabilities, and to automate the mapping of one to the other. One of many outcomes of the project was a specification language we called Rome 1 -- hence the title of this paper, which offers a short retrospective on the approach and some of the lessons we learned along the way.

...read moreread less

Journal Article•DOI•

[...]

David Nellans¹, Rajeev Balasubramonian¹, Erik Brunvand¹•Institutions (1)

University of Utah¹

Runtime monitoring on multicores via OASES

TL;DR: This position paper articulates this alternative mechanism with initial results that demonstrate promise and reveals that the primary benefits of off-loading can be captured with alternative mechanisms that eliminate the negative effects of Off-loading.

...read moreread less

Abstract: Large-scale multi-core chips open up the possibility of implementing heterogeneous cores on a single chip, where some cores can be customized to execute common code patterns. The operating system is an example of a common code pattern that is constantly executing on every processor. It is therefore a prime candidate for core customization. Recent work has begun to explore this possibility, where some fraction of system calls and other OS functionality is off-loaded to a separate special-purpose core. Studies have shown that this can improve overall system performance and power consumption. However, our explorations in this arena reveal that the primary benefits of off-loading can be captured with alternative mechanisms that eliminate the negative effects of off-loading. This position paper articulates this alternative mechanism with initial results that demonstrate promise.

...read moreread less

Journal Article•DOI•

[...]

Vijay Nagarajan¹, Rajiv Gupta¹•Institutions (1)

University of California, Riverside¹

The CHART system: a high-performance, fair transport architecture based on explicit-rate signaling

TL;DR: This paper presents OASES: OS and Architectural Support for Efficient Shadow memory implementation for multicores that is also robust and shows that the overheads of runtime monitoring tasks are significantly reduced in comparison to previous software implementations.

...read moreread less

Abstract: Runtime monitoring support serves as a foundation for the important tasks of providing security, performing debugging, and improving performance of applications. Often runtime monitoring requires the maintenance of information associated with each of the application's original memory location, which is held in corresponding shadow memory locations. Unfortunately, existing robust shadow memory implementations are inefficient. In this paper, we present OASES: OS and Architectural Support for Efficient Shadow memory implementation for multicores that is also robust. A combination of operating system support (in the form of coupled allocation of memory pages used by the application and associated shadow memory pages) and architectural support (in the form of ISA support and exposed cache events) is proposed. Our page allocation policy enables fast translation of original addresses into corresponding shadow memory addresses; thus allowing implicit addressing of shadow memory. By exposing the cache events to the software, we ensure in software that the shadow memory instructions execute atomically with their corresponding original memory instructions. Our experiments show that the overheads of runtime monitoring tasks are significantly reduced in comparison to previous software implementations.

...read moreread less

Journal Article•DOI•

[...]

Jack Brassil¹, Rick McGeer¹, Raj Rajagopalan¹, Puneet Sharma¹, Praveen Yalagandula¹, Sujata Banerjee¹, David Reed¹, Sung-Ju Lee¹ - Show less +4 more•Institutions (1)