scispace - formally typeset
Search or ask a question

Showing papers in "Operating Systems Review in 2009"


Journal ArticleDOI
TL;DR: Using a range of VM workloads, post-copy improves several metrics including pages transferred, total migration time, and network overhead and is facilitated with adaptive prepaging techniques to minimize the number of page faults across the network.
Abstract: We present the design, implementation, and evaluation of post-copy based live migration for virtual machines (VMs) across a Gigabit LAN. Post-copy migration defers the transfer of a VM's memory contents until after its processor state has been sent to the target host. This deferral is in contrast to the traditional pre-copy approach, which first copies the memory state over multiple iterations followed by a final transfer of the processor state. The post-copy strategy can provide a "win-win" by reducing total migration time while maintaining the liveness of the VM during migration. We compare post-copy extensively against the traditional pre-copy approach on the Xen Hypervisor. Using a range of VM workloads we show that post-copy improves several metrics including pages transferred, total migration time, and network overhead. We facilitate the use of post-copy with adaptive prepaging techniques to minimize the number of page faults across the network. We propose different prepaging strategies and quantitatively compare their effectiveness in reducing network-bound page faults. Finally, we eliminate the transfer of free memory pages in both pre-copy and post-copy through a dynamic self-ballooning (DSB) mechanism. DSB periodically reclaims free pages from a VM and significantly speeds up migration with negligible performance impact on VM workload.

358 citations


Journal ArticleDOI
TL;DR: How fos's design is well suited to attack the scalability challenge of future multicores is described and how traditional application-operating systems interfaces can be redesigned to improve scalability is discussed.
Abstract: The next decade will afford us computer chips with 100's to 1,000's of cores on a single piece of silicon. Contemporary operating systems have been designed to operate on a single core or small number of cores and hence are not well suited to manage and provide operating system services at such large scale. If multicore trends continue, the number of cores that an operating system will be managing will continue to double every 18 months. The traditional evolutionary approach of redesigning OS subsystems when there is insufficient parallelism will cease to work because the rate of increasing parallelism will far outpace the rate at which OS designers will be capable of redesigning subsystems. The fundamental design of operating systems and operating system data structures must be rethought to put scalability as the prime design constraint. This work begins by documenting the scalability problems of contemporary operating systems. These studies are used to motivate the design of a factored operating system (fos). fos is a new operating system targeting manycore systems with scalability as the primary design constraint, where space sharing replaces time sharing to increase scalability.We describe fos, which is built in a message passing manner, out of a collection of Internet inspired services. Each operating system service is factored into a set of communicating servers which in aggregate implement a system service. These servers are designed much in the way that distributed Internet services are designed, but instead of providing high level Internet services, these servers provide traditional kernel services and replace traditional kernel data structures in a factored, spatially distributed manner. fos replaces time sharing with space sharing. In other words, fos's servers are bound to distinct processing cores and by doing so do not fight with end user applications for implicit resources such as TLBs and caches. We describe how fos's design is well suited to attack the scalability challenge of future multicores and discuss how traditional application-operating systems interfaces can be redesigned to improve scalability.

342 citations


Journal ArticleDOI
TL;DR: This work proposes a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline, and is comparatively simple and scalable.
Abstract: Future heterogeneous single-ISA multicore processors will have an edge in potential performance per watt over comparable homogeneous processors. To fully tap into that potential, the OS scheduler needs to be heterogeneity-aware, so it can match jobs to cores according to characteristics of both. We propose a Heterogeneity-Aware Signature-Supported scheduling algorithm that does the matching using per-thread architectural signatures, which are compact summaries of threads' architectural properties collected offline. The resulting algorithm does not rely on dynamic profiling, and is comparatively simple and scalable. We implemented HASS in OpenSolaris, and achieved average workload speedups of up to 13%, matching best static assignment, achievable only by an oracle. We have also implemented a dynamic IPC-driven algorithm proposed earlier that relies on online profiling. We found that the complexity, load imbalance and associated performance degradation resulting from dynamic profiling are significant challenges to using this algorithm successfully. As a result it failed to deliver expected performance gains and to outperform HASS.

256 citations


Journal ArticleDOI
TL;DR: COTSon opens up a new dimension in the speed/accuracy space, allowing simulation of a cluster of nodes several orders of magnitude faster with a minimal accuracy loss, and abandon the idea of "always-on" cycle-based simulation in favor of statistical sampling approaches that can trade accuracy for speed.
Abstract: Simulation has historically been the primary technique used for evaluating the performance of new proposals in computer architecture. Speed and complexity considerations have traditionally limited its applicability to single-thread processors running application-level code. This is no longer sufficient to model modern multicore systems running the complex workloads of commercial interest today.COTSon is a simulator framework jointly developed by HP Labs and AMD. The goal of COTSon is to provide fast and accurate evaluation of current and future computing systems, covering the full software stack and complete hardware models. It targets cluster-level systems composed of hundreds of commodity multicore nodes and their associated devices connected through a standard communication network. COTSon adopts a functional-directed philosophy, where fast functional emulators and timing models cooperate to improve the simulation accuracy at a speed sufficient to simulate the full stack of applications, middleware and OSs.This paper describes the changes in simulation philosophy we embraced in COTSon to address these new challenges. We base functional emulation on established, fast and validated tools that support commodity OSs and complex multitier applications. Through a robust interface between the functional and timing domain, we can leverage other existing simulators for individual sub-components, such as disks or networks. We abandon the idea of "always-on" cycle-based simulation in favor of statistical sampling approaches that can trade accuracy for speed.COTSon opens up a new dimension in the speed/accuracy space, allowing simulation of a cluster of nodes several orders of magnitude faster with a minimal accuracy loss.

206 citations


Journal ArticleDOI
Micah Dowty1, Jeremy Sugerman1
TL;DR: This paper describes in detail the specific GPU virtualization architecture developed for VMware's hosted products (VMware Workstation and VMware Fusion) and finds that taking advantage of hardware acceleration significantly closes the gap between pure emulation and native, but that different implementations and host graphics stacks show distinct variation.
Abstract: Modern graphics co-processors (GPUs) can produce high fidelity images several orders of magnitude faster than general purpose CPUs, and this performance expectation is rapidly becoming ubiquitous in personal computers. Despite this, GPU virtualization is a nascent field of research. This paper introduces a taxonomy of strategies for GPU virtualization and describes in detail the specific GPU virtualization architecture developed for VMware's hosted products (VMware Workstation and VMware Fusion).We analyze the performance of our GPU virtualization with a combination of applications and microbenchmarks. We also compare against software rendering, the GPU virtualization in Parallels Desktop 3.0, and the native GPU. We find that taking advantage of hardware acceleration significantly closes the gap between pure emulation and native, but that different implementations and host graphics stacks show distinct variation. The microbenchmarks show that our architecture amplifies the overheads in the traditional graphics API bottlenecks: draw calls, downloading buffers, and batch sizes.Our virtual GPU architecture runs modern graphics-intensive games and applications at interactive frame rates while preserving virtual machine portability. The applications we tested achieve from 86% to 12% of native rates and 43 to 18 frames per second with VMware Fusion 2.0.

179 citations


Journal ArticleDOI
TL;DR: The rationale for the design of the SmartFrog framework, details of its design, plus a description of the further research that is in progress are covered.
Abstract: SmartFrog is a framework for creating configuration-driven systems. It has been designed with the express purpose of making the design, deployment and management of distributed component-based systems simpler and more robust. Over the last decade it has been the focus for ongoing research into aspects of configuration management and large-scale distributed systems, providing a platform for experimentation. The paper covers the rationale for the design of the framework, details of its design, plus a description of the further research that is in progress.

106 citations


Journal ArticleDOI
TL;DR: An overview of the joint research at HP Labs and University of Michigan in the past few years, where control theory was applied to automated resource and service level management in data centers, and the key benefits of a control-theoretic approach for systems research are highlighted.
Abstract: Feedback mechanisms can help today's increasingly complex computer systems adapt to changes in workloads or operating conditions. Control theory offers a principled way for designing feedback loops to deal with unpredictable changes, uncertainties, and disturbances in systems. We provide an overview of the joint research at HP Labs and University of Michigan in the past few years, where control theory was applied to automated resource and service level management in data centers. We highlight the key benefits of a control-theoretic approach for systems research, and present specific examples from our experience of designing adaptive resource control systems where this approach worked well. In addition, we outline the main limitations of this approach, and discuss the lessons learned from our experience.

106 citations


Journal ArticleDOI
TL;DR: MEmory Balancer is introduced which dynamically monitors the memory usage of each virtual machine, accurately predicts its memory needs, and periodically reallocates host memory and the overall system throughput can be significantly improved with MEB.
Abstract: Virtualization essentially enables multiple operating systems and applications to run on one physical computer by multiplexing hardware resources. A key motivation for applying virtualization is to improve hardware resource utilization while maintaining reasonable quality of service. However, such a goal cannot be achieved without efficient resource management. Though most physical resources, such as processor cores and I/O devices, are shared among virtual machines using time slicing and can be scheduled flexibly based on priority, allocating an appropriate amount of main memory to virtual machines is more challenging. Different applications have different memory requirements. Even a single application shows varied working set sizes during its execution. An optimal memory management strategy under a virtualized environment thus needs to dynamically adjust memory allocation for each virtual machine, which further requires a prediction model that forecasts its host physical memory needs on the fly. This paper introduces MEmory Balancer (MEB) which dynamically monitors the memory usage of each virtual machine, accurately predicts its memory needs, and periodically reallocates host memory. MEB uses two effective memory predictors which, respectively, estimate the amount of memory available for reclaiming without a notable performance drop, and additional memory required for reducing the virtual machine paging penalty. Our experimental results show that our prediction schemes yield high accuracy and low overhead. Furthermore, the overall system throughput can be significantly improved with MEB.

104 citations


Journal ArticleDOI
TL;DR: This paper shows how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system.
Abstract: Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.

82 citations


Journal ArticleDOI
TL;DR: This paper explores the design space exploration for the configuration and software stack of hybrid clusters of AMPs and GPPs in an implementation of the popular MapReduce programming model and finds that in a cluster with resource-constrained and well-provisioned AMP accelerators, a streaming approach achieves 50.5% and 73.1% better performance compared to the non-streaming approach.
Abstract: Asymmetric multi-core processors (AMPs) with general-purpose and specialized cores packaged on the same chip, are emerging as a leading paradigm for high-end computing. A large body of existing research explores the use of standalone AMPs in computationally challenging and data-intensive applications. AMPs are rapidly deployed as high-performance accelerators on clusters. In these settings, scheduling, communication and I/O are managed by generalpurpose processors (GPPs), while computation is off-loaded to AMPs. Design space exploration for the configuration and software stack of hybrid clusters of AMPs and GPPs is an open problem. In this paper, we explore this design space in an implementation of the popular MapReduce programming model. Our contributions are: An exploration of various design alternatives for hybrid asymmetric clusters of AMPs and GPPs; the adoption of a streaming approach to supporting MapReduce computations on clusters with asymmetric components; and adaptive schedulers that take into account individual component capabilities in asymmetric clusters. Throughout our design, we remove I/O bottlenecks, using double-buffering and asynchronous I/O. We present an evaluation of the design choices through experiments on a real cluster with MapReduce workloads of varying degrees of computation intensity. We find that in a cluster with resource-constrained and well-provisioned AMP accelerators, a streaming approach achieves 50.5% and 73.1% better performance compared to the non-streaming approach, respectively, and scales almost linearly with increasing number of compute nodes.We also show that our dynamic scheduling mechanisms adapt effectively the parameters of the scheduling policies between applications with different computation density.

67 citations


Journal ArticleDOI
TL;DR: FlexDCP is a framework that allows the Operating System to guarantee a QoS for each application running in a chip multiprocessor and offers more flexibility to the OS as it can optimize different QoS metrics like per-application performance or global performance metrics such as fairness, weighted speed up or throughput.
Abstract: Current multicore architectures offer high throughput by increasing hardware resource utilization. As the number of cores in a multicore system increases, providing Quality of Service (QoS) to applications in addition to throughput is becoming an important problem. In this work, we present FlexDCP, a framework that allows the Operating System (OS) to guarantee a QoS for each application running in a chip multiprocessor. FlexDCP directly estimates the performance of applications for different cache configurations instead of using indirect measures of performance like the number of misses. This information allows the OS to convert QoS requirements into resource assignments. Consequently, it offers more flexibility to the OS as it can optimize different QoS metrics like per-application performance or global performance metrics such as fairness, weighted speed up or throughput. Our results show that FlexDCP is able to force applications in a workload to run at a certain percentage of their maximum performance in 94% of the cases considered, being on average 1:48% under the objective for remaining cases. When optimizing a global QoS metric like fairness, FlexDCP consistently outperforms traditional eviction policies like LRU, pseudo LRU and previous dynamic cache partitioning proposals for two-, four- and eightcore configurations. In an eight-core architecture FlexDCP obtains a fairness improvement of 10:1% over Fair, the best policy in the literature optimizing fairness.

Journal ArticleDOI
TL;DR: In this article, the authors describe a lightweight software mechanism for migrating virtual machines with direct hardware access, based on shadow drivers, an agent in the guest OS kernel that efficiently captures and restores the state of a device driver.
Abstract: Virtual machine migration greatly aids management by allowing flexible provisioning of resources and decommissioning of hardware for maintenance. However, efforts to improve network performance by granting virtual machines direct access to hardware currently prevent migration. This occurs because (1) theVMMcannot migrate the state of the device, and (2) the source and destination machines may have different network devices, requiring different drivers to run in the migrated virtual machine.In this paper, we describe a lightweight software mechanism for migrating virtual machines with direct hardware access. We base our solution on shadow drivers, an agent in the guest OS kernel that efficiently captures and restores the state of a device driver. On the source machine, the shadow driver monitors the state of the driver and device. After migration, the shadow driver uses this information to configure a driver for the corresponding device on the destination machine. We implement shadow driver migration for Linux network drivers running on the Xen hypervisor. Shadow driver migration requires a migration downtime similar to the driver initialization time, short enough to avoid disrupting active TCP connections. We find that the performance overhead, compared to direct hardware access, is negligible and is much lower than using a virtual NIC.

Journal ArticleDOI
TL;DR: This work has developed very efficient technology to detect approximate duplication of large directory hierarchies, which can be caused by unnecessary mirroring of repositories by uncoordinated employees or departments.
Abstract: In order to catch and reduce waste in the exponentially increasing demand for disk storage, we have developed very efficient technology to detect approximate duplication of large directory hierarchies. Such duplication can be caused, for example, by unnecessary mirroring of repositories by uncoordinated employees or departments. Identifying these duplicate or near-duplicate hierarchies allows appropriate action to be taken at a high level. For example, one could coordinate and consolidate multiple copies in one location.

Journal ArticleDOI
TL;DR: This paper introduces DataSeries, an on-disk format, run-time library and set of tools for storing and analyzing structured serial data, and identifies six key properties of a system to store and analyze this type of data.
Abstract: Structured serial data is used in many scientific fields; such data sets consist of a series of records, and are typically written once, read many times, chronologically ordered, and read sequentially. In this paper we introduce DataSeries, an on-disk format, run-time library and set of tools for storing and analyzing structured serial data. We identify six key properties of a system to store and analyze this type of data, and describe how DataSeries was designed to provide these properties. We quantify the benefits of DataSeries through several experiments. In particular, we demonstrate that DataSeries exceeds the performance of common trace formats by at least a factor of two.

Journal ArticleDOI
TL;DR: This work addresses the software costs of switching threads between cores in a multicore processor, and describes the implementation of core switching in the Linux kernel, as well as software changes that can decrease switching costs.
Abstract: We address the software costs of switching threads between cores in a multicore processor. Fast core switching enables a variety of potential improvements, such as thread migration for thermal management, fine-grained load balancing, and exploiting asymmetric multicores, where performance asymmetry creates opportunities for more efficient resource utilization. Successful exploitation of these opportunities demands low core-switching costs. We describe our implementation of core switching in the Linux kernel, as well as software changes that can decrease switching costs. We use detailed simulations to evaluate several alternative implementations. We also explore how some simple architectural variations can reduce switching costs. We evaluate system efficiency using both real (but symmetric) hardware, and simulated asymmetric hardware, using both microbenchmarks and realistic applications.

Journal ArticleDOI
TL;DR: The VPIO model is described, and preliminary results in using it to support two commodity network cards within the Palacios VMM the authors are building are presented, and an appropriate model for an I/O device could be produced by the hardware vendor as part of the design, implementation, and testing process.
Abstract: A commodity I/O device has no support for virtualization. A VMM can assign such a device to a single guest with direct, fast, but insecure access by the guest's native device driver. Alternatively, the VMMcan build virtual devices on top of the physical device, allowing it to be multiplexed across VMs, but with lower performance. We propose a technique that provides an intermediate option. In virtual passthrough I/O (VPIO), the guest interacts directly with the physical device most of the time, achieving high performance, as in passthrough I/O. Additionally, the guest/device interactions drive a model that in turn identifies (1) when the physical device can be handed off to another VM, and (2) if the guest programs the device to behave illegitimately. In this paper, we describe the VPIO model, and present preliminary results in using it to support two commodity network cards within the Palacios VMM we are building. We believe that an appropriate model for an I/O device could be produced by the hardware vendor as part of the design, implementation, and testing process.

Journal ArticleDOI
TL;DR: The architecture for reducing and containing the privileged code of the Xen Hypervisor is described and the Trusted Virtual Platform architecture is described, aimed at supporting the strong enforcement of integrity and security policy controls over a virtual entity.
Abstract: This paper introduces our work around combining machine virtualization technology with Trusted Computing Group technology We first describe our architecture for reducing and containing the privileged code of the Xen Hypervisor Secondly we describe our Trusted Virtual Platform architecture This is aimed at supporting the strong enforcement of integrity and security policy controls over a virtual entity where a virtual entity can be either a full guest operating system or virtual appliance running on a virtualized platform The architecture includes a virtualization-specific integrity measurement and reporting framework This is designed to reflect all the dependencies of the virtual environment of a guest operating system The work is a core enabling component of our research around converged devices -- client platforms such as notebooks or desktop PCs that can safely host multiple virtual operating systems and virtual appliances concurrently and report accurately on the trustworthiness of the individually executing entities

Journal ArticleDOI
TL;DR: A model of computer systems research is developed to help prospective authors understand the often obscure workings of conference program committees, and it is argued that paper merit is likely to be zipf distributed, making it inherently difficult for program committees to distinguish between most papers.
Abstract: This paper develops a model of computer systems research to help prospective authors understand the often obscure workings of conference program committees. We present data to show that the variability between reviewers is often the dominant factor as to whether a paper is accepted. We argue that paper merit is likely to be zipf distributed, making it inherently difficult for program committees to distinguish between most papers. We use game theory to show that with noisy reviews and zipf merit, authors have an incentive to submit papers too early and too often. These factors make conference reviewing, and systems research as a whole, less efficient and less effective. We describe some recent changes in conference design to address these issues, and we suggest some further potential improvements.

Journal ArticleDOI
TL;DR: It is argued that hardware should take a more active role in the management of its computation resources, and hardware techniques to virtualize the cores of a multicore processor are proposed, allowing hardware to flexibly reassign the virtual processors that are exposed to a single operating system to any subset of the physical cores.
Abstract: As the computing industry enters the multicore era, exponential growth in the number of transistors on a chip continues to present challenges and opportunities for computer architects and system designers. We examine one emerging issue in particular: that of dynamic heterogeneity, which can arise, even among physically homogeneous cores, from changing reliability, power, or thermal conditions, different cache and TLB contents, or changing resource configurations. This heterogeneity results in a constantly varying pool of hardware resources, which greatly complicates software's traditional task of assigning computation to cores. In part to address dynamic heterogeneity, we argue that hardware should take a more active role in the management of its computation resources. We propose hardware techniques to virtualize the cores of a multicore processor, allowing hardware to flexibly reassign the virtual processors that are exposed, even to a single operating system, to any subset of the physical cores. We show that multicore virtualization operates with minimal overhead, and that it enables several novel resource management applications for improving both performance and reliability.

Journal ArticleDOI
TL;DR: It is proposed that the controlling domain in a Virtual MachineMonitor or hypervisor is relatively insensitive to changes in core frequency, and thus scheduling it on a slower core saves power while only slightly affecting guest domain performance.
Abstract: Single-ISA heterogeneous multicore architectures promise to deliver plenty of cores with varying complexity, speed and performance in the near future. Virtualization enables multiple operating systems to run concurrently as distinct, independent guest domains, thereby reducing core idle time and maximizing throughput. This paper seeks to identify a heuristic that can aid in intelligently scheduling these virtualized workloads to maximize performance while reducing power consumption.We propose that the controlling domain in a Virtual MachineMonitor or hypervisor is relatively insensitive to changes in core frequency, and thus scheduling it on a slower core saves power while only slightly affecting guest domain performance. We test and validate our hypothesis and further propose a metric, the Combined Usage of a domain, to assist in future energy-efficient scheduling. Our preliminary findings show that the Combined Usage metric can be used as a starting point to gauge the sensitivity of a guest domain to variations in the controlling domain's frequency.

Journal ArticleDOI
TL;DR: First, the ability of workload management algorithms to handle workloads that include unexpectedly long-running queries are evaluated, and a new and more accurate method for predicting the resource usage of queries before runtime is described.
Abstract: We explore how to manage database workloads that contain a mixture of OLTP-like queries that run for milliseconds as well as business intelligence queries and maintenance tasks that last for hours. As data warehouses grow in size to petabytes and complex analytic queries play a greater role in day-to-day business operations, factors such as inaccurate cardinality estimates, data skew, and resource contention all make it notoriously difficult to predict how such queries will behave before they start executing. However, traditional workload management assumes that accurate expectations for the resource requirements and performance characteristics of a workload are available at compile-time, and relies on such information in order to make critical workload management decisions. In this paper, we describe our approach to dealing with inaccurate predictions. First, we evaluate the ability of workload management algorithms to handle workloads that include unexpectedly long-running queries. Second, we describe a new and more accurate method for predicting the resource usage of queries before runtime. We have carried out an extensive set of experiments, and report on a few of our results.

Journal ArticleDOI
TL;DR: Memory Buddies, a memory sharingaware placement system for virtual machines that includes a memory fingerprinting system to efficiently determine the sharing potential among a set of VMs, and compute more efficient placements, makes use of live migration to optimize VM placement as workloads change.
Abstract: Many data center virtualization solutions, such as VMware ESX, employ content-based page sharing to consolidate the resources of multiple servers. Page sharing identifies virtual machine memory pages with identical content and consolidates them into a single shared page. This technique, implemented at the host level, applies only between VMs placed on a given physical host. In a multiserver data center, opportunities for sharing may be lost because the VMs holding identical pages are resident on different hosts. In order to obtain the full benefit of content-based page sharing it is necessary to place virtual machines such that VMs with similar memory content are located on the same hosts.In this paper we present Memory Buddies, a memory sharingaware placement system for virtual machines. This system includes a memory fingerprinting system to efficiently determine the sharing potential among a set of VMs, and compute more efficient placements. In addition it makes use of live migration to optimize VM placement as workloads change.We have implemented a prototype Memory Buddies system with VMware ESX Server and present experimental results on our testbed, as well as an analysis of an extensive memory trace study. Evaluation of our prototype using a mix of enterprise and e-commerce applications demonstrates an increase of data center capacity (i.e. number of VMs supported) of 17%, while imposing low overhead and scaling to as many as a thousand servers.

Journal ArticleDOI
John R. Douceur1
TL;DR: In this article, the authors proposed an alternative approach based on rankings rather than ratings for the evaluation of computer science papers in the computer science community, where individual reviewers are asked to provide their evaluations of papers by assigning a rating to each paper's overall quality.
Abstract: Within the computer-science community, submitted conference papers are typically evaluated by means of rating, in two respects: First, individual reviewers are asked to provide their evaluations of papers by assigning a rating to each paper's overall quality. Second, program committees collectively rate each paper as being either worthy or unworthy of acceptance, according to the aggregate judgment of the committee members. This paper proposes an alternative approach to these two processes, based on rankings rather than ratings. It also presents experiences from employing rankings in PC discussions of a major CS conference.

Journal ArticleDOI
TL;DR: The experimental results show that auto-tuning is promising and worth integrating into operating systems and compilers, so that manycore applications can be tuned more effectively at run-time.
Abstract: Due to stagnating processor clock rates, parallelism will be the source for future performance improvements. Despite the growing complexity, users now demand for better performance of general-purpose parallel software without sacrificing portability and maintainability. This is difficult to achieve, and a coordinated approach is needed for the entire software hierarchy, including operating systems, compilers, and applications. In particular, performance optimization becomes more difficult because of the growing number of targets to optimize for. With mass markets for multicore systems, the diversity of multicore architectures has increased as well. Their characteristics often differ slightly, e.g., in the number of executable threads, the cache sizes and architectures, or the memory access times and bandwidth. Many parallel applications are optimized at design-time to achieve peak performance on a particular machine, but perform poorly on others. This is unacceptable for applications used in everyday life. Auto-tuners [1] have great potential to tackle this problem effectively. Instead of being hard-wired in the code, the performance-relevant parameters of a multicore application are made configurable. An auto-tuner is used on the target platform where the program is executed to systematically find an optimal configuration; this is typically not known beforehand and may be counter-intuitive. When the program is migrated to another machine, auto-tuning is repeated, thus preserving portability. In this paper, we present our novel contributions to make auto-tuning work for general-purpose parallel programs, not just scientific numerical programs. Our experimental results show that auto-tuning is promising and worth integrating into operating systems and compilers, so that manycore applications can be tuned more effectively at run-time.

Journal ArticleDOI
TL;DR: This paper addresses the problems associated with managing the cryptographic keys upon which such services rely by ensuring that keys remain within the trusted computing base.
Abstract: Virtualization brings exibility to the data center and enables separations allowing for better security properties. For these security properties to be fully utilized, virtual machines need to be able to connect to secure services such as networking and storage. This paper addresses the problems associated with managing the cryptographic keys upon which such services rely by ensuring that keys remain within the trusted computing base. Here we describe a general architecture for managing keys tied to the underlying virtualized systems, with a specific example given for secure storage.

Journal ArticleDOI
TL;DR: A novel method of 'cooperative' performance data collection is disclosed that effectively enables time-based sampling of virtualized workloads combined with hardware event counting, is applicable to unmodified, commercially available virtual machines, and has competitive precision and overhead.
Abstract: This article addresses a problem of performance monitoring inside virtual machines (VMs). It advocates focused monitoring of particular virtualized programs, explains the need for and the importance of such an approach to performance monitoring in virtualized execution environments, and emphasizes its benefits for virtual machine manufacturers, virtual machine users (mostly, software developers) and hardware (processor) manufacturers. The article defines the problem of in-VM performance monitoring as the ability to employ modern methods and hardware performance monitoring capabilities inside virtual machines to an extent comparable with what is being done in real environments. Unfortunately, there are numerous reasons preventing us from achieving such an ambitious goal, one of those reasons being the lack of support from virtualization engines; that is why a novel method of 'cooperative' performance data collection is disclosed. The method implies collection of performance data at physical hardware and simultaneous tracking of software states inside a virtual machine. Each statistically visible execution point of the virtualized software may then be associated with information on real hardware events. The method effectively enables time-based sampling of virtualized workloads combined with hardware event counting, is applicable to unmodified, commercially available virtual machines, and has competitive precision and overhead. The practical significance and value of the method are further illustrated by studying a parallel workload and uncovering virtualization-specific performance issues of multithreaded programs.

Journal ArticleDOI
John Wilkes1
TL;DR: A short retrospective on the Storage Systems Program's decade-long journey to automate the management of enterprise storage systems by means of a technique the authors initially called attribute-managed storage, which resulted in a specification language they called Rome 1.
Abstract: Starting in 1994/5, the Storage Systems Program at HP Labs embarked on a decade-long journey to automate the management of enterprise storage systems by means of a technique we initially called attribute-managed storage. The key idea was to provide declarative specifications of workloads and their needs, and of storage devices and their capabilities, and to automate the mapping of one to the other. One of many outcomes of the project was a specification language we called Rome 1 -- hence the title of this paper, which offers a short retrospective on the approach and some of the lessons we learned along the way.

Journal ArticleDOI
TL;DR: This position paper articulates this alternative mechanism with initial results that demonstrate promise and reveals that the primary benefits of off-loading can be captured with alternative mechanisms that eliminate the negative effects of Off-loading.
Abstract: Large-scale multi-core chips open up the possibility of implementing heterogeneous cores on a single chip, where some cores can be customized to execute common code patterns. The operating system is an example of a common code pattern that is constantly executing on every processor. It is therefore a prime candidate for core customization. Recent work has begun to explore this possibility, where some fraction of system calls and other OS functionality is off-loaded to a separate special-purpose core. Studies have shown that this can improve overall system performance and power consumption. However, our explorations in this arena reveal that the primary benefits of off-loading can be captured with alternative mechanisms that eliminate the negative effects of off-loading. This position paper articulates this alternative mechanism with initial results that demonstrate promise.

Journal ArticleDOI
TL;DR: This paper presents OASES: OS and Architectural Support for Efficient Shadow memory implementation for multicores that is also robust and shows that the overheads of runtime monitoring tasks are significantly reduced in comparison to previous software implementations.
Abstract: Runtime monitoring support serves as a foundation for the important tasks of providing security, performing debugging, and improving performance of applications. Often runtime monitoring requires the maintenance of information associated with each of the application's original memory location, which is held in corresponding shadow memory locations. Unfortunately, existing robust shadow memory implementations are inefficient. In this paper, we present OASES: OS and Architectural Support for Efficient Shadow memory implementation for multicores that is also robust. A combination of operating system support (in the form of coupled allocation of memory pages used by the application and associated shadow memory pages) and architectural support (in the form of ISA support and exposed cache events) is proposed. Our page allocation policy enables fast translation of original addresses into corresponding shadow memory addresses; thus allowing implicit addressing of shadow memory. By exposing the cache events to the software, we ensure in software that the shadow memory instructions execute atomically with their corresponding original memory instructions. Our experiments show that the overheads of runtime monitoring tasks are significantly reduced in comparison to previous software implementations.

Journal ArticleDOI
TL;DR: The CHART System is described as a set of five interacting services and protocol improvements which act together to make TCP/IP robust under conditions of loss and latency, and it is described and detail the test regime and performance results.
Abstract: TCP/IP is known to have poor performance under conditions of moderate to high packet loss (5%-20%) and end-to-end latency (20-200 ms). The CHART system, under development by HP and its partners under contract to the US Defense Advanced Research Projects Agency, is a careful re-engineering of Internet Layer 3 and Layer 4 protocols to improve TCP/IP performance in these cases. The CHART system has just completed the second phase of a three-phase, 42-month development cycle. The goal for the 42-month program was a 10x improvement in the performance of TCP/IP under conditions of loss and delay. In independent tests for DARPA at Science Applications In-ternational Corporation, the CHART System demonstrated a 20x performance improvement over TCP/IP, exceeding the goals for the program by a factor of two. Fairness to legacy TCP and UDP ows was further demonstrated in DARPA testing. We describe the CHART System as a set of five interacting services and protocol improvements which act together to make TCP/IP robust under conditions of loss and latency, and we describe and detail the test regime and performance results.