scispace - formally typeset
Search or ask a question

Showing papers in "Operating Systems Review in 2011"


Journal ArticleDOI
TL;DR: TaintEraser is a new tool that tracks the movement of sensitive user data as it flows through off-the-shelf applications while precisely scrubbing user-defined sensitive data that would otherwise have been exposed to restricted output channels.
Abstract: We present TaintEraser, a new tool that tracks the movement of sensitive user data as it flows through off-the-shelf applications. TaintEraser uses application-level dynamic taint analysis to let users run applications in their own environment while preventing unwanted information exposure. It is made possible by techniques we developed for accurate and efficient tainting: (1) Semantic-aware instruction-level tainting is critical to track taint accurately, without explosion or loss. (2) Function summaries provide an interface to handle taint propagation within the kernel and reduce the overhead of instruction-level tracking. (3) On-demand instrumentation enables fast loading of large applications. Together, these techniques let us analyze large, multi-threaded, networked applications in near real-time. In tests on Internet Explorer, Yahoo! Messenger, and Windows Notepad, Taint- Eraser generated no false positives and instrumented fewer than 5% of the executed instructions while precisely scrubbing user-defined sensitive data that would otherwise have been exposed to restricted output channels. Our research provides the first evidence that it is viable to track taint accurately and efficiently for real, interactive applications running on commodity hardware.

175 citations


Journal ArticleDOI
TL;DR: This paper describes two communication libraries available on the Single-Chip Cloud Computer: RCCE and Rckmb, a light-weight, minimal library for writing message passing parallel applications and SCC's non-cache-coherent shared memory for transferring data between cores without needing to go off-chip.
Abstract: Many-core chips are changing the way high-performance computing systems are built and programmed. As it is becoming increasingly difficult to maintain cache coherence across many cores, manufacturers are exploring designs that do not feature any cache coherence between cores. Communications on such chips are naturally implemented using message passing, which makes them resemble clusters, but with an important difference. Special hardware can be provided that supports very fast on-chip communications, reducing latency and increasing bandwidth. We present one such chip, the Single-Chip Cloud Computer (SCC). This is an experimental processor, created by Intel Labs. We describe two communication libraries available on SCC: RCCE and Rckmb. RCCE is a light-weight, minimal library for writing message passing parallel applications. Rckmb provides the data link layer for running network services such as TCP/IP. Both utilize SCC's non-cache-coherent shared memory for transferring data between cores without needing to go off-chip. In this paper we describe the design and implementation of RCCE and Rckmb. To compare their performance, we consider simple benchmarks run with RCCE, and MPI over TCP/IP.

93 citations


Journal ArticleDOI
TL;DR: A novel approach to intrusion detection of virtual server environments which utilizes only information available from the perspective of the virtual machine monitor (VMM), showing that by working entirely at the VMM-level, this approach is able to capture enough information to characterize normal executions and identify the presence of abnormal malicious behavior.
Abstract: As virtualization technology gains in popularity, so do attempts to compromise the security and integrity of virtualized computing resources. Anti-virus software and firewall programs are typically deployed in the guest virtual machine to detect malicious software. These security measures are effective in detecting known malware, but do little to protect against new variants of intrusions. Intrusion detection systems (IDSs) can be used to detect malicious behavior. Most intrusion detection systems for virtual execution environments track behavior at the application or operating system level, using virtualization as a means to isolate themselves from a compromised virtual machine.In this paper, we present a novel approach to intrusion detection of virtual server environments which utilizes only information available from the perspective of the virtual machine monitor (VMM). Such an IDS can harness the ability of the VMM to isolate and manage several virtual machines (VMs), making it possible to provide monitoring of intrusions at a common level across VMs. It also offers unique advantages over recent advances in intrusion detection for virtual machine environments. By working purely at the VMM-level, the IDS does not depend on structures or abstractions visible to the OS (e.g., file systems), which are susceptible to attacks and can be modified by malware to contain corrupted information (e.g., the Windows registry). In addition, being situated within the VMM provides ease of deployment as the IDS is not tied to a specific OS and can be deployed transparently below different operating systems.Due to the semantic gap between the information available to the VMM and the actual application behavior, we employ the power of data mining techniques to extract useful nuggets of knowledge from the raw, low-level architectural data. We show in this paper that by working entirely at the VMM-level, we are able to capture enough information to characterize normal executions and identify the presence of abnormal malicious behavior. Our experiments on over 300 real-world malware and exploits illustrate that there is sufficient information embedded within the VMM-level data to allow accurate detection of malicious attacks, with an acceptable false alarm rate.

88 citations


Journal ArticleDOI
TL;DR: A monitoring and prediction framework for heterogeneity along with software support to take advantage of this information and it will be shown that these proposed techniques can provide significant advantage in terms of performance and power efficiency in heterogeneous platforms.
Abstract: Almost all hardware platforms to date have been homogeneous with one or more identical processors managed by the operating system (OS). However, recently, it has been recognized that power constraints and the need for domain-specific high performance computing may lead architects towards building heterogeneous architectures and platforms in the near future. In this paper, we consider the three types of heterogeneous core architectures: (a) Virtual asymmetric cores: multiple processors that have identical core micro-architectures and ISA but each running at a different frequency point or perhaps having a different cache size, (b) Physically asymmetric cores: heterogeneous cores, each with a fundamentally different microarchitecture (in-order vs. out-of-order for instance) running at similar or different frequencies, with identical ISA and (c) Hybrid cores: multiple cores, where some cores have tightly-coupled hardware accelerators or special functional units. We show case studies that highlight why existing OS and hardware interaction in such heterogeneous architectures is inefficient and causes loss in application performance, throughput efficiency and lack of quality of service. We then discuss hardware and software support needed to address these challenges in heterogeneous platforms and establish efficient heterogeneous environments for platforms in the next decade. In particular, we will outline a monitoring and prediction framework for heterogeneity along with software support to take advantage of this information. Based on measurements on real platforms, we will show that these proposed techniques can provide significant advantage in terms of performance and power efficiency in heterogeneous platforms.

51 citations


Journal ArticleDOI
TL;DR: HipG is a distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the user-defined pieces of sequential work on graph nodes by providing a unified interface to executing methods on local and non-local graph nodes and an abstraction of exclusive execution.
Abstract: Distributed processing of real-world graphs is challenging due to their size and the inherent irregular structure of graph computations. We present HipG, a distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the user-defined pieces of sequential work on graph nodes. To make the user code high-level, the framework provides a unified interface to executing methods on local and non-local graph nodes and an abstraction of exclusive execution. The graph computations are managed by logical objects called synchronizers, which we used, for example, to implement distributed divide-and-conquer decomposition into strongly connected components. The code written in HipG is independent of a particular graph representation, to the point that the graph can be created on-the-fly, i.e. by the algorithm that computes on this graph, which we used to implement a distributed model checker. HipG programs are in general short and elegant; they achieve good portability, memory utilization, and performance.

32 citations


Journal ArticleDOI
TL;DR: RFS can deliver a good user experience under undependable network conditions, allowing mobile users to seamlessly, and safely, use the cloud for data storage.
Abstract: Due to the increasing number of applications (and their data) being placed on mobile devices, access to dependable storage is becoming a key issue in mobile system design -- and cloud storage is becoming an attractive solution. However, this introduces a number of new issues related to unpredictable wireless network connectivity and data privacy over the network.In this article we present RFS, a wireless-friendly network file system for mobile devices and the cloud. RFS provides deviceaware cache management and client-driven data security and privacy protection. We implement the RFS client in the Linux kernel and the RFS server with Amazon S3 cloud storage, and we employ two new optimizations: server prepush (a server-side data pre-fetching mechanism) and client reintegration (synchronizing a mobile device's cache with the cloud).The empirical results over wired, WiFi and 3G networks show that RFS achieves good performance compared to Coda and FScache, and it reduces network activity visibly. Further, the privacy overhead is acceptable when RFS is run over wireless networks. We present a case study of booting Android over RFS, thereby demonstrating the ability for RFS to host a full mobile system. Overall, RFS can deliver a good user experience under undependable network conditions, allowing mobile users to seamlessly, and safely, use the cloud for data storage.

27 citations


Journal ArticleDOI
TL;DR: This work proposes a software router architecture that parallelizes router functionality both across multiple servers and across multiple cores within a single server, and demonstrates a 40Gbps parallel router prototype.
Abstract: We revisit the problem of scaling software routers, motivated by recent advances in server technology that enable highspeed parallel processing a feature router workloads appear ideally suited to exploit We propose a software router architecture that parallelizes router functionality both across multiple servers and across multiple cores within a single server By carefully exploiting parallelism at every opportunity, we demonstrate a 40Gbps parallel router prototype; this router capacity can be linearly scaled through the use of additional servers Our prototype router is fully programmable using the familiar Click/Linux environment and is built entirely from off-the-shelf, general-purpose server hardware We also describe some of the lessons learned while supporting field deployments of Routebricks-based software routers

21 citations


Journal ArticleDOI
TL;DR: This paper analyzes the software challenges to the operating system and the application layer software on a heterogeneous system with functional asymmetry, where the ISA of the small and big cores overlaps, and proposes solutions.
Abstract: Heterogeneous processors that mix big high performance cores with small low power cores promise excellent single-threaded performance coupled with high multi-threaded throughput and higher performance-per-watt. A significant portion of the commercial multicore heterogeneous processors are likely to have a common instruction set architecture( ISA). However, due to limited design resources and goals, each core is likely to contain ISA extensions not yet implemented in the other core. Therefore, such heterogeneous processors will have inherent functional asymmetry at the ISA level and face significant software challenges. This paper analyzes the software challenges to the operating system and the application layer software on a heterogeneous system with functional asymmetry, where the ISA of the small and big cores overlaps. We look at the widely deployed Intel® Architecture and propose solutions to the software challenges that arise when a heterogeneous processor is designed around it. We broadly categorize functional asymmetries into those that can be exposed to application software and those that should be handled by system software. While one can argue that new software written should be heterogeneity-aware, it is important that we find ways in which legacy software can extract the best performance from heterogeneous multicore systems.

16 citations


Journal ArticleDOI
TL;DR: Techniques and features of the Log-Based Architectures project are highlighted that reduce the slowdown to just 2%--51% for sequential programs and 28%-- 51% for parallel programs.
Abstract: While application performance and power-efficiency are both important, application correctness is even more important. In other words, if the application is misbehaving, it is little consolation that it is doing so quickly or power-efficiently. In the Log-Based Architectures (LBA) project, we are focusing on a challenging source of application misbehavior: software bugs, including obscure bugs that only cause problems during security attacks. To help detect and fix software bugs, we have been exploring techniques for accelerating dynamic program monitoring tools, which we call "lifeguards". Lifeguards are typically written today using dynamic binary instrumentation frameworks such as Valgrind or Pin. Due to the overheads of binary instrumentation, lifeguards that require instructiongrain information typically experience 30X-100X slowdowns, and hence it is only practical to use them during explicit debug cycles. The goal in the LBA project is to reduce these overheads to the point where lifeguards can run continuously on deployed code. To accomplish this, we propose hardware mechanisms to create a dynamic log of instruction-level events in the monitored application and stream this information to one or more software lifeguards running on separate cores on the same multicore processor. In this paper, we highlight techniques and features of LBA that reduce the slowdown to just 2%--51% for sequential programs and 28%--51% for parallel programs.

15 citations


Journal ArticleDOI
TL;DR: Building on concrete examples from past work on APIs, performance points are shown to be an effective way to better exploit asymmetries and gain the power/performance improvements promised by heterogeneous multicore systems.
Abstract: Trends indicate a rapid increase in the number of cores on chip, exhibiting various types of performance and functional asymmetries present in hardware to gain scalability with balanced power vs. performance requirements. This poses new challenges in platform resource management, which are further exacerbated by the need for runtime power budgeting and by the increased dynamics in workload behavior observed in consolidated datacenter and cloudcomputing systems. This paper considers the implications of these challenges for the virtualization layer of abstraction, which is the base layer for resource management in such heterogeneous multicore platforms. Specifically, while existing and upcoming management methods routinely leverage system-level information available to the hypervisor about current and global platform state, we argue that for future systems there will be an increased necessity for additional information about applications and their needs. This 'end-to-end' argument leads us to propose 'performance points' as a general interface between the virtualization system and higher layers like the guest operating systems that run application workloads. Building on concrete examples from past work on APIs with which applications can inform systems of phase or workload changes and conversely, with which systems can indicate to applications desired changes in power consumption, performance points are shown to be an effective way to better exploit asymmetries and gain the power/performance improvements promised by heterogeneous multicore systems.

10 citations


Journal ArticleDOI
Shoumeng Yan1, Xiaocheng Zhou1, Ying Gao1, Hu Chen1, Gansha Wu1, Sai Luo1, Bratin Saha1 
TL;DR: This paper describes the approaches, experiences, and results in optimizing MYO on a heterogeneous platform consisting of a CPU and an Aubrey Isle accelerator, and demonstrates that users need not sacrifice performance for programmability.
Abstract: The client computing platform is moving towards a heterogeneous architecture that combines scalar-oriented CPU cores and throughput-oriented accelerator cores. Recognizing that existing programming models for such heterogeneous platforms are still difficult for most programmers, we advocate a shared virtual memory programming model to improve programmability. In this paper, we focus on performance, and demonstrate that users need not sacrifice performance for programmability. We describe our approaches, experiences, and results in optimizing MYO on a heterogeneous platform consisting of a CPU and an Aubrey Isle accelerator. Our efforts involve the whole system software stack including the OS, runtime, and application.

Journal ArticleDOI
TL;DR: The architecture and motivation for a clusterbased, many-core computing architecture for energy-efficient, dataintensive computing, and the longer-term implications of FAWN lead us to select a tightly integrated stacked chip and-memory architecture for future FAWN development are presented.
Abstract: This paper presents the architecture and motivation for a clusterbased, many-core computing architecture for energy-efficient, dataintensive computing FAWN, a Fast Array of Wimpy Nodes, consists of a large number of slower but efficient nodes coupled with low-power storage We present the computing trends that motivate a FAWN-like approach, for CPU, memory, and storage We follow with a set of microbenchmarks to explore under what workloads these FAWN nodes perform well (or perform poorly), and briefly examine scenarios in which both code and algorithms may need to be re-designed or optimized to perform well on an efficient platform We conclude with an outline of the longer-term implications of FAWN that lead us to select a tightly integrated stacked chip and-memory architecture for future FAWN development

Journal ArticleDOI
TL;DR: This paper presents an online algorithm for affinity driven distributed scheduling of multi-place 2 parallel computations that uses a low time and message complexity mechanism for ensuring affinity and a randomized work-stealing mechanism within places for load balancing.
Abstract: With the advent of many-core architectures and strong need for Petascale (and Exascale) performance in scientific domains and industry analytics, efficient scheduling of parallel computations for higher productivity and performance has become very important. Further, movement of massive amounts (Terabytes to Petabytes) of data is very expensive, which necessitates affinity driven computations. Therefore, distributed scheduling of parallel computations on multiple places 1 needs to optimize multiple performance objectives: follow affinity maximally and ensure efficient space, time and message complexity. Simultaneous consideration of these objectives makes distributed scheduling a particularly challenging problem. In addition, parallel computations have data dependent execution patterns which requires online scheduling to effectively optimize the computation orchestration as it unfolds.This paper presents an online algorithm for affinity driven distributed scheduling of multi-place 2 parallel computations. To optimize multiple performance objectives simultaneously, our algorithm uses a low time and message complexity mechanism for ensuring affinity and a randomized work-stealing mechanism within places for load balancing. Theoretical analysis of the expected and probabilistic lower and upper bounds on time and message complexity of this algorithm has been provided. On multi-core clusters such as Blue Gene/P (MPP architecture) and Intel multicore cluster, we demonstrate performance close to the custom MPI+Pthreads code. Further, strong, weak and data (increasing input data size) scalability have been demonstrated on multi-core clusters. Using well known benchmarks, we demonstrate 16% to 30% performance gain as compared to Cilk [6] on multi-core Intel Xeon 5570 (NUMA) architecture. Detailed experimental analysis illustrates efficient space (main memory) utilization as well. To the best of our knowledge, this is the first time multi-objective affinity driven distributed scheduling algorithm has been designed, theoretically analyzed and experimentally evaluated in a multi-place setup for multi-core cluster architectures.

Journal ArticleDOI
TL;DR: A simple and correct specification of an OS kernel in Z is proposed which simplifies the understanding and verification of operating system components.
Abstract: One of the mini challenges in software verification related to the Grand Challenge proposed by Tony Hoare concerns the formal specification and verification of an operating system kernel. This paper proposes a simple and correct specification of an OS kernel in Z which simplifies the understanding and verification of operating system components. Our current specification comprises process management, interprocess communication and a POSIX-compliant file system.

Journal ArticleDOI
Petros Maniatis1, Byung-Gon Chun1
TL;DR: The benefits of using "small," generic trusted primitives to increase the fault-tolerance of replicated systems and archival storage, and to improve the security of email SPAM and click-fraud prevention systems are described.
Abstract: Secure, fault-tolerant distributed systems are difficult to build, to validate, and to operate. Conservative design for such systems dictates that their security and fault tolerance depend on a very small number of assumptions taken on faith; such assumptions are typically called the "trusted computing base" (TCB) of a system. However, a rich trade-off exists between larger TCBs and more secure, more faulttolerant, or more efficient systems. In our recent work, we have explored this trade-off by defining "small," generic trusted primitives--for example, an attested, monotonically sequenced FIFO buffer of a few hundred machine words guaranteed to hold appended words until eviction and showing how such primitives can improve the performance, fault tolerance, and security of systems using them. In this article, we review our efforts in generating simple trusted primitives such as an attested circular buffer (called Attested Appendonly Memory), and an attested human activity detector. We describe the benefits of using these primitives to increase the fault-tolerance of replicated systems and archival storage, and to improve the security of email SPAM and click-fraud prevention systems. Finally, we share some lessons we have learned from this endeavor.

Journal ArticleDOI
TL;DR: The paper describes a mini-kernel project in the context of a Concurrent Programming course to implement Java monitors and interrupt handling on an FPGA board developed initially at EPFL for Computer Architecture courses.
Abstract: The paper describes a mini-kernel project in the context of a Concurrent Programming course. The goal of the project is to implement Java monitors and interrupt handling. The platform for the project is an FPGA board developed initially at EPFL for Computer Architecture courses.

Journal ArticleDOI
TL;DR: Bothnia achieves an average speedup of 3.6x compared to using the GPU as a device, primarily due to Bothnia's support for creation of shared virtual address space between heterogeneous threads of the same application spread on both IA CPU and GMA cores.
Abstract: In this paper, we introduce Bothnia, an extension to the Intel production graphics driver to support a shared virtual memory heterogeneous multithreading programming model. With Bothnia, the Intel graphics device driver can support both the traditional 3D graphics rendering software stack and a new class of heterogeneous multithreaded applications, which can use both IA (Intel Architecture) CPU cores and Intel integrated Graphics and Media Accelerator (GMA) cores in the same virtual address space. We describe the necessary architectural supports in both IA CPU and the GMA cores and present a reference Bothnia implementation. For a set of GPU accelerated media applications on a PC platform with Intel Core 2 Duo CPU and the Intel integrated GMA X3000 running under the Windows XP operating system, Bothnia achieves an average speedup of 3.6x compared to using the GPU as a device, primarily due to Bothnia's support for creation of shared virtual address space between heterogeneous threads of the same application spread on both IA CPU and GMA cores.

Journal ArticleDOI
TL;DR: This work proposes, implements and test several distributed data structures, namely, two different types of counters, a queue, a stack and a linked list, and for each one of the data structures to determine what is the preferred mutual exclusion lock to be used as the underling locking mechanism.
Abstract: Distributed mutual exclusion locks are the de facto mechanism for concurrency control on distributed data structures A process accesses the data structure only while holding the lock, and hence the process is guaranteed exclusive access The popularity of this approach is largely due to the apparently simple programming model of such locks and the availability of efficient implementations We study the relation between classical types of distributed locking mechanisms and several distributed data structures which use locking for synchronization Our objectives are: To determine which one of the two classical locking techniques -- token-based locking or permission-based locking -- is more efficient Our strategy to achieve this objective is to implement several locks and to compare their performanceTo propose, implement and test several distributed data structures, namely, two different types of counters, a queue, a stack and a linked list; and for each one of the data structures to determine what is the preferred mutual exclusion lock to be used as the underling locking mechanismTo determine which one of the two proposed counters is better to be used either as a stand-alone data structure or when used as a building block for implementing other high level data structuresOur testing environment is consisting of 20 Intel XEON 24 GHz machines running theWindows XP OS with 2GB of RAM and using a JRE version 142_08 All the machines were located inside the same LAN and were connected using a 20 port Cisco switch

Journal ArticleDOI
Arun Raghunath1, John Keys1, Mona Vij1
TL;DR: Direct Data Flows is proposed, an SoC focused system architecture where the OS can configure fixed-function hardware modules to communicate data directly with each other, allowing the general purpose CPU to be opportunistically brought into lower power states, reducing overall power consumption.
Abstract: Reducing power consumption of Mobile Internet Devices (MID) and smartphones is critical as battery life is a key feature for mobility. Most vendors use System-On-Chip designs integrating more and more fixed-function hardware modules in a bid to reduce power consumption. On the other hand the explosion of new applications has increased the demand for PC-like processing capabilities on these devices. They are best supported by general purpose CPUs and Operating Systems which consume more power. Traditional system architectures focus on a data transfer model with the CPU as one of the endpoints. Consequently there are numerous usage scenarios where the general purpose CPU just acts as an intermediary between hardware modules, transferring data from a hardware module to memory and vice-versa. We propose Direct Data Flows, an SoC focused system architecture where the OS can configure fixed-function hardware modules to communicate data directly with each other. This eliminates unnecessary data hops and reduces CPU interrupts allowing the general purpose CPU to be opportunistically brought into lower power states, reducing overall power consumption. We have created a prototype Direct Data Flow setup for network file downloads which demonstrates up to 65% energy savings for typical file sizes.