scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Computer Systems in 2014"


Journal ArticleDOI
TL;DR: TaintDroid as mentioned in this paper is an efficient, system-wide dynamic taint tracking and analysis system capable of simultaneously tracking multiple sources of sensitive data by leveraging Android's virtualized execution environment.
Abstract: Today’s smartphone operating systems frequently fail to provide users with visibility into how third-party applications collect and share their private data. We address these shortcomings with TaintDroid, an efficient, system-wide dynamic taint tracking and analysis system capable of simultaneously tracking multiple sources of sensitive data. TaintDroid enables realtime analysis by leveraging Android’s virtualized execution environment. TaintDroid incurs only 32p performance overhead on a CPU-bound microbenchmark and imposes negligible overhead on interactive third-party applications. Using TaintDroid to monitor the behavior of 30 popular third-party Android applications, in our 2010 study we found 20 applications potentially misused users’ private information; so did a similar fraction of the tested applications in our 2012 study. Monitoring the flow of privacy-sensitive data with TaintDroid provides valuable input for smartphone users and security service firms seeking to identify misbehaving applications.

2,983 citations


Journal ArticleDOI
TL;DR: An in-depth coverage of the comprehensive machine-checked formal verification of seL4, a general-purpose operating system microkernel, and the experience in maintaining this evolving formally verified code base.
Abstract: We present an in-depth coverage of the comprehensive machine-checked formal verification of seL4, a general-purpose operating system microkernel.We discuss the kernel design we used to make its verification tractable. We then describe the functional correctness proof of the kernel's C implementation and we cover further steps that transform this result into a comprehensive formal verification of the kernel: a formally verified IPC fastpath, a proof that the binary code of the kernel correctly implements the C semantics, a proof of correct access-control enforcement, a proof of information-flow noninterference, a sound worst-case execution time analysis of the binary, and an automatic initialiser for user-level systems that connects kernel-level access-control enforcement with reasoning about system behaviour. We summarise these results and show how they integrate to form a coherent overall analysis, backed by machine-checked, end-to-end theorems.The seL4 microkernel is currently not just the only general-purpose operating system kernel that is fully formally verified to this degree. It is also the only example of formal proof of this scale that is kept current as the requirements, design and implementation of the system evolve over almost a decade. We report on our experience in maintaining this evolving formally verified code base.

314 citations


Journal ArticleDOI
TL;DR: Simulations show that reduced-precision writes in multi-level phase-change memory cells can be 1.7× faster on average and using failed blocks can improve array lifetime by 23% on average with quality loss under 10%.
Abstract: Memories today expose an all-or-nothing correctness model that incurs significant costs in performance, energy, area, and design complexity But not all applications need high-precision storage for all of their data structures all of the time This article proposes mechanisms that enable applications to store data approximately and shows that doing so can improve the performance, lifetime, or density of solid-state memories We propose two mechanisms The first allows errors in multilevel cells by reducing the number of programming pulses used to write them The second mechanism mitigates wear-out failures and extends memory endurance by mapping approximate data onto blocks that have exhausted their hardware error correction resources Simulations show that reduced-precision writes in multilevel phase-change memory cells can be 17 × faster on average and using failed blocks can improve array lifetime by 23p on average with quality loss under 10p

236 citations


Journal ArticleDOI
TL;DR: This article advocates for extending standard operating system services and abstractions to GPUs in order to facilitate program development and enable harmonious integration of GPUs in computing systems.
Abstract: As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order to facilitate program development and enable harmonious integration of GPUs in computing systems. As an example, we describe the design and implementation of GPUFs, a software layer which provides operating system support for accessing host files directly from GPU programs. GPUFs provides a POSIX-like API, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the host CPU's buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adapted to use our file system, demonstrate the feasibility and benefits of the GPUFs approach. For example, a self-contained GPU program that searches for a set of strings throughout the Linux kernel source tree runs over seven times faster than on an eight-core CPU.

79 citations


Journal ArticleDOI
TL;DR: This article proposes six optimizations that enable an OS to fully exploit the performance characteristics of fast storage devices and demonstrates that the overheads from the traditional storage-stack design are significant and cannot easily be overcome without modifying the hardware interface and adding new capabilities to the operating system.
Abstract: Fast storage devices are an emerging solution to satisfy data-intensive applications. They provide high transaction rates for DBMS, low response times for Web servers, instant on-demand paging for applications with large memory footprints, and many similar advantages for performance-hungry applications. In spite of the benefits promised by fast hardware, modern operating systems are not yet structured to take advantage of the hardware’s full potential. The software overhead caused by an OS, negligible in the past, adversely impacts application performance, lessening the advantage of using such hardware. Our analysis demonstrates that the overheads from the traditional storage-stack design are significant and cannot easily be overcome without modifying the hardware interface and adding new capabilities to the operating system.In this article, we propose six optimizations that enable an OS to fully exploit the performance characteristics of fast storage devices. With the support of new hardware interfaces, our optimizations minimize per-request latency by streamlining the I/O path and amortize per-request latency by maximizing parallelism inside the device. We demonstrate the impact on application performance through well-known storage benchmarks run against a Linux kernel with a customized SSD. We find that eliminating context switches in the I/O path decreases the software overhead of an I/O request from 20 microseconds to 5 microseconds and a new request merge scheme called Temporal Merge enables the OS to achieve 87p to 100p of peak device performance, regardless of request access patterns or types. Although the performance improvement by these optimizations on a standard SATA-based SSD is marginal (because of its limited interface and relatively high response times), our sensitivity analysis suggests that future SSDs with lower response times will benefit from these changes. The effectiveness of our optimizations encourages discussion between the OS community and storage vendors about future device interfaces for fast storage devices.

43 citations


Journal ArticleDOI
TL;DR: SAGE combines a static compiler that automatically generates a set of CUDA kernels with varying levels of approximation with a runtime system that iteratively selects among the available kernels to achieve speedup while adhering to a target output quality set by the user.
Abstract: Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, is one solution that can help data processing keep pace with the current and growing abundance of information. For particular domains, such as multimedia and learning algorithms, approximation is commonly used today. We consider automation to be essential to provide transparent approximation, and we show that larger benefits can be achieved by constructing the approximation techniques to fit the underlying hardware. Our target platform is the GPU because of its high performance capabilities and difficult programming challenges that can be alleviated with proper automation. Our approach—SAGE—combines a static compiler that automatically generates a set of CUDA kernels with varying levels of approximation with a runtime system that iteratively selects among the available kernels to achieve speedup while adhering to a target output quality set by the user. The SAGE compiler employs three optimization techniques to generate approximate kernels that exploit the GPU microarchitecture: selective discarding of atomic operations, data packing, and thread fusion. Across a set of machine learning and image processing kernels, SAGE's approximation yields an average of 2.5× speedup with less than 10p quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.

39 citations


Journal ArticleDOI
TL;DR: This work designs and evaluates twelve design points along the Xeon-to-Atom spectrum, and finds that a mix of three processor architectures achieves a 12× reduction in response time violations relative to equal-power homogeneous systems.
Abstract: Specialization of datacenter resources brings performance and energy improvements in response to the growing scale and diversity of cloud applications. Yet heterogeneous hardware adds complexity and volatility to latency-sensitive applications. A resource allocation mechanism that leverages architectural principles can overcome both of these obstacles.We integrate research in heterogeneous architectures with recent advances in multi-agent systems. Embedding architectural insight into proxies that bid on behalf of applications, a market effectively allocates hardware to applications with diverse preferences and valuations. Exploring a space of heterogeneous datacenter configurations, which mix server-class Xeon and mobile-class Atom processors, we find an optimal heterogeneous balance that improves both welfare and energy-efficiency. We further design and evaluate twelve design points along the Xeon-to-Atom spectrum, and find that a mix of three processor architectures achieves a 12× reduction in response time violations relative to equal-power homogeneous systems.

24 citations


Journal ArticleDOI
TL;DR: It is found that Linux has more than doubled in size during this period, but the number of faults per line of code has been decreasing, and the fault rate of drivers is now below that of other directories, such as arch.
Abstract: In August 2011, Linux entered its third decade. Ten years before, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1. A major result of their work was that the drivers directory contained up to 7 times more of certain kinds of faults than other directories. This result inspired numerous efforts on improving the reliability of driver code. Today, Linux is used in a wider range of environments, provides a wider range of services, and has adopted a new development and release model. What has been the impact of these changes on code quality?To answer this question, we have transported Chou et al.’s experiments to all versions of Linux 2.6; released between 2003 and 2011. We find that Linux has more than doubled in size during this period, but the number of faults per line of code has been decreasing. Moreover, the fault rate of drivers is now below that of other directories, such as arch. These results can guide further development and research efforts for the decade to come. To allow updating these results as Linux evolves, we define our experimental protocol and make our checkers available.

18 citations


Journal ArticleDOI
TL;DR: Measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioning, and describes a hardware range partitioner that further improves efficiency.
Abstract: Data partitioning is a critical operation for manipulating large datasets because it subdivides tasks into pieces that are more amenable to efficient processing It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries This article measures the performance and energy of state-of-the-art software partitioners, and describes and evaluates a hardware range partitioner that further improves efficiencyThe software implementation is broken into two phases, allowing separate analysis of the partition function computation and data shuffling costs Although range partitioning is commonly thought to be more expensive than simpler strategies such as hash partitioning, our measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioningFor further acceleration, we describe a hardware range partitioner, or HARP, a streaming framework that offers a seamless execution environment for this and other streaming accelerators, and a detailed analysis of a 32nm physical design that matches the throughput of four to eight software threads while consuming just 69p of the area and 43p of the power of a Xeon core in the same technology generation

11 citations