Showing papers in "ACM Transactions on Computer Systems in 2008"

PDF

Open Access

Journal Article•DOI•

Bigtable: A Distributed Storage System for Structured Data

[...]

Fay W. Chang¹, Jeffrey Dean¹, Sanjay Ghemawat¹, Wilson C. Hsieh¹, Deborah A. Wallach¹, Michael Burrows¹, Tushar Deepak Chandra¹, Andrew Fikes¹, Robert E. Gruber¹ - Show less +5 more•Institutions (1)

Google¹

01 Jun 2008-ACM Transactions on Computer Systems

TL;DR: The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.

...read moreread less

Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

...read moreread less

3,259 citations

Journal Article•DOI•

A generic component model for building systems software

[...]

Geoff Coulson¹, Gordon S. Blair¹, Paul Grace¹, François Taïani¹, Ackbar Joolia¹, Kevin Lee¹, Jó Ueyama¹, Thirunavukkarasu Sivaharan¹ - Show less +4 more•Institutions (1)

Lancaster University¹

10 Mar 2008-ACM Transactions on Computer Systems

TL;DR: This article argues for the benefits and feasibility of a generic yet tailorable approach to component-based systems-building that offers a uniform programming model that is applicable in a wide range of systems-oriented target domains and deployment environments.

...read moreread less

Abstract: Component-based software structuring principles are now commonplace at the application level; but componentization is far less established when it comes to building low-level systems software. Although there have been pioneering efforts in applying componentization to systems-building, these efforts have tended to target specific application domains (e.g., embedded systems, operating systems, communications systems, programmable networking environments, or middleware platforms). They also tend to be targeted at specific deployment environments (e.g., standard personal computer (PC) environments, network processors, or microcontrollers). The disadvantage of this narrow targeting is that it fails to maximize the genericity and abstraction potential of the component approach. In this article, we argue for the benefits and feasibility of a generic yet tailorable approach to component-based systems-building that offers a uniform programming model that is applicable in a wide range of systems-oriented target domains and deployment environments. The component model, called OpenCom, is supported by a reflective runtime architecture that is itself built from components. After describing OpenCom and evaluating its performance and overhead characteristics, we present and evaluate two case studies of systems we have built using OpenCom technology, thus illustrating its benefits and its general applicability.

...read moreread less

407 citations

Journal Article•DOI•

RaWMS - Random Walk Based Lightweight Membership Service for Wireless Ad Hoc Networks

[...]

Ziv Bar-Yossef¹, Roy Friedman¹, Gabriel Kliot¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Jun 2008-ACM Transactions on Computer Systems

TL;DR: RaWMS provides each node with a partial uniformly chosen view of network nodes and is compared both analytically and by simulations with a number of other known methods such as flooding and gossip-based techniques.

...read moreread less

Abstract: This article presents RaWMS, a novel lightweight random membership service for ad hoc networks. The service provides each node with a partial uniformly chosen view of network nodes. Such a membership service is useful, for example, in data dissemination algorithms, lookup and discovery services, peer sampling services, and complete membership construction. The design of RaWMS is based on a novel reverse random walk (RW) sampling technique. The article includes a formal analysis of both the reverse RW sampling technique and RaWMS and verifies it through a detailed simulation study. In addition, RaWMS is compared both analytically and by simulations with a number of other known methods such as flooding and gossip-based techniques.

...read moreread less

87 citations

Journal Article•DOI•

Adaptive work-stealing with parallelism feedback

[...]

Kunal Agrawal¹, Charles E. Leiserson¹, Yuxiong He², Wen-Jing Hsu²•Institutions (2)

Massachusetts Institute of Technology¹, Nanyang Technological University²

22 Sep 2008-ACM Transactions on Computer Systems

TL;DR: This work presents a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors, and introduces a new technique called trim analysis, which allows it to prove that the adaptive thread Scheduler performs poorly on no more than a small number of time steps.

...read moreread less

Abstract: Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors.We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority.More precisely, suppose that a job has work T1 and span T∞. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/P˜ + T∞ + L lg P) time steps, where L is the length of a scheduling quantum, and P˜ denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T∞ + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, P˜

...read moreread less

56 citations

Journal Article•DOI•

High-bandwidth data dissemination for large-scale distributed systems

[...]

Dejan Kostic¹, Alex C. Snoeren², Amin Vahdat², Ryan Braud², Charles Killian², James W. Anderson², Jeannie Albrecht³, Adolfo Francisco Rodriguez⁴, Erik Vandekieft⁴ - Show less +5 more•Institutions (4)

École Polytechnique Fédérale de Lausanne¹, University of California, San Diego², Williams College³, IBM⁴

10 Mar 2008-ACM Transactions on Computer Systems

TL;DR: Bullet is presented, a data dissemination mesh that takes advantage of the computational and storage capabilities of end hosts to create a distribution structure where a node receives data in parallel from multiple peers, and reduces the need to perform expensive bandwidth probing.

...read moreread less

Abstract: This article focuses on the multireceiver data dissemination problem. Initially, IP multicast formed the basis for efficiently supporting such distribution. More recently, overlay networks have emerged to support point-to-multipoint communication. Both techniques focus on constructing trees rooted at the source to distribute content among all interested receivers. We argue, however, that trees have two fundamental limitations for data dissemination. First, since all data comes from a single parent, participants must often continuously probe in search of a parent with an acceptable level of bandwidth. Second, due to packet losses and failures, available bandwidth is monotonically decreasing down the tree.To address these limitations, we present Bullet, a data dissemination mesh that takes advantage of the computational and storage capabilities of end hosts to create a distribution structure where a node receives data in parallel from multiple peers. For the mesh to deliver improved bandwidth and reliability, we need to solve several key problems: (i) disseminating disjoint data over the mesh, (ii) locating missing content, (iii) finding who to peer with (peering strategy), (iv) retrieving data at the right rate from all peers (flow control), and (v) recovering from failures and adapting to dynamically changing network conditions. Additionally, the system should be self-adjusting and should have few user-adjustable parameter settings. We describe our approach to addressing all of these problems in a working, deployed system across the Internet. Bullet outperforms state-of-the-art systems, including BitTorrent, by 25-70p and exhibits strong performance and reliability in a range of deployment settings. In addition, we find that, relative to tree-based solutions, Bullet reduces the need to perform expensive bandwidth probing.

...read moreread less

50 citations

Journal Article•DOI•

Vigilante: End-to-end containment of Internet worm epidemics

[...]

Manuel Costa¹, Jon Crowcroft¹, Miguel Castro², Antony Rowstron², Lidong Zhou², Lintao Zhang², Paul Barham² - Show less +3 more•Institutions (2)

University of Cambridge¹, Microsoft²

19 Dec 2008-ACM Transactions on Computer Systems

TL;DR: Vigilante is proposed, a new end-to-end architecture to contain worms automatically that addresses limitations of network-level techniques to automate worm containment and does not require any changes to hardware, compilers, operating systems, or the source code of vulnerable programs.

...read moreread less

Abstract: Worm containment must be automatic because worms can spread too fast for humans to respond. Recent work proposed network-level techniques to automate worm containment; these techniques have limitations because there is no information about the vulnerabilities exploited by worms at the network level. We propose Vigilante, a new end-to-end architecture to contain worms automatically that addresses these limitations.In Vigilante, hosts detect worms by instrumenting vulnerable programs to analyze infection attempts. We introduce dynamic data-flow analysis: a broad-coverage host-based algorithm that can detect unknown worms by tracking the flow of data from network messages and disallowing unsafe uses of this data. We also show how to integrate other host-based detection mechanisms into the Vigilante architecture. Upon detection, hosts generate self-certifying alerts (SCAs), a new type of security alert that can be inexpensively verified by any vulnerable host. Using SCAs, hosts can cooperate to contain an outbreak, without having to trust each other. Vigilante broadcasts SCAs over an overlay network that propagates alerts rapidly and resiliently. Hosts receiving an SCA protect themselves by generating filters with vulnerability condition slicing: an algorithm that performs dynamic analysis of the vulnerable program to identify control-flow conditions that lead to successful attacks. These filters block the worm attack and all its polymorphic mutations that follow the execution path identified by the SCA.Our results show that Vigilante can contain fast-spreading worms that exploit unknown vulnerabilities, and that Vigilante's filters introduce a negligible performance overhead. Vigilante does not require any changes to hardware, compilers, operating systems, or the source code of vulnerable programs; therefore, it can be used to protect current software binaries.

...read moreread less

46 citations

Journal Article•DOI•

Rethink the sync

[...]

Edmund B. Nightingale¹, Kaushik Veeraraghavan¹, Peter M. Chen¹, Jason Flinn¹•Institutions (1)

University of Michigan¹

22 Sep 2008-ACM Transactions on Computer Systems

TL;DR: Xsyncfs as mentioned in this paper is an externally synchronous file system for Linux that provides the same durability and ordering-guarantees as those provided by a synchronously mounted ext3 file system.

...read moreread less

Abstract: We introduce external synchrony, a new model for local file I/O that provides the reliability and simplicity of synchronous I/O, yet also closely approximates the performance of asynchronous I/O. An external observer cannot distinguish the output of a computer with an externally synchronous file system from the output of a computer with a synchronous file system. No application modification is required to use an externally synchronous file system. In fact, application developers can program to the simpler synchronous I/O abstraction and still receive excellent performance. We have implemented an externally synchronous file system for Linux, called xsyncfs. Xsyncfs provides the same durability and ordering-guarantees as those provided by a synchronously mounted ext3 file system. Yet even for I/O-intensive benchmarks, xsyncfs performance is within 7p of ext3 mounted asynchronously. Compared to ext3 mounted synchronously, xsyncfs is up to two orders of magnitude faster.

...read moreread less

41 citations

Journal Article•DOI•

Probabilistic quorum systems in wireless Ad Hoc networks

[...]

Roy Friedman¹, Gabriel Kliot², Chen Avin³•Institutions (3)

Technion – Israel Institute of Technology¹, Microsoft², Ben-Gurion University of the Negev³

30 Sep 2008-ACM Transactions on Computer Systems

TL;DR: This paper presents the first detailed study of asymmetric probabilistic bi-quorum systems and shows that one of the strategies, based on random walks, exhibits the smallest communication overhead.

...read moreread less

Abstract: Quorums are a basic construct in solving many fundamental distributed computing problems. One of the known ways of making quorums scalable and efficient is by weakening their intersection guarantee to being probabilistic. This article explores several access strategies for implementing probabilistic quorums in ad hoc networks. In particular, we present the first detailed study of asymmetric probabilistic biquorum systems, that allow to mix different access strategies and different quorums sizes, while guaranteeing the desired intersection probability. We show the advantages of asymmetric probabilistic biquorum systems in ad hoc networks. Such an asymmetric construction is also useful for other types of networks with nonuniform access costs (e.g, peer-to-peer networks). The article includes a formal analysis of these approaches backed up by an extensive simulation-based study. The study explores the impact of various parameters such as network size, network density, mobility speed, and churn. In particular, we show that one of the strategies that uses random walks exhibits the smallest communication overhead, thus being very attractive for ad hoc networks.

...read moreread less

35 citations

Journal Article•DOI•

The SMesh wireless mesh network

[...]

Yair Amir¹, Claudiu Danilov¹, Raluca Musuăloiu-Elefteri¹, Nilo Rivera¹•Institutions (1)

Johns Hopkins University¹

30 Sep 2008-ACM Transactions on Computer Systems

TL;DR: This work presents the architecture and protocols of SMesh, the first transparent wireless mesh system that offers seamless, fast handoff, supporting real-time applications such as interactive VoIP, and provides a hybrid routing protocol that optimizes routes over wireless and wired links in a multihomed environment.

...read moreread less

Abstract: Wireless mesh networks extend the connectivity range of mobile devices by using multiple access points, some of them connected to the Internet, to create a mesh topology and forward packets over multiple wireless hops. However, the quality of service provided by the mesh is impaired by the delays and disconnections caused by handoffs, as clients move within the area covered by multiple access points. We present the architecture and protocols of SMesh, the first transparent wireless mesh system that offers seamless, fast handoff, supporting real-time applications such as interactive VoIP. The handoff and routing logic is done solely by the access points, and therefore connectivity is attainable by any 802.11 device. In SMesh, the entire mesh network is seen by the mobile clients as a single, omnipresent access point, giving the mobile clients the illusion that they are stationary. We use multicast for access points coordination and, during handoff transitions, we use more than one access point to handle the moving client. SMesh provides a hybrid routing protocol that optimizes routes over wireless and wired links in a multihomed environment. Experimental results on a fully deployed mesh network demonstrate the effectiveness of the SMesh architecture and its intra-domain and inter-domain handoff protocols.

...read moreread less

34 citations

Journal Article•DOI•

A stateless approach to connection-oriented protocols

[...]

Alan Shieh¹, Andrew C. Myers¹, Emin Gün Sirer¹•Institutions (1)

Cornell University¹

22 Sep 2008-ACM Transactions on Computer Systems

TL;DR: A novel TCP-like transport protocol and a new interface to replace sockets that together enable all state to be kept on one endpoint, allowing the other endpoint, typically the server, to operate without any per-connection state are introduced, called Trickles.

...read moreread less

Abstract: Traditional operating system interfaces and network protocol implementations force some system state to be kept on both sides of a connection. This state ties the connection to its endpoints, impedes transparent failover, permits denial-of-service attacks, and limits scalability. This article introduces a novel TCP-like transport protocol and a new interface to replace sockets that together enable all state to be kept on one endpoint, allowing the other endpoint, typically the server, to operate without any per-connection state. Called Trickles, this approach enables servers to scale well with increasing numbers of clients, consume fewer resources, and better resist denial-of-service attacks. Measurements on a full implementation in Linux indicate that Trickles achieves performance comparable to TCP/IP, interacts well with other flows, and scales well. Trickles also enables qualitatively different kinds of networked services. Services can be geographically replicated and contacted through an anycast primitive for improved availability and performance. Widely-deployed practices that currently have client-observable side effects, such as periodic server reboots, connection redirection, and failover, can be made transparent, and perform well, under Trickles. The protocol is secure against tampering and replay attacks, and the client interface is backward-compatible, requiring no changes to sockets-based client applications.

...read moreread less

17 citations

Journal Article•DOI•

Incrementally parallelizing database transactions with thread-level speculation

[...]

Christopher B. Colohan¹, Anastassia Ailamaki¹, J. Gregory Steffan², Todd C. Mowry¹•Institutions (2)

Carnegie Mellon University¹, University of Toronto²

10 Mar 2008-ACM Transactions on Computer Systems

TL;DR: Through this method of incrementally parallelizing transactions, this article can dramatically improve performance: on a simulated four-processor chip-multiprocessor, it improves the response time by 44--66% for three of the five TPC-C transactions, assuming the availability of idle processors.

...read moreread less

Abstract: With the advent of chip multiprocessors, exploiting intratransaction parallelism in database systems is an attractive way of improving transaction performance. However, exploiting intratransaction parallelism is difficult for two reasons: first, significant changes are required to avoid races or conflicts within the DBMS; and second, adding threads to transactions requires a high level of sophistication from transaction programmers. In this article we show how dividing a transaction into speculative threads solves both problems—it minimizes the changes required to the DBMS, and the details of parallelization are hidden from the transaction programmer. Our technique requires a limited number of small, localized changes to a subset of the low-level data structures in the DBMS. Through this method of incrementally parallelizing transactions, we can dramatically improve performance: on a simulated four-processor chip-multiprocessor, we improve the response time by 44--66p for three of the five TPC-C transactions, assuming the availability of idle processors.

...read moreread less

Journal Article•DOI•

Improving peer-to-peer performance through server-side scheduling

[...]

Yi Qiao¹, Fabián E. Bustamante¹, Peter A. Dinda¹, Stefan Birrer¹, Dong Lu² - Show less +1 more•Institutions (2)

Northwestern University¹, Ask.com²

19 Dec 2008-ACM Transactions on Computer Systems

TL;DR: This work introduces two new estimators that enable predictive SRPT scheduling policies that closely approach the performance of ideal SRPT, the algorithm known to be optimal for minimizing mean response time.

...read moreread less

Abstract: We show how to significantly improve the mean response time seen by both uploaders and downloaders in peer-to-peer data-sharing systems. Our work is motivated by the observation that response times are largely determined by the performance of the peers serving the requested objects, that is, by the peers in their capacity as servers. With this in mind, we take a close look at this server side of peers, characterizing its workload by collecting and examining an extensive set of traces. Using trace-driven simulation, we demonstrate the promise and potential problems with scheduling policies based on shortest-remaining-processing-time (SRPT), the algorithm known to be optimal for minimizing mean response time. The key challenge to using SRPT in this context is determining request service times. In addressing this challenge, we introduce two new estimators that enable predictive SRPT scheduling policies that closely approach the performance of ideal SRPT. We evaluate our approach through extensive single-server and system-level simulation coupled with real Internet deployment and experimentation.

...read moreread less

Proceedings Article•DOI•

Domain-specific languages as key tools for ulssis engineering

[...]

Jan Heering, Marjan Mernik¹•Institutions (1)

University of Maribor¹

10 May 2008-ACM Transactions on Computer Systems

TL;DR: In this article, the authors discuss the potential of domain specific languages and domain-specific modeling languages for ULSSIS engineering, some of the scaling challenges involved, and the possibilities for raising expressiveness beyond current levels.

...read moreread less

Abstract: We briefly discuss the potential of domain-specific languages and domain-specific modeling languages for ULSSIS engineering, some of the scaling challenges involved, and the possibilities for raising expressiveness beyond current levels.

...read moreread less