scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Computer Systems in 2000"


Journal ArticleDOI
TL;DR: On conventional PC hardware, the Click IP router achieves a maximum loss-free forwarding rate of 333,000 64-byte packets per second, demonstrating that Click's modular and flexible architecture is compatible with good performance.
Abstract: Clicks is a new software architecture for building flexible and configurable routers. A Click router is assembled from packet processing modules called elements. Individual elements implement simple router functions like packet classification, queuing, scheduling, and interfacing with network devices. A router configurable is a directed graph with elements at the vertices; packets flow along the edges of the graph. Several features make individual elements more powerful and complex configurations easier to write, including pull connections, which model packet flow drivn by transmitting hardware devices, and flow-based router context, which helps an element locate other interesting elements. Click configurations are modular and easy to extend. A standards-compliant Click IP router has 16 elements on its forwarding path; some of its elements are also useful in Ethernet switches and IP tunnelling configurations. Extending the IP router to support dropping policies, fairness among flows, or Differentiated Services simply requires adding a couple of element at the right place. On conventional PC hardware, the Click IP router achieves a maximum loss-free forwarding rate of 333,000 64-byte packets per second, demonstrating that Click's modular and flexible architecture is compatible with good performance.

2,595 citations


Journal ArticleDOI
TL;DR: IO-Lite as discussed by the authors is a unified I/O buffering and caching system for general-purpose operating systems, which allows applications, the interprocess communication system, the file system, and the file cache to safely and concurrently share a single physical copy of the data.
Abstract: This article presents the design, implementation, and evaluation of IO -Lite, a unified I/O buffering and caching system for general-purpose operating systems. IO-Lite unifies all buffering and caching in the system, to the extent permitted by the hardware. In particular, it allows applications, the interprocess communication system, the file system, the file cache, and the network subsystem to safely and concurrently share a single physical copy of the data. Protection and security are maintained through a combination of access control and read-only sharing. IO-Lite eliminates all copying and multiple buffering of I/O data, and enables various cross-subsystem optimizations. Experiments with a Web server show performance improvements between 40 and 80% on real workloads as a result of IO-Lite.

199 citations


Journal ArticleDOI
TL;DR: It is shown that a disk-based file system using soft updates achieves memory-basedfile system performance while providing stronger integrity and security guarantees than most disk- based file systems.
Abstract: Metadata updates, such as file creation and block allocation, have consistently been identified as a source of performance, integrity, security, and availability problems for file systems. Soft updates is an implementation technique for low-cost sequencing of fine-grained updates to write-back cache blocks. Using soft updates to track and enforce metadata update dependencies, a file system can safely use delayed writes for almost all file operations. This article describes soft updates, their incorporation into the 4.4BSD fast file system, and the resulting effects on the sytem. We show that a disk-based file system using soft updates achieves memory-based file system performance while providing stronger integrity and security guarantees than most disk-based file systems. For workloads that frequently perform updates on metadata (such as creating and deleting files), this improves performance by more than a factor of two and up to a factor of 20 when compared to the conventional synchronous write approach and by 4-19% when compared to an aggressive write-ahead logging approach. In addition, soft updates can improve file system availablity by relegating crash-recovery assistance (e.g., the fsck utility) to an optional and background role, reducing file system recovery time to less than one second.

169 citations


Journal ArticleDOI
TL;DR: Smart Packets improves the management of large complex networks by moving management decision points closer to the node being managed, targeting specific aspects of the node for information rather than exhaustive collection via polling, and abstracting the management concepts to language constructs, allowing nimble network control.
Abstract: This article introduces Smart Packets and describes the smart Packets architecture, the packet formats, the language and its design goals, and security considerations. Smart Packets is an Active Networks project focusing on applying active networks technology to network management and monitoring. Messages in active networks are programs that are executed at nodes on the path to one or more target hosts. Smart Packets programs are written in a tightly encoded, safe language specifically designed to support network management and avoid dangerous constructs and accesses. Smart Packets improves the management of large complex networks by (1) moving management decision points closer to the node being managed, (2) targeting specific aspects of the node for information rather than exhaustive collection via polling, and (3) abstracting the management concepts to language constructs, allowing nimble network control.

139 citations


Journal ArticleDOI
TL;DR: This paper proposes and evaluates soft timers, a new operating system facility that allows the efficient scheduling of software events at agranularity down to tens of microseconds, and shows that this technique can improve the throughput of a Web server by up to 25%.
Abstract: This paper proposes and evaluates soft timers, a new operating system facility that allows the efficient scheduling of software events at agranularity down to tens of microseconds. Soft timers can be used to avoid interrupts and reduce context switches associated with network processing, without sacrificing low communication delays. More specifically, soft timers enable transport protocols like TCP to efficiently perform rate-based clocking of packet transmissions. Experiments indicate that soft timers allow a server to employ rate-based clocking with little CPU overhead (2-6%) at high aggregate bandwidths. Soft timers can also be used to perform network polling, which eliminates network interrupts and increases the memory access locality of the network subsystem without sacrificing delay. Experiments show that this technique can improve the throughput of a Web server by up to 25%.

102 citations


Journal ArticleDOI
TL;DR: This article presents the observations demonstrating that operations on “narrow-width” quantities are common not only in multimedia codes, but also in more general workloads, and proposes two hardware mechanisms that dynamically recognize and capitalize on these narrow-width operations.
Abstract: The large address space needs of many current applications have pushed processor designs toward 64-bit word widths. Although full 64-bit addresses and operations are indeed sometimes needed, arithmetic operations on much smaller quantities are still more common. In fact, another instruction set trend has been the introduction of instructions geared toward subword operations on 16-bit quantities. For examples, most major processors now include instruction set support for multimedia operations allowing parallel execution of several subword operations in the same ALU. This article presents our observations demonstrating that operations on “narrow-width” quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this data, we propose two hardware mechanisms that dynamically recognize and capitalize on these narrow-width operations. The first, power-oriented optimization reduces processor power consumption by using operand-value-based clock gating to turn off portions of arithmetic units that will be unused by narrow-width operations. This optimization results in a 45%-60% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. Applying this optimization to SPECfp95 benchmarks results in slightly smaller power reductions, but still seems warranted. These reductions in integer unit power consumption equate to a 5%-10% full-chip power savings. Our second, performance-oriented optimization improves processor performance by packing together narrow-width operations so that they share a single arithmetic unit. Conceptually similar to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench.Overall, these optimizations highlight an increasing opportunity for value-based optimizations to improve both power and performance in current microprocessors.

80 citations


Journal ArticleDOI
TL;DR: This paper presents a system called Cellular Disco, which effectively turns a large-scale shared-memory multiprocessor into a virtual cluster that supports fault containment and heterogeneity, while avoiding operating system scalability bottlenecks and can manage the CPU and memory resources of the machine significantly better than the hardware partitioning approach.
Abstract: Despite the fact that large-scale shared-memory multiprocessors have been commercially available for several years, system software that fully utilizes all their features is still not available, mostly due to the complexity and cost of making the required changes to the operating system. A recently proposed approach, called Disco, substantially reduces this development cost by using a virtual machine monitor that laverages the existing operating system technology. In this paper we present a system called Cellular Disco that extends the Disco work to provide all the advantages of the hardware partitioning and scalable operating system approaches. We argue that Cellular Disco can achieve these benefits at only a small fraction of the development cost of modifying the operating system. Cellular Disco effectively turns a large-scale shared-memory multiprocessor into a virtual cluster that supports fault containment and heterogeneity, while avoiding operating system scalability bottlenecks. Yet at the same time, Cellular Disco preserves the benefits of a shared-memory multiprocessor by implementing dynamic, fine-grained resource sharing, and by allowing users to overcommit resources such as processors and memory. This hybrid approach requires a scalable resource manager that makes local decisions with limited information while still providing good global performance and fault containment. In this paper we describe our experience with a Cellular Disco prototype on a 32-processor SGI Origin 2000 system. We show that the execution time penalty for this approach is low, typically within 10% of the best available commercial operating system formost workloads, and that it can manage the CPU and memory resources of the machine significantly better than the hardware partitioning approach.

66 citations


Journal ArticleDOI
TL;DR: Simulations show that the block access times of the hint-based cooperative caching system are as good as those of the existing algorithms, while reducing manager load, block lookup traffic, and replacement traffic by more than a factor of seven.
Abstract: This article presents the design, implementation, and measurement of a hint-based cooperative caching file system. Hints allow clients to make decisions based on local state, enabling a loosely coordinated system that is simple to implement. The resulting performance is comparable to that of existing tightly coordinated algorithms that use global state, but with less overhead. Simulations show that the block access times of our system are as good as those of the existing algorithms, while reducing manager load by more than a factor of seven, block lookup traffic by nearly a factor of two-thirds, and replacement traffic a factor of five. To verify our simulation results in a real system with real users, we implemented a prototype and measured its performance for one week. Although the simulation and prototype environments were very different, the prototype system mirrored the simulation results by exhibiting reduced overhead and high hint accuracy. Furthermore, hint-based cooperative caching reduced the average block access time to almost half that of NFS.

65 citations


Journal ArticleDOI
TL;DR: This paper describes the motivation, design and performance of Porcupine, a scalable mail server that is designed to be easy to manage by emphasizing dynamic load balancing, automatic configuration, and graceful degradation in the presence of failures.
Abstract: This paper describes the motivation, design and performance of Porcupine, a scalable mail server. The goal of Porcupine is to provide a highly available and scalable electronic mail service using a large cluster of commodity PCs. We designed Porcupine to be easy to manage by emphasizing dynamic load balancing, automatic configuration, and graceful degradation in the presence of failures. Key to the system's manageability, availability, and performance is that sessions, data, and underlying services are distributed homogeneously and dynamically across nodes in a cluster.

55 citations


Journal ArticleDOI
TL;DR: This work provides a simpler definition of consistency model for Java, in which it is clearly distinguish the consistency model that is promised to the programmer from that which should be implemented in the JVM, and precisely defines their discrepancy.
Abstract: The Java Language Specification (JLS) [Gosling et al. 1996] provides an operational definition for the consistency of shared variables. The definition remains unchanged in the JLS 2nd edition, currently under peer review, which relies on a specific abstract machine as its underlying model, is very complicated. Several subsequent works have tried to simplify and formalize it. However, these revised definitions are also operational, and thus have failed to highlight the intuition behind the original specification. In this work we provide a complete nonoperational specification for Java and for the JVM, excluding synchronized operations. We provide a simpler definition, in which we clearly distinguish the consistency model that is promised to the programmer from that which should be implemented in the JVM. This distinction, which was implicit in the original definition, is crucial for building the JVM. We find that the programmer model is strictly weaker than that of the JVM, and precisely define their discrepancy. Moreover, our definition is independent of any specific (or even abstract) machine, and can thus be used to verify JVM implementations and compiler optimizations on any platform. Finally, we show the precise range of consistency relaxations obtainable for the Java memory model when a certain compiler optimization— called prescient stores in JLS—is applicable.

34 citations


Journal ArticleDOI
TL;DR: This study is the first to comprehensively explore the DSMP design space, and it is shown that applications execute up to 85% faster on a DSMP as compared to an all-software DSM, and that all-hardware DSMs hold a significant performance advantage over DSMPs on challenging applications.
Abstract: Parallel workstations, each comprising tens of processors based on shared memory, promise cost-effective scalable multiprocessing. This article explores the coupling of such small- to medium-scale shared-memory multiprocessors through software over a local area network to synthesize larger shared-memory systems. We call these systems Distributed Shared-memory MultiProcessors (DSMPs). This article introduces the design of a shared-memory system that uses multiple granularities of sharing, called MGS, and presents a prototype implementation of MGS on the MIT Alewife multiprocessor. Multigrain shared memory enables the collaboration of hardware and software shared memory, thus synthesizing a single transparent shared-memory address space across a cluster of multiprocessors. The system leverages the efficient support for fine-grain cache-line sharing within multiprocessor nodes as often as possible, and resorts to coarse-grain page-level sharing across nodes only when absolutely necessary. Using our prototype implementation of MGS, an in-depth study of several shared-memory application is conducted to understand the behavior of DSMPs. Our study is the first to comprehensively explore the DSMP design space, and teh compare the performance of DSMPs against all-software and all-hardware DSMs on a signle experimental platform. Keeping the total number of processors fixed, we show that applications execute up to 85% faster on a DSMP as compared to an all-software DSM. We also show that all-hardware DSMs hold a significant performance advantage over DSMPs on challenging applications, between 159% and 1014%. However, program transformations to improve data locality for these applications allow DSMPs to almost match the performance of an all-hardware multiprocessor of the same size.

Journal ArticleDOI
TL;DR: The design and use of the tape mechanism is described, a new high-level abstraction of accesses to shared data for software DSMs, and it is shown that Tapeworm eliminates 85% of remote misses, reduces message traffic, and improves performance by an average of 29% for the application suite.
Abstract: We describe the design and use of the tape mechanism, a new high-level abstraction of accesses to shared data for software DSMs. Tapes consolidate and generalize a number of recent protocol optimizations, including update-based locks and recorded-replay barriers. Tapes are usually created by “recording” shared accesses. The resulting recordings can be used to anticipate future accesses by tailoring data movement to application semantics. Tapes-based mechanisms are layered on top of existing shared-memory protocols, and are largely independent of the underlying memory model. Tapes can also be used to emulate the data-movement semantics of several update-based protocol implementations, without altering the underlying protocol implementation. We have used tapes to create the Tapeworm synchronization library. Tapeworm implements sophisticated record-replay mechanisms across barriers, augments locks with data-movement semantics, and allows the use of producer-consumer segments, which move entire modified segments when any portion of the segment is accessed. We show that Tapeworm eliminates 85% of remote misses, reduces message traffic by 63%, and improves performance by an average of 29% for our application suite.