scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Computer Systems in 1998"


Journal ArticleDOI
TL;DR: The Paxon parliament's protocol provides a new way of implementing the state machine approach to the design of distributed systems.
Abstract: Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament's protocol provides a new way of implementing the state machine approach to the design of distributed systems.

2,965 citations


Journal ArticleDOI
TL;DR: Coyote is described, a system that supports the construction of highly modular and configurable versions of communications-oriented abstractions with support for finer-grain microprotocol objects and a nonhierarchical composition scheme for use within a single layer of a protocol stack.
Abstract: Communication-oriented abstractions such as atomic multicast, group RPC, and protocols for location-independent mobile computing can simplify the development of complex applications built on distributed systems. This article describes Coyote, a system that supports the construction of highly modular and configurable versions of such abstractions. Coyote extends the notion of protocol objects and hierarchical composition found in existing systems with support for finer-grain microprotocol objects and a nonhierarchical composition scheme for use within a single layer of a protocol stack. A customized service is constructed by selecting microprotocols based on their semantic guarantees and configuring them together with a standard runtime system to form a composite protocol implementing the service. This composite protocol is then composed hierarchically with other protocols to form a complete network subsystem. The overall approach is described and illustrated with examples of services that have been constructed using Coyote, including atomic multicast, group RPC, membership, and mobile computing protocols. A prototype implementation based on extending x-kernel version 3.2 running on Mach 3.0 with support for microprotocols is also presented, together with performance results from a suite of microprotocols from which over 60 variants of group RPC can be constructed.

167 citations


Journal ArticleDOI
TL;DR: The article gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), the totally ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture.
Abstract: Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), the totally ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for 10 parallel applications illustrate the trade-offs made in the design of Orca and show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the article compares the performance of Orca with that of a page-based DSM (TreadMarks) and another object-based DSM (CRL). It also analyzes the communication overhead of the DSMs for several applications. All performance measurements are done on a 32-node Pentium Pro cluster with Myrinet and Fast Ethernet networks. The results show that Orca programs send fewer messages and less data than the TreadMarks and CRL programs and obtain better speedups.

145 citations


Journal ArticleDOI
TL;DR: The Totem multiple-ring protocol provides reliable totally ordered delivery of messages across multiple local-area networks interconnected by gateways and achieves a latency that increases logarithmically with system size by exploiting process group locality and selectiveforwarding of messages through the gateways.
Abstract: The Totem multiple-ring protocol provides reliable totally ordered delivery of messages across multiple local-area networks interconnected by gateways. This consistent message order is maintained in the presence of network partitioning and remerging, and of processor failure and recovery. The protocol provides accurate topology change information as part of the global total order of messages. It addresses the issue of scalability and achieves a latency that increases logarithmically with system size by exploiting process group locality and selective forwarding of messages through the gateways. Pseudocode for the protocol and an evaluation of its performance are given.—Authors' Abstract

88 citations


Journal ArticleDOI
TL;DR: The article gives a detailed performance analysis of the approach to extending the OS and establishes that Ufo introduces acceptable overhead for common applications even though intercepting individual system calls incurs a high cost.
Abstract: In this article we show how to extend a wide range of functionality of standard operation systems completely at the user level. Our approach works by intercepting selected system calls at the user level, using tracing facilities such as the /proc file system provided by many Unix operating systems. The behavior of some intercepted system calls is then modified to implement new functionality. This approach does not require any relinking or recompilation of existing applications. In fact, the extensions can even be dynamically “installed” into already running processes. The extensions work completely at the user level and install without system administrator assistance. Individual users can choose what extensions to run, in effect creating a personalized operating system view for themselves. We used this approach to implement a global file system, called Ufo, which allows users to treat remote files exactly as if they were local. Currently, Ufo supports file access through the FTP and HTTP protocols and allows new protocols to be plugged in. While several other projects have implemented global file system abstractions, they all require either changes to the operating system or modifications to standard libraries. The article gives a detailed performance analysis of our approach to extending the OS and establishes that Ufo introduces acceptable overhead for common applications even though intercepting individual system calls incurs a high cost.

85 citations


Journal ArticleDOI
TL;DR: The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses, and can improve the speed of some parallel applications by as much as a factor of two.
Abstract: The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing instructions to move data close to the processor before the data are actually needed. To minimize the burden on the programmer, compiler support is needed to automatically insert prefetch instructions into the code. A key challenge when inserting prefetches is ensuring that the overheads of prefetching do not outweigh the benefits. While previous studies have demonstrated the effectiveness of hand-inserted prefetching in multiprocessor applications, the benefit of compiler-inserted prefetching in practice has remained an open question. This article proposes and evaluates a new compiler algorithm for inserting prefetches into multiprocessor code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. We have implemented our algorithm in the SUIF(Stanford University Intermediate Format) optimizing compiler. The results of our detailed architectural simulations demonstrate that compiler-inserted prefetching can improve the speed of some parallel applications by as much as a factor of two.

72 citations


Journal ArticleDOI
TL;DR: The characteristics of the value prediction concept are examined from two perspectives: the related phenomena that are reflected in the nature of computer programs and the significance of these phenomena to boosting instruction-level parallelism of superscalar microprocessors that support speculative execution.
Abstract: This article presents an experimental and analytical study of value prediction and its impact on speculative execution in superscalar microprocessors Value prediction is a new paradigm that suggests predicting outcome values of operations (at run-time ) and using these predicted values to trigger the execution of true-data-dependent operations speculatively As a result, stals to memory locations can be reduced and the amount of instruction-level parallelism can be extended beyond the limits of the program's dataflow graph This article examines the characteristics of the value prediction concept from two perspectives: (1) the related phenomena that are reflected in the nature of computer programs and (2) the significance of these phenomena to boosting instruction-level parallelism of superscalar microprocessors that support speculative execution In order to better understand these characteristics, our work combines both analytical and experimental studies

67 citations


Journal ArticleDOI
TL;DR: The performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions.
Abstract: Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architecturtes do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, this article proposes a new class of memory operations called informing memory operations, which essentially consist of a memory operatin combined (either implicitly or explicitly) with a conditional branch-and-ink operation that is taken only if the reference suffers a cache miss. This article describes two different implementations of informing memory operations. One is based on a cache-outcome condition code, and the other is based on low-overhead traps. We find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and we look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.

35 citations


Journal ArticleDOI
TL;DR: There are several situations in which the models do not precisely predict the actual runtime behavior of an algorithm implementation, and a comparison between the models is provided in order to determine the model that induces that most efficient algorithms.
Abstract: In recent years, a large number of parallel computation models have been proposed to replace the PRAM as the parallel computation model presented to the algorithm designer. Although mostly the theoretical justifications for these models are sound, and many algorithmic results where obtained through these models, little experimentation has been conducted to validate the effectiveness of these models for developing cost-effective algorithms and applications on existing hardware platforms. In this article a first attempt is made to perform a detailed experimental account on the preciseness of these models. The achieve this, three models (BSP, E-BSP, and BPRAM) were selected and validated on five parallel platforms (Cray T3E, Thinking Machines CM-5, Intel Paragon, MasPar MP-1, and Parsytec GCel). The work described in this article consists of three parts. First, the predictive capabilities of the models are investigated. Unlike previous experimental work, which mostly demonstrated a close match between the measuredd and predicted execution times, this article shows that there are several situations in which the models do not precisely predict the actual runtime behavior of an algorithm implementation. Second, a comparison between the models is provided in order to determine the model that induces that most efficient algorithms. Lastly, the performance achieved by the model-derived algorithms is compared with the performace attained by machine-specific algorithms in order to examine the effectiveness of deriving fast algorithms through the formalisms of the models.

32 citations


Journal ArticleDOI
TL;DR: This article deals with a decay-usage scheduling policy in multiprocessors modeled after widely used systems, and shows how for a fixed set of jobs with large processing demands, the steady-state shares can be obtained given the base priorities, and conversely, how to set thebase priorities given the required shares.
Abstract: Decay-usage scheduling is a priority-aging time-sharing scheduling policy capable of dealing with a workload of both interactive and batch jobs by decreasing the priority of a job when it acquires CPU time, and by increasing its priority when it does not use the (a) CPU. In this article we deal with a decay-usage scheduling policy in multiprocessors modeled after widely used systems. The priority of a job consists of a base priority and a time-dependent component based on processor usage. Because t he priorities in our model are time dependent, a queuing-theoretic analysis—for instance, for the mean job response time—seems impossible. Still, it turns out that as a consequence of the scheduling policy, the shares of the available CPU time obtained by jobs converge, and a deterministic analysis for these shares is feasible: We show how for a fixed set of jobs with large processing demands, the steady-state shares can be obtained given the base priorities, and conversely, how to set the base priorities given the required shares. In addition, we analyze the relation between the values of the scheduler parameters and the level of control it can exercise over the steady-state share ratios, and we deal with the rate of convergence. We validate the model by simulations and by measurements of actual systems.

27 citations


Journal ArticleDOI
TL;DR: Graph Grammars (RGG) are defined and applied to the definition of processor array reconfiguration algorithms, and the resulting algorithms for dynamic (run-time) reconfigurations are efficient and can be implemented distributively.
Abstract: Reconfiguration for fault tolerance is a widely studied field, but this work applies graph grammars to this discipline for the first time. Reconfiguration Graph Grammars (RGG) are defined and applied to the definition of processor array reconfiguration algorithms. The nodes of a graph are associated with the processors of a processor array, and the edges are associated with those interprocessor communication lines that are active. The resulting algorithms for dynamic (run-time) reconfiguration are efficient and can be implemented distributively.