scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Computer Systems in 1999"


Journal ArticleDOI
TL;DR: This article introduces the protocol, provides a theoretical analysis of its behavior, review experimental results, and discusses some candidate applications, confirming that bimodal multicast is reliable, scalable, and that the protocol provides remarkably stable delivery throughput.
Abstract: There are many methods for making a multicast protocol “reliable.” At one end of the spectrum, a reliable multicast protocol might offer tomicity guarantees, such as all-or-nothing delivery, delivery ordering, and perhaps additional properties such as virtually synchronous addressing. At the other are protocols that use local repair to overcome transient packet loss in the network, offering “best effort” reliability. Yet none of this prior work has treated stability of multicast delivery as a basic reliability property, such as might be needed in an internet radio, television, or conferencing application. This article looks at reliability with a new goal: development of a multicast protocol which is reliable in a sense that can be rigorously quantified and includes throughput stability guarantees. We characterize this new protocol as a “bimodal multicast” in reference to its reliability model, which corresponds to a family of bimodal probability distributions. Here, we introduce the protocol, provide a theoretical analysis of its behavior, review experimental results, and discuss some candidate applications. These confirm that bimodal multicast is reliable, scalable, and that the protocol provides remarkably stable delivery throughput.

693 citations


Journal ArticleDOI
TL;DR: The main technique, controlled prefix expansion, transforms a set of prefixes into an equivalent set with fewer prefix lengths, and optimization techniques based on dynamic programming, and local transformations of data structures to improve cache behavior are used.
Abstract: Internet (IP) address lookup is a major bottleneck in high-performance routers. IP address lookup is challenging because it requires a longest matching prefix lookup. It is compounded by increasing routing table sizes, increased traffic, higher-speed links, and the migration to 128-bit IPv6 addresses. We describe how IP lookups and updates can be made faster using a set of of transformation techniques. Our main technique, controlled prefix expansion, transforms a set of prefixes into an equivalent set with fewer prefix lengths. In addition, we use optimization techniques based on dynamic programming, and local transformations of data structures to improve cache behavior. When applied to trie search, our techniques provide a range of algorithms (Expanded Tries) whose performance can be tuned. For example, using a processor with 1MB of L2 cache, search of the MaeEast database containing 38000 prefixes can be done in 3 L2 cache accesses. On a 300MHz Pentium II which takes 4 cycles for accessing the first word of the L2 cacheline, this algorithm has a worst-case search time of 180 nsec., a worst-case insert/delete time of 2.5 msec., and an average insert/delete time of 4 usec. Expanded tries provide faster search and faster insert/delete times than earlier lookup algirthms. When applied to Binary Search on Levels, our techniques improve worst-case search times by nearly a factor of 2 (using twice as much storage) for the MaeEast database. Our approach to algorithm design is based on measurements using the VTune tool on a Pentium to obtain dynamic clock cycle counts. Our techniques also apply to similar address lookup problems in other network protocols.

514 citations


Journal ArticleDOI
TL;DR: An implementation of the tools needed to support RecPlay, a combination of record/replay with automatic on-the-fly data race detection, which enables to limit the record phase to the more efficient recording of the synchronization operations, while deferring the time-consuming dataRace detection to the replay phase.
Abstract: This article presents a practical solution for the cyclic debugging of nondeterministic parallel programs. The solution consists of a combination of record/replay with automatic on-the-fly data race detection. This combination enables us to limit the record phase to the more efficient recording of the synchronization operations, while deferring the time-consuming data race detection to the replay phase. As the record phase is highly efficient, there is no need to switch it off, hereby eliminating the possibility of Heisenbugs because tracing can be left on all the time. This article describes an implementation of the tools needed to support RecPlay.

330 citations


Journal ArticleDOI
TL;DR: It is found that temporal and spatial reuse have balanced roles within a loop nest and that most reuse across nests and the entire program is temporal, which goes against the commonly held assumption that spatial reuse dominates.
Abstract: This article analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than at the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are internest capacity misses, and they correspond to potential reuse between nearby loop nests. In addition, we find that temporal and spatial reuse have balanced roles within a loop nest and that most reuse across nests and the entire program is temporal. These results are consistent with high hit rates (80% or more hits), but go against the commonly held assumption that spatial reuse dominates. Our locality measurements reveal important differences between loop nests and programs, refute some popular assertions, and provide new insights for the compiler writer and the architect.

59 citations


Journal ArticleDOI
TL;DR: Experimental results indicate that the use of optimistic synchronization in this context can significantly reduce the memory consumption and improve the overall performance.
Abstract: This article presents our experience using optimistic synchronization to implement fine-grain atomic operations in the context of a parallelizing compiler for irregular, object-based computations. Our experience shows that the synchronization requirements of these programs differ significantly from those of traditional parallel computations, which use loop nests to access dense matrices using affine access functions. In addition to coarse-grain barrier synchronization, our irregular computations require synchronization primitives that support efficient fine-grain atomic operations. The standard implementation mechanism for atomic operations uses mutual exclusion locks. But the overhead of acquiring and releasing locks can reduce the performance. Locks can also consume significant amounts of memory. Optimistic synchronization primitives such as loud-linked/store conditional are an attractive alternative. They require no additional memory and eliminate the use of heavyweight blocking synchronization constructs. We evaluate the effectiveness of optimistic synchronization by comparing experimental results from two versions of a parallelizing compiler for irregular, object-based computations. One version generates code that uses mutual exclusion locks to make operations execute atomically. The other version generates code that uses mutual exclusion locks to make operations execute atomically. The other version uses optimistic synchronization. We used this compiler to automatically parallelize three irregular, object-based benchmark applications of interest to the scientific and engineering computation community. The presented experimental results indicate that the use of optimistic synchronization in this context can significantly reduce the memory consumption and improve the overall performance.

47 citations


Journal ArticleDOI
TL;DR: It is argued that quasi FIFO is adequate for most applications, and an architectural framework for transparently embedding the authors' protocol at the network level by striping IP packets across multiple physical interfaces is developed.
Abstract: Link-striping algorithms are often used to overcome transmission bottlenecks in computer networks. Traditional striping algorithms suffer from two major disadvantages. They provide inadequate load sharing in the presence of variable-length packets, and may result in non-FIFO delivery of data. We describe a new family of link-striping algorithms that solves both problems. Our scheme applies to any layer that can provide multiple FIFO channels. We deal with variable-sized packets by showing how fair-queuing algorithms can be transformed into load-sharing algorithms. Our transformation results in practical load-sharing protocols, and shows a theoretical connection between two seemingly different problems. The same transformation can be applied to obtain load-sharing protocols for links with different capacities. We deal with the FIFO requirement for two separate cases. If a sequence number can be added to each packet, we show how to speed up packet processing by letting the receiver simulate the sender algorithm. If no header can be added, we show how to provide quasi FIFO delivery. Quasi FIFO is FIFO except during occasional periods of loss of synchronization. We argue that quasi FIFO is adequate for most applications. We also describe a simple technique for speedy restoration of synchronization in the event of loss. We develop an architectural framework for transparently embedding our protocol at the network level by striping IP packets across multiple physical interfaces. The resulting stripe protocol has been implemented within the NetBSD kernel. Our measurements and simulations show that the protocol offers scalable throughput even when striping is done over dissimilar links, and that the protocol synchronized quickly after packet loss. Measurements show performance improvements over conventional round-robin striping schemes and striping schemes that do not resequence packets. Some aspects of our solution have been implemented in Cisco's router operating system (IOS 11.3) in the context of Multilink PPP striping.

44 citations


Journal ArticleDOI
TL;DR: This article presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments, and performs a theoretical analysis which provides a guaranteed optimality bound for dynamic feedback relative to a hypothetical (and unrealizable) optimal algorithm that uses the best policy at every point during the execution.
Abstract: This article presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a different optimization policy. The generated code alternately performs sampling phases and production phases. Each sampling phase measures the overhead of each version in the current environment. Each production phase uses the version with the least overhead in the previous sampling phase. The computation periodically resamples to adjust dynamically to changes in the environment. We have implemented dynamic feedback in the context of a parallelizing compiler for object-based programs. The generated code uses dynamic feedback to automatically choose the best synchronization optimization policy. Our experimental results show that the synchronization optimization policy has a significant impact on the overall performance of the computation, that the best policy varies from program to program, that the compiler is unable to statically choose the best policy, and that dynamic feedback enables the generted code to exhibit performance that is comparable to that of code that has been manually tuned to use the best policy. We have also performed a theoretical analysis which provides, under certain assumptions, a guaranteed optimality bound for dynamic feedback relative to a hypothetical (and unrealizable) optimal algorithm that uses the best policy at every point during the execution.

22 citations


Journal ArticleDOI
TL;DR: The design and implementation of a new language for parallel programming, Ace, is described that integrates support for customizable protocols with minimal extensions to C and compiler optimizations that improve the performance of such software shared-memory systems are discussed.
Abstract: Customizing the protocols that manage accesses to different data structures within an application can improve the performance of software shared-memory programs substantially. Existing systems for using customizable protocols are hard to use directly because the mechanisms they provide for manipulating protocols are low-level ones. This article is an in-depth study of the issues involved in providing language support for application-specific protocols. We describe the design and implementation of a new language for parallel programming, Ace, that integrates support for customizable protocols with minimal extensions to C. Ace applications are developed using a shared-memory model with a default sequentially consistent protocol. Performance can then be optimized, with minor modifications to the application, by experimenting with different protocol libraries. The design of Ace was driven by a detailed study of the use of customizable protocols. We delineate the issues that arise when programming with customizable protocols and present novel abstractions that allow for their easy use. We describe the design and implementation of a runtime system and compiler for Ace nd discuss compiler optimizations that improve the performance of such software shared-memory systems. We study the communication patterns of a set of benchmark applications and consider the use of customizable protocols to optimize their performance. We evaluate the performance of our system through experiments on a Thinking Machine CM-5 and a Cray T3E. We also present measurements that demonstrate that Ace has good performance compared to that of a modern distributed shared-memory system.

18 citations


Journal ArticleDOI
TL;DR: An efficient server-based algorithm for garbage collecting persistent object stores in a client-server environmnet that works with standard implementation techniques such as Two-Phase Locking and Write-Ahead-Logging and supports client- server performance optimizations such as client caching and flexible management of client buffers.
Abstract: We describe an efficient server-based algorithm for garbage collecting persistent object stores in a client-server environmnet. The algorithm is incremental and runs concurrently with client transactions. Unlike previous algorithms, it does not hold any transactional locks on data and does non require callbacks to clients. It is fault-tolerant, but performs very little logging. The algorithm has been designed to be integrated into existing systems, and therefore it works with standard implementation techniques such as Two-Phase Locking and Write-Ahead-Logging. In addition, it supports client-server performance optimizations such as client caching and flexible management of client buffers. We describe an implementation of the algorithm in the EXODUS storage manager and present the results of a performance study of the implementation.

15 citations