scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Computer Systems in 1997"


Journal ArticleDOI
TL;DR: A new tool, called Eraser, is described, for dynamically detecting data races in lock-based multithreaded programs, which uses binary rewriting techniques to monitor every shared-monory reference and verify that consistent locking behavior is observed.
Abstract: Multithreaded programming is difficult and error prone. It is easy to make a mistake in synchronization that produces a data race, yet it can be extremely hard to locate this mistake during debugging. This article describes a new tool, called Eraser, for dynamically detecting data races in lock-based multithreaded programs. Eraser uses binary rewriting techniques to monitor every shared-monory reference and verify that consistent locking behavior is observed. We present several case studies, including undergraduate coursework and a multithreaded Web search engine, that demonstrate the effectiveness of this approach.

1,553 citations


Journal ArticleDOI
TL;DR: This article uses virtual machines to run multiple commodity operating systems on a scalable multiprocessor to reduce the memory overheads associated with running multiple operating systems, and uses the distributed-system support of modern operating systems to export a partial single system image to the users.
Abstract: In this article we examine the problem of extending modern operating systems to run efficiently on large-scale shared-memory multiprocessors without a large implementation effort. Our approach brings back an idea popular in the 1970s: virtual machine monitors. We use virtual machines to run multiple commodity operating systems on a scalable multiprocessor. This solution addresses many of the challenges facing the system software for these machines. We demonstrate our approach with a prototype called Disco that runs multiple copies of Silicon Graphics' IRIX operating system on a multiprocessor. Our experience shows that the overheads of the monitor are small and that the approach provides scalability as well as the ability to deal with the nonuniform memory access time of these systems. To reduce the memory overheads associated with running multiple operating systems, virtual machines transparently share major data structures such as the program code and the file system buffer cache. We use the distributed-system support of modern operating systems to export a partial single system image to the users. The overall solution achieves most of the benefits of operating systems customized for scalable multiprocessors, yet it can be achieved with a significantly smaller implementation effort.

603 citations


Journal ArticleDOI
TL;DR: The measurements indicate that the distribution of lifetimes for a UNIX process is Pareto (heavy-tailed), with a consistent functional form over a variety of workloads, and it is shown how to apply this distribution to derive a preemptive migration policy that requires no hand-tuned parameters.
Abstract: We consider policies for CPU load balancing in networks of workstations. We address the question of whether preemptive migration (migrating active processes) is necessary, or whether remote execution (migrating processes only at the time of birth) is sufficient for load balancing. We show that resolving this issue is strongly tied to understanding the process lifetime distribution. Our measurements indicate that the distribution of lifetimes for a UNIX process is Pareto (heavy-tailed), with a consistent functional form over a variety of workloads. We show how to apply this distribution to derive a preemptive migration policy that requires no hand-tuned parameters. We used a trace-driven simulation to show that our preemptive migration strategy is far more effective than remote execution, even when the memory transfer cost is high.

424 citations


Journal ArticleDOI
TL;DR: This work modified an interrupt-driven networking implementation to do so, and eliminates receive livelock without degrading other aspects of system performance, including the use of polling when the system is heavily loaded, while retaining theUse of interrupts urJer lighter load.
Abstract: Most operating systems use interface interrupts to schedule network tasks. Interrupt-driven systems can provide low overhead and good latency at low offered load, but degrade significantly at higher arrival rates unless care is taken to prevent several pathologies. These are various forms ofreceive livelock, in which the system spends all of its time processing interrupts, to the exclusion of other necessary tasks. Under extreme conditions, no packets are delivered to the user application or the output of the system. To avoid livelock and related problems, an operating system must schedule network interrupt handling as carefully as it schedules process execution. We modified an interrupt-driven networking implementation to do so; this modification eliminates receive livelock without degrading other aspects of system performance. Our modifications include the use of polling when the system is heavily loaded, while retaining the use of interrupts ur.Jer lighter load. We present measurements demonstrating the success of our approach.

405 citations


Journal ArticleDOI
David Kotz1
TL;DR: In this article, the authors proposed a disk-directed I/O technique, which allows the disk servers to determine the flow of data for maximum performance, and demonstrated that this technique provided consistent high performance that was largely independent of data distribution.
Abstract: Many scientific applications that run on today's multiprocessors, such as weather forecasting and seismic analysis, are bottlenecked by their file-I/O needs. Even if the multiprocessor is configured with sufficient I/O hardware, the file system software often fails to provide the available bandwidth to the application. Although libraries and enhanced file system interfaces can make a significant improvement, we believe that fundamental changes are needed in the file server software. We propose a new technique, disk-directed I/O, to allow the disk servers to determine the flow of data for maximum performance. Our simulations show that tremendous performance gains are possible both for simple reads and writes and for an out-of-core application. Indeed, our disk-directed I/O technique provided consistent high performance that was largely independent of data distribution and obtained up to 93% of peak disk bandwidth. It was as much as 18 times faster than either a typical parallel file system or a two-phase-I/O library.

344 citations


Journal ArticleDOI
TL;DR: This article identifies the hardware bottlenecks that prevent multiprocessors from effectively exploiting ILP and TLP, and shows that because of its dynamic resource sharing, SMT avoids these inefficiencies and benefits from being able to run more threads on a single processor.
Abstract: To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to complete for and share all of the processor's resources every cycle.The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to share the processor's functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When a program has only a single thread, all of the SMT processor's resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. We examine two alternative on-chip parallel architectures for the next generation of processors. We compare SMT and small-scale, on-chip multiprocessors in their ability to exploit both ILP and TLP. First, we identify the hardware bottlenecks that prevent multiprocessors from effectively exploiting ILP. Then, we show that because of its dynamic resource sharing, SMT avoids these inefficiencies and benefits from being able to run more threads on a single processor. The use of TLP is especially advantageous when per-thread ILP is limited. The ease of adding additional thread contexts on an SMT (relative to adding additional processors on an MP) allows simultaneous multithreading to expose more parallelism, further increasing functional unit utilization and attaining a 52% average speedup (versus a four-processor, single-chip multiprocessor with comparable execution resources). This study also addresses an often-cited concern regarding the use of thread-level parallelism or multithreading: interference in the memory system and branch prediction hardware.We find the multiple threads cause interthread interference in the caches and place greater demands on the memory system, thus increasing average memory latencies. By exploiting threading-level parallelism, however, SMT hides these additional latencies, so that they only have a small impact on total program performance. We also find that for parallel applications, the additional threads have minimal effects on branch prediction.

251 citations


Journal ArticleDOI
TL;DR: It is formally shown that lock-free shared objects often incur less overhead than object implementations based on wait-free algorithms or lock-based schemes.
Abstract: This article considers the use of lock-free shared objects within hard real-time systems. As the name suggests, lock-free shared objects are distinguished by the fact that they are accessed without locking. As such, they do not give rise to priority inversions, a key advantage over conventional, lock-based object-sharing approaches. Despite this advantage, it is not immediately apparent that lock-free shared objects can be employed if tasks must adhere to strict timing constraints. In particular, lock-free object implementations permit concurrent operations to interfere with each other, and repeated interferences can cause a given operation to take an arbitrarily long time to complete. The main contribution of this article is to show that such interferences can be bounded by judicious scheduling. This work pertains to periodic, hard real-time tasks that share lock-free objects on a uniprocessor. In the first part of the article, scheduling conditions are derived for such tasks, for both static and dynamic priority schemes. Based on these conditions, it is formally shown that lock-free shared objects often incur less overhead than object implementations based on wait-free algorithms or lock-based schemes. In the last part of the article, this conclusion is validated experimentally through work involving a real-time desktop videoconferencing system.

150 citations


Journal ArticleDOI
TL;DR: It is found that while it is possible to avoid pathological performance problems using previously proposed kernel mechanisms, a modest additional widening of the kernel/user interface can make scheduler-conscious synchronization algorithms significantly simpler and faster, with performance on dedicated machines comparable to that of Scheduler-oblivious algorithms.
Abstract: Efficient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors. Most synchronization algorithms have been designed to run on a dedicated machine, with one application process per processor, and can suffer serious performance degradation in the presence of multiprogramming. Problems arise when running processes block or, worse, busy-wait for action on the part of a process that the scheduler has chosen not to run. We show that these problems are particularly severe for scalable synchronization algorithms based on distributed data structures. We then describe and evaluate a set of algorithms that perform well in the presence of multiprogramming while maintaining good performance on dedicated machines. We consider both large and small machines, with a particular focus on scalability, and examine mutual-exclusion locks, reader-writer locks, and barriers. Our algorithms vary in the degree of support required from the kernel scheduler. We find that while it is possible to avoid pathological performance problems using previously proposed kernel mechanisms, a modest additional widening of the kernel/user interface can make scheduler-conscious synchronization algorithms significantly simpler and faster, with performance on dedicated machines comparable to that of scheduler-oblivious algorithms.

99 citations


Journal ArticleDOI
TL;DR: This work has implemented HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor, and demonstrates that the flexibility of HFS comes with little processing or I/O overhead.
Abstract: The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file system policies, and I/O interfaces. Files in HFS are implemented using simple building blocks composed in potentially complex ways. This approach yields great flexibility, allowing an application to customize the structure and policies of a file to exactly meet its requirements. As an extreme example, HFS allows a file's structure to be optimized for concurrent random-access write-only operations by 10 threads, something no other file system can do. Similarly, the prefetching, locking, and file cache management policies can all be chosen to match an application's access pattern. In contrast, most parallel file systems support a single file structure and a small set of policies. We have implemented HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. We also show that for a number of file access patterns, HFS is able to deliver to the applications the full I/O bandwidth of the disks on our system.

48 citations


Journal ArticleDOI
Birgit Pfitzmann, Michael Waidner1
TL;DR: This article shows how untraceable electronic cash can be made loss tolerant, i.e., how the monetary value of the lost data can be recovered and how security against fraud and preservation of privacy are ensured.
Abstract: Untraceable electronic cash means prepaid digital payment systems, usually with offline payments, that protect user privacy. Such systems have recently been given considerable attention by both theory and development projects. However, in most current schemes, loss of a user device containing electronic cash implies a loss of money, just as with real cash. In comparison with credit schemes, this is considered a serious shortcoming. This article shows how untraceable electronic cash can be made loss tolerant, i.e., how the monetary value of the lost data can be recovered. Security against fraud and preservation of privacy are ensured; strong loss tolerance means that not even denial of recovery is possible. In particular, systems based on electronic coins are treated. We present general design principles and options and their instantiation in one concrete payment system. The measures are practical.

30 citations


Journal ArticleDOI
TL;DR: This article presents an I/O architecture that addresses problems and supports high-speed network I/W on distributed-memory systems, and describes this implementaiton, presents performance results, and uses application examples to validated the main features of the I/ O architecture.
Abstract: Distributed-memory systems have traditionally had great difficulty performing network I/O at rates proportional to their computational power. The problem is that the network interface has to support network I/O for a supercomputer, using computational and memory bandwidth resources similar to those of a workstation. As a result, the network interface becomes a bottleneck. In this article we present an I/O architecture that addresses these problems and supports high-speed network I/O on distributed-memory systems. The key to good performance is to partition the work appropriately between the system and the network interface. Some communication tasks are performed on the distributed-memory parallel system, since it is more powerful and less likely to become a bottleneck than the network interface. Tasks that do not parallelize well are performed on the network interface, and hardware support is provided for the most time-critical operations. This architecture has been implemented for the iWarp distributed-memory system and has been used by a number of applications. We describe this implementaiton, present performance results, and use application examples to validated the main features of the I/O architecture.

Journal ArticleDOI
TL;DR: This work describes three approaches to the problem of what fraction of disks and other shared devices must be reserved to play an audio/video document without dropouts: one is fast but imprecise, one is precise but slow, and one is both precise and fast.
Abstract: What fraction of disks and other shared devices must be reserved to play an audio/video document without dropouts? In general, this question cannot be answered precisely For documents with complex and irregular structure, such as those arising in audio/video editing, it is difficult even to give a good estimate We describe three approaches to this problem The first, based on long-term average properties of segments, is fast but imprecise: it underreserves in some cases and overreserves in others The second approach models individual disk and network operations It is precise but slow The third approach, a hybrid, is both precise and fast

Journal ArticleDOI
TL;DR: It is proved that the algorithms developed here are optimally adaptive in the sense that any further flexibility in communication will result in deadlock, and that the Turn Model is actually a member of this new class of algorithms that does not perform as well as other algorithms within the new class.
Abstract: In circuit-switched routing, the path between a source and its destination is established by incrementally reserving all required links before the data transmission can begin. If the routing algorithm is not carefully designed, deadlocks can occur in reserving these links. Deadlock-free algorithms based on dimension-ordered routing, such as the E-cube, exist. However, E-cube does not provide any flexibility in choosing a path from a source to its destination and can thus result in long latencies under heavy or uneven traffic. Adaptive, minimum-distance routing algorithms, such as the Turn Model and the UP Preference algorithms, have previously been reported. In this article, we present a new class of adaptive, provably deadlock-free, minimum-distance routing algorithms. We prove that the algorithms developed here are optimally adaptive in the sense that any further flexibility in communication will result in deadlock. We show that the Turn Model is actually a member of our new class of algorithms that does not perform as well as other algorithms within the new class. It creates artificial hotspots in routing the traffic and allows fewer total paths. We present an analytical comparison of the flexibility and balance in routing provided by various algorithms and a comparison based on uniform and nonuniform traffic simulations. The Extended UP Preference algorithm developed in this article is shown to have improved performance with respect to existing algorithms. The methodology and the algorithms developed here can be used to develop routing for other schemes such as wormhole routing, and for other recursively defined networks such as k-ary n-cubes.