scispace - formally typeset
Search or ask a question
Author

Kevin B. Theobald

Bio: Kevin B. Theobald is an academic researcher from University of Delaware. The author has contributed to research in topics: Multithreading & Execution model. The author has an hindex of 14, co-authored 33 publications receiving 1421 citations. Previous affiliations of Kevin B. Theobald include Intel & McGill University.

Papers
More filters
Proceedings ArticleDOI
19 Jun 2010
TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.
Abstract: Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to non-temporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Re-reference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm to-date. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.

715 citations

Proceedings ArticleDOI
27 Jun 1995
TL;DR: The design of EARTH (EE-cient Architecture for Running THreads), which attempts to address the above issues, is described and it is demonstrated that multithread-ing support can be eeciently implemented (with little em-ulation overhead) in a multiprocessor without a major impact on uniprocessors performance.
Abstract: Multithreaded node architectures have been proposed for future multiprocessor systems. However, some open issues remain: can eecient multithreading support be provided in a multiprocessor machine such that it is capable of tolerating synchronization and communication latencies, with little intrusion on the performance of sequentially-executed code? Also, how much (quantitatively) does such non-intrusive multithreading support contribute to the scalable parallel performance in the presence of increasing interprocessor communication and synchronization demands? In this paper, we describe the design of EARTH (EE-cient Architecture for Running THreads), which attempts to address the above issues. Each processor in EARTH has an oo-the-shelf RISC processor for executing threads, and an ASIC Synchronization Unit (SU) supporting dataaow-like thread synchronizations, scheduling, and remote memory requests. In preparation for an implementation of the SU, we have emulated a basic EARTH model on MANNA 2.0, an existing multiprocessor whose hardware connguration closely matches EARTH. This EARTH-MANNA emulation testbed has been fully functional, enabling us to experiment with large-scale benchmarks with impressive speed. With this platform, we demonstrate that multithread-ing support can be eeciently implemented (with little em-ulation overhead) in a multiprocessor without a major impact on uniprocessor performance. Also, we give our rst quantitative indications of how much the basic multithread-ing support can help in tolerating increasing communica-tion/synchronization demands.

89 citations

Proceedings ArticleDOI
01 May 1996
TL;DR: Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone, and allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARth- MANNA multithreaded system.
Abstract: Parallel systems supporting multithreading, or message passing in general, have typically used either polling or interrupts to handle incoming messages. Neither approach is ideal; either may lead to excessive overheads or message-handling latencies, depending on the application. This paper investigates a combined approach---Polling Watchdog, where both are used depending on the circumstances. The Polling Watchdog is a simple hardware extension that limits the generation of interrupts to the cases where explicit polling fails to handle the message quickly. As an added benefit, this mechanism also has the potential to simplify the interaction between interrupts and the network accesses performed by the program.We present the resulting performance for the EARTH-MANNA-S system, an implementation of the EARTH (Efficient Architecture for Running THreads) execution model on the MANNA multiprocessor. In contrast to the original EARTH-MANNA system, this system does not use a dedicated communication processor. Rather, synchronization and communication tasks are performed on the same processor as the regular computations. Therefore, an efficient message-handling mechanism is essential to good performance. Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone. In fact, this mechanism allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARTH-MANNA multithreaded system.

82 citations

Journal ArticleDOI
10 Dec 1992
TL;DR: A new study of instruction-level parallelism is reported, which examines aspects not covered in previous studies, including the effects of various memory reuse policies and long-latency operations, and the results achieved when large benchmarks are allowed to run to completion.
Abstract: There have been many recent studies of the “limits on instruction parallelism” an application programs. This paper reports a new study of instruction-level parallelism which examines aspects not covered in previous studies, including the effects of various memory reuse policies and long-latency operations, and the results achieved when large benchmarks are allowed to run to completion. We also define and study program smoothability, which quantifies the extent to which deferring program operations from periods of peak parallelism increases execution time. The results show a high degree of smoothability, suggesting that processor utilization can be quite high when the number of pro

71 citations


Cited by
More filters
Journal ArticleDOI
Alan R. Jones1

1,349 citations

Book
01 Jan 2006
TL;DR: The author discusses the history and present situation of operating systems, as well as some of the techniques used to design and implement these systems.
Abstract: Table of Contents CHAPTER 1 INTRODUCTION 1.1 WHAT IS AN OPERATING SYSTEM? 1.2 HISTORY OF OPERATING SYSTEMS 1.3 OPERATING SYSTEM CONCEPTS 1.4 SYSTEM CALLS 1.5 OPERATING SYSTEM STRUCTURE 1.6 OUTLINE OF THE REST OF THIS BOOK 1.7 SUMMARY CHAPTER 2 PROCESSES 2.1 INTRODUCTION TO PROCESSES 2.2 INTERPROCESS COMMUNICATION 2.3 CLASSICAL IPC PROBLEMS 2.4 SCHEDULING 2.5 OVERVIEW OF PROCESSES IN MINIX 3 2.6 IMPLEMENTATION OF PROCESSES IN MINIX 3 2.7 THE SYSTEM TASK IN MINIX 3 2.8 THE CLOCK TASK IN MINIX 3 2.9 SUMMARY CHAPTER 3 INPUT/OUTPUT 3.1 PRINCIPLES OF I/O HARDWARE 3.2 PRINCIPLES OF I/O SOFTWARE 3.3 DEADLOCKS 3.4 OVERVIEW OF I/O IN MINIX 3 3.5 BLOCK DEVICES IN MINIX 3 3.6 RAM DISKS 3.7 DISKS 3.8 TERMINALS 3.9 SUMMARY CHAPTER 4 MEMORY MANAGEMENT 4.1 BASIC MEMORY MANAGEMENT 4.2 SWAPPING 4.3 VIRTUAL MEMORY 4.4 PAGE REPLACEMENT ALGORITHMS 4.5 DESIGN ISSUES FOR PAGING SYSTEMS 4.6 SEGMENTATION 4.7 OVERVIEW OF THE MINIX 3 PROCESS MANAGER 4.8 IMPLEMENTATION OF THE MINIX 3 PROCESS MANAGER 4.9 SUMMARY CHAPTER 5 FILE SYSTEMS 5.1 FILES 5.2 DIRECTORIES 5.3 FILE SYSTEM IMPLEMENTATION 5.4 SECURITY 5.5 PROTECTION MECHANISMS 5.6 OVERVIEW OF THE MINIX 3 FILE SYSTEM 5.7 IMPLEMENTATION OF THE MINIX 3 FILE SYSTEM 5.8 SUMMARY CHAPTER 6 READING LIST AND BIBLIOGRAPHY 6.1 SUGGESTIONS FOR FURTHER READING 6.2 ALPHABETICAL BIBLIOGRAPHY APPENDIX A - INSTALLING MINIX 3 APPENDIX B - MINIX 3 SOURCE CODE LISTING APPENDIX C - INDEX TO FILES INDEX

572 citations

Proceedings ArticleDOI
02 Dec 1996
TL;DR: It is shown that simple microarchitectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences, reduce average result latency and provide performance gains of 4.5%-23% by exceeding the dataflow limit.
Abstract: For decades, the serialization constraints induced by true data dependences have been regarded as an absolute limit--the dataflow limit--on the parallel execution of serial programs. This paper proposes a new technique--value prediction--for exceeding that limit that allows data dependent instructions to issue and execute in parallel without violating program semantics. This technique is built on the concept of value locality, which describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Value prediction consists of predicting entire 32- and 64-bit register values based on previously-seen values. We find that such register values being written by machine instructions are frequently predictable. Furthermore, we show that simple micro- architectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences, reduce average result latency, and provide performance gains of 4.5%-23% (depending on machine model) by exceeding the dataflow limit.

526 citations

Proceedings ArticleDOI
01 Dec 2012
TL;DR: This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.
Abstract: This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wave front locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.

408 citations

Proceedings ArticleDOI
19 Sep 2012
TL;DR: There is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.
Abstract: Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

348 citations