Home
/
Authors
/
Kevin B. Theobald

Author

Kevin B. Theobald

Other affiliations: Intel, McGill University, Codex Corporation

Bio: Kevin B. Theobald is an academic researcher from University of Delaware. The author has contributed to research in topics: Multithreading & Execution model. The author has an hindex of 14, co-authored 33 publications receiving 1421 citations. Previous affiliations of Kevin B. Theobald include Intel & McGill University.

Topics: Multithreading, Execution model, Cache, Thread (computing), Speedup ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

High performance cache replacement using re-reference interval prediction (RRIP)

[...]

Aamer Jaleel¹, Kevin B. Theobald¹, Simon C. Steely¹, Joel Emer¹•Institutions (1)

Intel¹

19 Jun 2010

TL;DR: This paper proposes Static RRIP that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan- resistant and thrash-resistant that require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors.

...read moreread less

Abstract: Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. Applications that exhibit a distant re-reference interval perform badly under LRU. Such applications usually have a working-set larger than the cache or have frequent bursts of references to non-temporal data (called scans). To improve the performance of such workloads, this paper proposes cache replacement using Re-reference Interval Prediction (RRIP). We propose Static RRIP (SRRIP) that is scan-resistant and Dynamic RRIP (DRRIP) that is both scan-resistant and thrash-resistant. Both RRIP policies require only 2-bits per cache block and easily integrate into existing LRU approximations found in modern processors. Our evaluations using PC games, multimedia, server and SPEC CPU2006 workloads on a single-core processor with a 2MB last-level cache (LLC) show that both SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 4% and 10% respectively. Our evaluations with over 1000 multi-programmed workloads on a 4-core CMP with an 8MB shared LLC show that SRRIP and DRRIP outperform LRU replacement on the throughput metric by an average of 7% and 9% respectively. We also show that RRIP outperforms LFU, the state-of the art scan-resistant replacement algorithm to-date. For the cache configurations under study, RRIP requires 2X less hardware than LRU and 2.5X less hardware than LFU.

...read moreread less

715 citations

Proceedings Article•DOI•

A design study of the EARTH multiprocessor

[...]

Herbert H. J. Hum, Olivier Maquelin, Kevin B. Theobald, Xinmin Tian, Xinan Tang, Guang R. Gao, Phil Cupryk, Nasser Elmasri, Laurie Hendren, Alberto Jimenez, Shoba Krishnan, Andres Marquez, Shamir Merali, Shashank Nemawarkar, Prakash Panangaden, Xun Xue, Yingchun Zhu - Show less +13 more

27 Jun 1995

TL;DR: The design of EARTH (EE-cient Architecture for Running THreads), which attempts to address the above issues, is described and it is demonstrated that multithread-ing support can be eeciently implemented (with little em-ulation overhead) in a multiprocessor without a major impact on uniprocessors performance.

...read moreread less

Abstract: Multithreaded node architectures have been proposed for future multiprocessor systems. However, some open issues remain: can eecient multithreading support be provided in a multiprocessor machine such that it is capable of tolerating synchronization and communication latencies, with little intrusion on the performance of sequentially-executed code? Also, how much (quantitatively) does such non-intrusive multithreading support contribute to the scalable parallel performance in the presence of increasing interprocessor communication and synchronization demands? In this paper, we describe the design of EARTH (EE-cient Architecture for Running THreads), which attempts to address the above issues. Each processor in EARTH has an oo-the-shelf RISC processor for executing threads, and an ASIC Synchronization Unit (SU) supporting dataaow-like thread synchronizations, scheduling, and remote memory requests. In preparation for an implementation of the SU, we have emulated a basic EARTH model on MANNA 2.0, an existing multiprocessor whose hardware connguration closely matches EARTH. This EARTH-MANNA emulation testbed has been fully functional, enabling us to experiment with large-scale benchmarks with impressive speed. With this platform, we demonstrate that multithread-ing support can be eeciently implemented (with little em-ulation overhead) in a multiprocessor without a major impact on uniprocessor performance. Also, we give our rst quantitative indications of how much the basic multithread-ing support can help in tolerating increasing communica-tion/synchronization demands.

...read moreread less

89 citations

Proceedings Article•DOI•

Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling

[...]

Olivier Maquelin¹, Guang R. Gao¹, Herbert H. J. Hum², Kevin B. Theobald¹, Xinmin Tian¹ - Show less +1 more•Institutions (2)

McGill University¹, Concordia University Wisconsin²

01 May 1996

TL;DR: Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone, and allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARth- MANNA multithreaded system.

...read moreread less

Abstract: Parallel systems supporting multithreading, or message passing in general, have typically used either polling or interrupts to handle incoming messages. Neither approach is ideal; either may lead to excessive overheads or message-handling latencies, depending on the application. This paper investigates a combined approach---Polling Watchdog, where both are used depending on the circumstances. The Polling Watchdog is a simple hardware extension that limits the generation of interrupts to the cases where explicit polling fails to handle the message quickly. As an added benefit, this mechanism also has the potential to simplify the interaction between interrupts and the network accesses performed by the program.We present the resulting performance for the EARTH-MANNA-S system, an implementation of the EARTH (Efficient Architecture for Running THreads) execution model on the MANNA multiprocessor. In contrast to the original EARTH-MANNA system, this system does not use a dedicated communication processor. Rather, synchronization and communication tasks are performed on the same processor as the regular computations. Therefore, an efficient message-handling mechanism is essential to good performance. Simulation results and performance measurements show that the Polling Watchdog indeed performs better than either polling or interrupts alone. In fact, this mechanism allows the EARTH-MANNA-S system to achieve the same level of performance as the original EARTH-MANNA multithreaded system.

...read moreread less

82 citations

Earth: an efficient architecture for running threads

[...]

Guang R. Gao, Kevin B. Theobald

01 Jan 1999

80 citations

Journal Article•DOI•

On the limits of program parallelism and its smoothability

[...]

Kevin B. Theobald¹, Guang R. Gao, Laurie Hendren•Institutions (1)

McGill University¹

10 Dec 1992

TL;DR: A new study of instruction-level parallelism is reported, which examines aspects not covered in previous studies, including the effects of various memory reuse policies and long-latency operations, and the results achieved when large benchmarks are allowed to run to completion.

...read moreread less

Abstract: There have been many recent studies of the “limits on instruction parallelism” an application programs. This paper reports a new study of instruction-level parallelism which examines aspects not covered in previous studies, including the effects of various memory reuse policies and long-latency operations, and the results achieved when large benchmarks are allowed to run to completion. We also define and study program smoothability, which quantifies the extent to which deferring program operations from periods of peak parallelism increases execution time. The results show a high degree of smoothability, suggesting that processor utilization can be quite high when the number of pro

...read moreread less

71 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Fast Fourier Transform

[...]

Alan R. Jones¹•Institutions (1)

IBM¹

01 Mar 1970-Sigplan Notices

1,349 citations

Book•

Operating Systems: Design and Implementation

[...]

Andrew S. Tanenbaum¹•Institutions (1)

University of Amsterdam¹

01 Jan 2006

TL;DR: The author discusses the history and present situation of operating systems, as well as some of the techniques used to design and implement these systems.

...read moreread less

Abstract: Table of Contents CHAPTER 1 INTRODUCTION 1.1 WHAT IS AN OPERATING SYSTEM? 1.2 HISTORY OF OPERATING SYSTEMS 1.3 OPERATING SYSTEM CONCEPTS 1.4 SYSTEM CALLS 1.5 OPERATING SYSTEM STRUCTURE 1.6 OUTLINE OF THE REST OF THIS BOOK 1.7 SUMMARY CHAPTER 2 PROCESSES 2.1 INTRODUCTION TO PROCESSES 2.2 INTERPROCESS COMMUNICATION 2.3 CLASSICAL IPC PROBLEMS 2.4 SCHEDULING 2.5 OVERVIEW OF PROCESSES IN MINIX 3 2.6 IMPLEMENTATION OF PROCESSES IN MINIX 3 2.7 THE SYSTEM TASK IN MINIX 3 2.8 THE CLOCK TASK IN MINIX 3 2.9 SUMMARY CHAPTER 3 INPUT/OUTPUT 3.1 PRINCIPLES OF I/O HARDWARE 3.2 PRINCIPLES OF I/O SOFTWARE 3.3 DEADLOCKS 3.4 OVERVIEW OF I/O IN MINIX 3 3.5 BLOCK DEVICES IN MINIX 3 3.6 RAM DISKS 3.7 DISKS 3.8 TERMINALS 3.9 SUMMARY CHAPTER 4 MEMORY MANAGEMENT 4.1 BASIC MEMORY MANAGEMENT 4.2 SWAPPING 4.3 VIRTUAL MEMORY 4.4 PAGE REPLACEMENT ALGORITHMS 4.5 DESIGN ISSUES FOR PAGING SYSTEMS 4.6 SEGMENTATION 4.7 OVERVIEW OF THE MINIX 3 PROCESS MANAGER 4.8 IMPLEMENTATION OF THE MINIX 3 PROCESS MANAGER 4.9 SUMMARY CHAPTER 5 FILE SYSTEMS 5.1 FILES 5.2 DIRECTORIES 5.3 FILE SYSTEM IMPLEMENTATION 5.4 SECURITY 5.5 PROTECTION MECHANISMS 5.6 OVERVIEW OF THE MINIX 3 FILE SYSTEM 5.7 IMPLEMENTATION OF THE MINIX 3 FILE SYSTEM 5.8 SUMMARY CHAPTER 6 READING LIST AND BIBLIOGRAPHY 6.1 SUGGESTIONS FOR FURTHER READING 6.2 ALPHABETICAL BIBLIOGRAPHY APPENDIX A - INSTALLING MINIX 3 APPENDIX B - MINIX 3 SOURCE CODE LISTING APPENDIX C - INDEX TO FILES INDEX

...read moreread less

572 citations

Proceedings Article•DOI•

Exceeding the dataflow limit via value prediction

[...]

Mikko H. Lipasti¹, John Paul Shen¹•Institutions (1)

Carnegie Mellon University¹

02 Dec 1996

TL;DR: It is shown that simple microarchitectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences, reduce average result latency and provide performance gains of 4.5%-23% by exceeding the dataflow limit.

...read moreread less

Abstract: For decades, the serialization constraints induced by true data dependences have been regarded as an absolute limit--the dataflow limit--on the parallel execution of serial programs. This paper proposes a new technique--value prediction--for exceeding that limit that allows data dependent instructions to issue and execute in parallel without violating program semantics. This technique is built on the concept of value locality, which describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Value prediction consists of predicting entire 32- and 64-bit register values based on previously-seen values. We find that such register values being written by machine instructions are frequently predictable. Furthermore, we show that simple micro- architectural enhancements to a modern microprocessor implementation based on the PowerPC 620 that enable value prediction can effectively exploit value locality to collapse true dependences, reduce average result latency, and provide performance gains of 4.5%-23% (depending on machine model) by exceeding the dataflow limit.

...read moreread less

526 citations

Proceedings Article•DOI•

Cache-Conscious Wavefront Scheduling

[...]

Timothy G. Rogers¹, Mike O'Connor², Tor M. Aamodt¹•Institutions (2)

University of British Columbia¹, Advanced Micro Devices²

01 Dec 2012

TL;DR: This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.

...read moreread less

Abstract: This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wave front locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.

...read moreread less

408 citations

Proceedings Article•DOI•

Base-delta-immediate compression: practical data compression for on-chip caches

[...]

Gennady Pekhimenko¹, Vivek Seshadri¹, Onur Mutlu¹, Michael Kozuch², Phillip B. Gibbons², Todd C. Mowry¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Intel²

19 Sep 2012

TL;DR: There is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

...read moreread less

Abstract: Cache compression is a promising technique to increase on-chip cache capacity and to decrease on-chip and off-chip bandwidth usage. Unfortunately, directly applying well-known compression algorithms (usually implemented in software) leads to high hardware complexity and unacceptable decompression/compression latencies, which in turn can negatively affect performance. Hence, there is a need for a simple yet efficient compression technique that can effectively compress common in-cache data patterns, and has minimal effect on cache access latency.

...read moreread less

348 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse