Home
/
Authors
/
Vijay S. Pai

Author

Vijay S. Pai

Bio: Vijay S. Pai is an academic researcher from Purdue University. The author has contributed to research in topics: Shared memory & Cache. The author has an hindex of 25, co-authored 70 publications receiving 2286 citations. Previous affiliations of Vijay S. Pai include Rice University.

Papers published on a yearly basis

2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Rsim: simulating shared-memory multiprocessors with ILP processors

[...]

Christopher J. Hughes¹, Vijay S. Pai², Parthasarathy Ranganathan, Sarita V. Adve¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Rice University²

01 Feb 2002-IEEE Computer

TL;DR: The authors' experience with Rsim demonstrates that modeling ILP features is important even in shared-memory multiprocessor systems, and plans to release a new Rsim version shortly that will include instruction caches, TLBs, multimedia extensions, simultaneous multithreading, Rabbit fast simulation mode, and ports to Linux platforms.

...read moreread less

Abstract: The early 1990s saw several announcements of commercial shared-memory systems using processors that aggressively exploited instruction-level parallelism (ILP), including the MIPS R10000, Hewlett-Packard PA8000, and Intel Pentium Pro. These processors could potentially reduce memory read stalls by overlapping read latency with other operations, possibly changing the nature of performance bottlenecks in the system. The authors' experience with Rsim demonstrates that modeling ILP features is important even in shared-memory multiprocessor systems. In particular, current simple processor-based approximations cannot model significant performance effects for applications exhibiting parallel read misses. Further, recent shared-memory designs such as aggressive implementations of sequential consistency use the aggressive ILP-enhancing features of modern processors that simple processor-based simulators do not model. As microprocessor systems become more complex, the availability of shared infrastructure source code is likely to become increasingly crucial. The authors plan to release a new Rsim version shortly that will include instruction caches, TLBs, multimedia extensions, simultaneous multithreading, Rabbit fast simulation mode, and ports to Linux platforms.

...read moreread less

204 citations

Proceedings Article•DOI•

Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

[...]

Parthasarathy Ranganathan¹, Vijay S. Pai¹, Sarita V. Adve¹•Institutions (1)

Rice University¹

01 Jun 1997

TL;DR: It is found that the optimized implementation of PC performs significant tly better than the best implementation of sequential consistency (SC) in some cases because PC relaxes the store-to-load ordering constraint of SC.

...read moreread less

Abstract: This paper studies techniques to improue the performance of memory consistency models for shared-memory multiprocessors with ILP processors The first part of this paper extends earlier work by studying the impact of current hardware optimization to memory consistency implementations, hardware-controlled non-binding prefetching and speculative load execution, on the performance of the processor consistency (PC) memory model We find that the optimized implementation of PC performs significant tly better than the best implementation of sequential consistency (SC) in some cases because PC relaxes the store-to-load ordering constraint of SC Nevertheless, release consistency (RC) provides significant benefits over PC in some cases, because PC’ suffers from the negative ef7ects of premature store prefetches and insufficient memory queue sizes The second part of the paper proposes and evaluates a new technique, speculative retirement, to improve the performance of SC Speculative retirement alleviates the impact of the store-to-load constraint of SC by allowing loads and subsequent instructions to speculatively commit or retire, even while a previous store is outstanding Speculative retirement needs additional hardware support (in the form of a history bu~er) to recover from possible consistency violations due to such speculative retires With a 64 element history bufler, speculative retirement reduces the execution time gap between SC and PC to within 11% for ail our applications on our base architecture; a significant, though reduced, gap still remains between SC and RC The third part of our paper evaluates the interactions of the various techniques with larger instruction window sizes When increasing instruction window size, initially, the previous best implementations of all models generally improve in performance due to increased load and store overlap With further increases, the performance of PC and RC stabilizes while that of SC often degrades (due to negative eflects of “This work is supported in part by the National Science Foundation under Grant NcJ CCR-9410457, CCR-9502500, and CDA-9502791, and the Texas Advanced Technology Program under Grant No 003604016 Vijay S Pai is also supported by a Fannie and John Hertz Foundation Fellowship Permission to make digit: d/lmrdcopies of all or IMUIot’this nmteria I Iiir personal or classroom use is granted without fee provided IIUNW copies are not made or distributed I’orprofit or commercial advantage the wspvright notice, the title of the publication and its dak appesr, and nuticx w given tlmt copyright is by permission of the ACM IIWTO copy Aerwiw, 10 republish 10 post on servers or 10 redistribute 10 IisLs requires

...read moreread less

147 citations

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors

[...]

Vijay S. Pai, Parthasarathy Ranganathan, Sarita V. Adve

20 Oct 1997

TL;DR: RSIM is execution-driven and models state-of-the-art ILP processors, an aggressive memory system, and a multiprocessor coherence protocol and interconnect, including contention at all resources, as well as being used successfully in both undergraduate and graduate computer architecture courses at Rice University.

...read moreread less

Abstract: This paper describes RSIM { the Rice Simulator for ILP Multiprocessors { Version 1.0. RSIM simulates shared-memory multiprocessors (and uniprocessors) built from processors that aggressively exploit instruction-level parallelism (ILP). RSIM is execution-driven and models state-of-the-art ILP processors, an aggressive memory system, and a multiprocessor coherence protocol and interconnect, including contention at all resources. Although originally designed as a research tool, RSIM is also being used successfully in both undergraduate and graduate computer architecture courses at Rice University. RSIM version 1.0 is publicly available.

...read moreread less

145 citations

RSIM Reference Manual: Version 1.0

[...]

Vijay S. Pai, Parthasarathy Ranganathan, Sarita V. Adve

20 Aug 1997

TL;DR: Signiicant parts of the RSIM memory and network system are based on code from RPPT, a project led by Prof. and involving several graduate students.

...read moreread less

Abstract: Acknowledgments We thank other past and present members of the RSIM group for their contributions. Hazim Abdel-Shaa contributed parts of the memory system simulator code and documentation. Murthy Durbhakula provided valuable support in setting up the RSIM distribution. Jonathan Hall worked on initial versions of the memory system simulator. Tracy Harton supported our development eort over the last two y ears. We are also grateful to the Rice Parallel Processing Testbed (RPPT) group. Signiicant parts of the RSIM memory and network system are based on code from RPPT, a project led by Prof. and involving several graduate students.

...read moreread less

142 citations

Proceedings Article•DOI•

Accelerating multicore reuse distance analysis with sampling and parallelization

[...]

Derek L. Schuff¹, Milind Kulkarni¹, Vijay S. Pai¹•Institutions (1)

Purdue University¹

11 Sep 2010

TL;DR: A sampled, parallelized method of measuring reuse distance proiles for multithreaded programs, modeling private and shared cache configurations, and using O(1) data structures that may be made thread-private to reduce overhead in analysis mode.

...read moreread less

Abstract: Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance profiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.

...read moreread less

119 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14

Collapse

Cited by

PDF

Open Access

More filters

Book•

Parallel Computer Architecture: A Hardware/Software Approach

[...]

David E. Culler, Anoop Gupta, Jaswinder Pal Singh

15 Aug 1998

TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.

...read moreread less

Abstract: The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

...read moreread less

1,571 citations

Journal Article•DOI•

Shared memory consistency models: a tutorial

[...]

Sarita V. Adve¹, Kourosh Gharachorloo•Institutions (1)

Rice University¹

01 Dec 1996-IEEE Computer

TL;DR: This work describes an alternative, programmer-centric view of relaxed consistency models that describes them in terms of program behavior, not system optimizations, and most of these models emphasize the system optimizations they support.

...read moreread less

Abstract: The memory consistency model of a system affects performance, programmability, and portability. We aim to describe memory consistency models in a way that most computer professionals would understand. This is important if the performance-enhancing features being incorporated by system designers are to be correctly and widely used by programmers. Our focus is consistency models proposed for hardware-based shared memory systems. Most of these models emphasize the system optimizations they support, and we retain this system-centric emphasis. We also describe an alternative, programmer-centric view of relaxed consistency models that describes them in terms of program behavior, not system optimizations.

...read moreread less

1,213 citations

Proceedings Article•DOI•

Memory consistency and event ordering in scalable shared-memory multiprocessors

[...]

Kourosh Gharachorloo¹, Daniel E. Lenoski¹, James Laudon¹, Phillip B. Gibbons¹, Anoop Gupta¹, John L. Hennessy¹ - Show less +2 more•Institutions (1)

Stanford University¹

01 May 1990

TL;DR: A new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models is introduced and is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization.

...read moreread less

Abstract: Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the fast processors. Unless carefully controlled, such architectural optimizations can cause memory accesses to be executed in an order different from what the programmer expects. The set of allowable memory access orderings forms the memory consistency model or event ordering model for an architecture.This paper introduces a new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models. A framework for classifying shared accesses and reasoning about event ordering is developed. The release consistency model is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization. Possible performance gains from the less strict constraints of the release consistency model are explored. Finally, practical implementation issues are discussed, concentrating on issues relevant to scalable architectures.

...read moreread less

1,169 citations

Book•

Computer Networking: A Top-Down Approach

[...]

James F. Kurose, Keith W. Ross

05 Mar 2012

TL;DR: Computer Networking: A Top-Down Approach Featuring the Internet explains the engineering problems that are inherent in communicating digital information from point to point, and presents the mathematics that determine the best path, show some code that implements those algorithms, and illustrate the logic by using excellent conceptual diagrams.

...read moreread less

Abstract: Certain data-communication protocols hog the spotlight, but all of them have a lot in common. Computer Networking: A Top-Down Approach Featuring the Internet explains the engineering problems that are inherent in communicating digital information from point to point. The top-down approach mentioned in the subtitle means that the book starts at the top of the protocol stack--at the application layer--and works its way down through the other layers, until it reaches bare wire. The authors, for the most part, shun the well-known seven-layer Open Systems Interconnection (OSI) protocol stack in favor of their own five-layer (application, transport, network, link, and physical) model. It's an effective approach that helps clear away some of the hand waving traditionally associated with the more obtuse layers in the OSI model. The approach is definitely theoretical--don't look here for instructions on configuring Windows 2000 or a Cisco router--but it's relevant to reality, and should help anyone who needs to understand networking as a programmer, system architect, or even administration guru.The treatment of the network layer, at which routing takes place, is typical of the overall style. In discussing routing, authors James Kurose and Keith Ross explain (by way of lots of clear, definition-packed text) what routing protocols need to do: find the best route to a destination. Then they present the mathematics that determine the best path, show some code that implements those algorithms, and illustrate the logic by using excellent conceptual diagrams. Real-life implementations of the algorithms--including Internet Protocol (both IPv4 and IPv6) and several popular IP routing protocols--help you to make the transition from pure theory to networking technologies. --David WallTopics covered: The theory behind data networks, with thorough discussion of the problems that are posed at each level (the application layer gets plenty of attention). For each layer, there's academic coverage of networking problems and solutions, followed by discussion of real technologies. Special sections deal with network security and transmission of digital multimedia.

...read moreread less

1,079 citations

Proceedings Article•DOI•

Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

[...]

Trevor E. Carlson¹, Wim Heirmant¹, Lieven Eeckhout¹•Institutions (1)

Ghent University¹

12 Nov 2011

TL;DR: Interval simulation provides a balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle, while still providing the detail necessary to observe core-uncore interactions across the entire system.

...read moreread less

Abstract: Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multi-core systems within a limited simulation time budget. By bringing together accurate high-abstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Interval simulation provides this balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle-accurate simulation, while still providing the detail necessary to observe core-uncore interactions across the entire system. Validations against real hardware show average absolute errors within 25% for a variety of multi-threaded workloads; more than twice as accurate on average as one-IPC simulation. Further, we demonstrate scalable simulation speed of up to 2.0 MIPS when simulating a 16-core system on an 8-core SMP machine.

...read moreread less

818 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse