scispace - formally typeset
Search or ask a question
Author

Vijay S. Pai

Other affiliations: Rice University
Bio: Vijay S. Pai is an academic researcher from Purdue University. The author has contributed to research in topics: Shared memory & Cache. The author has an hindex of 25, co-authored 70 publications receiving 2286 citations. Previous affiliations of Vijay S. Pai include Rice University.


Papers
More filters
Journal ArticleDOI
TL;DR: The authors' experience with Rsim demonstrates that modeling ILP features is important even in shared-memory multiprocessor systems, and plans to release a new Rsim version shortly that will include instruction caches, TLBs, multimedia extensions, simultaneous multithreading, Rabbit fast simulation mode, and ports to Linux platforms.
Abstract: The early 1990s saw several announcements of commercial shared-memory systems using processors that aggressively exploited instruction-level parallelism (ILP), including the MIPS R10000, Hewlett-Packard PA8000, and Intel Pentium Pro. These processors could potentially reduce memory read stalls by overlapping read latency with other operations, possibly changing the nature of performance bottlenecks in the system. The authors' experience with Rsim demonstrates that modeling ILP features is important even in shared-memory multiprocessor systems. In particular, current simple processor-based approximations cannot model significant performance effects for applications exhibiting parallel read misses. Further, recent shared-memory designs such as aggressive implementations of sequential consistency use the aggressive ILP-enhancing features of modern processors that simple processor-based simulators do not model. As microprocessor systems become more complex, the availability of shared infrastructure source code is likely to become increasingly crucial. The authors plan to release a new Rsim version shortly that will include instruction caches, TLBs, multimedia extensions, simultaneous multithreading, Rabbit fast simulation mode, and ports to Linux platforms.

204 citations

Proceedings ArticleDOI
01 Jun 1997
TL;DR: It is found that the optimized implementation of PC performs significant tly better than the best implementation of sequential consistency (SC) in some cases because PC relaxes the store-to-load ordering constraint of SC.
Abstract: This paper studies techniques to improue the performance of memory consistency models for shared-memory multiprocessors with ILP processors The first part of this paper extends earlier work by studying the impact of current hardware optimization to memory consistency implementations, hardware-controlled non-binding prefetching and speculative load execution, on the performance of the processor consistency (PC) memory model We find that the optimized implementation of PC performs significant tly better than the best implementation of sequential consistency (SC) in some cases because PC relaxes the store-to-load ordering constraint of SC Nevertheless, release consistency (RC) provides significant benefits over PC in some cases, because PC’ suffers from the negative ef7ects of premature store prefetches and insufficient memory queue sizes The second part of the paper proposes and evaluates a new technique, speculative retirement, to improve the performance of SC Speculative retirement alleviates the impact of the store-to-load constraint of SC by allowing loads and subsequent instructions to speculatively commit or retire, even while a previous store is outstanding Speculative retirement needs additional hardware support (in the form of a history bu~er) to recover from possible consistency violations due to such speculative retires With a 64 element history bufler, speculative retirement reduces the execution time gap between SC and PC to within 11% for ail our applications on our base architecture; a significant, though reduced, gap still remains between SC and RC The third part of our paper evaluates the interactions of the various techniques with larger instruction window sizes When increasing instruction window size, initially, the previous best implementations of all models generally improve in performance due to increased load and store overlap With further increases, the performance of PC and RC stabilizes while that of SC often degrades (due to negative eflects of “This work is supported in part by the National Science Foundation under Grant NcJ CCR-9410457, CCR-9502500, and CDA-9502791, and the Texas Advanced Technology Program under Grant No 003604016 Vijay S Pai is also supported by a Fannie and John Hertz Foundation Fellowship Permission to make digit: d/lmrdcopies of all or IMUIot’this nmteria I Iiir personal or classroom use is granted without fee provided IIUNW copies are not made or distributed I’orprofit or commercial advantage the wspvright notice, the title of the publication and its dak appesr, and nuticx w given tlmt copyright is by permission of the ACM IIWTO copy Aerwiw, 10 republish 10 post on servers or 10 redistribute 10 IisLs requires

147 citations

20 Oct 1997
TL;DR: RSIM is execution-driven and models state-of-the-art ILP processors, an aggressive memory system, and a multiprocessor coherence protocol and interconnect, including contention at all resources, as well as being used successfully in both undergraduate and graduate computer architecture courses at Rice University.
Abstract: This paper describes RSIM { the Rice Simulator for ILP Multiprocessors { Version 1.0. RSIM simulates shared-memory multiprocessors (and uniprocessors) built from processors that aggressively exploit instruction-level parallelism (ILP). RSIM is execution-driven and models state-of-the-art ILP processors, an aggressive memory system, and a multiprocessor coherence protocol and interconnect, including contention at all resources. Although originally designed as a research tool, RSIM is also being used successfully in both undergraduate and graduate computer architecture courses at Rice University. RSIM version 1.0 is publicly available.

145 citations

20 Aug 1997
TL;DR: Signiicant parts of the RSIM memory and network system are based on code from RPPT, a project led by Prof. and involving several graduate students.
Abstract: Acknowledgments We thank other past and present members of the RSIM group for their contributions. Hazim Abdel-Shaa contributed parts of the memory system simulator code and documentation. Murthy Durbhakula provided valuable support in setting up the RSIM distribution. Jonathan Hall worked on initial versions of the memory system simulator. Tracy Harton supported our development eort over the last two y ears. We are also grateful to the Rice Parallel Processing Testbed (RPPT) group. Signiicant parts of the RSIM memory and network system are based on code from RPPT, a project led by Prof. and involving several graduate students.

142 citations

Proceedings ArticleDOI
11 Sep 2010
TL;DR: A sampled, parallelized method of measuring reuse distance proiles for multithreaded programs, modeling private and shared cache configurations, and using O(1) data structures that may be made thread-private to reduce overhead in analysis mode.
Abstract: Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance profiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.

119 citations


Cited by
More filters
Book
15 Aug 1998
TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Abstract: The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

1,571 citations

Journal ArticleDOI
TL;DR: This work describes an alternative, programmer-centric view of relaxed consistency models that describes them in terms of program behavior, not system optimizations, and most of these models emphasize the system optimizations they support.
Abstract: The memory consistency model of a system affects performance, programmability, and portability. We aim to describe memory consistency models in a way that most computer professionals would understand. This is important if the performance-enhancing features being incorporated by system designers are to be correctly and widely used by programmers. Our focus is consistency models proposed for hardware-based shared memory systems. Most of these models emphasize the system optimizations they support, and we retain this system-centric emphasis. We also describe an alternative, programmer-centric view of relaxed consistency models that describes them in terms of program behavior, not system optimizations.

1,213 citations

Proceedings ArticleDOI
01 May 1990
TL;DR: A new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models is introduced and is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization.
Abstract: Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the fast processors. Unless carefully controlled, such architectural optimizations can cause memory accesses to be executed in an order different from what the programmer expects. The set of allowable memory access orderings forms the memory consistency model or event ordering model for an architecture.This paper introduces a new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models. A framework for classifying shared accesses and reasoning about event ordering is developed. The release consistency model is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization. Possible performance gains from the less strict constraints of the release consistency model are explored. Finally, practical implementation issues are discussed, concentrating on issues relevant to scalable architectures.

1,169 citations

Book
05 Mar 2012
TL;DR: Computer Networking: A Top-Down Approach Featuring the Internet explains the engineering problems that are inherent in communicating digital information from point to point, and presents the mathematics that determine the best path, show some code that implements those algorithms, and illustrate the logic by using excellent conceptual diagrams.
Abstract: Certain data-communication protocols hog the spotlight, but all of them have a lot in common. Computer Networking: A Top-Down Approach Featuring the Internet explains the engineering problems that are inherent in communicating digital information from point to point. The top-down approach mentioned in the subtitle means that the book starts at the top of the protocol stack--at the application layer--and works its way down through the other layers, until it reaches bare wire. The authors, for the most part, shun the well-known seven-layer Open Systems Interconnection (OSI) protocol stack in favor of their own five-layer (application, transport, network, link, and physical) model. It's an effective approach that helps clear away some of the hand waving traditionally associated with the more obtuse layers in the OSI model. The approach is definitely theoretical--don't look here for instructions on configuring Windows 2000 or a Cisco router--but it's relevant to reality, and should help anyone who needs to understand networking as a programmer, system architect, or even administration guru.The treatment of the network layer, at which routing takes place, is typical of the overall style. In discussing routing, authors James Kurose and Keith Ross explain (by way of lots of clear, definition-packed text) what routing protocols need to do: find the best route to a destination. Then they present the mathematics that determine the best path, show some code that implements those algorithms, and illustrate the logic by using excellent conceptual diagrams. Real-life implementations of the algorithms--including Internet Protocol (both IPv4 and IPv6) and several popular IP routing protocols--help you to make the transition from pure theory to networking technologies. --David WallTopics covered: The theory behind data networks, with thorough discussion of the problems that are posed at each level (the application layer gets plenty of attention). For each layer, there's academic coverage of networking problems and solutions, followed by discussion of real technologies. Special sections deal with network security and transmission of digital multimedia.

1,079 citations

Proceedings ArticleDOI
12 Nov 2011
TL;DR: Interval simulation provides a balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle, while still providing the detail necessary to observe core-uncore interactions across the entire system.
Abstract: Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multi-core systems within a limited simulation time budget. By bringing together accurate high-abstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Interval simulation provides this balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle-accurate simulation, while still providing the detail necessary to observe core-uncore interactions across the entire system. Validations against real hardware show average absolute errors within 25% for a variety of multi-threaded workloads; more than twice as accurate on average as one-IPC simulation. Further, we demonstrate scalable simulation speed of up to 2.0 MIPS when simulating a 16-core system on an 8-core SMP machine.

818 citations