scispace - formally typeset
Search or ask a question
Author

Michel Dubois

Bio: Michel Dubois is an academic researcher from University of Southern California. The author has contributed to research in topics: Cache & Shared memory. The author has an hindex of 30, co-authored 114 publications receiving 2678 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively.
Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively. >

167 citations

Proceedings ArticleDOI
16 Aug 1993
TL;DR: This work proposes to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness, and shows significant reductions of the read penalty and of the overall execution time.
Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware tech niques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the exe cution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show significant reductions of the read penalty and of the overall execution time.

149 citations

Proceedings ArticleDOI
01 May 1993
TL;DR: A new classification of misses in shared-memory multiprocessors based on interprocessor communication is introduced, which identifies the set of essential misses, i.e., the smallest set of misses necessary for correct execution.
Abstract: In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.

121 citations

Journal ArticleDOI
TL;DR: This article presents a comprehensive survey of various approaches for the verification of cache coherence protocols based on state enumeration, (symbolic model checking, and symbolic state models), and discusses the efficiency and the limitations of each technique in terms of memory and computation time.
Abstract: In this article we present a comprehensive survey of various approaches for the verification of cache coherence protocols based on state enumeration, (symbolic model checking, and symbolic state models. Since these techniques search the state space of the protocol exhaustively, the amount of memory required to manipulate that state information and the verification time grow very fast with the number of processors and the complexity of the protocol mechanisms. To be successful for systems of arbitrary complexity, a verification technique must solve this so-called state space explosion problem. The emphasis of our discussion is onthe underlying theory in each method of handling the state space exposion problem, and formulationg and checking the safety properties (e.g., data consistency) and the liveness properties (absence of deadlock and livelock). We compare the efficiency and discuss the limitations of each technique in terms of memory and computation time. Also, we discuss issues of generality, applicability, automaticity, and amenity for existing tools in each class of methods. No method is truly superior because each method has its own strengths and weaknesses. Finally, refinements that can further reduce the verification time and/or the memory requirement are also discussed.

111 citations

Journal ArticleDOI
01 Apr 1982
TL;DR: An in-depth analysis of the effects of cache coherency in multiprocessors is presented and a novel analytical model for the program behavior of a multitasked system is introduced.
Abstract: In many commercial multiprocessor systems, each processor accesses the memory through a private cache. One problem that could limit the extensibility of the system and its performance is the enforcement of cache coherence. A mechanism must exist which prevents the existence of several different copies of the same data block in different private caches. In this paper, we present an indepth analysis of the effect of cache coherency in multiprocessors. A novel analytical model for the program behavior of a multitasked system is introduced. The model includes the behavior of each process and the interactions between processes with regard to the sharing of data blocks. An approximation is developed to derive the main effects of the cache coherency contributing to degradations in system performance.

109 citations


Cited by
More filters
Proceedings ArticleDOI
01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Abstract: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

4,002 citations

Journal ArticleDOI
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.

1,614 citations

Book
15 Aug 1998
TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Abstract: The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

1,571 citations

Journal ArticleDOI
TL;DR: The properties of direct networks are reviewed, and the operation and characteristics of wormhole routing are discussed in detail, along with a technique that allows multiple virtual channels to share the same physical channel.
Abstract: Several research contributions and commercial ventures related to wormhole routing, a switching technique used in direct networks, are discussed. The properties of direct networks are reviewed, and the operation and characteristics of wormhole routing are discussed in detail. By its nature, wormhole routing is particularly susceptible to deadlock situations, in which two or more packets may block one another indefinitely. Several approaches to deadlock-free. routing, along with a technique that allows multiple virtual channels to share the same physical channel, are described. In addition, several open issues related to wormhole routing are discussed. >

1,307 citations