scispace - formally typeset
Search or ask a question
Author

Evan Torrie

Bio: Evan Torrie is an academic researcher from Stanford University. The author has contributed to research in topics: Cache pollution & Cache. The author has an hindex of 3, co-authored 4 publications receiving 3970 citations.

Papers
More filters
Proceedings ArticleDOI
01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Abstract: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

4,002 citations

Proceedings ArticleDOI
27 Jun 1995
TL;DR: It is found that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications, and it is shown that when long cache lines are used, many of these applications suffer from false sharing and low cache line utilization.
Abstract: Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from the SPEC, NAS, PERFECT, and RICEPS benchmark suites, we measure statistics such as speedups, memory costs, causes of cache misses, cache line utilization, and data traffic. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggestmethods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate-scale multiprocessors.

22 citations

Journal ArticleDOI
TL;DR: It is found that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications, and it is shown that when long cache lines are used, many of these applications suffer from false sharing and low cache line utilization.
Abstract: Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from standard benchmark suites, we measure statistics such as speedups, synchronization and load imbalance, causes of cache misses, cache line utilization, data traffic, and memory costs. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggest methods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate-scale multiprocessors.

20 citations

Journal ArticleDOI
TL;DR: It is found that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications, and it is shown that when long cache lines are used, many of these applications suffer from false sharing and low cache line utilization.
Abstract: Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from the SPEC, NAS, PERFECT, and RICEPS benchmark suites, we measure statistics such as speedups, memory costs, causes of cache misses, cache line utilization, and data traffic. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggest methods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate-scale multiprocessors.

Cited by
More filters
Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

3,514 citations

Book
01 Jan 2004
TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
Abstract: One of the greatest challenges faced by designers of digital systems is optimizing the communication and interconnection between system components. Interconnection networks offer an attractive and economical solution to this communication crisis and are fast becoming pervasive in digital systems. Current trends suggest that this communication bottleneck will be even more problematic when designing future generations of machines. Consequently, the anatomy of an interconnection network router and science of interconnection network design will only grow in importance in the coming years. This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies. It incorporates hardware-level descriptions of concepts, allowing a designer to see all the steps of the process from abstract design to concrete implementation. ·Case studies throughout the book draw on extensive author experience in designing interconnection networks over a period of more than twenty years, providing real world examples of what works, and what doesn't. ·Tightly couples concepts with implementation costs to facilitate a deeper understanding of the tradeoffs in the design of a practical network. ·A set of examples and exercises in every chapter help the reader to fully understand all the implications of every design decision. Table of Contents Chapter 1 Introduction to Interconnection Networks 1.1 Three Questions About Interconnection Networks 1.2 Uses of Interconnection Networks 1.3 Network Basics 1.4 History 1.5 Organization of this Book Chapter 2 A Simple Interconnection Network 2.1 Network Specifications and Constraints 2.2 Topology 2.3 Routing 2.4 Flow Control 2.5 Router Design 2.6 Performance Analysis 2.7 Exercises Chapter 3 Topology Basics 3.1 Nomenclature 3.2 Traffic Patterns 3.3 Performance 3.4 Packaging Cost 3.5 Case Study: The SGI Origin 2000 3.6 Bibliographic Notes 3.7 Exercises Chapter 4 Butterfly Networks 4.1 The Structure of Butterfly Networks 4.2 Isomorphic Butterflies 4.3 Performance and Packaging Cost 4.4 Path Diversity and Extra Stages 4.5 Case Study: The BBN Butterfly 4.6 Bibliographic Notes 4.7 Exercises Chapter 5 Torus Networks 5.1 The Structure of Torus Networks 5.2 Performance 5.3 Building Mesh and Torus Networks 5.4 Express Cubes 5.5 Case Study: The MIT J-Machine 5.6 Bibliographic Notes 5.7 Exercises Chapter 6 Non-Blocking Networks 6.1 Non-Blocking vs. Non-Interfering Networks 6.2 Crossbar Networks 6.3 Clos Networks 6.4 Benes Networks 6.5 Sorting Networks 6.6 Case Study: The Velio VC2002 (Zeus) Grooming Switch 6.7 Bibliographic Notes 6.8 Exercises Chapter 7 Slicing and Dicing 7.1 Concentrators and Distributors 7.2 Slicing and Dicing 7.3 Slicing Multistage Networks 7.4 Case Study: Bit Slicing in the Tiny Tera 7.5 Bibliographic Notes 7.6 Exercises Chapter 8 Routing Basics 8.1 A Routing Example 8.2 Taxonomy of Routing Algorithms 8.3 The Routing Relation 8.4 Deterministic Routing 8.5 Case Study: Dimension-Order Routing in the Cray T3D 8.6 Bibliographic Notes 8.7 Exercises Chapter 9 Oblivious Routing 9.1 Valiant's Randomized Routing Algorithm 9.2 Minimal Oblivious Routing 9.3 Load-Balanced Oblivious Routing 9.4 Analysis of Oblivious Routing 9.5 Case Study: Oblivious Routing in the Avici Terabit Switch Router(TSR) 9.6 Bibliographic Notes 9.7 Exercises Chapter 10 Adaptive Routing 10.1 Adaptive Routing Basics 10.2 Minimal Adaptive Routing 10.3 Fully Adaptive Routing 10.4 Load-Balanced Adaptive Routing 10.5 Search-Based Routing 10.6 Case Study: Adaptive Routing in the Thinking Machines CM-5 10.7 Bibliographic Notes 10.8 Exercises Chapter 11 Routing Mechanics 11.1 Table-Based Routing 11.2 Algorithmic Routing 11.3 Case Study: Oblivious Source Routing in the IBM Vulcan Network 11.4 Bibliographic Notes 11.5 Exercises Chapter 12 Flow Control Basics 12.1 Resources and Allocation Units 12.2 Bufferless Flow Control 12.3 Circuit Switching 12.4 Bibliographic Notes 12.5 Exercises Chapter 13 Buffered Flow Control 13.1 Packet-Buffer Flow Control 13.2 Flit-Buffer Flow Control 13.3 Buffer Management and Backpressure 13.4 Flit-Reservation Flow Control 13.5 Bibliographic Notes 13.6 Exercises Chapter 14 Deadlock and Livelock 14.1 Deadlock 14.2 Deadlock Avoidance 14.3 Adaptive Routing 14.4 Deadlock Recovery 14.5 Livelock 14.6 Case Study: Deadlock Avoidance in the Cray T3E 14.7 Bibliographic Notes 14.8 Exercises Chapter 15 Quality of Service 15.1 Service Classes and Service Contracts 15.2 Burstiness and Network Delays 15.3 Implementation of Guaranteed Services 15.4 Implementation of Best-Effort Services 15.5 Separation of Resources 15.6 Case Study: ATM Service Classes 15.7 Case Study: Virtual Networks in the Avici TSR 15.8 Bibliographic Notes 15.9 Exercises Chapter 16 Router Architecture 16.1 Basic Router Architecture 16.2 Stalls 16.3 Closing the Loop with Credits 16.4 Reallocating a Channel 16.5 Speculation and Lookahead 16.6 Flit and Credit Encoding 16.7 Case Study: The Alpha 21364 Router 16.8 Bibliographic Notes 16.9 Exercises Chapter 17 Router Datapath Components 17.1 Input Buffer Organization 17.2 Switches 17.3 Output Organization 17.4 Case Study: The Datapath of the IBM Colony Router 17.5 Bibliographic Notes 17.6 Exercises Chapter 18 Arbitration 18.1 Arbitration Timing 18.2 Fairness 18.3 Fixed Priority Arbiter 18.4 Variable Priority Iterative Arbiters 18.5 Matrix Arbiter 18.6 Queuing Arbiter 18.7 Exercises Chapter 19 Allocation 19.1 Representations 19.2 Exact Algorithms 19.3 Separable Allocators 19.4 Wavefront Allocator 19.5 Incremental vs. Batch Allocation 19.6 Multistage Allocation 19.7 Performance of Allocators 19.8 Case Study: The Tiny Tera Allocator 19.9 Bibliographic Notes 19.10 Exercises Chapter 20 Network Interfaces 20.1 Processor-Network Interface 20.2 Shared-Memory Interface 20.3 Line-Fabric Interface 20.4 Case Study: The MIT M-Machine Network Interface 20.5 Bibliographic Notes 20.6 Exercises Chapter 21 Error Control 411 21.1 Know Thy Enemy: Failure Modes and Fault Models 21.2 The Error Control Process: Detection, Containment, and Recovery 21.3 Link Level Error Control 21.4 Router Error Control 21.5 Network-Level Error Control 21.6 End-to-end Error Control 21.7 Bibliographic Notes 21.8 Exercises Chapter 22 Buses 22.1 Bus Basics 22.2 Bus Arbitration 22.3 High Performance Bus Protocol 22.4 From Buses to Networks 22.5 Case Study: The PCI Bus 22.6 Bibliographic Notes 22.7 Exercises Chapter 23 Performance Analysis 23.1 Measures of Interconnection Network Performance 23.2 Analysis 23.3 Validation 23.4 Case Study: Efficiency and Loss in the BBN Monarch Network 23.5 Bibliographic Notes 23.6 Exercises Chapter 24 Simulation 24.1 Levels of Detail 24.2 Network Workloads 24.3 Simulation Measurements 24.4 Simulator Design 24.5 Bibliographic Notes 24.6 Exercises Chapter 25 Simulation Examples 495 25.1 Routing 25.2 Flow Control Performance 25.3 Fault Tolerance Appendix A Nomenclature Appendix B Glossary Appendix C Network Simulator

3,233 citations

Proceedings ArticleDOI
04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

2,697 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

2,487 citations

18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

2,262 citations