Author
Lawrence Snyder
Bio: Lawrence Snyder is an academic researcher from University of Washington. The author has contributed to research in topics: Parallel programming model & Compiler. The author has an hindex of 22, co-authored 51 publications receiving 1428 citations.
Papers published on a yearly basis
Papers
More filters
••
01 Apr 1991TL;DR: This paper presents the complete design of the router together with (simulated) performance figures, and shows that, the Chaos t-outer is competitive with the simple and fast obli~ious routers for random loads and greatly superior for loads with hot spots.
Abstract: The Chaos router is an adaptive, randomized message router for multlicomputers. Aclaptive routers are superior to oblivious routers, the state-of-the-art, because they can by-pass congestion and faults. unlike other adaptive routers, however, the Chaos router has reduced the complexity along the critical path of the routing decision by using randomization to eliminate livelock protection, The foundational theory for Chaotic routing, proving that, this approach is sound, has been previously de~-eloped [1 1]. In this paper we present, the complete design of the router together with (simulated) performance figures. The results show that, the Chaos t-outer is competitive with the simple and fast obli~ious routers for random loads and greatly superior for loads with hot spots.
121 citations
••
01 Jul 1998TL;DR: ZPL is a high level language that offers competitive performance and portability, as well as programming conveniences lacking in low level approaches, and simplifies the task of programming for parallel computers-without sacrificing efficiency.
Abstract: Message passing programs are efficient, but fall short on convenience and portability. ZPL is a high level language that offers competitive performance and portability, as well as programming conveniences lacking in low level approaches. ZPL runs on a variety of parallel and sequential computers. We describe the problems with message passing and describe how ZPL simplifies the task of programming for parallel computers-without sacrificing efficiency.
84 citations
••
TL;DR: ZPL is described and ZPt's machine-independent performance model is described, the programming benefits of ZPL's region-based constructs are described, and the compilation benefits of the language's high-level semantics are summarized.
Abstract: The goal of producing architecture-independent parallel programs is complicated by the competing need for high performance. The ZPL programming language achieves both goals by building upon an abstract parallel machine and by providing programming constructs that allow the programmer to "see" this underlying machine. This paper describes ZPL and provides a comprehensive evaluation of the language with respect to its goals of performance, portability, and programming convenience. In particular, we describe ZPt's machine-independent performance model, describe the programming benefits of ZPL's region-based constructs, summarize the compilation benefits of the language's high-level semantics, and summarize empirical evidence that ZPL has achieved both high performance and portability on diverse machines such as the IBM SP-2, Cray T3E, and SGI Power Challenge.
81 citations
••
[...]
TL;DR: The Chaos router, a randomizing, nonminimal adaptive packet router is introduced, it is shown to be deadlock free and probabilistically livelock free, and performance results are presented for a variety of work loads.
Abstract: The Chaos router, a randomizing, nonminimal adaptive packet router is introduced. Adaptive routers allow messages to dynamically select paths, depending on network traffic, and bypass congested nodes. This flexibility contrasts with oblivious packet routers where the path of a packet is statically determined at the source node. A key advancement of the Chaos router over previous nonminimal routers is the use of randomization to eliminate the need for livelock protection. This simplifies adaptive routing to be of approximately the same complexity along the critical decision path as an oblivious router. The primary cost is that the Chaos router is probabilistically livelock free rather than being deterministically livelock free, but evidence is presented implying that these are equivalent in practice. The principal advantage is excellent performance for nonuniform traffic patterns. The Chaos router is described, it is shown to be deadlock free and probabilistically livelock free, and performance results are presented for a variety of work loads. >
70 citations
••
01 May 1990TL;DR: In this paper the router is described, it is argued that the chaos router is deadlock free and probabilistically live-lock and starvation free, and simulation results are presented showing that the Chaos Router performs well.
Abstract: We present the chaos router, an asynchronous adaptive router, which under certain circumstances can send messages farther from their destinations. The chaos router greatly simplifies the routing logic by removing the livelock protection of previous schemes. Through an effective use of randomness, whose sources include that due to the adaptively processed load, the natural timing differences of selftimed circuitry and explicitly injected randomization, the chaos router avoids long message routes with high probability. In this paper the router is described, it is argued that the chaos router is deadlock free and probabilistically live-lock and starvation free, and simulation results are presented showing that the chaos router performs well.
62 citations
Cited by
More filters
•
01 Jan 2004
TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
Abstract: One of the greatest challenges faced by designers of digital systems is optimizing the communication and interconnection between system components. Interconnection networks offer an attractive and economical solution to this communication crisis and are fast becoming pervasive in digital systems. Current trends suggest that this communication bottleneck will be even more problematic when designing future generations of machines. Consequently, the anatomy of an interconnection network router and science of interconnection network design will only grow in importance in the coming years.
This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies. It incorporates hardware-level descriptions of concepts, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
·Case studies throughout the book draw on extensive author experience in designing interconnection networks over a period of more than twenty years, providing real world examples of what works, and what doesn't.
·Tightly couples concepts with implementation costs to facilitate a deeper understanding of the tradeoffs in the design of a practical network.
·A set of examples and exercises in every chapter help the reader to fully understand all the implications of every design decision.
Table of Contents
Chapter 1 Introduction to Interconnection Networks
1.1 Three Questions About Interconnection Networks
1.2 Uses of Interconnection Networks
1.3 Network Basics
1.4 History
1.5 Organization of this Book
Chapter 2 A Simple Interconnection Network
2.1 Network Specifications and Constraints
2.2 Topology
2.3 Routing
2.4 Flow Control
2.5 Router Design
2.6 Performance Analysis
2.7 Exercises
Chapter 3 Topology Basics
3.1 Nomenclature
3.2 Traffic Patterns
3.3 Performance
3.4 Packaging Cost
3.5 Case Study: The SGI Origin 2000
3.6 Bibliographic Notes
3.7 Exercises
Chapter 4 Butterfly Networks
4.1 The Structure of Butterfly Networks
4.2 Isomorphic Butterflies
4.3 Performance and Packaging Cost
4.4 Path Diversity and Extra Stages
4.5 Case Study: The BBN Butterfly
4.6 Bibliographic Notes
4.7 Exercises
Chapter 5 Torus Networks
5.1 The Structure of Torus Networks
5.2 Performance
5.3 Building Mesh and Torus Networks
5.4 Express Cubes
5.5 Case Study: The MIT J-Machine
5.6 Bibliographic Notes
5.7 Exercises
Chapter 6 Non-Blocking Networks
6.1 Non-Blocking vs. Non-Interfering Networks
6.2 Crossbar Networks
6.3 Clos Networks
6.4 Benes Networks
6.5 Sorting Networks
6.6 Case Study: The Velio VC2002 (Zeus) Grooming Switch
6.7 Bibliographic Notes
6.8 Exercises
Chapter 7 Slicing and Dicing
7.1 Concentrators and Distributors
7.2 Slicing and Dicing
7.3 Slicing Multistage Networks
7.4 Case Study: Bit Slicing in the Tiny Tera
7.5 Bibliographic Notes
7.6 Exercises
Chapter 8 Routing Basics
8.1 A Routing Example
8.2 Taxonomy of Routing Algorithms
8.3 The Routing Relation
8.4 Deterministic Routing
8.5 Case Study: Dimension-Order Routing in the Cray T3D
8.6 Bibliographic Notes
8.7 Exercises
Chapter 9 Oblivious Routing
9.1 Valiant's Randomized Routing Algorithm
9.2 Minimal Oblivious Routing
9.3 Load-Balanced Oblivious Routing
9.4 Analysis of Oblivious Routing
9.5 Case Study: Oblivious Routing in the
Avici Terabit Switch Router(TSR)
9.6 Bibliographic Notes
9.7 Exercises
Chapter 10 Adaptive Routing
10.1 Adaptive Routing Basics
10.2 Minimal Adaptive Routing
10.3 Fully Adaptive Routing
10.4 Load-Balanced Adaptive Routing
10.5 Search-Based Routing
10.6 Case Study: Adaptive Routing in the
Thinking Machines CM-5
10.7 Bibliographic Notes
10.8 Exercises
Chapter 11 Routing Mechanics
11.1 Table-Based Routing
11.2 Algorithmic Routing
11.3 Case Study: Oblivious Source Routing in the
IBM Vulcan Network
11.4 Bibliographic Notes
11.5 Exercises
Chapter 12 Flow Control Basics
12.1 Resources and Allocation Units
12.2 Bufferless Flow Control
12.3 Circuit Switching
12.4 Bibliographic Notes
12.5 Exercises
Chapter 13 Buffered Flow Control
13.1 Packet-Buffer Flow Control
13.2 Flit-Buffer Flow Control
13.3 Buffer Management and Backpressure
13.4 Flit-Reservation Flow Control
13.5 Bibliographic Notes
13.6 Exercises
Chapter 14 Deadlock and Livelock
14.1 Deadlock
14.2 Deadlock Avoidance
14.3 Adaptive Routing
14.4 Deadlock Recovery
14.5 Livelock
14.6 Case Study: Deadlock Avoidance in the Cray T3E
14.7 Bibliographic Notes
14.8 Exercises
Chapter 15 Quality of Service
15.1 Service Classes and Service Contracts
15.2 Burstiness and Network Delays
15.3 Implementation of Guaranteed Services
15.4 Implementation of Best-Effort Services
15.5 Separation of Resources
15.6 Case Study: ATM Service Classes
15.7 Case Study: Virtual Networks in the Avici TSR
15.8 Bibliographic Notes
15.9 Exercises
Chapter 16 Router Architecture
16.1 Basic Router Architecture
16.2 Stalls
16.3 Closing the Loop with Credits
16.4 Reallocating a Channel
16.5 Speculation and Lookahead
16.6 Flit and Credit Encoding
16.7 Case Study: The Alpha 21364 Router
16.8 Bibliographic Notes
16.9 Exercises
Chapter 17 Router Datapath Components
17.1 Input Buffer Organization
17.2 Switches
17.3 Output Organization
17.4 Case Study: The Datapath of the IBM Colony
Router
17.5 Bibliographic Notes
17.6 Exercises
Chapter 18 Arbitration
18.1 Arbitration Timing
18.2 Fairness
18.3 Fixed Priority Arbiter
18.4 Variable Priority Iterative Arbiters
18.5 Matrix Arbiter
18.6 Queuing Arbiter
18.7 Exercises
Chapter 19 Allocation
19.1 Representations
19.2 Exact Algorithms
19.3 Separable Allocators
19.4 Wavefront Allocator
19.5 Incremental vs. Batch Allocation
19.6 Multistage Allocation
19.7 Performance of Allocators
19.8 Case Study: The Tiny Tera Allocator
19.9 Bibliographic Notes
19.10 Exercises
Chapter 20 Network Interfaces
20.1 Processor-Network Interface
20.2 Shared-Memory Interface
20.3 Line-Fabric Interface
20.4 Case Study: The MIT M-Machine Network Interface
20.5 Bibliographic Notes
20.6 Exercises
Chapter 21 Error Control 411
21.1 Know Thy Enemy: Failure Modes and Fault Models
21.2 The Error Control Process: Detection, Containment,
and Recovery
21.3 Link Level Error Control
21.4 Router Error Control
21.5 Network-Level Error Control
21.6 End-to-end Error Control
21.7 Bibliographic Notes
21.8 Exercises
Chapter 22 Buses
22.1 Bus Basics
22.2 Bus Arbitration
22.3 High Performance Bus Protocol
22.4 From Buses to Networks
22.5 Case Study: The PCI Bus
22.6 Bibliographic Notes
22.7 Exercises
Chapter 23 Performance Analysis
23.1 Measures of Interconnection Network Performance
23.2 Analysis
23.3 Validation
23.4 Case Study: Efficiency and Loss in the
BBN Monarch Network
23.5 Bibliographic Notes
23.6 Exercises
Chapter 24 Simulation
24.1 Levels of Detail
24.2 Network Workloads
24.3 Simulation Measurements
24.4 Simulator Design
24.5 Bibliographic Notes
24.6 Exercises
Chapter 25 Simulation Examples 495
25.1 Routing
25.2 Flow Control Performance
25.3 Fault Tolerance
Appendix A Nomenclature
Appendix B Glossary
Appendix C Network Simulator
3,233 citations
18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.
2,262 citations
••
[...]
TL;DR: The major variations of the B-tree are discussed, especially the B+-tree, contrasting the merits and costs of each implementation and illustrating a general purpose access method that uses a B- tree.
Abstract: B-trees have become, de facto, a standard for file organization. File indexes of users, dedicated database systems, and general-purpose access methods have all been proposed and nnplemented using B-trees This paper reviews B-trees and shows why they have been so successful It discusses the major variations of the B-tree, especially the B+-tree, contrasting the relatwe merits and costs of each implementatmn. It illustrates a general purpose access method whmh uses a B-tree.
2,032 citations
••
TL;DR: AntNet is a distributed, mobile agents based Monte Carlo system that was inspired by recent work on the ant colony metaphor for solving optimization problems, and showed superior performance under all the experimental conditions with respect to its competitors.
Abstract: This paper introduces AntNet, a novel approach to the adaptive learning of routing tables in communications networks. AntNet is a distributed, mobile agents based Monte Carlo system that was inspired by recent work on the ant colony metaphor for solving optimization problems. AntNet's agents concurrently explore the network and exchange collected information. The communication among the agents is indirect and asynchronous, mediated by the network itself. This form of communication is typical of social insects and is called stigmergy. We compare our algorithm with six state-of-the-art routing algorithms coming from the telecommunications and machine learning fields. The algorithms' performance is evaluated over a set of realistic testbeds. We run many experiments over real and artificial IP datagram networks with increasing number of nodes and under several paradigmatic spatial and temporal traffic distributions. Results are very encouraging. AntNet showed superior performance under all the experimental conditions with respect to its competitors. We analyze the main characteristics of the algorithm and try to explain the reasons for its superiority.
1,712 citations
•
15 Aug 1998
TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Abstract: The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
* synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development
* presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs
* describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies
* illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs
Table of Contents
1 Introduction
2 Parallel Programs
3 Programming for Performance
4 Workload-Driven Evaluation
5 Shared Memory Multiprocessors
6 Snoop-based Multiprocessor Design
7 Scalable Multiprocessors
8 Directory-based Cache Coherence
9 Hardware-Software Tradeoffs
10 Interconnection Network Design
11 Latency Tolerance
12 Future Directions
APPENDIX A Parallel Benchmark Suites
1,571 citations