One of the greatest challenges faced by designers of digital systems is optimizing the communication and interconnection between system components. Interconnection networks offer an attractive and economical solution to this communication crisis and are fast becoming pervasive in digital systems. Current trends suggest that this communication bottleneck will be even more problematic when designing future generations of machines. Consequently, the anatomy of an interconnection network router and science of interconnection network design will only grow in importance in the coming years.

This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies. It incorporates hardware-level descriptions of concepts, allowing a designer to see all the steps of the process from abstract design to concrete implementation.

·Case studies throughout the book draw on extensive author experience in designing interconnection networks over a period of more than twenty years, providing real world examples of what works, and what doesn't.

·Tightly couples concepts with implementation costs to facilitate a deeper understanding of the tradeoffs in the design of a practical network.

·A set of examples and exercises in every chapter help the reader to fully understand all the implications of every design decision.

Table of Contents


Chapter 1 Introduction to Interconnection Networks 
1.1 Three Questions About Interconnection Networks 
1.2 Uses of Interconnection Networks 
1.3 Network Basics 
1.4 History 
1.5 Organization of this Book 

Chapter 2 A Simple Interconnection Network 
2.1 Network Specifications and Constraints 
2.2 Topology 
2.3 Routing 
2.4 Flow Control 
2.5 Router Design 
2.6 Performance Analysis 
2.7 Exercises 

Chapter 3 Topology Basics 
3.1 Nomenclature 
3.2 Traffic Patterns 
3.3 Performance 
3.4 Packaging Cost 
3.5 Case Study: The SGI Origin 2000 
3.6 Bibliographic Notes 
3.7 Exercises 

Chapter 4 Butterfly Networks 
4.1 The Structure of Butterfly Networks 
4.2 Isomorphic Butterflies 
4.3 Performance and Packaging Cost 
4.4 Path Diversity and Extra Stages 
4.5 Case Study: The BBN Butterfly 
4.6 Bibliographic Notes 
4.7 Exercises 

Chapter 5 Torus Networks 
5.1 The Structure of Torus Networks 
5.2 Performance 
5.3 Building Mesh and Torus Networks 
5.4 Express Cubes 
5.5 Case Study: The MIT J-Machine 
5.6 Bibliographic Notes 
5.7 Exercises 
Chapter 6 Non-Blocking Networks 
6.1 Non-Blocking vs. Non-Interfering Networks 
6.2 Crossbar Networks 
6.3 Clos Networks 
6.4 Benes Networks 
6.5 Sorting Networks 
6.6 Case Study: The Velio VC2002 (Zeus) Grooming Switch 
6.7 Bibliographic Notes 
6.8 Exercises 

Chapter 7 Slicing and Dicing 
7.1 Concentrators and Distributors 
7.2 Slicing and Dicing 
7.3 Slicing Multistage Networks 
7.4 Case Study: Bit Slicing in the Tiny Tera 
7.5 Bibliographic Notes 
7.6 Exercises 

Chapter 8 Routing Basics 
8.1 A Routing Example 
8.2 Taxonomy of Routing Algorithms 
8.3 The Routing Relation 
8.4 Deterministic Routing 
8.5 Case Study: Dimension-Order Routing in the Cray T3D 
8.6 Bibliographic Notes 
8.7 Exercises 

Chapter 9 Oblivious Routing 
9.1 Valiant's Randomized Routing Algorithm 
9.2 Minimal Oblivious Routing 
9.3 Load-Balanced Oblivious Routing 
9.4 Analysis of Oblivious Routing 
9.5 Case Study: Oblivious Routing in the
Avici Terabit Switch Router(TSR) 
9.6 Bibliographic Notes 
9.7 Exercises 

Chapter 10 Adaptive Routing 
10.1 Adaptive Routing Basics 
10.2 Minimal Adaptive Routing 
10.3 Fully Adaptive Routing 
10.4 Load-Balanced Adaptive Routing 
10.5 Search-Based Routing 
10.6 Case Study: Adaptive Routing in the
Thinking Machines CM-5 
10.7 Bibliographic Notes 
10.8 Exercises 

Chapter 11 Routing Mechanics 
11.1 Table-Based Routing 
11.2 Algorithmic Routing 
11.3 Case Study: Oblivious Source Routing in the
IBM Vulcan Network 
11.4 Bibliographic Notes 
11.5 Exercises 

Chapter 12 Flow Control Basics 
12.1 Resources and Allocation Units 
12.2 Bufferless Flow Control 
12.3 Circuit Switching 
12.4 Bibliographic Notes 
12.5 Exercises 

Chapter 13 Buffered Flow Control 
13.1 Packet-Buffer Flow Control 
13.2 Flit-Buffer Flow Control 
13.3 Buffer Management and Backpressure 
13.4 Flit-Reservation Flow Control 
13.5 Bibliographic Notes 
13.6 Exercises 

Chapter 14 Deadlock and Livelock 
14.1 Deadlock 
14.2 Deadlock Avoidance 
14.3 Adaptive Routing 
14.4 Deadlock Recovery 
14.5 Livelock 
14.6 Case Study: Deadlock Avoidance in the Cray T3E 
14.7 Bibliographic Notes 
14.8 Exercises 

Chapter 15 Quality of Service 
15.1 Service Classes and Service Contracts 
15.2 Burstiness and Network Delays 
15.3 Implementation of Guaranteed Services 
15.4 Implementation of Best-Effort Services 
15.5 Separation of Resources 
15.6 Case Study: ATM Service Classes 
15.7 Case Study: Virtual Networks in the Avici TSR 
15.8 Bibliographic Notes 
15.9 Exercises 

Chapter 16 Router Architecture 
16.1 Basic Router Architecture 
16.2 Stalls 
16.3 Closing the Loop with Credits 
16.4 Reallocating a Channel 
16.5 Speculation and Lookahead 
16.6 Flit and Credit Encoding 
16.7 Case Study: The Alpha 21364 Router 
16.8 Bibliographic Notes 
16.9 Exercises 

Chapter 17 Router Datapath Components 
17.1 Input Buffer Organization 
17.2 Switches 
17.3 Output Organization 
17.4 Case Study: The Datapath of the IBM Colony
Router 
17.5 Bibliographic Notes 
17.6 Exercises 

Chapter 18 Arbitration 
18.1 Arbitration Timing 
18.2 Fairness 
18.3 Fixed Priority Arbiter 
18.4 Variable Priority Iterative Arbiters 
18.5 Matrix Arbiter 
18.6 Queuing Arbiter 
18.7 Exercises 

Chapter 19 Allocation 
19.1 Representations
19.2 Exact Algorithms
19.3 Separable Allocators 
19.4 Wavefront Allocator 
19.5 Incremental vs. Batch Allocation 
19.6 Multistage Allocation 
19.7 Performance of Allocators 
19.8 Case Study: The Tiny Tera Allocator 
19.9 Bibliographic Notes 
19.10 Exercises

Chapter 20 Network Interfaces 
20.1 Processor-Network Interface 
20.2 Shared-Memory Interface 
20.3 Line-Fabric Interface 
20.4 Case Study: The MIT M-Machine Network Interface 
20.5 Bibliographic Notes 
20.6 Exercises 

Chapter 21 Error Control 411
21.1 Know Thy Enemy: Failure Modes and Fault Models 
21.2 The Error Control Process: Detection, Containment,
and Recovery 
21.3 Link Level Error Control 
21.4 Router Error Control 
21.5 Network-Level Error Control 
21.6 End-to-end Error Control 
21.7 Bibliographic Notes 
21.8 Exercises 

Chapter 22 Buses 
22.1 Bus Basics 
22.2 Bus Arbitration 
22.3 High Performance Bus Protocol 
22.4 From Buses to Networks 
22.5 Case Study: The PCI Bus 
22.6 Bibliographic Notes 
22.7 Exercises 

Chapter 23 Performance Analysis 
23.1 Measures of Interconnection Network Performance 
23.2 Analysis 
23.3 Validation
23.4 Case Study: Efficiency and Loss in the
BBN Monarch Network 
23.5 Bibliographic Notes 
23.6 Exercises 

Chapter 24 Simulation 
24.1 Levels of Detail 
24.2 Network Workloads 
24.3 Simulation Measurements 
24.4 Simulator Design 
24.5 Bibliographic Notes 
24.6 Exercises 

Chapter 25 Simulation Examples 495
25.1 Routing
25.2 Flow Control Performance 
25.3 Fault Tolerance 

Appendix A Nomenclature 
Appendix B Glossary 
Appendix C Network Simulator

Principles and Practices of Interconnection Networks

Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

/pdf/the-landscape-of-parallel-computing-research-a-view-from-1i5pswggnf.pdf

The Landscape of Parallel Computing Research: A View from Berkeley

Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.

Theano: A Python framework for fast computation of mathematical expressions

We examine the current performance and future demands of interconnects to and on silicon chips. We compare electrical and optical interconnects and project the requirements for optoelectronic and optical devices if optics is to solve the major problems of interconnects for future high-performance silicon chips. Optics has potential benefits in interconnect density, energy, and timing. The necessity of low interconnect energy imposes low limits especially on the energy of the optical output devices, with a ~ 10 fJ/bit device energy target emerging. Some optical modulators and radical laser approaches may meet this requirement. Low (e.g., a few femtofarads or less) photodetector capacitance is important. Very compact wavelength splitters are essential for connecting the information to fibers. Dense waveguides are necessary on-chip or on boards for guided wave optical approaches, especially if very high clock rates or dense wavelength-division multiplexing (WDM) is to be avoided. Free-space optics potentially can handle the necessary bandwidths even without fast clocks or WDM. With such technology, however, optics may enable the continued scaling of interconnect capacity required by future chips.

Device Requirements for Optical Interconnects to Silicon Chips

This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits. We describe and validate an end-to-end model that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs. The model captures the effects of two important masking phenomena, electrical masking and latching-window masking, which inhibit soft errors in combinational logic. We quantify the SER due to high-energy neutrons in SRAM cells, latches, and logic circuits for feature sizes from 600 nm to 50 nm and clock periods from 16 to 6 fan-out-of-4 inverter delays. Our model predicts that the SER per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Our result emphasizes that computer system designers must address the risks of soft errors in logic circuits for future designs.

/pdf/modeling-the-effect-of-technology-trends-on-the-soft-error-7cmde8sa08.pdf

Modeling the effect of technology trends on the soft error rate of combinational logic

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs, especially in mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to a multiplier array, where they are extensively reused; product accumulation is performed in a novel accumulator array. On contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7x and 2.3x, respectively, over a comparably provisioned dense CNN accelerator.

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This non-uniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy--while using less silicon area--by 13%, and comes within 13% of an ideal minimal hit latency solution.

/pdf/an-adaptive-non-uniform-cache-structure-for-wire-delay-1ylutc2xsf.pdf

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scali ng of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow substantially. In this paper, we describe technology-driven models for wire capacitance, wire delay, and microarchitectural component delay. Using the results of these models, we measure the simulated performance—estimating both clock rate and IPC —of an aggressive out-of-order microarchitecture as it is scaled from a 250nm technology to a 35nm technology. We perform this analysis for three clock scaling targets and two microarchitecture scaling strategies: pipeline scaling and capacity scaling. We find that no scaling strategy permits annual performance improvements of better than 12.5%, which is far worse than the annual 50-60% to which we have grown accustomed.

/pdf/clock-rate-versus-ipc-the-end-of-the-road-for-conventional-54d15znbzi.pdf

Clock rate versus IPC: the end of the road for conventional microarchitectures

This article discusses the capabilities of state-of-the art GPU-based high-throughput computing systems and considers the challenges to scaling single-chip parallel-computing systems, highlighting high-impact areas that the computing research community can address. Nvidia Research is investigating an architecture for a heterogeneous high-performance computing system that seeks to address these challenges.

Stephen W. Keckler

Papers

Modeling the effect of technology trends on the soft error rate of combinational logic

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Clock rate versus IPC: the end of the road for conventional microarchitectures

GPUs and the Future of Parallel Computing