Author
Richard Lethin
Other affiliations: Massachusetts Institute of Technology, Fujitsu
Bio: Richard Lethin is an academic researcher from Yale University. The author has contributed to research in topics: Compiler & Tensor. The author has an hindex of 19, co-authored 108 publications receiving 1969 citations. Previous affiliations of Richard Lethin include Massachusetts Institute of Technology & Fujitsu.
Topics: Compiler, Tensor, Data structure, Source code, Code generation
Papers published on a yearly basis
Papers
More filters
•
21 Oct 1998
TL;DR: In this paper, an optimizing object code translation system and method perform dynamic compilation and translation of a target object code on a source operating system while performing optimization, which is dynamically executed in real time.
Abstract: An optimizing object code translation system and method perform dynamic compilation and translation of a target object code on a source operating system while performing optimization. Compilation and optimization of the target code is dynamically executed in real time. A compiler performs analysis and optimizations that improve emulation relative to template-based translation and interpretation such that a host processor which processes larger order instructions, such as 32-bit instructions, may emulate a target processor which processes smaller order instructions, such as 16-bit and 8-bit instructions. The optimizing object code translator does not require knowledge of a static program flow graph or memory locations of target instructions prior to run time. In addition, the optimizing object code translator does not require knowledge of the location of all join points into the target object code prior to execution. During program execution, a translator records branch operations. The logging of information identifies instructions and instruction join points. When a number of times a branch operation is executed exceeds a threshold, the destination of the branch becomes a seed for compilation and code portions between seeds are defined as segments. A segment may be incomplete allowing for modification or replacement to account for a new flow of program control during real time program execution.
463 citations
••
TL;DR: The message-driven processor (MDP), a 36-b, 1.1-million transistor, VLSI microcomputer, specialized to operate efficiently in a multicomputer, is described and incorporates primitive mechanisms for communication, synchronization, and naming which support most proposed parallel programming models.
Abstract: The message-driven processor (MDP), a 36-b, 1.1-million transistor, VLSI microcomputer, specialized to operate efficiently in a multicomputer, is described. The MDP chip includes a processor, a 4096-word by 36-b memory, and a network port. An on-chip memory controller with error checking and correction (ECC) permits local memory to be expanded to one million words by adding external DRAM chips. The MDP incorporates primitive mechanisms for communication, synchronization, and naming which support most proposed parallel programming models. The MDP system architecture, instruction set architecture, network architecture, implementation, and software are discussed. >
221 citations
•
08 Mar 2010
TL;DR: In this paper, two custom computing apparatuses are used to resolve the satisfiability of a logical formula and provide an example, for the sole purpose of complying with the Abstract requirement rules, with the explicit understanding that it will not be used to interpret or limit the scope or the meaning of the claims.
Abstract: Methods, apparatus, and computer software product for making a decision based on the semantics of formal logic are provided. In an exemplary embodiment, two custom computing apparatuses are used to resolve the satisfiability of a logical formula and provide an example. In this embodiment, the two custom computing apparatuses operate in concert to explore the space of possible satisfying examples. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
126 citations
••
10 Feb 2014
TL;DR: Drawing from reports and more recent experience, this ASCAC subcommittee has identified the top ten computing technology advancements that are critical to making a capable, economically viable, exascale system.
Abstract: Exascale computing systems are essential for the scientific fields that will transform the 21st century global economy, including energy, biotechnology, nanotechnology, and materials science. Progress in these fields is predicated on the ability to perform advanced scientific and engineering simulations, and analyze the deluge of data. On July 29, 2013, ASCAC was charged by Patricia Dehmer, the Acting Director of the Office of Science, to assemble a subcommittee to provide advice on exascale computing. This subcommittee was directed to return a list of no more than ten technical approaches (hardware and software) that will enable the development of a system that achieves the Department's goals for exascale computing. Numerous reports over the past few years have documented the technical challenges and the non¬-viability of simply scaling existing computer designs to reach exascale. The technical challenges revolve around energy consumption, memory performance, resilience, extreme concurrency, and big data. Drawing from these reports and more recent experience, this ASCAC subcommittee has identified the top ten computing technology advancements that are critical to making a capable, economically viable, exascale system.
105 citations
••
23 Feb 2013TL;DR: An initial evaluation of Runnemede is presented that shows the design process for the on-chip network, demonstrates 2-4x improvements in memory energy from explicit control of on- chip memory, and illustrates the impact of hardware-software co-design on the energy consumption of a synthetic aperture radar algorithm on the architecture.
Abstract: DARPA's Ubiquitous High-Performance Computing (UHPC) program asked researchers to develop computing systems capable of achieving energy efficiencies of 50 GOPS/Watt, assuming 2018-era fabrication technologies. This paper describes Runnemede, the research architecture developed by the Intel-led UHPC team. Runnemede is being developed through a co-design process that considers the hardware, the runtime/OS, and applications simultaneously. Near-threshold voltage operation, fine-grained power and clock management, and separate execution units for runtime and application code are used to reduce energy consumption. Memory energy is minimized through application-managed on-chip memory and direct physical addressing. A hierarchical on-chip network reduces communication energy, and a codelet-based execution model supports extreme parallelism and fine-grained tasks. We present an initial evaluation of Runnemede that shows the design process for our on-chip network, demonstrates 2-4x improvements in memory energy from explicit control of on-chip memory, and illustrates the impact of hardware-software co-design on the energy consumption of a synthetic aperture radar algorithm on our architecture.
101 citations
Cited by
More filters
•
01 Jan 2004
TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
Abstract: One of the greatest challenges faced by designers of digital systems is optimizing the communication and interconnection between system components. Interconnection networks offer an attractive and economical solution to this communication crisis and are fast becoming pervasive in digital systems. Current trends suggest that this communication bottleneck will be even more problematic when designing future generations of machines. Consequently, the anatomy of an interconnection network router and science of interconnection network design will only grow in importance in the coming years.
This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies. It incorporates hardware-level descriptions of concepts, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
·Case studies throughout the book draw on extensive author experience in designing interconnection networks over a period of more than twenty years, providing real world examples of what works, and what doesn't.
·Tightly couples concepts with implementation costs to facilitate a deeper understanding of the tradeoffs in the design of a practical network.
·A set of examples and exercises in every chapter help the reader to fully understand all the implications of every design decision.
Table of Contents
Chapter 1 Introduction to Interconnection Networks
1.1 Three Questions About Interconnection Networks
1.2 Uses of Interconnection Networks
1.3 Network Basics
1.4 History
1.5 Organization of this Book
Chapter 2 A Simple Interconnection Network
2.1 Network Specifications and Constraints
2.2 Topology
2.3 Routing
2.4 Flow Control
2.5 Router Design
2.6 Performance Analysis
2.7 Exercises
Chapter 3 Topology Basics
3.1 Nomenclature
3.2 Traffic Patterns
3.3 Performance
3.4 Packaging Cost
3.5 Case Study: The SGI Origin 2000
3.6 Bibliographic Notes
3.7 Exercises
Chapter 4 Butterfly Networks
4.1 The Structure of Butterfly Networks
4.2 Isomorphic Butterflies
4.3 Performance and Packaging Cost
4.4 Path Diversity and Extra Stages
4.5 Case Study: The BBN Butterfly
4.6 Bibliographic Notes
4.7 Exercises
Chapter 5 Torus Networks
5.1 The Structure of Torus Networks
5.2 Performance
5.3 Building Mesh and Torus Networks
5.4 Express Cubes
5.5 Case Study: The MIT J-Machine
5.6 Bibliographic Notes
5.7 Exercises
Chapter 6 Non-Blocking Networks
6.1 Non-Blocking vs. Non-Interfering Networks
6.2 Crossbar Networks
6.3 Clos Networks
6.4 Benes Networks
6.5 Sorting Networks
6.6 Case Study: The Velio VC2002 (Zeus) Grooming Switch
6.7 Bibliographic Notes
6.8 Exercises
Chapter 7 Slicing and Dicing
7.1 Concentrators and Distributors
7.2 Slicing and Dicing
7.3 Slicing Multistage Networks
7.4 Case Study: Bit Slicing in the Tiny Tera
7.5 Bibliographic Notes
7.6 Exercises
Chapter 8 Routing Basics
8.1 A Routing Example
8.2 Taxonomy of Routing Algorithms
8.3 The Routing Relation
8.4 Deterministic Routing
8.5 Case Study: Dimension-Order Routing in the Cray T3D
8.6 Bibliographic Notes
8.7 Exercises
Chapter 9 Oblivious Routing
9.1 Valiant's Randomized Routing Algorithm
9.2 Minimal Oblivious Routing
9.3 Load-Balanced Oblivious Routing
9.4 Analysis of Oblivious Routing
9.5 Case Study: Oblivious Routing in the
Avici Terabit Switch Router(TSR)
9.6 Bibliographic Notes
9.7 Exercises
Chapter 10 Adaptive Routing
10.1 Adaptive Routing Basics
10.2 Minimal Adaptive Routing
10.3 Fully Adaptive Routing
10.4 Load-Balanced Adaptive Routing
10.5 Search-Based Routing
10.6 Case Study: Adaptive Routing in the
Thinking Machines CM-5
10.7 Bibliographic Notes
10.8 Exercises
Chapter 11 Routing Mechanics
11.1 Table-Based Routing
11.2 Algorithmic Routing
11.3 Case Study: Oblivious Source Routing in the
IBM Vulcan Network
11.4 Bibliographic Notes
11.5 Exercises
Chapter 12 Flow Control Basics
12.1 Resources and Allocation Units
12.2 Bufferless Flow Control
12.3 Circuit Switching
12.4 Bibliographic Notes
12.5 Exercises
Chapter 13 Buffered Flow Control
13.1 Packet-Buffer Flow Control
13.2 Flit-Buffer Flow Control
13.3 Buffer Management and Backpressure
13.4 Flit-Reservation Flow Control
13.5 Bibliographic Notes
13.6 Exercises
Chapter 14 Deadlock and Livelock
14.1 Deadlock
14.2 Deadlock Avoidance
14.3 Adaptive Routing
14.4 Deadlock Recovery
14.5 Livelock
14.6 Case Study: Deadlock Avoidance in the Cray T3E
14.7 Bibliographic Notes
14.8 Exercises
Chapter 15 Quality of Service
15.1 Service Classes and Service Contracts
15.2 Burstiness and Network Delays
15.3 Implementation of Guaranteed Services
15.4 Implementation of Best-Effort Services
15.5 Separation of Resources
15.6 Case Study: ATM Service Classes
15.7 Case Study: Virtual Networks in the Avici TSR
15.8 Bibliographic Notes
15.9 Exercises
Chapter 16 Router Architecture
16.1 Basic Router Architecture
16.2 Stalls
16.3 Closing the Loop with Credits
16.4 Reallocating a Channel
16.5 Speculation and Lookahead
16.6 Flit and Credit Encoding
16.7 Case Study: The Alpha 21364 Router
16.8 Bibliographic Notes
16.9 Exercises
Chapter 17 Router Datapath Components
17.1 Input Buffer Organization
17.2 Switches
17.3 Output Organization
17.4 Case Study: The Datapath of the IBM Colony
Router
17.5 Bibliographic Notes
17.6 Exercises
Chapter 18 Arbitration
18.1 Arbitration Timing
18.2 Fairness
18.3 Fixed Priority Arbiter
18.4 Variable Priority Iterative Arbiters
18.5 Matrix Arbiter
18.6 Queuing Arbiter
18.7 Exercises
Chapter 19 Allocation
19.1 Representations
19.2 Exact Algorithms
19.3 Separable Allocators
19.4 Wavefront Allocator
19.5 Incremental vs. Batch Allocation
19.6 Multistage Allocation
19.7 Performance of Allocators
19.8 Case Study: The Tiny Tera Allocator
19.9 Bibliographic Notes
19.10 Exercises
Chapter 20 Network Interfaces
20.1 Processor-Network Interface
20.2 Shared-Memory Interface
20.3 Line-Fabric Interface
20.4 Case Study: The MIT M-Machine Network Interface
20.5 Bibliographic Notes
20.6 Exercises
Chapter 21 Error Control 411
21.1 Know Thy Enemy: Failure Modes and Fault Models
21.2 The Error Control Process: Detection, Containment,
and Recovery
21.3 Link Level Error Control
21.4 Router Error Control
21.5 Network-Level Error Control
21.6 End-to-end Error Control
21.7 Bibliographic Notes
21.8 Exercises
Chapter 22 Buses
22.1 Bus Basics
22.2 Bus Arbitration
22.3 High Performance Bus Protocol
22.4 From Buses to Networks
22.5 Case Study: The PCI Bus
22.6 Bibliographic Notes
22.7 Exercises
Chapter 23 Performance Analysis
23.1 Measures of Interconnection Network Performance
23.2 Analysis
23.3 Validation
23.4 Case Study: Efficiency and Loss in the
BBN Monarch Network
23.5 Bibliographic Notes
23.6 Exercises
Chapter 24 Simulation
24.1 Levels of Detail
24.2 Network Workloads
24.3 Simulation Measurements
24.4 Simulator Design
24.5 Bibliographic Notes
24.6 Exercises
Chapter 25 Simulation Examples 495
25.1 Routing
25.2 Flow Control Performance
25.3 Fault Tolerance
Appendix A Nomenclature
Appendix B Glossary
Appendix C Network Simulator
3,233 citations
•
15 Aug 1998
TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Abstract: The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
* synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development
* presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs
* describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies
* illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs
Table of Contents
1 Introduction
2 Parallel Programs
3 Programming for Performance
4 Workload-Driven Evaluation
5 Shared Memory Multiprocessors
6 Snoop-based Multiprocessor Design
7 Scalable Multiprocessors
8 Directory-based Cache Coherence
9 Hardware-Software Tradeoffs
10 Interconnection Network Design
11 Latency Tolerance
12 Future Directions
APPENDIX A Parallel Benchmark Suites
1,571 citations