Garp: a MIPS processor with a reconfigurable coprocessor

doi:10.1109/FPGA.1997.624600

Home
/
Papers
/
Garp: a MIPS processor with a reconfigurable coprocessor

Proceedings Article•DOI•

Garp: a MIPS processor with a reconfigurable coprocessor

Jay Hauser¹, John Wawrzynek¹•Institutions (1)

University of California, Berkeley¹

16 Apr 1997-pp 12-21

TL;DR: Novel aspects of the Garp Architecture are presented, as well as a prototype software environment and preliminary performance results, which suggest that a Garp of similar technology could achieve speedups ranging from a factor of 2 to as high as a factors of 24 for some useful applications.

read less

Abstract: Typical reconfigurable machines exhibit shortcomings that make them less than ideal for general-purpose computing. The Garp Architecture combines reconfigurable hardware with a standard MIPS processor on the same die to retain the better features of both. Novel aspects of the architecture are presented, as well as a prototype software environment and preliminary performance results. Compared to an UltraSPARC, a Garp of similar technology could achieve speedups ranging from a factor of 2 to as high as a factor of 24 for some useful applications.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

The Landscape of Parallel Computing Research: A View from Berkeley

[...]

Krste Asanovic, Ras Bodik, Bryan Catanzaro, Joseph Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Plishker, John Shalf, Samuel Williams, Katherine Yelick - Show less +7 more

18 Dec 2006

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.

...read moreread less

Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

...read moreread less

2,262 citations

Journal Article•DOI•

Reconfigurable computing: a survey of systems and software

[...]

Katherine Compton¹, Scott Hauck²•Institutions (2)

Northwestern University¹, University of Washington²

01 Jun 2002-ACM Computing Surveys

TL;DR: The hardware aspects of reconfigurable computing machines, from single chip architectures to multi-chip systems, including internal structures and external coupling are explored, and the software that targets these machines is focused on.

...read moreread less

Abstract: Due to its potential to greatly accelerate a wide variety of applications, reconfigurable computing has become a subject of a great deal of research. Its key feature is the ability to perform computations in hardware to increase performance, while retaining much of the flexibility of a software solution. In this survey, we explore the hardware aspects of reconfigurable computing machines, from single chip architectures to multi-chip systems, including internal structures and external coupling. We also focus on the software that targets these machines, such as compilation tools that map high-level algorithms directly to the reconfigurable substrate. Finally, we consider the issues involved in run-time reconfigurable systems, which reuse the configurable hardware during program execution.

...read moreread less

1,666 citations

Journal Article•DOI•

Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks

[...]

Yu-Hsin Chen¹, Joel Emer², Vivienne Sze¹•Institutions (2)

Massachusetts Institute of Technology¹, Nvidia²

18 Jun 2016

TL;DR: A novel dataflow, called row-stationary (RS), is presented, that minimizes data movement energy consumption on a spatial architecture and can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine local storage, direct inter-PE communication and spatial parallelism.

...read moreread less

Abstract: Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy.In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.

...read moreread less

1,126 citations

Journal Article•DOI•

MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications

[...]

H. Singh¹, Ming-Hau Lee¹, Guangming Lu¹, Fadi J. Kurdahi¹, Nader Bagherzadeh¹, E.M. Chaves Filho² - Show less +2 more•Institutions (2)

University of California, Irvine¹, Federal University of Rio de Janeiro²

01 May 2000-IEEE Transactions on Computers

TL;DR: The MorphoSys architecture is described, including the reconfigurable processor array, the control processor, and data and configuration memories, and the suitability of MorphoSy for the target application domain is illustrated with examples such as video compression, data encryption and target recognition.

...read moreread less

Abstract: This paper introduces MorphoSys, a reconfigurable computing system developed to investigate the effectiveness of combining reconfigurable hardware with general-purpose processors for word-level, computation-intensive applications. MorphoSys is a coarse-grain, integrated, and reconfigurable system-on-chip, targeted at high-throughput and data-parallel applications. It is comprised of a reconfigurable array of processing cells, a modified RISC processor core, and an efficient memory interface unit. This paper describes the MorphoSys architecture, including the reconfigurable processor array, the control processor, and data and configuration memories. The suitability of MorphoSys for the target application domain is then illustrated with examples such as video compression, data encryption and target recognition. Performance evaluation of these applications indicates improvements of up to an order of magnitude (or more) on MorphoSys, in comparison with other systems.

...read moreread less

895 citations

Cites background from "Garp: a MIPS processor with a recon..."

...Hence, many researchers have proposed other models of reconfigurable systems targeting different applications....
[...]

Proceedings Article•DOI•

A decade of reconfigurable computing: a visionary retrospective

[...]

Reiner W. Hartenstein¹•Institutions (1)

Kaiserslautern University of Technology¹

13 Mar 2001

TL;DR: The paper surveys a decade of R&D on coarse grain reconfigurable hardware and related CAD, points out why this emerging discipline is heading toward a dichotomy of computing science, and advocates the introduction of a new soft machine paradigm to replace CAD by compilation.

...read moreread less

Abstract: The paper surveys a decade of R&D on coarse grain reconfigurable hardware and related CAD, points out why this emerging discipline is heading toward a dichotomy of computing science, and advocates the introduction of a new soft machine paradigm to replace CAD by compilation.

...read moreread less

661 citations

Cites background or methods from "Garp: a MIPS processor with a recon..."

...Garp tools [16] use a SUIF-based C compiler [43] to generate code for the MIPS host with embedded RA configuration code to accelerate (only non-nested) loops....
[...]
...The Garp Architecture [16] resembles an FPGA and comes with a MIPS-II-like host and, for acceleration of specific loops or subroutines, a 32 by 24 RA of LUT-based 2 bit PEs....
[...]
...The Garp Architecture [16] resembles an FPGA andcomes with a MIPS-II-like host and, for acceleration ofspecific loops or subroutines, a 32 by 24 RA of LUT-based 2bit PEs....
[...]
...Garp 1997 [16] 2-D mesh 2 bit global & semi-global lines heuristic routing loop acceleration...
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Applied Cryptography: Protocols, Algorithms, and Source Code in C

[...]

Bruce Schneier, Phil Sutherland

10 Nov 1993

TL;DR: This document describes the construction of protocols and their use in the real world, as well as some examples of protocols used in the virtual world.

...read moreread less

Abstract: CRYPTOGRAPHIC PROTOCOLS. Protocol Building Blocks. Basic Protocols. Intermediate Protocols. Advanced Protocols. Esoteric Protocols. CRYPTOGRAPHIC TECHNIQUES. Key Length. Key Management. Algorithm Types and Modes. Using Algorithms. CRYPTOGRAPHIC ALGORITHMS. Data Encryption Standard (DES). Other Block Ciphers. Other Stream Ciphers and Real Random-Sequence Generators. Public-Key Algorithms. Special Algorithms for Protocols. THE REAL WORLD. Example Implementations. Politics. SOURCE CODE.source Code. References.

...read moreread less

3,432 citations

"Garp: a MIPS processor with a recon..." refers methods in this paper

...Data Encryption Standard DES One of the most important encryption methods over the last 20 years has been the Data Encryption Standard, or DES [12]....
[...]
...One of the most important encryption methods over the last 20 years has been the Data Encryption Standard, or DES [12]....
[...]
...Each 64 bits is run through an “obfuscation loop” 16 times; and it is in this loop that DES spends most of its time....
[...]
...Software implementations of DES invariably implement the S-boxes as table lookups requiring a read from memory for each S-box evaluation....
[...]
...DES is a good application for reconfigurablehardware because normal processors have trouble implementing it efficiently....
[...]

Book•

Digital Halftoning

[...]

Robert Ulichney

27 Jun 1987

1,077 citations

"Garp: a MIPS processor with a recon..." refers methods in this paper

...To dither to this palette we employed Floyd-Steinberg error diffusion [13], which is essentially the standard algorithm for this task....
[...]

Proceedings Article•DOI•

A high-performance microarchitecture with hardware-programmable functional units

[...]

Rahul Razdan¹, Michael D. Smith¹•Institutions (1)

Harvard University¹

30 Nov 1994

TL;DR: A novel way to incorporate hardware-programmable resources into a processor microarchitecture to improve the performance of general-purpose applications through a coupling of compile-time analysis routines and hardware synthesis tools is explored.

...read moreread less

Abstract: This paper explores a novel way to incorporate hardware-programmable resources into a processor microarchitecture to improve the performance of general-purpose applications. Through a coupling of compile-time analysis routines and hardware synthesis tools, we automatically configure a given set of the hardware-programmable functional units (PFUs) and thus augment the base instruction set architecture so that it better meets the instruction set needs of each application. We refer to this new class of general-purpose computers as PRogrammable Instruction Set Computers (PRISC). Although similar in concept, the PRISC approach differs from dynamically programmable microcode because in PRISC we define entirely-new primitive datapath operations. In this paper, we concentrate on the microarchitectural design of the simplest form of PRISC—a RISC microprocessor with a single PFU that only evaluates combinational functions. We briefly discuss the operating system and the programming language compilation techniques that are needed to successfully build PRISC and, we present performance results from a proof-of-concept study. With the inclusion of a single 32-bit-wide PFU whose hardware cost is less than that of a 1 kilobyte SRAM, our study shows a 22% improvement in processor performance on the SPECint92 benchmarks.

...read moreread less

475 citations

"Garp: a MIPS processor with a recon..." refers background in this paper

...To address some of these concerns, various researchers have proposed building a machine that tightly couples reconfigurable hardware with a conventional microprocessor [2, 8, 9]....
[...]

Dissertation•

Reconfigurable Architectures for General-Purpose Computing

[...]

André DeHon, Thomas F. Knight

01 Jan 1996

TL;DR: MATRIX is developed, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs, and it is shown that MATRIX yields 10-20$\times the computational density of conventional processors.

...read moreread less

Abstract: General-purpose computing devices allow us to (1) customize computation after fabrication and (2) conserve area by reusing expensive active circuitry for different functions in time. We define RP-space, a restricted domain of the general-purpose architectural space focussed on reconfigurable computing architectures. Two dominant features differentiate reconfigurable from special-purpose architectures and account for most of the area overhead associated with RP devices: (1) instructions which tell the device how to behave, and (2) flexible interconnect which supports task dependent dataflow between operations. We can characterize RP-space by the allocation and structure of these resources and compare the efficiencies of architectural points across broad application characteristics. Conventional FPGAs fall at one extreme end of this space and their efficiency ranges over two orders of magnitude across the space of application characteristics. Understanding RP-space and its consequences allows us to pick the best architecture for a task and to search for more robust design points in the space. Our DPGA, a fine-grained computing device which adds small, on-chip instruction memories to FPGAs is one such design point. For typical logic applications and finite-state machines, a DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the DPGA, while reducing typical physical mapping times from hours to seconds. Rigid, fabrication-time organization of instruction resources significantly narrows the range of efficiency for conventional architectures. To avoid this performance brittleness, we developed MATRIX, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs. Our focus MATRIX design point is based on an array of 8-bit ALU and register-file building blocks interconnected via a byte-wide network. With today's silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit ops). On sample image processing tasks, we show that MATRIX yields 10-20$\times$ the computational density of conventional processors. Understanding the cost structure of RP-space helps us identify these intermediate architectural points and may provide useful insight more broadly in guiding our continual search for robust and efficient general-purpose computing structures. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

...read moreread less

435 citations

"Garp: a MIPS processor with a recon..." refers background in this paper

...In recent years, reconfigurable hardware—usually in the guise of field-programmable gate arrays (FPGAs)—has been touted as a new and better means of performing computation [1, 2, 3]....
[...]

Journal Article•DOI•

Processor reconfiguration through instruction-set metamorphosis

[...]

Peter Athanas, Harvey F. Silverman¹•Institutions (1)

Brown University¹

01 Mar 1993-IEEE Computer

TL;DR: The processor reconfiguration through instruction-set metamorphosis (PRISM) general-purpose architecture, which speeds up computationally intensive tasks by augmenting the core processor's functionality with new operations, is described.

...read moreread less

Abstract: The processor reconfiguration through instruction-set metamorphosis (PRISM) general-purpose architecture, which speeds up computationally intensive tasks by augmenting the core processor's functionality with new operations, is described. The PRISM approach adapts the configuration and fundamental operations of a core processing system to the computationally intensive portions of a targeted application. PRISM-1, an initial prototype system, is described, and experimental results that demonstrate the benefits of the PRISM concept are presented. >

...read moreread less

415 citations

"Garp: a MIPS processor with a recon..." refers background in this paper

...In recent years, reconfigurable hardware—usually in the guise of field-programmable gate arrays (FPGAs)—has been touted as a new and better means of performing computation [1, 2, 3]....
[...]