Home
/
Authors
/
Greg Stitt

Author

Greg Stitt

Other affiliations: University of California, Riverside, George Washington University

Bio: Greg Stitt is an academic researcher from University of Florida. The author has contributed to research in topics: Reconfigurable computing & Software. The author has an hindex of 26, co-authored 104 publications receiving 2278 citations. Previous affiliations of Greg Stitt include University of California, Riverside & George Washington University.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications

[...]

Jeremy Fowers¹, Greg Brown¹, Patrick Cooke¹, Greg Stitt¹•Institutions (1)

University of Florida¹

22 Feb 2012

TL;DR: This paper analyzes an important domain of applications, referred to as sliding-window applications, when executing on FPGAs, GPUs, and multicores, and presents optimization strategies and use cases where each device is most effective.

...read moreread less

Abstract: With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to have widely varying performance and energy metrics for different accelerators, different application domains, and different use cases. To address this problem, numerous studies have evaluated specific applications across different accelerators. In this paper, we analyze an important domain of applications, referred to as sliding-window applications, when executing on FPGAs, GPUs, and multicores. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that FPGAs can achieve speedup of up to 11x and 57x compared to GPUs and multicores, respectively, while also using orders of magnitude less energy.

...read moreread less

253 citations

Proceedings Article•DOI•

Dynamic hardware/software partitioning: a first approach

[...]

Greg Stitt¹, Roman Lysecky¹, Frank Vahid¹•Institutions (1)

University of California, Riverside¹

02 Jun 2003

TL;DR: This work describes the system architecture and initial on-chip tools, including profiler, decompiler, synthesis, and placement and routing tools for a simplified configurable logic fabric, able to perform dynamic partitioning of real benchmarks, and shows speedups averaging 2.6 for five benchmarks taken from Powerstone, Netbench and the own benchmarks.

...read moreread less

Abstract: Partitioning an application among software running on a microprocessor and hardware co-processors in on-chip configurable logic has been shown to improve performance and energy consumption in embedded systems. Meanwhile, dynamic software optimization methods have shown the usefulness and feasibility of runtime program optimization, but those optimizations do not achieve as much as partitioning. We introduce a first approach to dynamic hardware/software partitioning. We describe our system architecture and initial on-chip tools, including profiler, decompiler, synthesis, and placement and routing tools for a simplified configurable logic fabric, able to perform dynamic partitioning of real benchmarks. We show speedups averaging 2.6 for five benchmarks taken from Powerstone, NetBench, and our own benchmarks.

...read moreread less

177 citations

Journal Article•DOI•

Warp Processors

[...]

Roman Lysecky¹, Greg Stitt², Frank Vahid²•Institutions (2)

University of Arizona¹, University of California, Riverside²

07 Jun 2004

TL;DR: This work developed a custom FPGA fabric specifically designed to enable lean place and route tools, and developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing.

...read moreread less

Abstract: We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves these improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary's critical regions, reimplements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than those achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically reimplementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on chip with the main processor. We describe the results of developing our warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66% across a set of embedded benchmark applications. We further show that our tools utilize acceptably small amounts of computation and memory which are far less than traditional tools. Our work illustrates the feasibility and potential of warp processing, and we can foresee the possibility of warp processing becoming a feature in a variety of computing domains, including desktop, server, and embedded applications.

...read moreread less

147 citations

Proceedings Article•DOI•

A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication

[...]

Jeremy Fowers¹, Kalin Ovtcharov², Karin Strauss², Eric S. Chung², Greg Stitt¹ - Show less +1 more•Institutions (2)

University of Florida¹, Microsoft²

11 May 2014

TL;DR: This paper introduces an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low.

...read moreread less

Abstract: Sparse matrix-vector multiplication (SMVM) is a crucial primitive used in a variety of scientific and commercial applications. Despite having significant parallelism, SMVM is a challenging kernel to optimize due to its irregular memory access characteristics. Numerous studies have proposed the use of FPGAs to accelerate SMVM implementations. However, most prior approaches focus on parallelizing multiply-accumulate operations within a single row of the matrix (which limits parallelism if rows are small) and/or make inefficient uses of the memory system when fetching matrix and vector elements. In this paper, we introduce an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low. This system compares favorably with prior FPGA SMVM implementations. For the over 700 University of Florida sparse matrices we evaluated, it also performs within about two thirds of CPU SMVM performance on average, even though it has 2.4× lower DRAM memory bandwidth, and within almost one third of GPU SVMV performance on average, even at 9x lower memory bandwidth. Additionally, it consumes only 25W, for power efficiencies 2.6x and 2.3x higher than CPU and GPU, respectively, based on maximum device power.

...read moreread less

109 citations

Journal Article•DOI•

Energy savings and speedups from partitioning critical software loops to hardware in embedded systems

[...]

Greg Stitt¹, Frank Vahid¹, Shawn Nematbakhsh¹•Institutions (1)

University of California, Riverside¹

01 Feb 2004-ACM Transactions in Embedded Computing Systems

TL;DR: These experiments represent the most comprehensive hardware/software partitioning study published to date and found that moving critical code to hardware resulted in average speedups of 3 to 5 and average energy savings of 35% to 70%, with average hardware requirements of only 5000 to 10,000 gates.

...read moreread less

Abstract: We present results of extensive hardware/software partitioning experiments on numerous benchmarks. We describe our loop-oriented partitioning methodology for moving critical code from hardware to software. Our benchmarks included programs from PowerStone, MediaBench, and NetBench. Our experiments included estimated results for partitioning using an 8051 8-bit microcontroller or a 32-bit MIPS microprocessor for the software, and using on-chip configurable logic or custom application-specific integrated circuit hardware for the hardware. Additional experiments involved actual measurements taken from several physical implementations of hardware/software partitionings on real single-chip microprocessor/configurable-logic devices. We also estimated results assuming voltage scalable processors. We provide performance, energy, and size data for all of the experiments. We found that the benchmarks spent an average of 80p of their execution time in only 3p of their code, amounting to only about 200 bytes of critical code. For various experiments, we found that moving critical code to hardware resulted in average speedups of 3 to 5 and average energy savings of 35p to 70p, with average hardware requirements of only 5000 to 10,000 gates. To our knowledge, these experiments represent the most comprehensive hardware/software partitioning study published to date.

...read moreread less

106 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

EIE: efficient inference engine on compressed deep neural network

[...]

Song Han¹, Xingyu Liu¹, Huizi Mao¹, Jing Pu¹, Ardavan Pedram¹, Mark Horowitz¹, William J. Dally¹ - Show less +3 more•Institutions (1)

Stanford University¹

18 Jun 2016

TL;DR: In this paper, the authors proposed an energy efficient inference engine (EIE) that performs inference on a compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing.

...read moreread less

Abstract: State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

...read moreread less

2,445 citations

Journal Article•DOI•

A reconfigurable fabric for accelerating large-scale datacenter services

[...]

Andrew Putnam¹, Adrian M. Caulfield¹, Eric S. Chung¹, Derek Chiou², Kypros Constantinides³, John Demme⁴, Hadi Esmaeilzadeh⁵, Jeremy Fowers¹, Gopi Prashanth Gopal¹, Jan Gray¹, Michael Haselman¹, Scott Hauck⁶, Stephen F. Heil¹, Amir Hormati⁷, Joo-Young Kim¹, Sitaram Lanka¹, James R. Larus⁸, Eric C. Peterson¹, Simon Pope¹, Aaron L. Smith¹, Jason Thong¹, Phillip Yi Xiao¹, Doug Burger¹ - Show less +19 more•Institutions (8)

Microsoft¹, University of Texas at Austin², Amazon.com³, Columbia University⁴, Georgia Institute of Technology⁵, University of Washington⁶, Google⁷, École Polytechnique Fédérale de Lausanne⁸

28 Oct 2016-Communications of The ACM

TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.

...read moreread less

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost It is challenging to improve all of these factors simultaneously To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA) Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth networkWe describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput In other words, the reconfigurable fabric enables the same throughput using only half the number of servers

...read moreread less

835 citations

Journal Article•DOI•

A reconfigurable fabric for accelerating large-scale datacenter services

[...]

Andrew Putnam¹, Adrian M. Caulfield¹, Eric S. Chung¹, Derek Chiou², Kypros Constantinides¹, John Demme³, Hadi Esmaeilzadeh⁴, Jeremy Fowers¹, Gopi Prashanth Gopal¹, Jan Gray¹, Michael Haselman¹, Scott Hauck⁵, Stephen F. Heil¹, Amir Hormati⁶, Joo-Young Kim¹, Sitaram Lanka¹, James R. Larus⁷, Eric C. Peterson¹, Simon Pope¹, Aaron L. Smith¹, Jason Thong¹, Phillip Yi Xiao¹, Doug Burger¹ - Show less +19 more•Institutions (7)

Microsoft¹, University of Texas at Austin², Columbia University³, Georgia Institute of Technology⁴, University of Washington⁵, Google⁶, École Polytechnique Fédérale de Lausanne⁷

14 Jun 2014

TL;DR: The requirements and architecture of the fabric are described, the critical engineering challenges and solutions needed to make the system robust in the presence of failures are detailed, and the performance, power, and resilience of the system when ranking candidate documents are measured.

...read moreread less

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurablefabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cablesIn this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution--- or, while maintaining equivalent throughput, reduces the tail latency by 29%

...read moreread less

712 citations

Proceedings Article•DOI•

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

[...]

Song Han¹, Junlong Kang, Huizi Mao¹, Yiming Hu², Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang², Huazhong Yang², William J. Dally¹ - Show less +8 more•Institutions (2)

Stanford University¹, Tsinghua University²

22 Feb 2017

TL;DR: The Efficient Speech Recognition Engine (ESE) as discussed by the authors proposes a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization).

...read moreread less

Abstract: Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model. Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

...read moreread less

537 citations

Proceedings Article•DOI•

LegUp: high-level synthesis for FPGA-based processor/accelerator systems

[...]

Andrew Canis¹, Jongsok Choi¹, Mark Aldham¹, Victor Zhang¹, Ahmed Kammoona¹, Jason H. Anderson¹, Stephen J. Brown¹, Tomasz Czajkowski² - Show less +4 more•Institutions (2)

University of Toronto¹, Altera²

27 Feb 2011

TL;DR: A new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design and produces hardware solutions of comparable quality to a commercial high- level synthesis tool.

...read moreread less

Abstract: In this paper, we introduce a new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design LegUp accepts a standard C program as input and automatically compiles the program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware accelerators that communicate through a standard bus interface Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool

...read moreread less

531 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse