Home
/
Authors
/
Gu-Yeon Wei

Author

Gu-Yeon Wei

Bio: Gu-Yeon Wei is an academic researcher from Harvard University. The author has contributed to research in topics: Deep learning & Voltage regulator. The author has an hindex of 46, co-authored 198 publications receiving 8154 citations. Previous affiliations of Gu-Yeon Wei include Stanford University.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2000
1999
1996

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

System level analysis of fast, per-core DVFS using on-chip switching regulators

[...]

Wonyoung Kim¹, Meeta S. Gupta¹, Gu-Yeon Wei¹, David Brooks¹•Institutions (1)

Harvard University¹

24 Oct 2008

TL;DR: It is concluded that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms.

...read moreread less

Abstract: Portable, embedded systems place ever-increasing demands on high-performance, low-power microprocessor design. Dynamic voltage and frequency scaling (DVFS) is a well-known technique to reduce energy in digital systems, but the effectiveness of DVFS is hampered by slow voltage transitions that occur on the order of tens of microseconds. In addition, the recent trend towards chip-multiprocessors (CMP) executing multi-threaded workloads with heterogeneous behavior motivates the need for per-core DVFS control mechanisms. Voltage regulators that are integrated onto the same chip as the microprocessor core provide the benefit of both nanosecond-scale voltage switching and per-core voltage control. We show that these characteristics provide significant energy-saving opportunities compared to traditional off-chip regulators. However, the implementation of on-chip regulators presents many challenges including regulator efficiency and output voltage transient characteristics, which are significantly impacted by the system-level application of the regulator. In this paper, we describe and model these costs, and perform a comprehensive analysis of a CMP system with on-chip integrated regulators. We conclude that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms.

...read moreread less

758 citations

Journal Article•DOI•

Minerva: enabling low-power, highly-accurate deep neural network accelerators

[...]

Brandon Reagen¹, Paul N. Whatmough¹, Robert Adolf¹, Saketh Rama¹, Hyunkwang Lee¹, Sae Kyu Lee¹, José Miguel Hernández-Lobato¹, Gu-Yeon Wei¹, David Brooks¹ - Show less +5 more•Institutions (1)

Harvard University¹

18 Jun 2016

TL;DR: Minerva as mentioned in this paper proposes a co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators, and shows that fine-grained, heterogeneous dataatype optimization reduces power by 1.5× and aggressive, inline predication and pruning of small activity values further reduces power.

...read moreread less

Abstract: The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.

...read moreread less

540 citations

Proceedings Article•DOI•

Process Variation Tolerant 3T1D-Based Cache Architectures

[...]

Xiaoyao Liang¹, Ramon Canal¹, Gu-Yeon Wei¹, David Brooks²•Institutions (2)

Harvard University¹, Polytechnic University of Catalonia²

01 Dec 2007

TL;DR: A range of cache refresh and placement schemes that are sensitive to retention time are proposed, and it is shown that most of the retention time variations can be masked by the microarchitecture when using these schemes.

...read moreread less

Abstract: Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to reten- tion time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes. We have performed detailed circuit and architectural simulations assuming different degrees of variability in advanced technology nodes, and we show that the resulting memory architecture can tol- erate large process variations with little or even no impact on per- formance when compared to ideal 6T SRAM designs. Furthermore, these designs are robust to memory cell stability issues and can achieve large power savings. These advantages make the new mem- ory architectures a promising choice for on-chip variation-tolerant cache structures required for next generation microprocessors.

...read moreread less

359 citations

Proceedings Article•DOI•

Profiling a warehouse-scale computer

[...]

Svilen Kanev¹, Juan Pablo Darago², Kim Hazelwood³, Parthasarathy Ranganathan⁴, Tipp Moseley⁴, Gu-Yeon Wei¹, David Brooks¹ - Show less +3 more•Institutions (4)

Harvard University¹, University of Buenos Aires², Yahoo!³, Google⁴

13 Jun 2015

TL;DR: A detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications finds that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss.

...read moreread less

Abstract: With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.

...read moreread less

328 citations

Journal Article•DOI•

Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures

[...]

Yakun Sophia Shao¹, Brandon Reagen¹, Gu-Yeon Wei¹, David Brooks¹•Institutions (1)

Harvard University¹

14 Jun 2014

TL;DR: Aladdin is presented, a pre-RTL, power-performance accelerator modeling framework and its application to system-on-chip (SoC) simulation and provides researchers an approach to model the power and performance of accelerators in an SoC environment.

...read moreread less

Abstract: Hardware specialization, in the form of accelerators that provide custom datapath and control for specific algorithms and applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerator analysis relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but are also slow and tedious to use, making large design space exploration infeasible. To overcome this problem, we present Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrate its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9%, 4.9%, and 6.6% with respect to RTL implementations. Integrated with architecture-level core and memory hierarchy simulators, Aladdin provides researchers an approach to model the power and performance of accelerators in an SoC environment

...read moreread less

269 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Collapse

Cited by

PDF

Open Access

More filters

Posted Content•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Albert T. Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Christopher Aaron Clark, Jeremy Coriell, Michael J. Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William John Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, D. Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Khaitan Harshit, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andrew Everett Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Michael Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay K. Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon - Show less +71 more

16 Apr 2017-arXiv: Hardware Architecture

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

3,067 citations

Proceedings Article•DOI•

In-Datacenter Performance Analysis of a Tensor Processing Unit

[...]

Norman P. Jouppi¹, Cliff Young¹, Nishant Patil¹, David A. Patterson¹, Gaurav Agrawal¹, Raminder Bajwa¹, Sarah Bates¹, Suresh Bhatia¹, Nan Boden¹, Albert T. Borchers¹, Rick Boyle¹, Pierre-luc Cantin¹, Clifford Chao¹, Christopher Aaron Clark¹, Jeremy Coriell¹, Michael J. Daley¹, Matt Dau¹, Jeffrey Dean¹, Ben Gelb¹, Tara Vazir Ghaemmaghami¹, Rajendra Gottipati¹, William John Gulland¹, Robert Hagmann¹, C. Richard Ho¹, Doug Hogberg¹, John Hu¹, Robert Hundt¹, D. Hurt¹, Julian Ibarz¹, Aaron Jaffey¹, Alek Jaworski¹, Alexander Kaplan¹, Khaitan Harshit¹, Daniel Killebrew¹, Andy Koch¹, Naveen Kumar¹, Steve Lacy¹, James Laudon¹, James Law¹, Diemthu Le¹, Chris Leary¹, Zhuyuan Liu¹, Kyle Lucke¹, Alan Lundin¹, Gordon MacKean¹, Adriana Maggiore¹, Maire Mahony¹, Kieran Miller¹, Rahul Nagarajan¹, Ravi Narayanaswami¹, Ray Ni¹, Kathy Nix¹, Thomas Norrie¹, Mark Omernick¹, Narayana Penukonda¹, Andrew Everett Phelps¹, Jonathan Ross¹, Matt Ross¹, Amir Salek¹, Emad Samadiani¹, Chris Severn¹, Gregory Sizikov¹, Matthew Snelham¹, Jed Souter¹, Dan Steinberg¹, Andy Swing¹, Mercedes Tan¹, Gregory Michael Thorson¹, Bo Tian¹, Horia Toma¹, Erick Tuttle¹, Vijay K. Vasudevan¹, Richard Walter¹, Walter Wang¹, Eric Wilcox¹, Doe Hyun Yoon¹ - Show less +72 more•Institutions (1)

Google¹

24 Jun 2017

TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).

...read moreread less

Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

...read moreread less

2,679 citations

Journal Article•DOI•

EIE: efficient inference engine on compressed deep neural network

[...]

Song Han¹, Xingyu Liu¹, Huizi Mao¹, Jing Pu¹, Ardavan Pedram¹, Mark Horowitz¹, William J. Dally¹ - Show less +3 more•Institutions (1)

Stanford University¹

18 Jun 2016

TL;DR: In this paper, the authors proposed an energy efficient inference engine (EIE) that performs inference on a compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing.

...read moreread less

Abstract: State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

...read moreread less

2,445 citations

Journal Article•DOI•

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

[...]

Vivienne Sze¹, Yu-Hsin Chen¹, Tien-Ju Yang¹, Joel Emer¹•Institutions (1)

Massachusetts Institute of Technology¹

20 Nov 2017

TL;DR: In this paper, the authors provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs, and discuss various hardware platforms and architectures that support DNN, and highlight key trends in reducing the computation cost of deep neural networks either solely via hardware design changes or via joint hardware and DNN algorithm changes.

...read moreread less

Abstract: Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.

...read moreread less

2,391 citations

Обнаружение транспортных средств на изображениях загородных шоссе на основе метода Single shot multibox Detector

[...]

Р Ю Чуйков, Д А Юдин

01 Jan 2017

1,687 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse