Home
/
Authors
/
Zhiru Zhang

Author

Zhiru Zhang

Other affiliations: Xilinx, Nvidia, Altera ...read more

Bio: Zhiru Zhang is an academic researcher from Cornell University. The author has contributed to research in topics: High-level synthesis & Computer science. The author has an hindex of 28, co-authored 122 publications receiving 3561 citations. Previous affiliations of Zhiru Zhang include Xilinx & Nvidia.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003

Papers

PDF

Open Access

More filters

Journal Article•DOI•

High-Level Synthesis for FPGAs: From Prototyping to Deployment

[...]

Jason Cong, Bin Liu, Stephen Neuendorffer¹, Juanjo Noguera¹, Kees Vissers¹, Zhiru Zhang - Show less +2 more•Institutions (1)

Xilinx¹

01 Apr 2011-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx are used as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains.

...read moreread less

Abstract: Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS msystem-on-chip design complexityethodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design.

...read moreread less

728 citations

Proceedings Article•DOI•

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

[...]

Ritchie Zhao¹, Weinan Song¹, Wentao Zhang, Tianwei Xing², Jeng-Hau Lin³, Mani Srivastava², Rajesh Gupta³, Zhiru Zhang¹ - Show less +4 more•Institutions (3)

Cornell University¹, University of California, Los Angeles², University of California, San Diego³

22 Feb 2017

TL;DR: The design of a BNN accelerator is presented that is synthesized from C++ to FPGA-targeted Verilog and outperforms existing FPGAs-based CNN accelerators in GOPS as well as energy and resource efficiency.

...read moreread less

Abstract: Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.

...read moreread less

379 citations

Proceedings Article•DOI•

Application-specific instruction generation for configurable processor architectures

[...]

Jason Cong¹, Yiping Fan¹, Guoling Han¹, Zhiru Zhang¹•Institutions (1)

University of California, Los Angeles¹

22 Feb 2004

TL;DR: A set of algorithms, including pattern generation, pattern selection, and application mapping, are proposed to efficiently utilize the instruction set extensibility of the target configurable processor.

...read moreread less

Abstract: Designing an application-specific embedded system in nanometer technologies has become more difficult than ever due to the rapid increase in design complexity and manufacturing cost. Efficiency and flexibility must be carefully balanced to meet different application requirements. The recently emerged configurable and extensible processor architectures offer a favorable tradeoff between efficiency and flexibility, and a promising way to minimize certain important metrics (e.g., execution time, code size, etc.) of the embedded processors. This paper addresses the problem of generating the application-specific instructions to improve the execution speed for configurable processors. A set of algorithms, including pattern generation, pattern selection, and application mapping, are proposed to efficiently utilize the instruction set extensibility of the target configurable processor. Applications of our approach to several real-life benchmarks on the Altera Nios processor show encouraging performance speedup (2.75X on average and up to 3.73X in some cases).

...read moreread less

255 citations

Proceedings Article•DOI•

Reverse engineering convolutional neural networks through side-channel information leaks

[...]

Weizhe Hua¹, Zhiru Zhang¹, G. Edward Suh¹•Institutions (1)

Cornell University¹

24 Jun 2018

TL;DR: This study shows that even with data encryption, the adversary can infer the underlying network structure by exploiting the memory and timing side-channels, and reveals the importance of hiding off-chip memory access pattern to truly protect confidential CNN models.

...read moreread less

Abstract: A convolutional neural network (CNN) model represents a crucial piece of intellectual property in many applications. Revealing its structure or weights would leak confidential information. In this paper we present novel reverse-engineering attacks on CNNs running on a hardware accelerator, where an adversary can feed inputs to the accelerator and observe the resulting off-chip memory accesses. Our study shows that even with data encryption, the adversary can infer the underlying network structure by exploiting the memory and timing side-channels. We further identify the information leakage on the values of weights when a CNN accelerator performs dynamic zero pruning for off-chip memory accesses. Overall, this work reveals the importance of hiding off-chip memory access pattern to truly protect confidential CNN models.

...read moreread less

177 citations

Proceedings Article•DOI•

An efficient and versatile scheduling algorithm based on SDC formulation

[...]

Jason Cong¹, Zhiru Zhang¹•Institutions (1)

University of California, Los Angeles¹

24 Jul 2006

TL;DR: A new scheduler is described that converts a rich set of scheduling constraints into a system of difference constraints (SDC) and performs a variety of powerful optimizations under a unified mathematical programming framework and effectively optimize longest path latency, expected overall latency, and the slack distribution.

...read moreread less

Abstract: Scheduling plays a central role in the behavioral synthesis process, which automatically compiles high-level specifications into optimized hardware implementations. However, most of the existing behavior-level scheduling heuristics either have a limited efficiency in a specific class of applications or lack general support of various design constraints. In this paper we describe a new scheduler that converts a rich set of scheduling constraints into a system of difference constraints (SDC) and performs a variety of powerful optimizations under a unified mathematical programming framework. In particular, we show that our SDC-based scheduling algorithm can efficiently support resource constraints, frequency constraints, latency constraints, and relative timing constraints, and effectively optimize longest path latency, expected overall latency, and the slack distribution. Experiments demonstrate that our proposed technique provides efficient solutions for a broader range of applications with higher quality of results (in terms of system performance) when compared to the state-of-the-art scheduling heuristics.

...read moreread less

171 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

Book•

IEEE transactions on computer-aided design of integrated circuits and systems : a publication of the IEEE Circuits and Systems Society

[...]

Ieee Circuits

01 Jan 1982

729 citations

Journal Article•DOI•

High-Level Synthesis for FPGAs: From Prototyping to Deployment

[...]

Jason Cong, Bin Liu, Stephen Neuendorffer¹, Juanjo Noguera¹, Kees Vissers¹, Zhiru Zhang - Show less +2 more•Institutions (1)

Xilinx¹

01 Apr 2011-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

...read moreread less

728 citations

Proceedings Article•DOI•

LegUp: high-level synthesis for FPGA-based processor/accelerator systems

[...]

Andrew Canis¹, Jongsok Choi¹, Mark Aldham¹, Victor Zhang¹, Ahmed Kammoona¹, Jason H. Anderson¹, Stephen J. Brown¹, Tomasz Czajkowski² - Show less +4 more•Institutions (2)

University of Toronto¹, Altera²

27 Feb 2011

TL;DR: A new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design and produces hardware solutions of comparable quality to a commercial high- level synthesis tool.

...read moreread less

Abstract: In this paper, we introduce a new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design LegUp accepts a standard C program as input and automatically compiles the program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware accelerators that communicate through a standard bus interface Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool

...read moreread less

531 citations

Proceedings Article•DOI•

A configurable cloud-scale DNN processor for real-time AI

[...]

Jeremy Fowers¹, Kalin Ovtcharov¹, Michael K. Papamichael¹, Todd Massengill¹, Ming Liu¹, Lo Daniel¹, Shlomi Alkalay¹, Michael Haselman¹, Logan Adams¹, Mahdi Ghandi¹, Stephen F. Heil¹, Prerak Patel¹, Adam Sapek¹, Gabriel Weisz¹, Lisa Woods¹, Sitaram Lanka¹, Steven K. Reinhardt¹, Adrian M. Caulfield¹, Eric S. Chung¹, Doug Burger¹ - Show less +16 more•Institutions (1)

Microsoft¹

02 Jun 2018

TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.

...read moreread less

Abstract: Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

...read moreread less

498 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse