Home
/
Authors
/
Kees Vissers

Author

Kees Vissers

Other affiliations: University of Kassel

Bio: Kees Vissers is an academic researcher from Xilinx. The author has contributed to research in topics: Field-programmable gate array & Artificial neural network. The author has an hindex of 18, co-authored 45 publications receiving 2176 citations. Previous affiliations of Kees Vissers include University of Kassel.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

[...]

Yaman Umuroglu¹, Nicholas J. Fraser², Giulio Gambardella³, Michaela Blott³, Philip H. W. Leong², Magnus Jahre¹, Kees Vissers³ - Show less +3 more•Institutions (3)

Norwegian University of Science and Technology¹, University of Sydney², Xilinx³

22 Feb 2017

TL;DR: FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture that implements fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements is presented.

...read moreread less

Abstract: Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 μs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 μs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.

...read moreread less

811 citations

Journal Article•DOI•

High-Level Synthesis for FPGAs: From Prototyping to Deployment

[...]

Jason Cong, Bin Liu, Stephen Neuendorffer¹, Juanjo Noguera¹, Kees Vissers¹, Zhiru Zhang - Show less +2 more•Institutions (1)

Xilinx¹

01 Apr 2011-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx are used as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains.

...read moreread less

Abstract: Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS msystem-on-chip design complexityethodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design.

...read moreread less

728 citations

Journal Article•DOI•

FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

[...]

Michaela Blott¹, Thomas B. Preußer¹, Nicholas J. Fraser¹, Giulio Gambardella¹, Kenneth O'Brien¹, Yaman Umuroglu¹, Miriam Leeser², Kees Vissers¹ - Show less +4 more•Institutions (2)

Xilinx¹, Northeastern University²

15 Dec 2018-ACM Transactions on Reconfigurable Technology and Systems

TL;DR: The second generation of the FINN framework is described, an end-to-end tool that enables design-space exploration and automates the creation of fully customized inference engines on FPGAs that optimizes for given platforms, design targets, and a specific precision.

...read moreread less

Abstract: Convolutional Neural Networks have rapidly become the most successful machine-learning algorithm, enabling ubiquitous machine vision and intelligent decisions on even embedded computing systems. While the underlying arithmetic is structurally simple, compute and memory requirements are challenging. One of the promising opportunities is leveraging reduced-precision representations for inputs, activations, and model parameters. The resulting scalability in performance, power efficiency, and storage footprint provides interesting design compromises in exchange for a small reduction in accuracy. FPGAs are ideal for exploiting low-precision inference engines leveraging custom precisions to achieve the required numerical accuracy for a given application. In this article, we describe the second generation of the FINN framework, an end-to-end tool that enables design-space exploration and automates the creation of fully customized inference engines on FPGAs. Given a neural network description, the tool optimizes for given platforms, design targets, and a specific precision. We introduce formalizations of resource cost functions and performance predictions and elaborate on the optimization algorithms. Finally, we evaluate a selection of reduced precision neural networks ranging from CIFAR-10 classifiers to YOLO-based object detection on a range of platforms including PYNQ and AWS F1, demonstrating new unprecedented measured throughput at 50 TOp/s on AWS F1 and 5 TOp/s on embedded devices.

...read moreread less

204 citations

Proceedings Article•DOI•

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

[...]

Yaman Umuroglu¹, Nicholas J. Fraser², Giulio Gambardella³, Michaela Blott³, Philip H. W. Leong², Magnus Jahre¹, Kees Vissers³ - Show less +3 more•Institutions (3)

Norwegian University of Science and Technology¹, University of Sydney², Xilinx³

01 Dec 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture, with fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements.

...read moreread less

Abstract: Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.

...read moreread less

176 citations

Proceedings Article•DOI•

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

[...]

Murad Qasaimeh¹, Kristof Denolf², Jack Lo², Kees Vissers², Joseph Zambreno¹, Phillip H. Jones¹ - Show less +2 more•Institutions (2)

Iowa State University¹, Xilinx²

02 Jun 2019

TL;DR: A comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels is conducted and rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of arange of vision kernel categories are discussed.

...read moreread less

Abstract: Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which embedded platform is most suitable for their application, we conduct a comprehensive benchmark of the run-time performance and energy efficiency of a wide range of vision kernels. We discuss rationales for why a given underlying hardware architecture innately performs well or poorly based on the characteristics of a range of vision kernel categories. Specifically, our study is performed for three commonly used HW accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1–3.2× compared to the others for simple kernels. While for more complicated kernels and complete vision pipelines, the FPGA outperforms the others with energy/frame reduction ratios of 1.2–22.3×. It is also observed that the FPGA performs increasingly better as a vision application's pipeline complexity grows.

...read moreread less

98 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

Cited by

PDF

Open Access

More filters

The Transmission Control Protocol.

[...]

Aleksander Malinowski, Bogdan M. Wilamowski

01 Jan 2005

1,360 citations

Proceedings Article•DOI•

FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search

[...]

Bichen Wu¹, Kurt Keutzer¹, Xiaoliang Dai², Peizhao Zhang³, Yanghan Wang³, Fei Sun³, Yiming Wu³, Yuandong Tian³, Peter Vajda³, Yangqing Jia³ - Show less +6 more•Institutions (3)

University of California, Berkeley¹, Princeton University², Facebook³

15 Jun 2019

TL;DR: This work proposes a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods.

...read moreread less

Abstract: Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large. Due to this, previous neural architecture search (NAS) methods are computationally expensive. ConvNet architecture optimality depends on factors such as input resolution and target devices. However, existing approaches are too resource demanding for case-by-case redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP count does not always reflect actual latency. To address these, we propose a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. FBNets (Facebook-Berkeley-Nets), a family of models discovered by DNAS surpass state-of-the-art models both designed manually and generated automatically. FBNet-B achieves 74.1% top-1 accuracy on ImageNet with 295M FLOPs and 23.1 ms latency on a Samsung S8 phone, 2.4x smaller and 1.5x faster than MobileNetV2-1.3 with similar accuracy. Despite higher accuracy and lower latency than MnasNet, we estimate FBNet-B's search cost is 420x smaller than MnasNet's, at only 216 GPU-hours. Searched for different resolutions and channel sizes, FBNets achieve 1.5% to 6.4% higher accuracy than MobileNetV2. The smallest FBNet achieves 50.2% accuracy and 2.9 ms latency (345 frames per second) on a Samsung S8. Over a Samsung-optimized FBNet, the iPhone-X-optimized model achieves a 1.4x speedup on an iPhone X. FBNet models are open-sourced at https://github. com/facebookresearch/mobile-vision.

...read moreread less

1,201 citations

Proceedings Article•DOI•

TVM: an automated end-to-end optimizing compiler for deep learning

[...]

Tianqi Chen¹, Thierry Moreau¹, Ziheng Jiang¹, Lianmin Zheng², Eddie Yan¹, Meghan Cowan¹, Haichen Shen¹, Leyuan Wang³, Yuwei Hu⁴, Luis Ceze¹, Carlos Guestrin¹, Arvind Krishnamurthy¹ - Show less +8 more•Institutions (4)

University of Washington¹, Shanghai Jiao Tong University², University of California, Davis³, Cornell University⁴

08 Oct 2018

TL;DR: TVM as discussed by the authors is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends, such as mobile phones, embedded devices, and accelerators.

...read moreread less

Abstract: There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms - such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) - requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

...read moreread less

991 citations

Journal Article•DOI•

A reconfigurable fabric for accelerating large-scale datacenter services

[...]

Andrew Putnam¹, Adrian M. Caulfield¹, Eric S. Chung¹, Derek Chiou², Kypros Constantinides³, John Demme⁴, Hadi Esmaeilzadeh⁵, Jeremy Fowers¹, Gopi Prashanth Gopal¹, Jan Gray¹, Michael Haselman¹, Scott Hauck⁶, Stephen F. Heil¹, Amir Hormati⁷, Joo-Young Kim¹, Sitaram Lanka¹, James R. Larus⁸, Eric C. Peterson¹, Simon Pope¹, Aaron L. Smith¹, Jason Thong¹, Phillip Yi Xiao¹, Doug Burger¹ - Show less +19 more•Institutions (8)

Microsoft¹, University of Texas at Austin², Amazon.com³, Columbia University⁴, Georgia Institute of Technology⁵, University of Washington⁶, Google⁷, École Polytechnique Fédérale de Lausanne⁸

28 Oct 2016-Communications of The ACM

TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.

...read moreread less

Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost It is challenging to improve all of these factors simultaneously To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA) Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth networkWe describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput In other words, the reconfigurable fabric enables the same throughput using only half the number of servers

...read moreread less

835 citations

Book•

IEEE transactions on computer-aided design of integrated circuits and systems : a publication of the IEEE Circuits and Systems Society

[...]

Ieee Circuits

01 Jan 1982

729 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse