scispace - formally typeset
Open AccessJournal ArticleDOI

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

TLDR
In this paper, a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum energy, high-accuracy, nanosecond inference and fully automated deployment on chip is introduced.
Abstract
Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton–proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of $${\mathcal{O}}(1)\,\upmu{\rm{s}}$$ is required. Nanosecond inference and a resource consumption reduced by a factor of 50 when implemented on field-programmable gate array hardware are achieved. With edge computing on custom hardware, real-time inference with deep neural networks can reach the nanosecond timescale. An important application in this regime is event processing at particle collision detectors like those at the Large Hadron Collider (LHC). To ensure high performance as well as reduced resource consumption, a method is developed, and made available as an extension of the Keras library, to automatically design optimal quantization of the different layers in a deep neural network.

read more

Content maybe subject to copyright    Report

Automatic heterogeneous quantization of deep neural networks for low-latency
inference on the edge for particle detectors
Claudionor N. Coelho Jr.
Palo Alto Networks (California, USA)
Aki Kuusela, Shan Li, and Hao Zhuang
Google LLC (California, USA)
Thea Aarrestad,
Vladimir Loncar,
Maurizio Pierini, Adrian Alan Pol, and Sioni Summers
European Organization for Nuclear Research (CERN) (Geneva, Switzerland)
Jennifer Ngadiuba
California Institute of Technology (Caltech) (California,USA)
(Dated: June 22, 2021)
Although the quest for more accurate solutions is pushing deep learning research towards larger
and more complex algorithms, edge devices demand efficient inference and therefore reduction in
model size, latency and energy consumption. One technique to limit model size is quantization,
which implies using fewer bits to represent weights and biases. Such an approach usually results
in a decline in performance. Here, we introduce a method for designing optimally heterogeneously
quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond
inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic
quantization procedure, sampling from a wide range of quantizers, model energy consumption and
size are minimized while high accuracy is maintained. This is crucial for the event selection procedure
in proton–proton collisions at the CERN Large Hadron Collider, where resources are strictly limited
and a latency of
O
(1)
µ
s is required. Nanosecond inference and a resource consumption reduced by
a factor of 50 when implemented on field-programmable gate array hardware are achieved.
FIG. I. An ultra-compressed deep neural network for particle identification on a Xilinx FPGA.
E-mail: thea.aarrestad@cern.ch
Also at Institute of Physics Belgrade, Serbia.
arXiv:2006.10159v3 [physics.ins-det] 21 Jun 2021

2
I. INTRODUCTION
With edge computing, real-time inference of deep neural
networks (DNNs) on custom hardware has become increas-
ingly relevant. Smartphone companies are incorporating
Artificial Intelligence (AI) chips in their design for on-
device inference to improve user experience and tighten
data security, and the autonomous vehicle industry is
turning to application-specific integrated circuits (ASICs)
to keep the latency low. While the typical acceptable
latency for real-time inference in applications like those
above is
O
(1) ms [
1
,
2
], other applications may require
sub-microsecond inference. For instance, high-frequency
trading machine learning (ML) algorithms are running
on field-programmable gate arrays (FPGAs) to make de-
cisions within nanoseconds [
3
]. At the extreme inference
spectrum end of both the low-latency (as in high-frequency
trading) and limited-area (as in smartphone applications)
is the processing of data from proton-proton collisions at
the Large Hadron Collider (LHC) at CERN [
4
]. In the
particle detectors around the LHC ring, tens of terabytes
of data per second are produced from collisions occurring
every 25 ns. This extremely large data rate is reduced by
a real-time event filter processing system the trigger
which decides whether each discrete collision event should
be kept for further analysis or be discarded. Data is
buffered close to the detector while the processing occurs,
with a maximum latency of
O
(1)
µ
s to make the trigger
decision. High selection accuracy in the trigger is crucial
in order to keep only the most interesting events while
keeping the output bandwidth low, reducing the event
rate from 40 MHz to 100 kHz. In 2027 the LHC will
be upgraded from its current state, capable of producing
up to one billion proton-proton collisions per second, to
the so-called High Luminosity-LHC (HL-LHC) [
5
]. This
will involve increasing the number of proton collisions
occurring every second by a factor of five to seven, ulti-
mately resulting in a total amount of accumulated data
one order of magnitude higher than what is possible with
the current collider. With this extreme increase, ML so-
lutions are being explored as fast approximations of the
algorithms currently in use to minimize the latency and
maximize the precision of tasks that can be performed.
Hardware used for real-time inference in particle detec-
tors usually has limited computational capacity due to
size constraints. Incorporating resource-intensive models
without a loss in performance poses a great challenge. In
recent years many developments aimed at providing effi-
cient inference from the algorithmic point of view. This
includes compact network design [
6
10
], weight and filter
pruning [
11
,
12
] or quantization. In post-training quan-
tization [
13
17
] the pre-trained model parameters are
translated into lower precision equivalents. However, this
process is, by definition, lossy and sacrifices model per-
formance. Therefore, solutions to do quantization-aware
training have been suggested [
18
27
]. In these, a fixed
numerical representation is adopted for the whole model,
and the model training is performed enforcing this con-
straint during weight optimization. More recently [
28
31
],
it is argued that some layers may be more accommodating
for aggressive quantization, whereas others may require
more expensive arithmetic. This suggests that per-layer
heterogeneous quantization is the optimal way to achieve
higher accuracy at low resource cost, however, might
require further specialization of the hardware resources.
In this paper, we introduce a novel workflow for find-
ing the optimal heterogeneous quantization per layer and
per parameter type for a given model, and deploying
that model on FPGA hardware. Through minimal code
changes, the model footprint is minimized while retain-
ing high accuracy, and then translated into low-latency
firmware. This paper makes the following contributions:
We have implemented a range of quantization meth-
ods in a common library, which provide a broad
base from which optimal quantizations can easily
be sampled;
We introduce a novel method for finding the optimal
heterogeneous quantization for a given model, re-
sulting in minimum area or minimum power DNNs
while maintaining high accuracy;
We have made these methods available online
in easy-to-use libraries, called QKeras and Au-
toQKeras
1
, where simple drop-in replacement of
Keras [
32
] layers makes it straightforward for users
to transform Keras models to their equivalent
deep heterogeneously quantized versions, which are
trained quantization aware. Using AutoQKeras, a
user can trade-off accuracy by model size reduction
(e.g. area or energy);
We have added support for quantized QKeras mod-
els in the library,
hls4ml
[
13
], which converts these
pre-trained quantized models into highly-parallel
FPGA firmware for ultra low-latency inference.
To demonstrate the significant practical advantages of
these tools for high-energy physics and other inference on
the edge applications:
We conduct an experiment consisting of classify-
ing events in an extreme environment, namely the
triggering of proton-proton collisions at the CERN
LHC, where resources are limited and a maximum
latency of O(1 )µs is imposed;
We show that inference within 60 ns and a reduction
of the model resource consumption by a factor of 50
can be achieved through automatic heterogeneous
quantization, while maintaining similar accuracy
(within 3% of the floating point model accuracy);
We show that the original floating point model ac-
curacy can be maintained for homogeneously quan-
tized DNNs down to a bit width of six while re-
ducing resource consumption up to 75 % through
quantization-aware training with QKeras.
The proposed pipeline provides a novel, automatic end-to-
end flow for deploying ultra low latency, low-area DNNs
1
https://github.com/google/qkeras

3
on chip. This will be crucial for the deployment of ML
models on FPGAs in particle detectors and other fields
with extreme inference and low-power requirements.
The remainder of the paper is organized as follows.
In Section II we discuss previous work related to model
quantization and model compression with a focus on par-
ticle detectors. In Section V we uncover the novel library
for training ultra low-latency optimally heterogeneously
quantized DNNs, QKeras. Section VI describes the pro-
cedure of automatic quantization for optimizing model
size and accuracy simultaneously. Finally, in Sections VII
we deploy these optimally quantized QKeras models on
FPGAs and evaluate their performance.
II. MOTIVATION
The hardware triggering system in a particle detector at
the CERN LHC is one of the most extreme environments
one can imagine deploying DNNs. Latency is restricted to
O
(
1 )µs
, governed by the frequency of particle collisions
and the amount of on-detector buffers. The system con-
sists of a limited amount of FPGA resources, all of which
are located in underground caverns 50-100 meters below
the ground surface, working on thousands of different
tasks in parallel. Due to the high number of tasks being
performed, limited cooling capabilities, limited space in
the cavern, and the limited number of processors, algo-
rithms must be kept as resource-economic as possible. In
order to minimize the latency and maximize the precision
of tasks that can be performed in the hardware trigger,
ML solutions are being explored as fast approximations
of the algorithms currently in use. To simplify the im-
plementation of these, a general library for converting
pre-trained ML models into FPGA or application-specific
integrated circuits (ASIC) firmware has been developed,
hls4ml
[
13
]. The package comprises a library of opti-
mized
C++
code for common network layers, which can
be synthesized through a high-level synthesis (HLS) tool.
Converters are provided for multiple model formats, like
TensorFlow
[
33
],
Keras
[
32
] ,
PyTorch
[
34
] and
ONNX
[
35
].
Although other libraries for the translation of ML mod-
els to FPGA firmware exist, as summarized in Refs. [
36
39
],
hls4ml
targets extreme low-latency inference in order
to stay within the strict constraints of
O
(
1 )µs
imposed
by the hardware trigger systems. In addition, the unique
aspect of
hls4ml
is the support for multiple HLS-vendor
backends like Xilinx Vivado HLS, Intel Quartus HLS [
40
]
and Mentor Catapult HLS [
41
], all of which are in use at
the LHC experiments. The Vivado HLS backend is the
most advanced and therefore the one used in this paper.
The
hls4ml
conversion process maps the user-provided
neural network model into a given vendor-specific ab-
straction (like Vivado HLS), with easy-to-use handles
to tune performance. The
hls4ml
NN architecture is
introduced in [
13
]. A model-specific, layer-unrolled ar-
chitecture is used in order to produce ultra low latency,
resource efficient inference engines for particle physics.
Computation for each NN layer is carried out in distinct
hardware elements of the target device, which allows for
high computational throughput through the layer pipeline,
as well as fine-grained configuration of each layer (includ-
ing quantization). A simple handle, named “Reuse Factor”
enables users to control the parallelization of the com-
putation, again at a per-layer level. In the fully parallel
model, using a Reuse Factor of 1, each individual mul-
tiplication of the NN layers is carried out on different
resources (whether FPGA DSPs or LUTs). With a Reuse
Factor greater than 1, multiplication elements are reused
sequentially to reduce the resource cost, at the expense
of latency and throughput. This simple handle enables
rapid design space exploration as well as configurability to
target specific constraints in available resources, latency,
and throughput.
In addition, the data access at the NN input and output,
as well as the data movement between NN layers, can be
configured to be fully parallel or fully serial. The former
option is used to target ultra low latency, high throughput
inference in the real-time processing of particle physics
experiments, while the latter can be used to fit larger NN
models within the available FPGA resources when ultra
low latency is not as much of a constraint.
The
hls4ml
library is implemented as a Python pack-
age to facilitate ease of use for non-experts, as well as
consistency with other popular Deep Learning libraries.
The first step in the conversion into FPGA firmware
consists of translating a given model into an internal rep-
resentation of the network graph. During this conversion,
user-specified optimization configurations are attached to
the model, such as the choice of quantization and paralleli-
sation. The internal representation is written out into an
HLS project, assigning the appropriate layers of the target
NN and the user configuration. This HLS project can
then be synthesized with the FPGA vendor tools, generat-
ing an IP core that can be used in the target application.
Many commonly used NN layers are supported: Dense;
Convolution; BatchNormalization; and several Activation
layers. In addition, domain specific layers can be easily
added, one example being compressed distance-weighted
graph networks [42].
In
hls4ml
, the precision used to represent weights,
biases, activations, and other components are configurable
through post-training quantization, replacing the floating
point values by lower precision fixed-point ones. This
allows compression of the model size, but to some extent
sacrifices accuracy. Recently, support for binary and
ternary precision DNNs [
43
] trained quantization-aware
has been included in the library. This greatly reduces
the model size, but requiring such an extremely low-
precision of each parameter type sacrifices accuracy and
generalization.
As demonstrated in Refs. [
28
31
], mixed-precision quan-
tization, i.e. keeping some layers at higher precision and
some at lower precision, is a promising approach to achieve
smaller models with high accuracy. However, finding the
optimal heterogeneous quantization per layer and per

4
parameter type, hereby referred to as quantization config-
uration, is extremely challenging, with the search space
increasing exponentially with the number of layers in a
model [
30
]. A solution for finding the mixed quantization
configuration that yields best generalization/accuracy us-
ing the Hessian spectrum is proposed in Ref. [
30
]. For ML
applications in hardware triggering systems, the resources
one has at disposal, as well as the minimum tolerable
model accuracy, are usually known. Finding the best
model for a given task is, therefore, a fine compromise
between the desired model compression and accuracy with
respect to the floating point based model. Both factors
must be considered when tuning quantization. The goal
of this work is hence to provide a method for finding the
optimal mixed-precision configuration for a given model,
accounting for both the desired model size and accuracy
when optimizing the architecture, and to transform these
into highly parallel firmware for ultra low-latency infer-
ence on chip.
III. RELATED WORK
Closely related to the work presented here are the
FINN [
44
] and FINN-R [
45
] frameworks from Xilinx Re-
search Labs, which aim to deploy quantized neural net-
works on Xilinx FPGAs. The same group have also
developed a library for quantization-aware training, Bre-
vitas [
46
], based on
PyTorch
model formats. The Log-
icNets design flow [
47
], also from Xilinx Research Labs,
allows for the training of quantized DNNs that map to
highly efficient Xilinx FPGA implementations. A com-
parison between the approach presented here and Logic-
Nets is provided in Section VII. The FP-DNN [
48
] frame-
work takes TensorFlow [
33
]-described DNNs as input and
maps them onto FPGAs. The open-source alternative
DNNWeaver [
49
] automatically generates accelerator Ver-
ilog code using optimized templates. Other frameworks
focusing on the mapping of convolutional architectures
onto efficient hardware design include Snowflake [
50
], fp-
gaConvNet [
51
53
] and Ref. [
54
]. For other work on
FPGA DNN inference, we refer to the recent surveys at
Refs. [
36
39
,
55
]. TensorFlow Lite [
56
] is a set of tools
for on-device inference with low latency and small binary
sizes, targeting mobile, embedded and internet of things
(IoT) devices. Currently, TensorFlow Lite supports de-
ployment on Android and iOS devices, embedded Linux,
and microcontrollers.
Our approach differs from those above with its emphasis
on being a multi-backend tool, embracing a fully on-
chip design to target the microsecond latency imposed in
physics experiments. The
hls4ml
library is completely
open-source, and aims to provide domain scientists with
easy-to-use software for deploying highly efficient ML
algorithms on hardware.
In HAQ [28], a hardware-aware automated framework
for quantization is introduced. The automization proce-
dure consists of computing the curvature of the weight
space of a layer, assuming a low curvature will require a
lower bit-precision for the weights. Our approach differs
from HAQ by combining reduced bit-precision with filter
or neuron unit tuning, where the number of filters or
neurons can be automatically tuned during the scan. In
this case, the problem becomes highly non-linear, and we
therefore take advantage of an AutoML-type of approach.
A Bayesian optimization or randomized search is per-
formed to find a solution that encompasses the precision
used for the weights and activations, and the number of
units or filters of the layer.
IV. PARTICLE IDENTIFICATION IN THE
HARDWARE TRIGGER
A crucial task performed by the trigger system that
could be greatly improved by a ML algorithm, both in
terms of latency and accuracy, is the identification and
classification of particles coming from each proton-proton
collision. In this paper, we analyze the publicly available
dataset introduced in Ref. [
13
,
57
]. Here, a dataset [
58
]
for the discrimination of jets, a collimated spray of par-
ticles, stemming from the decay and/or hadronization
of five different particles was presented. It consists of
quark (q), gluon (g), W boson, Z boson, and top (t)
jets, each represented by 16 physics-motivated high-level
features. In Ref. [
13
], this data set was used to train
a DNN for deployment on a Xilinx FPGA. This model
was compressed through post-training quantization in
order to further reduce the FPGA resource consump-
tion and provides a baseline to measure the benefits of
quantization-aware training with heterogeneous quantiza-
tion, over post-training quantization.
Adopting the same architecture as in Ref. [
13
], we use a
fully-connected neural network consisting of three hidden
layers (64, 32, and 32 nodes, respectively) with ReLU
activation functions, shown in Fig. II. The output layer
has five nodes, yielding a probability for each of the five
classes through a Softmax activation function. The model
definition in TensorFlow Keras is given in Listing 1.
As in [
13
], the weights of this network are homoge-
neously quantized post-training to a fixed [pont precision
yielding the best compromise between accuracy, latency,
and resource consumption. This is found to be a fixed
point precision, or bit width, of 14 bits with 6 integer bits,
in the following referred to as
h
14
,
6
i
. We refer to this
configuration as the baseline full model (BF). We then
train a second pruned version of the BF model, hereby
referred to as baseline pruned (BP). This model has 70%
of its weights set to zero through an iterative process
where small weights are removed using the TensorFlow
Pruning API [
59
], following what was done in Ref. [
13
].
This reduces the model size and resource consumption sig-
nificantly, as all zero-multiplications are excluded during
the firmware implementation. We then create one hetero-
geneously quantized version of the BP model, where each
layer is quantized independently post-training to yield

5
Dense (64)
4,0
Dense (32)
Ternary
Input (16)
16,6
Dense (32)
2,1
ReLU
ReLU
ReLU
Softmax
Dense (5)
w: Binary b:8,3
16,6
16,64,2
3,1
4,2
FIG. II. Model architecture for the fully-connected NN architecture under study. The numbers in brackets are the precisions
used for each layer, quoted as
hB, Ii
, where
B
is the precision in bits and
I
the number of integer bits. When different precision
is used for weights and biases, the quantization is listed as w and b, respectively. These have been obtained using the per-layer,
per-parameter type automatic quantization procedure described in Section VI.
TABLE I. Per-layer quantization for the different baseline models (quantized post-training). When different precision is used for
weights and biases, the quantization is listed as w and b, respectively.
Model Precision
Dense ReLU Dense ReLU Dense ReLU Dense Softmax
BF/BP h14, 6i h14, 6i h14, 6i h14, 6i h14, 6i h14, 6i h14, 6i h14, 6i
BH w:h8, 3i b:h4,2i h13, 7i h7, 2i h10, 5i h5, 2i h8, 4i w:h7, 3i b:h4,1i h16, 6i
Listing 1. TensorFlow Keras model definition.
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.layers import BatchNormalization
x = Input((16))
x = Dense(64)(x)
x = BatchNormalization()(x)
x = Activation(’relu)(x)
x = Dense(32)(x)
x = BatchNormalization()(x)
x = Activation(’relu)(x)
x = Dense(32)(x)
x = BatchNormalization()(x)
x = Activation(’relu)(x)
x = Dense(5)(x)
x = Activation(’softmax’)(x)
the highest accuracy possible at the lowest resource cost.
We start with an initial configuration of the model quan-
tization using a wide bit-width, then iteratively reduce
the bit-width until reaching a threshold in accuracy loss
relative to the initial floating-point model, evaluated on
the training set. We iterate over the model in layer order,
finding the appropriate precision for weights, biases, and
output of a given layer, before moving to the next. We
apply a more strict threshold in accuracy for earlier layers,
since each round of precision reduction only degrades the
accuracy. In this case we restrict to a 1% accuracy differ-
ence in the first layer, loosening to 2% for the final layer.
This model is referred to as the baseline heterogeneous
(BH) model. A summary of the per-layer quantizations for
the baselines is given in Table I. From Ref. [
13
], we know
that a post-training quantization of this model results in
a degradation in model accuracy. The smaller the model
footprint is made through post-training quantization, the
larger the accuracy degradation becomes. To overcome
this, we develop a novel library that, through minimal
code changes, allows us to create deep heterogeneously
quantized versions of Keras model, trained quantization-
aware. In addition, as the amount of available resources
on chip is known in advance, we want to find the optimal
model for a given use-case allowing a trade-off between

Citations
More filters
Posted Content

Industrial Digital Twins at the Nexus of NextG Wireless Networks and Computational Intelligence: A Survey.

TL;DR: This article presents the functional aspects, appeal, and innovative use of DT in smart industries, and discusses the DT deployment strategies at different industrial communication layers to meet the monitoring and control requirements of industrial applications.
Journal ArticleDOI

Fast convolutional neural networks on FPGAs with hls4ml

TL;DR: In this paper, an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on field-programmable gate arrays (FPGAs) is introduced.
Journal ArticleDOI

Industrial IoT in 5G-and-Beyond Networks: Vision, Architecture, and Design Trends

TL;DR: In this article , the authors present the state-of-the-art 5G architecture, transformative technologies, and recent design trends, which also selectively supplemented with new results, and identify several research challenges in these promising design trends that beyond-5G systems must overcome to support rapidly unfolding transition in creating value-centric industrial wireless networks.
Posted Content

Real-time Artificial Intelligence for Accelerator Control: A Study at the Fermilab Booster

TL;DR: Preliminary results by training a surrogate machine-learning model on real accelerator data to emulate the Booster environment and using this surrogate model to train the neural network for its regulation task are demonstrated.

Charged Particle Tracking via Edge-Classifying Interaction Networks

TL;DR: In this article, the authors adapt the physics-motivated interaction network (IN) GNN to the problem of particle tracking in pileup conditions similar to those expected at the high-luminosity Large Hadron Collider.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings Article

Rectified Linear Units Improve Restricted Boltzmann Machines

TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What are the contributions in "Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors" ?

Here, the authors introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. 

Taking a pre-trained model and making it suitable for hardware implementation on the edge, both in terms of latency and size, is one of the bottlenecks for bringing ML applications into extremely constrained computing environments ( e. g. a detector at a particle collider ), and the workflow presented here will allow for a streamlined and simple process, ultimately resulting in a great improvement in the quality of physics data collected in the future. The generality and flexibility of the QKeras+hls4ml workflow opens up for a wide array of possible future work. In addition, while the energy estimator provides a good baseline for relative energy consumption, as demonstrated, the authors hope to extend the library to provide more device-specific absolute energy estimates. The authors also plan to explore using a combination of block energy and the curvature of the weight space, as done in HAQ, when quantizing a network one block at a time.