(Open Access) Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors (2021) | Claudionor Coelho

Q: What are the contributions in "Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors" ?

Here, the authors introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip.

Q: What have the authors stated for future works in "Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors" ?

Taking a pre-trained model and making it suitable for hardware implementation on the edge, both in terms of latency and size, is one of the bottlenecks for bringing ML applications into extremely constrained computing environments ( e. g. a detector at a particle collider ), and the workflow presented here will allow for a streamlined and simple process, ultimately resulting in a great improvement in the quality of physics data collected in the future. The generality and flexibility of the QKeras+hls4ml workflow opens up for a wide array of possible future work. In addition, while the energy estimator provides a good baseline for relative energy consumption, as demonstrated, the authors hope to extend the library to provide more device-specific absolute energy estimates. The authors also plan to explore using a combination of block energy and the curvature of the weight space, as done in HAQ, when quantizing a network one block at a time.

Automatic heterogeneous quantization of deep neural networks for low-latency

inference on the edge for particle detectors

Claudionor N. Coelho Jr.

Palo Alto Networks (California, USA)

Aki Kuusela, Shan Li, and Hao Zhuang

Google LLC (California, USA)

Thea Aarrestad,

∗

Vladimir Loncar,

†

Maurizio Pierini, Adrian Alan Pol, and Sioni Summers

European Organization for Nuclear Research (CERN) (Geneva, Switzerland)

Jennifer Ngadiuba

California Institute of Technology (Caltech) (California,USA)

(Dated: June 22, 2021)

Although the quest for more accurate solutions is pushing deep learning research towards larger

and more complex algorithms, edge devices demand eﬃcient inference and therefore reduction in

model size, latency and energy consumption. One technique to limit model size is quantization,

which implies using fewer bits to represent weights and biases. Such an approach usually results

in a decline in performance. Here, we introduce a method for designing optimally heterogeneously

quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond

inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic

quantization procedure, sampling from a wide range of quantizers, model energy consumption and

size are minimized while high accuracy is maintained. This is crucial for the event selection procedure

in proton–proton collisions at the CERN Large Hadron Collider, where resources are strictly limited

and a latency of

(1)

s is required. Nanosecond inference and a resource consumption reduced by

a factor of 50 when implemented on ﬁeld-programmable gate array hardware are achieved.

FIG. I. An ultra-compressed deep neural network for particle identiﬁcation on a Xilinx FPGA.

∗

E-mail: thea.aarrestad@cern.ch

†

Also at Institute of Physics Belgrade, Serbia.

arXiv:2006.10159v3 [physics.ins-det] 21 Jun 2021

I. INTRODUCTION

With edge computing, real-time inference of deep neural

networks (DNNs) on custom hardware has become increas-

ingly relevant. Smartphone companies are incorporating

Artiﬁcial Intelligence (AI) chips in their design for on-

device inference to improve user experience and tighten

data security, and the autonomous vehicle industry is

turning to application-speciﬁc integrated circuits (ASICs)

to keep the latency low. While the typical acceptable

latency for real-time inference in applications like those

above is

(1) ms [

], other applications may require

sub-microsecond inference. For instance, high-frequency

trading machine learning (ML) algorithms are running

on ﬁeld-programmable gate arrays (FPGAs) to make de-

cisions within nanoseconds [

]. At the extreme inference

spectrum end of both the low-latency (as in high-frequency

trading) and limited-area (as in smartphone applications)

is the processing of data from proton-proton collisions at

the Large Hadron Collider (LHC) at CERN [

]. In the

particle detectors around the LHC ring, tens of terabytes

of data per second are produced from collisions occurring

every 25 ns. This extremely large data rate is reduced by

a real-time event ﬁlter processing system – the trigger –

which decides whether each discrete collision event should

be kept for further analysis or be discarded. Data is

buﬀered close to the detector while the processing occurs,

with a maximum latency of

(1)

s to make the trigger

decision. High selection accuracy in the trigger is crucial

in order to keep only the most interesting events while

keeping the output bandwidth low, reducing the event

rate from 40 MHz to 100 kHz. In 2027 the LHC will

be upgraded from its current state, capable of producing

up to one billion proton-proton collisions per second, to

the so-called High Luminosity-LHC (HL-LHC) [

]. This

will involve increasing the number of proton collisions

occurring every second by a factor of ﬁve to seven, ulti-

mately resulting in a total amount of accumulated data

one order of magnitude higher than what is possible with

the current collider. With this extreme increase, ML so-

lutions are being explored as fast approximations of the

algorithms currently in use to minimize the latency and

maximize the precision of tasks that can be performed.

Hardware used for real-time inference in particle detec-

tors usually has limited computational capacity due to

size constraints. Incorporating resource-intensive models

without a loss in performance poses a great challenge. In

recent years many developments aimed at providing eﬃ-

cient inference from the algorithmic point of view. This

includes compact network design [

–

], weight and ﬁlter

pruning [

] or quantization. In post-training quan-

tization [

–

] the pre-trained model parameters are

translated into lower precision equivalents. However, this

process is, by deﬁnition, lossy and sacriﬁces model per-

formance. Therefore, solutions to do quantization-aware

training have been suggested [

–

]. In these, a ﬁxed

numerical representation is adopted for the whole model,

and the model training is performed enforcing this con-

straint during weight optimization. More recently [

–

it is argued that some layers may be more accommodating

for aggressive quantization, whereas others may require

more expensive arithmetic. This suggests that per-layer

heterogeneous quantization is the optimal way to achieve

higher accuracy at low resource cost, however, might

require further specialization of the hardware resources.

In this paper, we introduce a novel workﬂow for ﬁnd-

ing the optimal heterogeneous quantization per layer and

per parameter type for a given model, and deploying

that model on FPGA hardware. Through minimal code

changes, the model footprint is minimized while retain-

ing high accuracy, and then translated into low-latency

ﬁrmware. This paper makes the following contributions:

•

We have implemented a range of quantization meth-

ods in a common library, which provide a broad

base from which optimal quantizations can easily

be sampled;

•

We introduce a novel method for ﬁnding the optimal

heterogeneous quantization for a given model, re-

sulting in minimum area or minimum power DNNs

while maintaining high accuracy;

•

We have made these methods available online

in easy-to-use libraries, called QKeras and Au-

toQKeras

, where simple drop-in replacement of

Keras [

] layers makes it straightforward for users

to transform Keras models to their equivalent

deep heterogeneously quantized versions, which are

trained quantization aware. Using AutoQKeras, a

user can trade-oﬀ accuracy by model size reduction

(e.g. area or energy);

•

We have added support for quantized QKeras mod-

els in the library,

hls4ml

[

], which converts these

pre-trained quantized models into highly-parallel

FPGA ﬁrmware for ultra low-latency inference.

To demonstrate the signiﬁcant practical advantages of

these tools for high-energy physics and other inference on

the edge applications:

•

We conduct an experiment consisting of classify-

ing events in an extreme environment, namely the

triggering of proton-proton collisions at the CERN

LHC, where resources are limited and a maximum

latency of O(1 )µs is imposed;

•

We show that inference within 60 ns and a reduction

of the model resource consumption by a factor of 50

can be achieved through automatic heterogeneous

quantization, while maintaining similar accuracy

(within 3% of the ﬂoating point model accuracy);

•

We show that the original ﬂoating point model ac-

curacy can be maintained for homogeneously quan-

tized DNNs down to a bit width of six while re-

ducing resource consumption up to 75 % through

quantization-aware training with QKeras.

The proposed pipeline provides a novel, automatic end-to-

end ﬂow for deploying ultra low latency, low-area DNNs

https://github.com/google/qkeras

on chip. This will be crucial for the deployment of ML

models on FPGAs in particle detectors and other ﬁelds

with extreme inference and low-power requirements.

The remainder of the paper is organized as follows.

In Section II we discuss previous work related to model

quantization and model compression with a focus on par-

ticle detectors. In Section V we uncover the novel library

for training ultra low-latency optimally heterogeneously

quantized DNNs, QKeras. Section VI describes the pro-

cedure of automatic quantization for optimizing model

size and accuracy simultaneously. Finally, in Sections VII

we deploy these optimally quantized QKeras models on

FPGAs and evaluate their performance.

II. MOTIVATION

The hardware triggering system in a particle detector at

the CERN LHC is one of the most extreme environments

one can imagine deploying DNNs. Latency is restricted to

(

1 )µs

, governed by the frequency of particle collisions

and the amount of on-detector buﬀers. The system con-

sists of a limited amount of FPGA resources, all of which

are located in underground caverns 50-100 meters below

the ground surface, working on thousands of diﬀerent

tasks in parallel. Due to the high number of tasks being

performed, limited cooling capabilities, limited space in

the cavern, and the limited number of processors, algo-

rithms must be kept as resource-economic as possible. In

order to minimize the latency and maximize the precision

of tasks that can be performed in the hardware trigger,

ML solutions are being explored as fast approximations

of the algorithms currently in use. To simplify the im-

plementation of these, a general library for converting

pre-trained ML models into FPGA or application-speciﬁc

integrated circuits (ASIC) ﬁrmware has been developed,

hls4ml

[

]. The package comprises a library of opti-

mized

C++

code for common network layers, which can

be synthesized through a high-level synthesis (HLS) tool.

Converters are provided for multiple model formats, like

TensorFlow

[

Keras

[

] ,

PyTorch

[

] and

ONNX

[

Although other libraries for the translation of ML mod-

els to FPGA ﬁrmware exist, as summarized in Refs. [

–

hls4ml

targets extreme low-latency inference in order

to stay within the strict constraints of

(

1 )µs

imposed

by the hardware trigger systems. In addition, the unique

aspect of

hls4ml

is the support for multiple HLS-vendor

backends like Xilinx Vivado HLS, Intel Quartus HLS [

]

and Mentor Catapult HLS [

], all of which are in use at

the LHC experiments. The Vivado HLS backend is the

most advanced and therefore the one used in this paper.

The

hls4ml

conversion process maps the user-provided

neural network model into a given vendor-speciﬁc ab-

straction (like Vivado HLS), with easy-to-use handles

to tune performance. The

hls4ml

NN architecture is

introduced in [

]. A model-speciﬁc, layer-unrolled ar-

chitecture is used in order to produce ultra low latency,

resource eﬃcient inference engines for particle physics.

Computation for each NN layer is carried out in distinct

hardware elements of the target device, which allows for

high computational throughput through the layer pipeline,

as well as ﬁne-grained conﬁguration of each layer (includ-

ing quantization). A simple handle, named “Reuse Factor”

enables users to control the parallelization of the com-

putation, again at a per-layer level. In the fully parallel

model, using a Reuse Factor of 1, each individual mul-

tiplication of the NN layers is carried out on diﬀerent

resources (whether FPGA DSPs or LUTs). With a Reuse

Factor greater than 1, multiplication elements are reused

sequentially to reduce the resource cost, at the expense

of latency and throughput. This simple handle enables

rapid design space exploration as well as conﬁgurability to

target speciﬁc constraints in available resources, latency,

and throughput.

In addition, the data access at the NN input and output,

as well as the data movement between NN layers, can be

conﬁgured to be fully parallel or fully serial. The former

option is used to target ultra low latency, high throughput

inference in the real-time processing of particle physics

experiments, while the latter can be used to ﬁt larger NN

models within the available FPGA resources when ultra

low latency is not as much of a constraint.

The

hls4ml

library is implemented as a Python pack-

age to facilitate ease of use for non-experts, as well as

consistency with other popular Deep Learning libraries.

The ﬁrst step in the conversion into FPGA ﬁrmware

consists of translating a given model into an internal rep-

resentation of the network graph. During this conversion,

user-speciﬁed optimization conﬁgurations are attached to

the model, such as the choice of quantization and paralleli-

sation. The internal representation is written out into an

HLS project, assigning the appropriate layers of the target

NN and the user conﬁguration. This HLS project can

then be synthesized with the FPGA vendor tools, generat-

ing an IP core that can be used in the target application.

Many commonly used NN layers are supported: Dense;

Convolution; BatchNormalization; and several Activation

layers. In addition, domain speciﬁc layers can be easily

added, one example being compressed distance-weighted

graph networks [42].

hls4ml

, the precision used to represent weights,

biases, activations, and other components are conﬁgurable

through post-training quantization, replacing the ﬂoating

point values by lower precision ﬁxed-point ones. This

allows compression of the model size, but to some extent

sacriﬁces accuracy. Recently, support for binary and

ternary precision DNNs [

] trained quantization-aware

has been included in the library. This greatly reduces

the model size, but requiring such an extremely low-

precision of each parameter type sacriﬁces accuracy and

generalization.

As demonstrated in Refs. [

–

], mixed-precision quan-

tization, i.e. keeping some layers at higher precision and

some at lower precision, is a promising approach to achieve

smaller models with high accuracy. However, ﬁnding the

optimal heterogeneous quantization per layer and per

parameter type, hereby referred to as quantization conﬁg-

uration, is extremely challenging, with the search space

increasing exponentially with the number of layers in a

model [

]. A solution for ﬁnding the mixed quantization

conﬁguration that yields best generalization/accuracy us-

ing the Hessian spectrum is proposed in Ref. [

]. For ML

applications in hardware triggering systems, the resources

one has at disposal, as well as the minimum tolerable

model accuracy, are usually known. Finding the best

model for a given task is, therefore, a ﬁne compromise

between the desired model compression and accuracy with

respect to the ﬂoating point based model. Both factors

must be considered when tuning quantization. The goal

of this work is hence to provide a method for ﬁnding the

optimal mixed-precision conﬁguration for a given model,

accounting for both the desired model size and accuracy

when optimizing the architecture, and to transform these

into highly parallel ﬁrmware for ultra low-latency infer-

ence on chip.

III. RELATED WORK

Closely related to the work presented here are the

FINN [

] and FINN-R [

] frameworks from Xilinx Re-

search Labs, which aim to deploy quantized neural net-

works on Xilinx FPGAs. The same group have also

developed a library for quantization-aware training, Bre-

vitas [

], based on

PyTorch

model formats. The Log-

icNets design ﬂow [

], also from Xilinx Research Labs,

allows for the training of quantized DNNs that map to

highly eﬃcient Xilinx FPGA implementations. A com-

parison between the approach presented here and Logic-

Nets is provided in Section VII. The FP-DNN [

] frame-

work takes TensorFlow [

]-described DNNs as input and

maps them onto FPGAs. The open-source alternative

DNNWeaver [

] automatically generates accelerator Ver-

ilog code using optimized templates. Other frameworks

focusing on the mapping of convolutional architectures

onto eﬃcient hardware design include Snowﬂake [

], fp-

gaConvNet [

–

] and Ref. [

]. For other work on

FPGA DNN inference, we refer to the recent surveys at

Refs. [

–

]. TensorFlow Lite [

] is a set of tools

for on-device inference with low latency and small binary

sizes, targeting mobile, embedded and internet of things

(IoT) devices. Currently, TensorFlow Lite supports de-

ployment on Android and iOS devices, embedded Linux,

and microcontrollers.

Our approach diﬀers from those above with its emphasis

on being a multi-backend tool, embracing a fully on-

chip design to target the microsecond latency imposed in

physics experiments. The

hls4ml

library is completely

open-source, and aims to provide domain scientists with

easy-to-use software for deploying highly eﬃcient ML

algorithms on hardware.

In HAQ [28], a hardware-aware automated framework

for quantization is introduced. The automization proce-

dure consists of computing the curvature of the weight

space of a layer, assuming a low curvature will require a

lower bit-precision for the weights. Our approach diﬀers

from HAQ by combining reduced bit-precision with ﬁlter

or neuron unit tuning, where the number of ﬁlters or

neurons can be automatically tuned during the scan. In

this case, the problem becomes highly non-linear, and we

therefore take advantage of an AutoML-type of approach.

A Bayesian optimization or randomized search is per-

formed to ﬁnd a solution that encompasses the precision

used for the weights and activations, and the number of

units or ﬁlters of the layer.

IV. PARTICLE IDENTIFICATION IN THE

HARDWARE TRIGGER

A crucial task performed by the trigger system that

could be greatly improved by a ML algorithm, both in

terms of latency and accuracy, is the identiﬁcation and

classiﬁcation of particles coming from each proton-proton

collision. In this paper, we analyze the publicly available

dataset introduced in Ref. [

]. Here, a dataset [

]

for the discrimination of jets, a collimated spray of par-

ticles, stemming from the decay and/or hadronization

of ﬁve diﬀerent particles was presented. It consists of

quark (q), gluon (g), W boson, Z boson, and top (t)

jets, each represented by 16 physics-motivated high-level

features. In Ref. [

], this data set was used to train

a DNN for deployment on a Xilinx FPGA. This model

was compressed through post-training quantization in

order to further reduce the FPGA resource consump-

tion and provides a baseline to measure the beneﬁts of

quantization-aware training with heterogeneous quantiza-

tion, over post-training quantization.

Adopting the same architecture as in Ref. [

], we use a

fully-connected neural network consisting of three hidden

layers (64, 32, and 32 nodes, respectively) with ReLU

activation functions, shown in Fig. II. The output layer

has ﬁve nodes, yielding a probability for each of the ﬁve

classes through a Softmax activation function. The model

deﬁnition in TensorFlow Keras is given in Listing 1.

As in [

], the weights of this network are homoge-

neously quantized post-training to a ﬁxed [pont precision

yielding the best compromise between accuracy, latency,

and resource consumption. This is found to be a ﬁxed

point precision, or bit width, of 14 bits with 6 integer bits,

in the following referred to as

. We refer to this

conﬁguration as the baseline full model (BF). We then

train a second pruned version of the BF model, hereby

referred to as baseline pruned (BP). This model has 70%

of its weights set to zero through an iterative process

where small weights are removed using the TensorFlow

Pruning API [

], following what was done in Ref. [

This reduces the model size and resource consumption sig-

niﬁcantly, as all zero-multiplications are excluded during

the ﬁrmware implementation. We then create one hetero-

geneously quantized version of the BP model, where each

layer is quantized independently post-training to yield

Dense (64)

〈4,0〉

Dense (32)

Ternary

Input (16)

〈16,6〉

Dense (32)

〈2,1〉

ReLU

Softmax

Dense (5)

w: Binary b:〈8,3〉

〈16,6〉

〈16,6〉〈4,2〉

〈3,1〉

〈4,2〉

FIG. II. Model architecture for the fully-connected NN architecture under study. The numbers in brackets are the precisions

used for each layer, quoted as

hB, Ii

, where

is the precision in bits and

the number of integer bits. When diﬀerent precision

is used for weights and biases, the quantization is listed as w and b, respectively. These have been obtained using the per-layer,

per-parameter type automatic quantization procedure described in Section VI.

TABLE I. Per-layer quantization for the diﬀerent baseline models (quantized post-training). When diﬀerent precision is used for

weights and biases, the quantization is listed as w and b, respectively.

Model Precision

Dense ReLU Dense ReLU Dense ReLU Dense Softmax

BF/BP h14, 6i h14, 6i h14, 6i h14, 6i h14, 6i h14, 6i h14, 6i h14, 6i

BH w:h8, 3i b:h4,2i h13, 7i h7, 2i h10, 5i h5, 2i h8, 4i w:h7, 3i b:h4,1i h16, 6i

Listing 1. TensorFlow Keras model deﬁnition.

from tensorﬂow.keras.layers import Input

from tensorﬂow.keras.layers import Dense, Activation

from tensorﬂow.keras.layers import BatchNormalization

x = Input((16))

x = Dense(64)(x)

x = BatchNormalization()(x)

x = Activation(’relu’)(x)

x = Dense(32)(x)

x = BatchNormalization()(x)

x = Activation(’relu’)(x)

x = Dense(32)(x)

x = BatchNormalization()(x)

x = Activation(’relu’)(x)

x = Dense(5)(x)

x = Activation(’softmax’)(x)

the highest accuracy possible at the lowest resource cost.

We start with an initial conﬁguration of the model quan-

tization using a wide bit-width, then iteratively reduce

the bit-width until reaching a threshold in accuracy loss

relative to the initial ﬂoating-point model, evaluated on

the training set. We iterate over the model in layer order,

ﬁnding the appropriate precision for weights, biases, and

output of a given layer, before moving to the next. We

apply a more strict threshold in accuracy for earlier layers,

since each round of precision reduction only degrades the

accuracy. In this case we restrict to a 1% accuracy diﬀer-

ence in the ﬁrst layer, loosening to 2% for the ﬁnal layer.

This model is referred to as the baseline heterogeneous

(BH) model. A summary of the per-layer quantizations for

the baselines is given in Table I. From Ref. [

], we know

that a post-training quantization of this model results in

a degradation in model accuracy. The smaller the model

footprint is made through post-training quantization, the

larger the accuracy degradation becomes. To overcome

this, we develop a novel library that, through minimal

code changes, allows us to create deep heterogeneously

quantized versions of Keras model, trained quantization-

aware. In addition, as the amount of available resources

on chip is known in advance, we want to ﬁnd the optimal

model for a given use-case allowing a trade-oﬀ between

Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors

Figures

Citations

Industrial Digital Twins at the Nexus of NextG Wireless Networks and Computational Intelligence: A Survey.

Fast convolutional neural networks on FPGAs with hls4ml

Industrial IoT in 5G-and-Beyond Networks: Vision, Architecture, and Design Trends

Real-time Artificial Intelligence for Accelerator Control: A Study at the Fermilab Booster

Charged Particle Tracking via Edge-Classifying Interaction Networks

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

Very Deep Convolutional Networks for Large-Scale Image Recognition

Rectified Linear Units Improve Restricted Boltzmann Machines

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Related Papers (5)

Fast inference of deep neural networks in FPGAs for particle physics

FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks

FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review

Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics.

Quantized neural networks: training neural networks with low precision weights and activations

Frequently Asked Questions (2)

Q1. What are the contributions in "Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors" ?

Q2. What have the authors stated for future works in "Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors" ?