scispace - formally typeset
Open AccessJournal ArticleDOI

A Reconfigurable Neural Network ASIC for Detector Front-End Data Compression at the HL-LHC

Reads0
Chats0
TLDR
It is demonstrated that a neural network autoencoder model can be implemented in a radiation tolerant ASIC to perform lossy data compression alleviating the data transmission problem while preserving critical information of the detector energy profile.
Abstract
Despite advances in the programmable logic capabilities of modern trigger systems, a significant bottleneck remains in the amount of data to be transported from the detector to off-detector logic where trigger decisions are made. We demonstrate that a neural network (NN) autoencoder model can be implemented in a radiation-tolerant application-specific integrated circuit (ASIC) to perform lossy data compression alleviating the data transmission problem while preserving critical information of the detector energy profile. For our application, we consider the high-granularity calorimeter from the Compact Muon Solenoid (CMS) experiment at the CERN Large Hadron Collider. The advantage of the machine learning approach is in the flexibility and configurability of the algorithm. By changing the NN weights, a unique data compression algorithm can be deployed for each sensor in different detector regions and changing detector or collider conditions. To meet area, performance, and power constraints, we perform quantization-aware training to create an optimized NN hardware implementation. The design is achieved through the use of high-level synthesis tools and the hls4ml framework and was processed through synthesis and physical layout flows based on a low-power (LP)-CMOS 65-nm technology node. The flow anticipates 200 Mrad of ionizing radiation to select gates and reports a total area of 3.6 mm2 and consumes 95 mW of power. The simulated energy consumption per inference is 2.4 nJ. This is the first radiation-tolerant on-detector ASIC implementation of an NN that has been designed for particle physics applications.

read more

Content maybe subject to copyright    Report

0018-9499 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNS.2021.3087100, IEEE
Transactions on Nuclear Science
1
A reconfigurable neural network ASIC for detector
front-end data compression at the HL-LHC
Giuseppe Di Guglielmo, Farah Fahim, Christian Herwig, Manuel Blanco Valentin, Javier Duarte, Cristian Gingu,
Philip Harris, James Hirschauer, Martin Kwok, Vladimir Loncar, Yingyi Luo, Llovizna Miranda, Jennifer
Ngadiuba, Daniel Noonan, Seda Ogrenci-Memik, Maurizio Pierini, Sioni Summers, Nhan Tran
Abstract—Despite advances in the programmable logic ca-
pabilities of modern trigger systems, a significant bottleneck
remains in the amount of data to be transported from the
detector to off-detector logic where trigger decisions are made.
We demonstrate that a neural network autoencoder model can
be implemented in a radiation tolerant ASIC to perform lossy
data compression alleviating the data transmission problem while
preserving critical information of the detector energy profile. For
our application, we consider the high-granularity calorimeter
from the CMS experiment at the CERN Large Hadron Collider.
The advantage of the machine learning approach is in the
flexibility and configurability of the algorithm. By changing the
neural network weights, a unique data compression algorithm
can be deployed for each sensor in different detector regions,
and changing detector or collider conditions. To meet area,
performance, and power constraints, we perform a quantization-
aware training to create an optimized neural network hardware
implementation. The design is achieved through the use of
high-level synthesis tools and the hls4ml framework, and was
processed through synthesis and physical layout flows based on a
LP CMOS 65 nm technology node. The flow anticipates 200 Mrad
of ionizing radiation to select gates, and reports a total area of
3.6 mm
2
and consumes 95 mW of power. The simulated energy
consumption per inference is 2.4 nJ. This is the first radiation
tolerant on-detector ASIC implementation of a neural network
that has been designed for particle physics applications.
Index Terms—ASIC, artificial intelligence, autoencoder, LHC,
machine learning, SEE mitigation, high-level synthesis, hardware
accelerator
Manuscript received March 26, 2021; (Corresponding e-mail:
farah@fnal.gov)
Farah Fahim, Cristian Gingu, Christian Herwig, James Hirschauer, Llovizna
Miranda, Nhan Tran are with Fermi National Accelerator Laboratory, Batavia,
IL, USA and are supported by Fermi Research Alliance, LLC under Contract
No. DE-AC02-07CH11359 with the U.S. Department of Energy (DOE), Office
of Science, Office of High Energy Physics
Manuel Blanco Valentin, Farah Fahim, Yingyi Luo, Seda Ogrenci-Memik,
Nhan Tran are with Northwestern University, Evanston, IL, USA
Giuseppe Di Guglielmo is with Columbia University, New York, NY, USA
Javier Duarte is with UC San Diego, La Jolla, CA, USA and is supported
by the DOE, Office of Science, Office of High Energy Physics Early Career
Research program under Award No. DE-SC0021187.
Philip Harris is with Massachusetts Institute of Technology, Cambridge,
MA, USA and is supported by a Massachusetts Institute of Technology
University grant.
Martin Kwok is with Brown University, Providence, RI, USA
Jennifer Ngadiuba is with California Institute of Technology, Pasedena, CA,
USA
Daniel Noonan is with Florida Institute of Technology, Melbourne, FL,
USA
Vladimir Loncar, Maurizio Pierini, Sioni Summers are with CERN, Geneva,
Switzerland and are supported by the European Research Council (ERC) under
the European Union’s Horizon 2020 research and innovation program (Grant
Agreement No. 772369).
Vladimir Loncar is with the Institute of Physics Belgrade, Serbia
I. INTRODUCTION
B
REAKTHROUGHS in the precision and speed of sensing
instrumentation are impactful on advances in scientific
methodologies and theories. Thus, a common paradigm across
many scientific disciplines in physics has been to increase
the resolution of the sensing equipment in order to increase
either the robustness or the sensitivity of the experiment itself.
This demand for increasingly higher sensitivity in experiments,
along with advances in the design of state-of-the-art sensing
systems, has resulted in rapidly growing big data pipelines
such that transmission of acquired data to the processing
unit via conventional methods is no longer feasible. Data
transmission is commonly much less efficient than data pro-
cessing. Therefore, placing data compression and processing
as close as possible to data creation while maintaining physics
performance is a crucial task in modern physics experiments.
At the CERN Large Hadron Collider (LHC) and its high
luminosity upgrade (HL-LHC), extreme collision rates present
extreme challenges for data processing and transmission at
multiple stages in detector readout and trigger systems. As
the initial stage in the data chain, the on-detector (front-
end) electronics that read out detector sensors must oper-
ate with low latency and low power dissipation in a high
radiation environment, necessitating the use of application-
specific integrated circuits (ASICs). In order to mitigate the
initial bottleneck of moving data from front-end ASICs to
off-detector (back-end) systems based on field-programmable
gate arrays (FPGAs), front-end ASICs must provide edge
computing resources to efficiently use limited bandwidth
through real-time processing and identification of interesting
data. Front-end data compression algorithms have historically
relied on zero-suppression, threshold-based selection, sorting
or summing of data.
Artificial intelligence (AI), and more specifically machine
learning (ML), has recently been demonstrated to be a pow-
erful tool for data compression, processing, and analysis
in physics [1–4] and many other domains. While progress
has been made towards generic real-time processing through
inference including boosted decision trees and neural networks
(NNs) using FPGAs in off-detector electronics [5, 6], ML
methods have not yet been used to address the significant
bottleneck in the transport of data from front-end ASICs to
back-end FPGAs.
The high-granularity endcap calorimeter (HGCAL) [7] cur-
rently under construction by the CMS experiment [8] for
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

0018-9499 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNS.2021.3087100, IEEE
Transactions on Nuclear Science
2
eventual use at HL-LHC provides an excellent example of the
big data challenges facing high energy physics. As an imaging
calorimeter, the HGCAL includes over 6 million readout
channels, providing an unprecedented level of segmentation
for calorimetry at high-energy colliders. In order to provide
input to the real-time event filtering (trigger) system of CMS,
the HGCAL transmits a stream of trigger data at a frequency
of 40 MHz resulting in massive data rates. At data creation,
two ASICs are used to digitize and encode trigger data before
transmission to back-end FPGAs for further processing.
In this paper, we explore the application of ML algorithms
to the task of processing large amounts of data with low
latency and low power in a high radiation environment in order
to maximize efficient use of limited bandwidth. We focus on an
ASIC implementation of an autoencoder algorithm that uses
a configurable NN to efficiently compress and encode data
automatically before transmission. Subsequent stages of data
processing can either decode the data or continue analyzing
the encoded data. In our ASIC implementation, the NN
architecture is fixed, but exceptional flexibility in application is
preserved by making the NN weights programmable. We apply
our methodology to the specific front-end data transmission
challenge of the CMS HGCAL, showing that the advantage of
our approach lies in the flexibility and configurability of the al-
gorithm, which allows us to generate unique data compression
algorithms depending on HGCAL sensor geometry, sensor
location on the detector and the corresponding occupancy and
signal patterns, changing accelerator conditions, or changing
detector conditions.
The remainder of this paper is organized as follows. In Sec-
tion II, we introduce the HGCAL challenge in greater detail
and outline our conceptual approach. Then, in Section III, we
elaborate on the design and training of the autoencoder NN for
the specific case of the CMS HGCAL detector. In Section IV,
we present the digital implementation of the trained NN in
the ASIC. Finally, we summarize our work and discuss future
directions in Section V.
II. S
YSTEM CONSTRAINTS AND CONCEPT
The HGCAL is a major upgrade of the CMS endcap
calorimeter planned to for the HL-LHC and provides a fitting
demonstrator for the ASIC ML accelerator technology. The
HGCAL is described in detail in Ref. [7]; relevant implemen-
tation details that have changed since publication of Ref. [7]
are updated in this article.
This “imaging calorimeter, which includes over 600 m
2
of
active silicon and over 6 million readout channels, is composed
of 50 layers of active shower-sampling media interleaved
with dense absorber. The active medium of the 28-layer front
electromagnetic compartment is silicon, while the 22-layer
rear hadronic compartment includes both silicon and plastic
scintillator. Silicon layers are tiled with 8 hexagonal sensor
modules, with each module including 48 logical trigger cells
(TC) arranged in three 44 matrices as shown in Fig. 1. While
the NN can be configured for both the silicon and scintillator
geometries, the silicon geometry is used throughout this paper
to illustrate the concepts.
Algorithms
Neural
Network
Shape-
Encoder
Output to
trigger path
Transmission bandwidth
Cell
max
Normalizer
High-granularity
detector data
Fig. 1. Simplified version of the internal flow of the autoencoder compression
task which takes the module energy deposits, normalizes them to the sum of
the energy in the module, and then performs shape encoding
To provide input to the CMS trigger system, data must be
transmitted from the on-sensor analog-to-digital ASICs to the
all-FPGA back-end detector electronics system at the nom-
inal HL-LHC collision rate of 40 MHz. Because bandwidth
constraints prohibit transmission of data for all 48 TCs at
40 MHz, a front-end concentrator ASIC (ECON-T) is being
developed to compress a single sensor’s information before
transmission to the back-end trigger electronics. Each sensor
module produces 7 bits of floating-point charge data for each
of the 48 TCs at 40 MHz. Thus, the lossy compression task
of the ECON-T ASIC is to aggregate the 48 7-bit signals
from a sensor and compress the data into a range spanning
48–144 bits while maximally preserving the energy pattern
of the sensor. The range of the output bits depends on the
location of the sensor module in the detector and the number
of links available for a given ECON-T ASIC to transmit the
data. The number of links allocated will roughly correspond
to the average sensor occupancy, which varies by two orders
of magnitude over all sensor locations. The exact task depends
on handling of data framing and TC address information as
well as on whether the ECON-T algorithm operates with fixed
or variable latency. The ECON-T design provides the user
a choice among three expert algorithms for TC compression
including TC threshold application, sorting and selection of
highest energy TC, and aggregation of adjacent TCs. Unused
algorithms are clock-gated to conserve power.
The ECON-T ASIC is being developed for the LPCMOS
(Low Power CMOS) 65 nm feature size technology and is
under active development for CMS. Because it is located on-
detector in a high radiation environment, its design also re-
quires tolerance to single event effects. The allocated footprint
for the fabrication of this chip is approximately 5 5 mm
2
. It
is expected that sufficient area will be available for potential
inclusion of our NN compression logic with an approximate
area of 4 mm
2
. The constraints for the compression algorithms
are that they should accept new input data at 40 MHz and
complete processing in 50 ns. The power budget of the task is
less than 100 mW.
Our contribution is a NN to perform the ECON-T com-
pression task. It is a central tenet of our design that the
compression algorithm is reconfigurable. Because we are
implementing a design for an ASIC, the architecture of the NN
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

0018-9499 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNS.2021.3087100, IEEE
Transactions on Nuclear Science
3
will be fixed, but the weights of the NN need to be configurable
such that the algorithm itself is adaptable. This has several
advantages. Through reconfiguration, we will be able to:
enable more computationally complex compression algo-
rithms, which could improve overall physics performance
or allow more flexible algorithms;
customize the compression algorithm of each sensor
based on their location within the detector;
adapt the compression algorithm for changing detector
and collider conditions (for example, if the detector loses
a channel or has a noisy channel it can be accounted
for or if the collider has more pileup than expected, the
algorithm can be adjusted to deal with new or unexpected
conditions without catastrophic failure).
For our compression algorithm, we choose to utilize an
autoencoder architecture. It provides a generic and flexible
compression solution, consisting of two NNs: an encoder and a
decoder. The encoding network maps inputs to an intermediate
latent representation with lower dimensionality than the space
of inputs, after which the decoding network aims to recover
the original signal. In the HGCAL application, the encod-
ing NN would compress HGCAL data on the ASIC before
transmission to the calorimeter trigger FPGAs for subsequent
decoding. Ultimately, in a final realistic system, we do not
anticipate using a full autoencoder architecture because FPGA
resources on the back-end FPGA system will not be sufficient
to do a full decoding for every sensor. However, in the absence
of understanding how best to use the latent representation
later in the processing chain, we optimize performance for an
autoencoder because it is a reasonable proxy for the encoder
NN encapsulating the salient sensor features such that the
image can be decoded from the latent representation. Finally,
in Fig. 1, the compression task is split into two parts: an
overall normalization over the entire sensor to preserve the
total energy in the sensor and the NN shape encoder, which
encodes the energy pattern across the sensor.
For the automated design tool flow, it is very important to
have a rapid co-design loop between the NN algorithm training
and the implementation in hardware in order to understand
whether the algorithm is meeting system constraints for power,
area, and performance simultaneously. To achieve this, we use
hls4ml [5] which translates NNs trained in common open-
source ML software frameworks into RTL using high-level
synthesis (HLS) tools [9]. The efficacy of this approach will
be described in greater detail in the following section.
III. A
LGORITHM DESIGN AND PERFORMANCE
Our task is to design an algorithm that will reproduce the
energy pattern in the sensor while simultaneously adhering to
hardware constraints, i.e., fitting in the available area within
the ECON-T ASIC chip while complying with system latency
and power constraints. Because we are training the algorithm
based on a single sensor’s energy pattern, we will not be
able to optimize for multi-sensor physics performance, such as
particle energy resolution. Ultimately, the physics performance
may determine the final system optimization, however, it is
beyond the scope of this study. Therefore, our target is to
design an algorithm that reproduces the original sensor energy
pattern as accurately as possible through the autoencoder
compression-and-decompression bottleneck.
There are a number of elements needed to design our
compression algorithm: a sample of events for training and
validation, a preprocessing and normalization block, an op-
timized NN architecture, and metrics for evaluating the NN
performance, both for determining the training loss and the
final network evaluation. An essential aspect of the train-
ing procedure is quantization-aware training (QAT), i.e., we
approximate bit-accurate reduced precision for all of the
NN calculations during training. QAT is known to be much
more performant than post-training quantization (PTQ), where
the training is done using 32-bit floating-point operations,
which are then truncated post-training to fixed-point or integer
representations. In a previous study of the QKeras [10]
tool, QAT was performed for an LHC trigger task. It was
found that with PTQ, the minimum bit width possible without
loss of performance was 14 bits while with QAT, the same
performance could be achieved with 6-bit weights. Thus, PTQ
would lead to more than 4-fold increase in the area of an ASIC
design based on the bit operations hardware design metric [11].
Therefore, we use QKeras for training the NNs in this study.
Training Sample: Test energy patterns in the sensors are
simulated using top-quark-pair events overlaid with 200 simul-
taneous collisions per bunch crossing in the CMS software
framework [12]. These simulated events create a sample of
typical energy patterns in the HGCAL sensors, which we use
as a realistic proxy for the sensor data.
Preprocessing: The compression task is factorized into
normalization and shape extraction components, as illustrated
in Fig. 1. The first stage of ECON-T processing for all
algorithms is to expand the 7-bit floating-point TC data to the
inherent 22-bit fixed-point TC data. The sum value of all 48
TC is identified and used to normalize the charge distribution
across the full sensor (and the sum of TC charge is included
in the final data payload to allow subsequent interpretation of
normalized TC data). The normalized NN inputs are truncated
to 8 bits to allow a more compact NN implementation, while
ensuring that any omitted cells constitute less than 1% of the
total energy recorded within a module.
NN Architecture: The encoding NN architecture con-
sists of successive layers that sequentially process the input
data. Convolutional layers are used to extract spatial features
from images through the application of filters: matrices of
configurable parameters. While convolutional layers use rel-
atively few parameters, convolution requires many multiply-
accumulate operations (MACs). Conversely, a fully-connected
layer multiplies a vector of input elements by a matrix of
configurable weights, generally requiring more configurable
parameters and fewer MACs than a convolution. The impact of
the choice of precision for all internal parameters (constrained
by the available area on chip) is accounted for by training
inherently quantized models with the QKeras package [10].
Because the HGCAL sensor data compression task takes as
input an image data representation, we consider a convolu-
tional NN layer as a natural approach. Typical convolutions
rely on the input being in a Cartesian representation, though
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

0018-9499 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNS.2021.3087100, IEEE
Transactions on Nuclear Science
4
other shapes can be explored in future work. Here, we map the
hexagonal sensor shape to a more typical Cartesian arrange-
ment as illustrated in Fig. 2, which simplifies the training and
hardware implementation.
C. Herwig — Autoencoder for Front-End ASIC
Oct 27, 2020
Highly-efficient design is necessary to achieve acceptable
design footprint and power consumption
Borrow from image processing: convolutional NN (CNN)
with filters trained to identify physical radiation patterns
NN encoding layers
15
Map to a
regular
geometry
2d 3d
or
Scan filters to
extract features
A fully-connected NN layer achieves further reduction,
exploiting correlated features across the image
CNN operations scale as ~ (Image * Filter) volumes
Dense layer scales with encoder output dimension
Fig. 2. Mapping the hexagonal sensor geometry to potential Cartesian
representations for convolutional layer operations.
Training and Performance Metrics: The performance of
the autoencoder is based on how well the original image
is reproduced after encoding and decoding. We quantify the
difference between raw and decoded HGCAL data with the
energy mover’s distance (EMD) [13]. Given a particular nor-
malized energy distribution, one may physically relocate some
energy fraction dE by a spatial distance dx, leading to a new
distribution with an associated rearrangement cost dEdx. This
notion can be extended to define an “optimal transport” be-
tween two energy distributions A and B, as a re-mapping that
minimizes this total rearrangement cost, EMD(A, B). While
the performance of the autoencoder is assessed with EMD,
taking into account the complete hexagonal sensor geometry,
this metric is not used directly in the algorithm training,
as it involves nondifferentiable and computationally intensive
operations. Models are instead trained with a modified
2
loss function incorporating cell-to-cell distances, as a fast
approximation of EMD. Specifically, individual TCs are re-
summed into all physical groups of approximately 2 2 and
3 3 “super-cells” based on the full hexagonal cells with
corresponding
2
values computed for the coarsened images.
The total loss sums each such
2
together, resulting in a
comparatively lenient penalty when mis-reconstruction occurs
only on small spatial scales. Including these additional
2
terms in the training procedure is found to yield significant
improvements to the autoencoder performance, as measured
with EMD. To ensure an unbiased NN optimization, the data
are randomly partitioned into separate samples for training
(80%) and validation (20%), with training termination set by
the loss observed in the validation sample.
Baseline Encoder Model
A simple encoding NN with a single convolutional and
dense layer architecture is investigated. Normalized inputs
from hexagonal sensors are arranged into three arrays of 4 4
to form a regular geometry. The convolution layer consists of
eight 3 3 3 kernel matrices, giving a 8 4 4 output
after convolution. It was found that more than eight kernel
matrices brought negligible performance improvement. These
128 values are flattened and fed through a dense layer to yield
16 9-bit output values. ReLU activations [14, 15] are applied
before and after the dense layer. This leads to a total of 2,288
weight parameters (dominated by the 2,064 parameters used
to configure the dense layer), each of which are specified with
Fig. 3. The autoencoder neural network architecture and data flow for the
baseline encoder model
6-bit precision. A single inference requires a total of 4,448
MACs, with similar requirements from the convolution (2,400)
and dense layers (2,048). The size and complexity of this
baseline model are constrained by area, on-chip memory and
interfaces, and power, which impose additional optimization
considerations. The encoder architecture with the reconfig-
urable weights is illustrated in Fig. 3
Optimization Considerations and Comparisons
While the presence of a single convolutional layer is critical
for good physics performance of the algorithm (approximated
by the EMD between input and decoded images), adding more
filters or additional convolutional layers only weakly improves
physics performance, at the expense of significantly increased
area. Changes in the number and size of the dense layers yield
more dramatic differences.
Figure 4 shows a sweep over the number of dense layer
outputs, where remaining aspects of the design are fixed
based on hardware constraints: the precision of outputs and
weights are coherently varied to ensure that both the total
number of outputs and the weight bits are fixed. Architec-
tures featuring many outputs with lower relative precision
consistently outperform their counterparts. The autoencoder
is robust across a variety of conditions and performs well in
the high-occupancy regime, which poses the greatest challenge
for trigger reconstruction.
Reconfigurability: Figure 4 also demonstrates how the
same NN encoder can be re-optimized and configured for new
data-taking conditions, by comparing sensors in detector re-
gions requiring low- and high-throughput. The maximum data
throughput of 144 bits from 16 9-bit outputs can be reduced
through fully configurable selective truncation. Expected use
cases include transmission of (48, 80, 112, 144) bits from
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

0018-9499 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNS.2021.3087100, IEEE
Transactions on Nuclear Science
5
NN outputs
Sensor output
bandwidth
64 bits
160 bits
6
10
16
Fig. 4. Median EMD for decoded HGCAL images from the validation
dataset, as function of sensor occupancy for six NN configurations. Vertical
lines (suppressed for the 160-bit configurations) denote 68% EMD intervals.
Occupancy is defined as the number of TCs with signals exceeding one
minimum ionizing particle divided by cosh where is the pseudorapidity
of the TC. (Results shown for version of NN with maximum of 10 bits for
each of 16 outputs rather than 9 bits as described in the text.)
TABLE I
A
REA BREAKDOWN FOR PIPELINED IMPLEMENTATIONS.THE RESULTS
ARE FROM
CATAPU LT HLS ESTIMATIONS AND AREAS ARE IN µm
2
.
Initiation Interval Total Area Register Area MUX Area
1 1,138,242 925 0
2
891,195 5,228 12,989
4
765,877 8,503 16,089
8 699,988 8,509 16,252
16 (3, 4, 7, 9) bit outputs, though the network can also be
configured to transmit fewer than 16 outputs, or a mix of
precisions.
IV. I
MPLEMENTATION METHODOLOGY AND RESULTS
In this section, we detail the implementation of the trained
NN described in Sec. III in the ECON-T ASIC. We discuss
the design and verification flow, the architectural and design
exploration, steps required for deployment in a radiation
environment, design performance metrics, and finally the
implementation results.
Algorithm to Accelerator Development
For our design flow, we adopted the hls4ml framework [5]
to automate the mapping of ML models onto reconfigurable
logic. For this work, we extended hls4ml to our ASIC flow.
Traditionally, hardware designers utilize hardware description
languages (HDLs) and a level of abstraction known as the
Register Transfer Level (RTL). In recent years, HLS has
become an alternative for generating hardware modules from
code written in programming languages such as C/C++. HLS
comes with significant benefits: it raises the level of abstraction
and reduces the simulation time; it simplifies the verification
phases; and finally, it makes the exploration and evaluation
of design alternatives easier. The original flow of hls4ml
generates state-of-the-art synthesizable C++ code and HLS
directives from the ML-model specifications. The generated
code is then fed into the Vivado HLS tool to generate an
accelerator in HDL RTL code for the deployment on Xilinx
FPGAs [16]. We extended hls4ml to support Mentor’s
Catapult HLS [17] tool and target our specific 65 nm LP
CMOS technology for ASIC fabrication. We integrated the
HLS-generated code with a SystemVerilog RTL IP of the
programmable I
2
C peripheral
1
. We finally created a component
database and layout to be incorporated into the ECON-T ASIC
top-level assembly using a digital implementation flow. The
standard flow was modified to accommodate automatic triple
modular redundancy implementation for HLS-generated RTL
integrated with other SystemVerilog modules.
We complemented our design flow with a robust validation
and verification methodology across the various refinement
steps. We validated the C++ HLS code against the QKeras
trained model to guarantee the model’s functional correct-
ness. Earlier in the design flow, we also performed dynamic
and static verification of the synthesizable specifications: we
checked design rules with static analysis of the C++ HLS
code (Mentor CDesignChecker [19]), measured coverage met-
rics (Mentor CCov [19]), and finally, ran simulation-based
equivalence checking. For the HLS-generated RTL code, we
followed a more traditional simulation-based verification to
ensure optimized power, area, and speed.
Architectural Exploration
hls4ml coupled with the industry standard Catapult HLS
(ver. 10.6) tool allowed us to explore the cost and performance
trade-offs of various micro-architectural hardware implemen-
tations for our ML model. We decided on a pipelined imple-
mentation for our accelerator to increase concurrent execution
as an early design decision. A pipelined design can process
new inputs every N clock cycles, where N is the initiation
interval (II) of the design. Table I shows the area breakdown
for different II values (1, 2, 4, 8). It is noticeable that although
the area is higher for II=1, the required resources are mostly
functional logic to implement a highly-parallel datapath, i.e.
there are no multiplexers. A higher II value implies less design
parallelism and more functional-resource reuse. This choice
reduces the overall area, but the resource breakdown shows
an increase in control logic (MUX) and registers. An II of 1
was ultimately selected so that new inputs may be processed
in sync with a single clock operating at 40 MHz LHC crossing
frequency.
We used a fixed-point representation (ac_fixed [20]) for
the input, intermediate, and output parameters of our ML
model designed with hls4ml. This choice provided us with
a high degree of flexibility for exploring the area and accuracy
trade-off of the ML-model hardware implementations obtained
with HLS. The RTL schematics for the encoder block are
shown in Fig. 5. The basic structure of the convolution and
dense layers can be seen at the schematic level. The top right
and bottom right diagrams are zoomed in portions of this
schematic depicting the output and MACs.
1
The authors use controller/peripheral in place of master/slave when
referring to such I
2
C devices or processes [18].
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

Citations
More filters
Posted Content

Which Metric on the Space of Collider Events

TL;DR: The particle linearized unbalanced optimal transport (pluOT) framework for collider events based on the linearized Hellinger-Kantorovich distance was developed in this article.
Journal ArticleDOI

Which metric on the space of collider events?

- 07 Apr 2022 - 
TL;DR: In this article , the particle linearized unbalanced optimal transport framework for collider events based on the linearized Hellinger-Kantorovich distance was developed for boosted jet tagging.

Smart sensors using artificial intelligence for on-detector electronics and ASICs

TL;DR: The motivations and potential applications for on-detector AI, and a number of areas of opportunity where machine learning techniques, codesign workflows, and future microelectronics technologies which will accelerate design, performance, and implementations for next generation experiments are discussed.
Journal ArticleDOI

Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml

TL;DR: This paper presents an implementation of two types of recurrent neural network layers—long short-term memory and gated recurrent unit—within the hls4ml framework and shows the performance and synthesized designs for multiple neural networks, many of which are trained for jet identification tasks at the CERN Large Hadron Collider.

Electronics for Fast Timing

TL;DR: In this article , the authors address challenges in electronics design for the new generations of fast timing detectors, which will help address the increasing complexity of events at hadron colliders and provide new tools for precise tracking and calorimetry for all experiments.
References
More filters
Proceedings Article

Rectified Linear Units Improve Restricted Boltzmann Machines

TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Proceedings Article

Deep Sparse Rectifier Neural Networks

TL;DR: This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-dierentiabil ity.
Journal ArticleDOI

The CMS experiment at the CERN LHC

S. Chatrchyan, +3175 more
TL;DR: The Compact Muon Solenoid (CMS) detector at the Large Hadron Collider (LHC) at CERN as mentioned in this paper was designed to study proton-proton (and lead-lead) collisions at a centre-of-mass energy of 14 TeV (5.5 TeV nucleon-nucleon) and at luminosities up to 10(34)cm(-2)s(-1)
Journal ArticleDOI

Machine learning and the physical sciences

TL;DR: This article reviews in a selective way the recent research on the interface between machine learning and the physical sciences, including conceptual developments in ML motivated by physical insights, applications of machine learning techniques to several domains in physics, and cross fertilization between the two fields.
Journal ArticleDOI

The use of triple-modular redundancy to improve computer reliability

TL;DR: One of the proposed techniques for meeting the severe reliability requirements inherent in certain future computer applications is described, which involves the use of triple-modular redundancy, which is essentially theuse of the two-out-of-three votingc oncept at a low level.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What are the contributions in "A reconfigurable neural network asic for detector front-end data compression at the hl-lhc" ?

The authors demonstrate that a neural network autoencoder model can be implemented in a radiation tolerant ASIC to perform lossy data compression alleviating the data transmission problem while preserving critical information of the detector energy profile. For their application, the authors consider the high-granularity calorimeter from the CMS experiment at the CERN Large Hadron Collider. To meet area, performance, and power constraints, the authors perform a quantizationaware training to create an optimized neural network hardware implementation. The flow anticipates 200 Mrad of ionizing radiation to select gates, and reports a total area of 3. 6 mm2 and consumes 95 mW of power. This is the first radiation tolerant on-detector ASIC implementation of a neural network that has been designed for particle physics applications. 

The constraints for the compression algorithms are that they should accept new input data at 40 MHz and complete processing in 50 ns. 

The encoder architecture set by the model requires approximately 225,000 multiply-accumulate operations to performAuthorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. 

There are a number of elements needed to design their compression algorithm: a sample of events for training and validation, a preprocessing and normalization block, an optimized NN architecture, and metrics for evaluating the NN performance, both for determining the training loss and the final network evaluation. 

In recent years, HLS has become an alternative for generating hardware modules from code written in programming languages such as C/C++. HLS comes with significant benefits: it raises the level of abstraction and reduces the simulation time; it simplifies the verification phases; and finally, it makes the exploration and evaluation of design alternatives easier. 

The digital design consists of three major functional blocks: (i) A converter which is a classical module designed with HLS; (ii) An encoder, which uses hls4ml; (iii) and an I2C peripheral which uses a SystemVerilog RTL code. 

The ECON-T ASIC is being developed for the LPCMOS (Low Power CMOS) 65 nm feature size technology and is under active development for CMS. 

the authors show that in spite of a fixed ASIC implementation, ML algorithms can still be designed with sufficient flexibility to enable reconfiguration for new operational conditions. 

The digital implementation stage is time intensive, requiring ⇠65 hours of design and verification to meet the speed and area constraints with fewer iterations. 

The design demonstrates how complex NN architectures can be implemented on the front-end ASICs with realistic area constraints, allowing for minimal loss of information in the trigger data stream. 

For the automated design tool flow, it is very important to have a rapid co-design loop between the NN algorithm training and the implementation in hardware in order to understand whether the algorithm is meeting system constraints for power, area, and performance simultaneously. 

The range of the output bits depends on the location of the sensor module in the detector and the number of links available for a given ECON-T ASIC to transmit the data. 

QAT is known to be much more performant than post-training quantization (PTQ), where the training is done using 32-bit floating-point operations, which are then truncated post-training to fixed-point or integer representations. 

The generated code is then fed into the Vivado HLS tool to generate an accelerator in HDL RTL code for the deployment on Xilinx FPGAs [16]. 

The autoencoder is robust across a variety of conditions and performs well in the high-occupancy regime, which poses the greatest challenge for trigger reconstruction. 

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.vector multiplications every 25 ns. 

The normalized NN inputs are truncated to 8 bits to allow a more compact NN implementation, while ensuring that any omitted cells constitute less than 1% of the total energy recorded within a module. 

This leads to a total of 2,288 weight parameters (dominated by the 2,064 parameters used to configure the dense layer), each of which are specified withFig.