How many operations are required to perform the encoder architecture?

The encoder architecture set by the model requires approximately 225,000 multiply-accumulate operations to performAuthorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY.

What is the main reason why Mentor has developed a HLS tool?

In recent years, HLS has become an alternative for generating hardware modules from code written in programming languages such as C/C++. HLS comes with significant benefits: it raises the level of abstraction and reduces the simulation time; it simplifies the verification phases; and finally, it makes the exploration and evaluation of design alternatives easier.

What is the design of the digital converter?

The digital design consists of three major functional blocks: (i) A converter which is a classical module designed with HLS; (ii) An encoder, which uses hls4ml; (iii) and an I2C peripheral which uses a SystemVerilog RTL code.

How can a fixed ASIC be designed?

the authors show that in spite of a fixed ASIC implementation, ML algorithms can still be designed with sufficient flexibility to enable reconfiguration for new operational conditions.

How long does it take to implement a digital system?

The digital implementation stage is time intensive, requiring ⇠65 hours of design and verification to meet the speed and area constraints with fewer iterations.

What is the design of the ECON-T ASIC?

The design demonstrates how complex NN architectures can be implemented on the front-end ASICs with realistic area constraints, allowing for minimal loss of information in the trigger data stream.

What is the difference between QAT and PTQ?

QAT is known to be much more performant than post-training quantization (PTQ), where the training is done using 32-bit floating-point operations, which are then truncated post-training to fixed-point or integer representations.

How many weight parameters are used to configure the dense layer?

This leads to a total of 2,288 weight parameters (dominated by the 2,064 parameters used to configure the dense layer), each of which are specified withFig.

(Open Access) A Reconfigurable Neural Network ASIC for Detector Front-End Data Compression at the HL-LHC (2021) | Giuseppe Di Guglielmo

Q: What are the contributions in "A reconfigurable neural network asic for detector front-end data compression at the hl-lhc" ?

The authors demonstrate that a neural network autoencoder model can be implemented in a radiation tolerant ASIC to perform lossy data compression alleviating the data transmission problem while preserving critical information of the detector energy profile. For their application, the authors consider the high-granularity calorimeter from the CMS experiment at the CERN Large Hadron Collider. To meet area, performance, and power constraints, the authors perform a quantizationaware training to create an optimized neural network hardware implementation. The flow anticipates 200 Mrad of ionizing radiation to select gates, and reports a total area of 3. 6 mm2 and consumes 95 mW of power. This is the first radiation tolerant on-detector ASIC implementation of a neural network that has been designed for particle physics applications.

Q: What are the elements needed to design their compression algorithm?

There are a number of elements needed to design their compression algorithm: a sample of events for training and validation, a preprocessing and normalization block, an optimized NN architecture, and metrics for evaluating the NN performance, both for determining the training loss and the final network evaluation.

0018-9499 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TNS.2021.3087100, IEEE

Transactions on Nuclear Science

A reconﬁgurable neural network ASIC for detector

front-end data compression at the HL-LHC

Giuseppe Di Guglielmo, Farah Fahim, Christian Herwig, Manuel Blanco Valentin, Javier Duarte, Cristian Gingu,

Philip Harris, James Hirschauer, Martin Kwok, Vladimir Loncar, Yingyi Luo, Llovizna Miranda, Jennifer

Ngadiuba, Daniel Noonan, Seda Ogrenci-Memik, Maurizio Pierini, Sioni Summers, Nhan Tran

Abstract—Despite advances in the programmable logic ca-

pabilities of modern trigger systems, a signiﬁcant bottleneck

remains in the amount of data to be transported from the

detector to off-detector logic where trigger decisions are made.

We demonstrate that a neural network autoencoder model can

be implemented in a radiation tolerant ASIC to perform lossy

data compression alleviating the data transmission problem while

preserving critical information of the detector energy proﬁle. For

our application, we consider the high-granularity calorimeter

from the CMS experiment at the CERN Large Hadron Collider.

The advantage of the machine learning approach is in the

ﬂexibility and conﬁgurability of the algorithm. By changing the

neural network weights, a unique data compression algorithm

can be deployed for each sensor in different detector regions,

and changing detector or collider conditions. To meet area,

performance, and power constraints, we perform a quantization-

aware training to create an optimized neural network hardware

implementation. The design is achieved through the use of

high-level synthesis tools and the hls4ml framework, and was

processed through synthesis and physical layout ﬂows based on a

LP CMOS 65 nm technology node. The ﬂow anticipates 200 Mrad

of ionizing radiation to select gates, and reports a total area of

3.6 mm

and consumes 95 mW of power. The simulated energy

consumption per inference is 2.4 nJ. This is the ﬁrst radiation

tolerant on-detector ASIC implementation of a neural network

that has been designed for particle physics applications.

Index Terms—ASIC, artiﬁcial intelligence, autoencoder, LHC,

machine learning, SEE mitigation, high-level synthesis, hardware

accelerator

Manuscript received March 26, 2021; (Corresponding e-mail:

farah@fnal.gov)

Farah Fahim, Cristian Gingu, Christian Herwig, James Hirschauer, Llovizna

Miranda, Nhan Tran are with Fermi National Accelerator Laboratory, Batavia,

IL, USA and are supported by Fermi Research Alliance, LLC under Contract

No. DE-AC02-07CH11359 with the U.S. Department of Energy (DOE), Ofﬁce

of Science, Ofﬁce of High Energy Physics

Manuel Blanco Valentin, Farah Fahim, Yingyi Luo, Seda Ogrenci-Memik,

Nhan Tran are with Northwestern University, Evanston, IL, USA

Giuseppe Di Guglielmo is with Columbia University, New York, NY, USA

Javier Duarte is with UC San Diego, La Jolla, CA, USA and is supported

by the DOE, Ofﬁce of Science, Ofﬁce of High Energy Physics Early Career

Research program under Award No. DE-SC0021187.

Philip Harris is with Massachusetts Institute of Technology, Cambridge,

MA, USA and is supported by a Massachusetts Institute of Technology

University grant.

Martin Kwok is with Brown University, Providence, RI, USA

Jennifer Ngadiuba is with California Institute of Technology, Pasedena, CA,

USA

Daniel Noonan is with Florida Institute of Technology, Melbourne, FL,

USA

Vladimir Loncar, Maurizio Pierini, Sioni Summers are with CERN, Geneva,

Switzerland and are supported by the European Research Council (ERC) under

the European Union’s Horizon 2020 research and innovation program (Grant

Agreement No. 772369).

Vladimir Loncar is with the Institute of Physics Belgrade, Serbia

I. INTRODUCTION

REAKTHROUGHS in the precision and speed of sensing

instrumentation are impactful on advances in scientiﬁc

methodologies and theories. Thus, a common paradigm across

many scientiﬁc disciplines in physics has been to increase

the resolution of the sensing equipment in order to increase

either the robustness or the sensitivity of the experiment itself.

This demand for increasingly higher sensitivity in experiments,

along with advances in the design of state-of-the-art sensing

systems, has resulted in rapidly growing big data pipelines

such that transmission of acquired data to the processing

unit via conventional methods is no longer feasible. Data

transmission is commonly much less efﬁcient than data pro-

cessing. Therefore, placing data compression and processing

as close as possible to data creation while maintaining physics

performance is a crucial task in modern physics experiments.

At the CERN Large Hadron Collider (LHC) and its high

luminosity upgrade (HL-LHC), extreme collision rates present

extreme challenges for data processing and transmission at

multiple stages in detector readout and trigger systems. As

the initial stage in the data chain, the on-detector (front-

end) electronics that read out detector sensors must oper-

ate with low latency and low power dissipation in a high

radiation environment, necessitating the use of application-

speciﬁc integrated circuits (ASICs). In order to mitigate the

initial bottleneck of moving data from front-end ASICs to

off-detector (back-end) systems based on ﬁeld-programmable

gate arrays (FPGAs), front-end ASICs must provide edge

computing resources to efﬁciently use limited bandwidth

through real-time processing and identiﬁcation of interesting

data. Front-end data compression algorithms have historically

relied on zero-suppression, threshold-based selection, sorting

or summing of data.

Artiﬁcial intelligence (AI), and more speciﬁcally machine

learning (ML), has recently been demonstrated to be a pow-

erful tool for data compression, processing, and analysis

in physics [1–4] and many other domains. While progress

has been made towards generic real-time processing through

inference including boosted decision trees and neural networks

(NNs) using FPGAs in off-detector electronics [5, 6], ML

methods have not yet been used to address the signiﬁcant

bottleneck in the transport of data from front-end ASICs to

back-end FPGAs.

The high-granularity endcap calorimeter (HGCAL) [7] cur-

rently under construction by the CMS experiment [8] for

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

Transactions on Nuclear Science

eventual use at HL-LHC provides an excellent example of the

big data challenges facing high energy physics. As an imaging

calorimeter, the HGCAL includes over 6 million readout

channels, providing an unprecedented level of segmentation

for calorimetry at high-energy colliders. In order to provide

input to the real-time event ﬁltering (trigger) system of CMS,

the HGCAL transmits a stream of trigger data at a frequency

of 40 MHz resulting in massive data rates. At data creation,

two ASICs are used to digitize and encode trigger data before

transmission to back-end FPGAs for further processing.

In this paper, we explore the application of ML algorithms

to the task of processing large amounts of data with low

latency and low power in a high radiation environment in order

to maximize efﬁcient use of limited bandwidth. We focus on an

ASIC implementation of an autoencoder algorithm that uses

a conﬁgurable NN to efﬁciently compress and encode data

automatically before transmission. Subsequent stages of data

processing can either decode the data or continue analyzing

the encoded data. In our ASIC implementation, the NN

architecture is ﬁxed, but exceptional ﬂexibility in application is

preserved by making the NN weights programmable. We apply

our methodology to the speciﬁc front-end data transmission

challenge of the CMS HGCAL, showing that the advantage of

our approach lies in the ﬂexibility and conﬁgurability of the al-

gorithm, which allows us to generate unique data compression

algorithms depending on HGCAL sensor geometry, sensor

location on the detector and the corresponding occupancy and

signal patterns, changing accelerator conditions, or changing

detector conditions.

The remainder of this paper is organized as follows. In Sec-

tion II, we introduce the HGCAL challenge in greater detail

and outline our conceptual approach. Then, in Section III, we

elaborate on the design and training of the autoencoder NN for

the speciﬁc case of the CMS HGCAL detector. In Section IV,

we present the digital implementation of the trained NN in

the ASIC. Finally, we summarize our work and discuss future

directions in Section V.

II. S

YSTEM CONSTRAINTS AND CONCEPT

The HGCAL is a major upgrade of the CMS endcap

calorimeter planned to for the HL-LHC and provides a ﬁtting

demonstrator for the ASIC ML accelerator technology. The

HGCAL is described in detail in Ref. [7]; relevant implemen-

tation details that have changed since publication of Ref. [7]

are updated in this article.

This “imaging calorimeter,” which includes over 600 m

active silicon and over 6 million readout channels, is composed

of 50 layers of active shower-sampling media interleaved

with dense absorber. The active medium of the 28-layer front

electromagnetic compartment is silicon, while the 22-layer

rear hadronic compartment includes both silicon and plastic

scintillator. Silicon layers are tiled with 8” hexagonal sensor

modules, with each module including 48 logical trigger cells

(TC) arranged in three 4⇥4 matrices as shown in Fig. 1. While

the NN can be conﬁgured for both the silicon and scintillator

geometries, the silicon geometry is used throughout this paper

to illustrate the concepts.

Algorithms

Neural

Network

Shape-

Encoder

Output to

trigger path

Transmission bandwidth

Cell

max

Normalizer

High-granularity

detector data

Fig. 1. Simpliﬁed version of the internal ﬂow of the autoencoder compression

task which takes the module energy deposits, normalizes them to the sum of

the energy in the module, and then performs shape encoding

To provide input to the CMS trigger system, data must be

transmitted from the on-sensor analog-to-digital ASICs to the

all-FPGA back-end detector electronics system at the nom-

inal HL-LHC collision rate of 40 MHz. Because bandwidth

constraints prohibit transmission of data for all 48 TCs at

40 MHz, a front-end concentrator ASIC (ECON-T) is being

developed to compress a single sensor’s information before

transmission to the back-end trigger electronics. Each sensor

module produces 7 bits of ﬂoating-point charge data for each

of the 48 TCs at 40 MHz. Thus, the lossy compression task

of the ECON-T ASIC is to aggregate the 48 7-bit signals

from a sensor and compress the data into a range spanning

48–144 bits while maximally preserving the energy pattern

of the sensor. The range of the output bits depends on the

location of the sensor module in the detector and the number

of links available for a given ECON-T ASIC to transmit the

data. The number of links allocated will roughly correspond

to the average sensor occupancy, which varies by two orders

of magnitude over all sensor locations. The exact task depends

on handling of data framing and TC address information as

well as on whether the ECON-T algorithm operates with ﬁxed

or variable latency. The ECON-T design provides the user

a choice among three expert algorithms for TC compression

including TC threshold application, sorting and selection of

highest energy TC, and aggregation of adjacent TCs. Unused

algorithms are clock-gated to conserve power.

The ECON-T ASIC is being developed for the LPCMOS

(Low Power CMOS) 65 nm feature size technology and is

under active development for CMS. Because it is located on-

detector in a high radiation environment, its design also re-

quires tolerance to single event effects. The allocated footprint

for the fabrication of this chip is approximately 5 ⇥ 5 mm

. It

is expected that sufﬁcient area will be available for potential

inclusion of our NN compression logic with an approximate

area of 4 mm

. The constraints for the compression algorithms

are that they should accept new input data at 40 MHz and

complete processing in 50 ns. The power budget of the task is

less than 100 mW.

Our contribution is a NN to perform the ECON-T com-

pression task. It is a central tenet of our design that the

compression algorithm is reconﬁgurable. Because we are

implementing a design for an ASIC, the architecture of the NN

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

Transactions on Nuclear Science

will be ﬁxed, but the weights of the NN need to be conﬁgurable

such that the algorithm itself is adaptable. This has several

advantages. Through reconﬁguration, we will be able to:

• enable more computationally complex compression algo-

rithms, which could improve overall physics performance

or allow more ﬂexible algorithms;

• customize the compression algorithm of each sensor

based on their location within the detector;

• adapt the compression algorithm for changing detector

and collider conditions (for example, if the detector loses

a channel or has a noisy channel it can be accounted

for or if the collider has more pileup than expected, the

algorithm can be adjusted to deal with new or unexpected

conditions without catastrophic failure).

For our compression algorithm, we choose to utilize an

autoencoder architecture. It provides a generic and ﬂexible

compression solution, consisting of two NNs: an encoder and a

decoder. The encoding network maps inputs to an intermediate

latent representation with lower dimensionality than the space

of inputs, after which the decoding network aims to recover

the original signal. In the HGCAL application, the encod-

ing NN would compress HGCAL data on the ASIC before

transmission to the calorimeter trigger FPGAs for subsequent

decoding. Ultimately, in a ﬁnal realistic system, we do not

anticipate using a full autoencoder architecture because FPGA

resources on the back-end FPGA system will not be sufﬁcient

to do a full decoding for every sensor. However, in the absence

of understanding how best to use the latent representation

later in the processing chain, we optimize performance for an

autoencoder because it is a reasonable proxy for the encoder

NN encapsulating the salient sensor features such that the

image can be decoded from the latent representation. Finally,

in Fig. 1, the compression task is split into two parts: an

overall normalization over the entire sensor to preserve the

total energy in the sensor and the NN shape encoder, which

encodes the energy pattern across the sensor.

For the automated design tool ﬂow, it is very important to

have a rapid co-design loop between the NN algorithm training

and the implementation in hardware in order to understand

whether the algorithm is meeting system constraints for power,

area, and performance simultaneously. To achieve this, we use

hls4ml [5] which translates NNs trained in common open-

source ML software frameworks into RTL using high-level

synthesis (HLS) tools [9]. The efﬁcacy of this approach will

be described in greater detail in the following section.

III. A

LGORITHM DESIGN AND PERFORMANCE

Our task is to design an algorithm that will reproduce the

energy pattern in the sensor while simultaneously adhering to

hardware constraints, i.e., ﬁtting in the available area within

the ECON-T ASIC chip while complying with system latency

and power constraints. Because we are training the algorithm

based on a single sensor’s energy pattern, we will not be

able to optimize for multi-sensor physics performance, such as

particle energy resolution. Ultimately, the physics performance

may determine the ﬁnal system optimization, however, it is

beyond the scope of this study. Therefore, our target is to

design an algorithm that reproduces the original sensor energy

pattern as accurately as possible through the autoencoder

compression-and-decompression bottleneck.

There are a number of elements needed to design our

compression algorithm: a sample of events for training and

validation, a preprocessing and normalization block, an op-

timized NN architecture, and metrics for evaluating the NN

performance, both for determining the training loss and the

ﬁnal network evaluation. An essential aspect of the train-

ing procedure is quantization-aware training (QAT), i.e., we

approximate bit-accurate reduced precision for all of the

NN calculations during training. QAT is known to be much

more performant than post-training quantization (PTQ), where

the training is done using 32-bit ﬂoating-point operations,

which are then truncated post-training to ﬁxed-point or integer

representations. In a previous study of the QKeras [10]

tool, QAT was performed for an LHC trigger task. It was

found that with PTQ, the minimum bit width possible without

loss of performance was 14 bits while with QAT, the same

performance could be achieved with 6-bit weights. Thus, PTQ

would lead to more than 4-fold increase in the area of an ASIC

design based on the bit operations hardware design metric [11].

Therefore, we use QKeras for training the NNs in this study.

Training Sample: Test energy patterns in the sensors are

simulated using top-quark-pair events overlaid with 200 simul-

taneous collisions per bunch crossing in the CMS software

framework [12]. These simulated events create a sample of

typical energy patterns in the HGCAL sensors, which we use

as a realistic proxy for the sensor data.

Preprocessing: The compression task is factorized into

normalization and shape extraction components, as illustrated

in Fig. 1. The ﬁrst stage of ECON-T processing for all

algorithms is to expand the 7-bit ﬂoating-point TC data to the

inherent 22-bit ﬁxed-point TC data. The sum value of all 48

TC is identiﬁed and used to normalize the charge distribution

across the full sensor (and the sum of TC charge is included

in the ﬁnal data payload to allow subsequent interpretation of

normalized TC data). The normalized NN inputs are truncated

to 8 bits to allow a more compact NN implementation, while

ensuring that any omitted cells constitute less than 1% of the

total energy recorded within a module.

NN Architecture: The encoding NN architecture con-

sists of successive layers that sequentially process the input

data. Convolutional layers are used to extract spatial features

from images through the application of ﬁlters: matrices of

conﬁgurable parameters. While convolutional layers use rel-

atively few parameters, convolution requires many multiply-

accumulate operations (MACs). Conversely, a fully-connected

layer multiplies a vector of input elements by a matrix of

conﬁgurable weights, generally requiring more conﬁgurable

parameters and fewer MACs than a convolution. The impact of

the choice of precision for all internal parameters (constrained

by the available area on chip) is accounted for by training

inherently quantized models with the QKeras package [10].

Because the HGCAL sensor data compression task takes as

input an image data representation, we consider a convolu-

tional NN layer as a natural approach. Typical convolutions

rely on the input being in a Cartesian representation, though

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

Transactions on Nuclear Science

other shapes can be explored in future work. Here, we map the

hexagonal sensor shape to a more typical Cartesian arrange-

ment as illustrated in Fig. 2, which simpliﬁes the training and

hardware implementation.

C. Herwig — Autoencoder for Front-End ASIC

Oct 27, 2020

•

Highly-efﬁcient design is necessary to achieve acceptable

design footprint and power consumption

•

Borrow from image processing: convolutional NN (CNN)

with ﬁlters trained to identify physical radiation patterns

NN encoding layers

Map to a

regular

geometry

2d 3d

Scan ﬁlters to

extract features

•

A fully-connected NN layer achieves further reduction,

exploiting correlated features across the image

•

CNN operations scale as ~ (Image * Filter) volumes

•

Dense layer scales with encoder output dimension

Fig. 2. Mapping the hexagonal sensor geometry to potential Cartesian

representations for convolutional layer operations.

Training and Performance Metrics: The performance of

the autoencoder is based on how well the original image

is reproduced after encoding and decoding. We quantify the

difference between raw and decoded HGCAL data with the

energy mover’s distance (EMD) [13]. Given a particular nor-

malized energy distribution, one may physically relocate some

energy fraction dE by a spatial distance dx, leading to a new

distribution with an associated rearrangement cost dEdx. This

notion can be extended to deﬁne an “optimal transport” be-

tween two energy distributions A and B, as a re-mapping that

minimizes this total rearrangement cost, EMD(A, B). While

the performance of the autoencoder is assessed with EMD,

taking into account the complete hexagonal sensor geometry,

this metric is not used directly in the algorithm training,

as it involves nondifferentiable and computationally intensive

operations. Models are instead trained with a modiﬁed 

loss function incorporating cell-to-cell distances, as a fast

approximation of EMD. Speciﬁcally, individual TCs are re-

summed into all physical groups of approximately 2 ⇥ 2 and

3 ⇥ 3 “super-cells” based on the full hexagonal cells with

corresponding 

values computed for the coarsened images.

The total loss sums each such 

together, resulting in a

comparatively lenient penalty when mis-reconstruction occurs

only on small spatial scales. Including these additional 

terms in the training procedure is found to yield signiﬁcant

improvements to the autoencoder performance, as measured

with EMD. To ensure an unbiased NN optimization, the data

are randomly partitioned into separate samples for training

(80%) and validation (20%), with training termination set by

the loss observed in the validation sample.

Baseline Encoder Model

A simple encoding NN with a single convolutional and

dense layer architecture is investigated. Normalized inputs

from hexagonal sensors are arranged into three arrays of 4 ⇥ 4

to form a regular geometry. The convolution layer consists of

eight 3 ⇥ 3 ⇥ 3 kernel matrices, giving a 8 ⇥ 4 ⇥ 4 output

after convolution. It was found that more than eight kernel

matrices brought negligible performance improvement. These

128 values are ﬂattened and fed through a dense layer to yield

16 9-bit output values. ReLU activations [14, 15] are applied

before and after the dense layer. This leads to a total of 2,288

weight parameters (dominated by the 2,064 parameters used

to conﬁgure the dense layer), each of which are speciﬁed with

Fig. 3. The autoencoder neural network architecture and data ﬂow for the

baseline encoder model

6-bit precision. A single inference requires a total of 4,448

MACs, with similar requirements from the convolution (2,400)

and dense layers (2,048). The size and complexity of this

baseline model are constrained by area, on-chip memory and

interfaces, and power, which impose additional optimization

considerations. The encoder architecture with the reconﬁg-

urable weights is illustrated in Fig. 3

Optimization Considerations and Comparisons

While the presence of a single convolutional layer is critical

for good physics performance of the algorithm (approximated

by the EMD between input and decoded images), adding more

ﬁlters or additional convolutional layers only weakly improves

physics performance, at the expense of signiﬁcantly increased

area. Changes in the number and size of the dense layers yield

more dramatic differences.

Figure 4 shows a sweep over the number of dense layer

outputs, where remaining aspects of the design are ﬁxed

based on hardware constraints: the precision of outputs and

weights are coherently varied to ensure that both the total

number of outputs and the weight bits are ﬁxed. Architec-

tures featuring many outputs with lower relative precision

consistently outperform their counterparts. The autoencoder

is robust across a variety of conditions and performs well in

the high-occupancy regime, which poses the greatest challenge

for trigger reconstruction.

Reconﬁgurability: Figure 4 also demonstrates how the

same NN encoder can be re-optimized and conﬁgured for new

data-taking conditions, by comparing sensors in detector re-

gions requiring low- and high-throughput. The maximum data

throughput of 144 bits from 16 9-bit outputs can be reduced

through fully conﬁgurable selective truncation. Expected use

cases include transmission of (48, 80, 112, 144) bits from

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.

Transactions on Nuclear Science

NN outputs

Sensor output

bandwidth

64 bits

160 bits

Fig. 4. Median EMD for decoded HGCAL images from the validation

dataset, as function of sensor occupancy for six NN conﬁgurations. Vertical

lines (suppressed for the 160-bit conﬁgurations) denote 68% EMD intervals.

Occupancy is deﬁned as the number of TCs with signals exceeding one

minimum ionizing particle divided by cosh ⌘ where ⌘ is the pseudorapidity

of the TC. (Results shown for version of NN with maximum of 10 bits for

each of 16 outputs rather than 9 bits as described in the text.)

TABLE I

REA BREAKDOWN FOR PIPELINED IMPLEMENTATIONS.THE RESULTS

ARE FROM

CATAPU LT HLS ESTIMATIONS AND AREAS ARE IN µm

Initiation Interval Total Area Register Area MUX Area

1 1,138,242 925 0

891,195 5,228 12,989

765,877 8,503 16,089

8 699,988 8,509 16,252

16 (3, 4, 7, 9) bit outputs, though the network can also be

conﬁgured to transmit fewer than 16 outputs, or a mix of

precisions.

IV. I

MPLEMENTATION METHODOLOGY AND RESULTS

In this section, we detail the implementation of the trained

NN described in Sec. III in the ECON-T ASIC. We discuss

the design and veriﬁcation ﬂow, the architectural and design

exploration, steps required for deployment in a radiation

environment, design performance metrics, and ﬁnally the

implementation results.

Algorithm to Accelerator Development

For our design ﬂow, we adopted the hls4ml framework [5]

to automate the mapping of ML models onto reconﬁgurable

logic. For this work, we extended hls4ml to our ASIC ﬂow.

Traditionally, hardware designers utilize hardware description

languages (HDLs) and a level of abstraction known as the

become an alternative for generating hardware modules from

code written in programming languages such as C/C++. HLS

comes with signiﬁcant beneﬁts: it raises the level of abstraction

and reduces the simulation time; it simpliﬁes the veriﬁcation

phases; and ﬁnally, it makes the exploration and evaluation

of design alternatives easier. The original ﬂow of hls4ml

generates state-of-the-art synthesizable C++ code and HLS

directives from the ML-model speciﬁcations. The generated

code is then fed into the Vivado HLS tool to generate an

accelerator in HDL RTL code for the deployment on Xilinx

FPGAs [16]. We extended hls4ml to support Mentor’s

Catapult HLS [17] tool and target our speciﬁc 65 nm LP

CMOS technology for ASIC fabrication. We integrated the

HLS-generated code with a SystemVerilog RTL IP of the

programmable I

C peripheral

. We ﬁnally created a component

database and layout to be incorporated into the ECON-T ASIC

top-level assembly using a digital implementation ﬂow. The

standard ﬂow was modiﬁed to accommodate automatic triple

modular redundancy implementation for HLS-generated RTL

integrated with other SystemVerilog modules.

We complemented our design ﬂow with a robust validation

and veriﬁcation methodology across the various reﬁnement

steps. We validated the C++ HLS code against the QKeras

trained model to guarantee the model’s functional correct-

ness. Earlier in the design ﬂow, we also performed dynamic

and static veriﬁcation of the synthesizable speciﬁcations: we

checked design rules with static analysis of the C++ HLS

code (Mentor CDesignChecker [19]), measured coverage met-

rics (Mentor CCov [19]), and ﬁnally, ran simulation-based

equivalence checking. For the HLS-generated RTL code, we

followed a more traditional simulation-based veriﬁcation to

ensure optimized power, area, and speed.

Architectural Exploration

hls4ml coupled with the industry standard Catapult HLS

(ver. 10.6) tool allowed us to explore the cost and performance

trade-offs of various micro-architectural hardware implemen-

tations for our ML model. We decided on a pipelined imple-

mentation for our accelerator to increase concurrent execution

as an early design decision. A pipelined design can process

new inputs every N clock cycles, where N is the initiation

interval (II) of the design. Table I shows the area breakdown

for different II values (1, 2, 4, 8). It is noticeable that although

the area is higher for II=1, the required resources are mostly

functional logic to implement a highly-parallel datapath, i.e.

there are no multiplexers. A higher II value implies less design

parallelism and more functional-resource reuse. This choice

reduces the overall area, but the resource breakdown shows

an increase in control logic (MUX) and registers. An II of 1

was ultimately selected so that new inputs may be processed

in sync with a single clock operating at 40 MHz LHC crossing

frequency.

We used a ﬁxed-point representation (ac_fixed [20]) for

the input, intermediate, and output parameters of our ML

model designed with hls4ml. This choice provided us with

a high degree of ﬂexibility for exploring the area and accuracy

trade-off of the ML-model hardware implementations obtained

with HLS. The RTL schematics for the encoder block are

shown in Fig. 5. The basic structure of the convolution and

dense layers can be seen at the schematic level. The top right

and bottom right diagrams are zoomed in portions of this

schematic depicting the output and MACs.

The authors use controller/peripheral in place of master/slave when

referring to such I

C devices or processes [18].

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on June 10,2021 at 22:28:32 UTC from IEEE Xplore. Restrictions apply.