In this paper, a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum energy, high-accuracy, nanosecond inference and fully automated deployment on chip is introduced.
Abstract:
Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton–proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of $${\mathcal{O}}(1)\,\upmu{\rm{s}}$$
is required. Nanosecond inference and a resource consumption reduced by a factor of 50 when implemented on field-programmable gate array hardware are achieved. With edge computing on custom hardware, real-time inference with deep neural networks can reach the nanosecond timescale. An important application in this regime is event processing at particle collision detectors like those at the Large Hadron Collider (LHC). To ensure high performance as well as reduced resource consumption, a method is developed, and made available as an extension of the Keras library, to automatically design optimal quantization of the different layers in a deep neural network.
TL;DR: This article presents the functional aspects, appeal, and innovative use of DT in smart industries, and discusses the DT deployment strategies at different industrial communication layers to meet the monitoring and control requirements of industrial applications.
TL;DR: In this paper, an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on field-programmable gate arrays (FPGAs) is introduced.
TL;DR: In this article , the authors present the state-of-the-art 5G architecture, transformative technologies, and recent design trends, which also selectively supplemented with new results, and identify several research challenges in these promising design trends that beyond-5G systems must overcome to support rapidly unfolding transition in creating value-centric industrial wireless networks.
TL;DR: Preliminary results by training a surrogate machine-learning model on real accelerator data to emulate the Booster environment and using this surrogate model to train the neural network for its regulation task are demonstrated.
TL;DR: In this article, the authors adapt the physics-motivated interaction network (IN) GNN to the problem of particle tracking in pileup conditions similar to those expected at the high-luminosity Large Hadron Collider.
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Q1. What are the contributions in "Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors" ?
Here, the authors introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip.
Q2. What have the authors stated for future works in "Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors" ?
Taking a pre-trained model and making it suitable for hardware implementation on the edge, both in terms of latency and size, is one of the bottlenecks for bringing ML applications into extremely constrained computing environments ( e. g. a detector at a particle collider ), and the workflow presented here will allow for a streamlined and simple process, ultimately resulting in a great improvement in the quality of physics data collected in the future. The generality and flexibility of the QKeras+hls4ml workflow opens up for a wide array of possible future work. In addition, while the energy estimator provides a good baseline for relative energy consumption, as demonstrated, the authors hope to extend the library to provide more device-specific absolute energy estimates. The authors also plan to explore using a combination of block energy and the curvature of the weight space, as done in HAQ, when quantizing a network one block at a time.