Activation Density based Mixed-Precision Quantization for Energy Efficient Neural Networks
Karina Vasquez,Yeshwanth Venkatesha,Abhiroop Bhattacharjee,Abhishek Moitra,Priyadarshini Panda +4 more
- pp 1360-1365
TLDR
In this paper, the authors proposed an energy-efficient mixed-precision quantization method based on activation density, which calculates the optimal bitwidth/precision for each layer during training.Abstract:
As neural networks gain widespread adoption in embedded devices, there is a growing need for model compression techniques to facilitate seamless deployment in resource-constrained environments. Quantization is one of the go-to methods yielding state-of-the-art model compression. Most quantization approaches take a fully trained model, then apply different heuristics to determine the optimal bit-precision for different layers of the network, and finally retrain the network to regain any drop in accuracy. Based on Activation Density-the proportion of non-zero activations in a layer-we propose a novel in-training quantization method. Our method calculates optimal bit-width/precision for each layer during training yielding an energy-efficient mixed precision model with competitive accuracy. Since we train lower precision models progressively during training, our approach yields the final quantized model at lower training complexity and also eliminates the need for re-training. We run experiments on benchmark datasets like CIFAR-10, CIFAR-100, TinyImagenet on VGG19/ResNet18 architectures and report the accuracy and energy estimates for the same. We achieve up to 4.5× benefit in terms of estimated multiply-and-accumulate (MAC) reduction while reducing the training complexity by 50% in our experiments. To further evaluate the energy benefits of our proposed method, we develop a mixed-precision scalable Process In Memory (PIM) hardware accelerator platform. The hardware platform incorporates shift-add functionality for handling multibit precision neural network models. Evaluating the quantized models obtained with our proposed method on the PIM platform yields about 5× energy reduction compared to baseline 16-bit models. Additionally, we find that integrating activation density based quantization with activation density based pruning (both conducted during training) yields up to ~ 198× and ~44× energy reductions for VGG19 and ResNet18 architectures respectively on PIM platform compared to baseline 16-bit precision, unpruned models.read more
Citations
More filters
Journal ArticleDOI
Look-up-Table Based Processing-in-Memory Architecture With Programmable Precision-Scaling for Deep Learning Applications
TL;DR: In this paper , a Look-Up Table (LUT) based PIM architecture is proposed to accelerate CNN/DNN acceleration on the DRAM memory platform, which replaces logic-based processing with pre-calculated results stored inside the LUTs in order to perform complex computations.
Journal ArticleDOI
Quantized Sparse Training: A Unified Trainable Framework for Joint Pruning and Quantization in DNNs
TL;DR: A novel compression framework that prunes and quantizes networks jointly in a unified training process based on the straight-through estimator is proposed, which achieves a 135KB model size in case of VGG16, without any accuracy degradation.
Journal ArticleDOI
Impact of Mixed Precision Techniques on Training and Inference Efficiency of Deep Neural Networks
Proceedings ArticleDOI
Heterogeneous Multi-Functional Look-Up-Table-based Processing-in-Memory Architecture for Deep Learning Acceleration
TL;DR: In this paper , a multi-functional look-up-table (LUT)-based reconfigurable PIM architecture is proposed to achieve energy efficiency and flexibility simultaneously in hardware accelerators.
Proceedings ArticleDOI
Coarse-Grained High-speed Reconfigurable Array-based Approximate Accelerator for Deep Learning Applications
TL;DR: In this paper , a reconfigurable DNN/CNN accelerator is proposed, which comprises nine processing elements that can perform both convolution and arithmetic operations through run-time reconfiguration and with minimal overhead.
References
More filters
Proceedings Article
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
TL;DR: Deep Compression as mentioned in this paper proposes a three-stage pipeline: pruning, quantization, and Huffman coding to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
Proceedings Article
Learning both weights and connections for efficient neural networks
TL;DR: In this paper, the authors proposed a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections using a three-step method.
Proceedings Article
Optimal Brain Damage
TL;DR: A class of practical and nearly optimal schemes for adapting the size of a neural network by using second-derivative information to make a tradeoff between network complexity and training set error is derived.
Book ChapterDOI
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
TL;DR: The Binary-Weight-Network version of AlexNet is compared with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy.
Proceedings Article
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.
Jonathan Frankle,Michael Carbin +1 more
TL;DR: This work finds that dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations, and articulate the "lottery ticket hypothesis".
Related Papers (5)
Training High-Performance and Large-Scale Deep Neural Networks with Full 8-bit Integers
An Analytical Method to Determine Minimum Per-Layer Precision of Deep Neural Networks
Charbel Sakr,Naresh R. Shanbhag +1 more