Value-Aware Quantization for Training and Inference of Neural Networks
Eunhyeok Park,Sungjoo Yoo,Peter Vajda +2 more
- pp 608-624
Reads0
Chats0
TLDR
In this paper, value-aware quantization is proposed to apply aggressively reduced precision to the majority of data while separately handling a small amount of large values in high precision, which reduces total quantization errors under very low precision.Abstract:
We propose a novel value-aware quantization which applies aggressively reduced precision to the majority of data while separately handling a small amount of large values in high precision, which reduces total quantization errors under very low precision. We present new techniques to apply the proposed quantization to training and inference. The experiments show that our method with 3-bit activations (with 2% of large ones) can give the same training accuracy as full-precision one while offering significant (41.6% and 53.7%) reductions in the memory cost of activations in ResNet-152 and Inception-v3 compared with the state-of-the-art method. Our experiments also show that deep networks such as Inception-v3, ResNet-101 and DenseNet-121 can be quantized for inference with 4-bit weights and activations (with 1% 16-bit data) within 1% top-1 accuracy drop.read more
Citations
More filters
Proceedings ArticleDOI
HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision
TL;DR: Hessian AWare Quantization (HAWQ), a novel second-order quantization method that allows for the automatic selection of the relative quantization precision of each layer, based on the layer's Hessian spectrum, is introduced.
Proceedings ArticleDOI
ZeroQ: A Novel Zero Shot Quantization Framework
TL;DR: THE AUTHORS' enables mixed-precision quantization without any access to the training or validation data, and it can finish the entire quantization process in less than 30s, which is very low computational overhead.
Proceedings ArticleDOI
Energy-efficient neural network accelerator based on outlier-aware low-precision computation
TL;DR: The outlier-aware accelerator (OLAccel) performs dense and low-precision computations for a majority of data (weights and activations) while efficiently handling a small number of sparse and high-pre precision outliers (e.g., amounting to 3% of total data).
Posted Content
Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model.
Aishwarya Bhandare,Vamsi Sripathi,Deepthi Karkada,Vivek Menon,Sun Choi,Kushal Datta,Vikram A. Saletore +6 more
TL;DR: This work quantizes a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel Cascade Lake processors to improve inference performance while maintaining less than 0.5% drop in accuracy.
Posted Content
HAWQV3: Dyadic Neural Network Quantization
Zhewei Yao,Zhen Dong,Zhangcheng Zheng,Amir Gholami,Jiali Yu,Eric Tan,Leyuan Wang,Qijing Huang,Yida Wang,Michael W. Mahoney,Kurt Keutzer +10 more
TL;DR: This work presents HAWQV3, a novel dyadic quantization framework, and shows that mixed-precision INT4/8 quantization can be used to achieve higher speed ups, as compared to INT8 inference, with minimal impact on accuracy.
References
More filters
ReportDOI
Building a large annotated corpus of English: the penn treebank
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Book ChapterDOI
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
TL;DR: The Binary-Weight-Network version of AlexNet is compared with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy.
Posted Content
In-Datacenter Performance Analysis of a Tensor Processing Unit
Norman P. Jouppi,Cliff Young,Nishant Patil,David A. Patterson,Gaurav Agrawal,Raminder Bajwa,Sarah Bates,Suresh Bhatia,Nan Boden,Albert T. Borchers,Rick Boyle,Pierre-luc Cantin,Clifford Chao,Christopher Aaron Clark,Jeremy Coriell,Michael J. Daley,Matt Dau,Jeffrey Dean,Ben Gelb,Tara Vazir Ghaemmaghami,Rajendra Gottipati,William John Gulland,Robert Hagmann,C. Richard Ho,Doug Hogberg,John Hu,Robert Hundt,D. Hurt,Julian Ibarz,Aaron Jaffey,Alek Jaworski,Alexander Kaplan,Khaitan Harshit,Andy Koch,Naveen Kumar,Steve Lacy,James Laudon,James Law,Diemthu Le,Chris Leary,Zhuyuan Liu,Kyle Lucke,Alan Lundin,Gordon MacKean,Adriana Maggiore,Maire Mahony,Kieran Miller,Rahul Nagarajan,Ravi Narayanaswami,Ray Ni,Kathy Nix,Thomas Norrie,Mark Omernick,Narayana Penukonda,Andrew Everett Phelps,Jonathan Ross,Matt Ross,Amir Salek,Emad Samadiani,Chris Severn,Gregory Sizikov,Matthew Snelham,Jed Souter,Dan Steinberg,Andy Swing,Mercedes Tan,Gregory Michael Thorson,Bo Tian,Horia Toma,Erick Tuttle,Vijay K. Vasudevan,Richard Walter,Walter Wang,Eric Wilcox,Doe Hyun Yoon +74 more
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Posted Content
Federated Learning: Strategies for Improving Communication Efficiency
Jakub Konečný,H. Brendan McMahan,Felix X. Yu,Peter Richtárik,Ananda Theertha Suresh,Dave Bacon +5 more
TL;DR: Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
Posted Content
Recurrent Neural Network Regularization
TL;DR: This paper shows how to correctly apply dropout to LSTMs, and shows that it substantially reduces overfitting on a variety of tasks.