scispace - formally typeset
Open AccessBook ChapterDOI

Value-Aware Quantization for Training and Inference of Neural Networks

Reads0
Chats0
TLDR
In this paper, value-aware quantization is proposed to apply aggressively reduced precision to the majority of data while separately handling a small amount of large values in high precision, which reduces total quantization errors under very low precision.
Abstract
We propose a novel value-aware quantization which applies aggressively reduced precision to the majority of data while separately handling a small amount of large values in high precision, which reduces total quantization errors under very low precision. We present new techniques to apply the proposed quantization to training and inference. The experiments show that our method with 3-bit activations (with 2% of large ones) can give the same training accuracy as full-precision one while offering significant (41.6% and 53.7%) reductions in the memory cost of activations in ResNet-152 and Inception-v3 compared with the state-of-the-art method. Our experiments also show that deep networks such as Inception-v3, ResNet-101 and DenseNet-121 can be quantized for inference with 4-bit weights and activations (with 1% 16-bit data) within 1% top-1 accuracy drop.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision

TL;DR: Hessian AWare Quantization (HAWQ), a novel second-order quantization method that allows for the automatic selection of the relative quantization precision of each layer, based on the layer's Hessian spectrum, is introduced.
Proceedings ArticleDOI

ZeroQ: A Novel Zero Shot Quantization Framework

TL;DR: THE AUTHORS' enables mixed-precision quantization without any access to the training or validation data, and it can finish the entire quantization process in less than 30s, which is very low computational overhead.
Proceedings ArticleDOI

Energy-efficient neural network accelerator based on outlier-aware low-precision computation

TL;DR: The outlier-aware accelerator (OLAccel) performs dense and low-precision computations for a majority of data (weights and activations) while efficiently handling a small number of sparse and high-pre precision outliers (e.g., amounting to 3% of total data).
Posted Content

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model.

TL;DR: This work quantizes a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel Cascade Lake processors to improve inference performance while maintaining less than 0.5% drop in accuracy.
Posted Content

HAWQV3: Dyadic Neural Network Quantization

TL;DR: This work presents HAWQV3, a novel dyadic quantization framework, and shows that mixed-precision INT4/8 quantization can be used to achieve higher speed ups, as compared to INT8 inference, with minimal impact on accuracy.
References
More filters
ReportDOI

Building a large annotated corpus of English: the penn treebank

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Book ChapterDOI

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

TL;DR: The Binary-Weight-Network version of AlexNet is compared with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than \(16\,\%\) in top-1 accuracy.
Posted Content

In-Datacenter Performance Analysis of a Tensor Processing Unit

TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Posted Content

Federated Learning: Strategies for Improving Communication Efficiency

TL;DR: Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
Posted Content

Recurrent Neural Network Regularization

TL;DR: This paper shows how to correctly apply dropout to LSTMs, and shows that it substantially reduces overfitting on a variety of tasks.
Related Papers (5)