scispace - formally typeset
Search or ask a question
Posted Content

PACT: Parameterized Clipping Activation for Quantized Neural Networks

TL;DR: It is shown, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets.
Abstract: Deep learning algorithms achieve high classification accuracy at the expense of significant computation cost To address this cost, a number of quantization schemes have been proposed - but most of these techniques focused on quantizing weights, which are relatively smaller in size compared to activations This paper proposes a novel quantization scheme for activations during training - that enables neural networks to work well with ultra low precision weights and activations without any significant accuracy degradation This technique, PArameterized Clipping acTivation (PACT), uses an activation clipping parameter $\alpha$ that is optimized during training to find the right quantization scale PACT allows quantizing activations to arbitrary bit precisions, while achieving much better accuracy relative to published state-of-the-art quantization schemes We show, for the first time, that both weights and activations can be quantized to 4-bits of precision while still achieving accuracy comparable to full precision networks across a range of popular models and datasets We also show that exploiting these reduced-precision computational units in hardware can enable a super-linear improvement in inferencing performance due to a significant reduction in the area of accelerator compute engines coupled with the ability to retain the quantized model and activation data in on-chip memories

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
31 Mar 2019
TL;DR: A Single Path One-Shot model is proposed to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated.
Abstract: We revisit the one-shot Neural Architecture Search (NAS) paradigm and analyze its advantages over existing NAS approaches. Existing one-shot method, however, is hard to train and not yet effective on large scale datasets like ImageNet. This work propose a Single Path One-Shot model to address the challenge in the training. Our central idea is to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated. Training is performed by uniform path sampling. All architectures (and their weights) are trained fully and equally.

610 citations


Cites methods from "PACT: Parameterized Clipping Activa..."

  • ...We use PACT [5] as the quantization algorithm....

    [...]

  • ...Following [36, 5, 27], we only search and quantize the res-blocks, excluding the first convolutional layer and the last fully-connected layer....

    [...]

Proceedings ArticleDOI
Kuan Wang1, Zhijian Liu1, Yujun Lin1, Ji Lin1, Song Han1 
15 Jun 2019
TL;DR: Huang et al. as discussed by the authors introduced the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and took the hardware accelerator's feedback in the design loop.
Abstract: Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. There are plenty of specialized hardware for neural networks, but little research has been done for specialized neural network optimization for a particular hardware architecture. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.

467 citations

Posted Content
Kuan Wang1, Zhijian Liu1, Yujun Lin1, Ji Lin1, Song Han1 
TL;DR: The Hardware-Aware Automated Quantization (HAQ) framework is introduced which leverages the reinforcement learning to automatically determine the quantization policy, and takes the hardware accelerator's feedback in the design loop to generate direct feedback signals to the RL agent.
Abstract: Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.

401 citations


Cites methods from "PACT: Parameterized Clipping Activa..."

  • ...-5 Latency PACT [3] 4 bits 62.44 84.19 45.45 ms 61.39 83.72 52.15 ms 62.44 84.19 57.49 ms 61.39 83.72 74.46 ms Ours flexible 67.40 87.90 45.51 ms 66.99 87.33 52.12 ms 65.33 86.60 57.40 ms 67.01 87.46 73.97 ms our edge device and Xilinx VU9P [29] as our cloud device....

    [...]

  • ...In order to demonstrate the effectiveness of our framework on different hardware architectures, we further compare our framework with PACT [3] under the latency constraints on the BitFusion [26] architecture (Table 4)....

    [...]

  • ...Conventional quantization methods use the same number of bits for all layers [3, 15], but as different layers have different redundancy and behave differently on the hardware (computation bounded or memory bounded), it is necessary to use mixed precision for different layers (as shown in Figure 1)....

    [...]

  • ...As for comparison, we adopt the PACT [3] as our baseline, which uses the same number of bits for all layers except for the first layer which extracts the low level features, they use 8 bits for both weights and activations as it has fewer parameters and is very sensitive to errors....

    [...]

  • ...Similar to the latency-constrained experiments, we compare our framework with PACT [3] that uses fixed number of bits without hardware feedback....

    [...]

Proceedings ArticleDOI
01 Oct 2019
TL;DR: Differentiable soft quantization (DSQ) as mentioned in this paper is proposed to bridge the gap between the full-precision and low-bit networks, which can automatically evolve during training to gradually approximate the standard quantization.
Abstract: Hardware-friendly network quantization (e.g., binary/uniform quantization) can efficiently accelerate the inference and meanwhile reduce memory consumption of the deep neural networks, which is crucial for model deployment on resource-limited devices like mobile phones. However, due to the discreteness of low-bit quantization, existing quantization methods often face the unstable training process and severe performance degradation. To address this problem, in this paper we propose Differentiable Soft Quantization (DSQ) to bridge the gap between the full-precision and low-bit networks. DSQ can automatically evolve during training to gradually approximate the standard quantization. Owing to its differentiable property, DSQ can help pursue the accurate gradients in backward propagation, and reduce the quantization loss in forward process with an appropriate clipping range. Extensive experiments over several popular network structures show that training low-bit neural networks with DSQ can consistently outperform state-of-the-art quantization methods. Besides, our first efficient implementation for deploying 2 to 4-bit DSQ on devices with ARM architecture achieves up to 1.7× speed up, compared with the open-source 8-bit high-performance inference framework NCNN [31].

363 citations

Journal ArticleDOI
TL;DR: A comprehensive survey of algorithms proposed for binary neural networks, mainly categorized into the native solutions directly conducting binarization, and the optimized ones using techniques like minimizing the quantization error, improving the network loss function, and reducing the gradient error are presented.

346 citations


Cites methods from "PACT: Parameterized Clipping Activa..."

  • ...can also adopt more exible binarization function and learn its parameters during minimizing the quantization error. To achieve this goal, Choi et.al. proposed PArameterized Clipping Activation (PACT) [74] with a learnable upper bound for the activation function. The optimized upper bound of each layer is able to ensure that the quantization range of each layer is aligned with the original distribution...

    [...]

  • ...et [60] - A - - High-Order Residual Quantization [70] - A - - ABC-Net [71] - S - - Two-Step Quantization [72] RB - - - Binary Weight Networks via Hashing[73] - S - - PArameterized Clipping acTivation [74] - - - LQ-Nets [61] RB - - - Wide Reduced-Precision Networks [75] WD A - - XNOR-Net++ [76] - A - - Learning Symmetric Quantization [77] - - X - BBG [78] SC - - - Real-to-Bin [79] SC A - X Improve Netw...

    [...]

  • ...et [71] 1/32 ResNet-18 62.8 84.4 2/32 ResNet-18 63.7 85.2 1/1 ResNet-18 42.7 67.6 1/1 ResNet-34 52.4 76.5 TSQ [72] 1/1 AlexNet 58.0 80.5 BWNH [73] 1/32 AlexNet 58.5 80.9 1/32 ResNet-18 64.3 85.9 PACT [74] 1/32 ResNet-18 65.8 86.7 1/2 ResNet-18 62.9 84.7 1/2 ResNet-50 67.8 87.9 LQ-Nets [61] 1/2 ResNet-18 62.6 84.3 1/2 ResNet-34 66.6 86.9 1/2 ResNet-50 68.7 88.4 1/2 AlexNet 55.7 78.8 1/2 VGG-Variant 67....

    [...]

  • ...ions are binarized. Thus eliminating the in uence of activation binarization is usually much more important when designing binary network, which becomes the main motivations for studies like [85] and [74]. After adding reasonable regularization to the dis25 Table 3: Image Classication Performance of Binary Neural Networks on CIFAR-10 Dataset Type Method Bit-Width (W/A) Topology Acc. (%) Full-Precisio...

    [...]

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations


"PACT: Parameterized Clipping Activa..." refers methods in this paper

  • ...We used ADAM with epsilon 10−5 and learning rate starting from 10−4 and scaled by 0.2 at epoch 56 and 64....

    [...]

  • ...• Quantization using Wide Reduced-Precision Networks (WRPN, Mishra et al. (2017)): A scheme to increase the number of filter maps to increase robustness for activation quantization....

    [...]

  • ...• Fine-grained Quantization (FGQ, Mellempudi et al. (2017)): A direct quantization scheme (i.e., little re-training needed) based on fine-grained grouping (i.e., within a small subset of filter maps)....

    [...]

  • ...(2017)), FGQ (Mellempudi et al. (2017)), WEP (Park et al. (2017)), LPBN (Graham (2017)), and HWGQ (Cai et al....

    [...]

  • ...For comparisons, we include accuracy results reported in the following prior work: DoReFa (Zhou et al. (2016)), BalancedQ (Zhou et al. (2017)), WRPN (Mishra et al. (2017)), FGQ (Mellempudi et al. (2017)), WEP (Park et al. (2017)), LPBN (Graham (2017)), and HWGQ (Cai et al. (2017))....

    [...]

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings Article
Sergey Ioffe1, Christian Szegedy1
06 Jul 2015
TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

30,843 citations


"PACT: Parameterized Clipping Activa..." refers background in this paper

  • ...Graham (2017) recommends that normalized activation, in the process of batch normalization (Ioffe & Szegedy (2015), BatchNorm), is a good candidate for quantization....

    [...]

Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations