Pruning and quantization for deep neural network acceleration: A survey

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.

A Survey of Quantization Methods for Efficient Neural Network Inference

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

This chapter provides approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. Over the past decade, people have observed significant improvements in the accuracy of Neural Networks (NNs) for a wide range of problems, often achieved by highly over-parameterized models. Achieving efficient, real-time NNs with optimal accuracy requires rethinking the design, training, and deployment of NN models. Model distillation involves training a large model and then using it as a teacher to train a more compact model. Loosely related to NN quantization is work in neuroscience that suggests that the human brain stores information in a discrete/quantized form, rather than in a continuous form. Gray and Neuhoff have written a very nice survey of the history of quantization up to 1998. 

We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via pruning and/or quantization with speedup, and well-performing solutions have been proposed independently for both compression approaches. In this paper, we introduce a new compression framework which covers both weight pruning and quantization in a unified setting, is time- and space-efficient, and considerably improves upon the practical performance of existing post-training methods. At the technical level, our approach is based on an exact and efficient realization of the classical Optimal Brain Surgeon (OBS) framework of [LeCun, Denker, and Solla, 1990] extended to also cover weight quantization at the scale of modern DNNs. From the practical perspective, our experimental results show that it can improve significantly upon the compression-accuracy trade-offs of existing post-training methods, and that it can enable the accurate compound application of both pruning and quantization in a post-training setting.

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant computation resources and energy costs. These challenges can be overcome through optimizations such as network compression. Network compression can often be realized with little loss of accuracy. In some cases accuracy may even improve. This paper provides a survey on two types of network compression: pruning and quantization. Pruning can be categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning. Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and activations may be quantized typically to 8-bit integers although lower bit width implementations are also discussed including binary neural networks. Both pruning and quantization can be used independently or combined. We compare current techniques, analyze their strengths and weaknesses, present compressed network accuracy results on a number of frameworks, and provide practical guidance for compressing networks.

Pruning and Quantization for Deep Neural Network Acceleration: A Survey

High bandwidth requirements are an obstacle for accelerating the training and inference of deep neural networks. Most previous research focuses on reducing the size of kernel maps for inference. We analyze parameter sparsity of six popular convolutional neural networks - AlexNet, MobileNet, ResNet-50, SqueezeNet, TinyNet, and VGG16. Of the networks considered, those using ReLU (AlexNet, SqueezeNet, VGG16) contain a high percentage of 0-valued parameters and can be statically pruned. Networks with Non-ReLU activation functions in some cases may not contain any 0-valued parameters (ResNet-50, TinyNet). We also investigate runtime feature map usage and find that input feature maps comprise the majority of bandwidth requirements when depth-wise convolution and point-wise convolutions used. We introduce dynamic runtime pruning of feature maps and show that 10% of dynamic feature map execution can be removed without loss of accuracy. We then extend dynamic pruning to allow for values within an epsilon of zero and show a further 5% reduction of feature map loading with a 1% loss of accuracy in top-1.

Dynamic Runtime Feature Map Pruning

A processor includes a translation lookaside buffer (TLB) comprising a plurality of ways, wherein each way is associated with a respective page size, and a processing core, communicatively coupled to the TLB, to execute an instruction associated with a virtual memory page, identify a first way of the plurality of ways, wherein the first way is associated with a first page size, determine an index value using the virtual memory page and the first page size for the first way, determine, using the index value, a first TLB entry of the first way, and translate, using a memory address translation stored in the first TLB entry, the first virtual memory page to a first physical memory page.

Variable translation-lookaside buffer (TLB) indexing

We describe a programmable and scalable Convolutional Neural Network (CNN) hardware accelerator optimized for mobile and edge inference computing. The accelerator is comprised of 4 heterogeneous engines - input engine, filter engine, post processing engine, and output engine. The specialized engines execute independently and concurrently. All engines have a core set of common instructions with each engine further specialized for specific functions. We describe the operation of each engine and provide silicon validated results for a number of CNN networks including LeNet-5, TinySSD, and SqueezeNet. We describe a blind modulation detection application using SqueezeNet. The accelerator has been fabricated in 28nm CMOS and runs at 1GHz. The logic consumes 0.6 mm2 and the fully hardened core with 2MB of SRAM including built-in self-test consumes 9.36mm2. The accelerator’s filter engine implements 288 f16 multipliers thereby achieving 288 GFLOPS at 1GHz. Two TOPS of peak performance is achieved with all engines running in parallel. The accelerator including SRAM dissipates 193mW running LeNet-5 at room temperature.

Lei Wang

Papers

Pruning and quantization for deep neural network acceleration: A survey

Pruning and Quantization for Deep Neural Network Acceleration: A Survey

Dynamic Runtime Feature Map Pruning

Variable translation-lookaside buffer (TLB) indexing

Heterogeneous Edge CNN Hardware Accelerator