To prune, or not to prune: exploring the efficacy of pruning for model compression

Open AccessPosted Content

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael H. Zhu, +1 more

- 05 Oct 2017 -

arXiv: Machine Learning

Chats0

TLDR

In this article, the authors investigate two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.

Abstract:

Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. Recent reports (Han et al., 2015; Narang et al., 2017) prune deep networks at the cost of only a marginal loss in accuracy and achieve a sizable reduction in model size. This hints at the possibility that the baseline models in these experiments are perhaps severely over-parameterized at the outset and a viable alternative for model compression might be to simply reduce the number of hidden units while maintaining the model's dense connection structure, exposing a similar trade-off in model size and accuracy. We investigate these two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning and can be seamlessly incorporated within the training process. We compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint. Across a broad range of neural network architectures (deep CNNs, stacked LSTM, and seq2seq LSTM models), we find large-sparse models to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.

Citations

PDF

Open Access

More filters

Proceedings Article

BADGE: Speeding Up BERT Inference after Deployment via Block-wise Bypasses and Divergence-based Early Exiting

Wei Zhu, +4 more

TL;DR: This paper proposed a novel divergence-based early exiting (DGE) mechanism, which obtains early exiting signals by comparing the predicted distributions of two adjacent layers' exits, which can alleviate the conflicts in jointly training multiple intermediate classifiers and thus improve the overall performances of multi-exit language models.

...read moreread less

Journal ArticleDOI

Artificial Neural Network evaluation of Poincaré constant for Voronoi polygons

Beatrice Crippa, +2 more

- 21 Jun 2022 -

arXiv.org

TL;DR: A method, based on Artiﬁcial Neural Networks, that learns the dependence of the constant in the Poincar´e inequality on polygonal elements of Voronoi meshes, on some geometrical metrics of the element.

...read moreread less

Journal ArticleDOI

Energy-Efficient Approximate Edge Inference Systems

Arnab Raha, +1 more

- 31 Mar 2023 -

ACM Transactions in Embedded Computing S...

TL;DR: In this paper , an approximate edge inference system (AxIS) is proposed to perform joint approximations between different subsystems in a deep neural network (DNN)-based inference system, leading to significant energy benefits compared to approximating individual subsystems.

...read moreread less

Journal ArticleDOI

Dodging the Sparse Double Descent

Victor Qu'etu, +1 more

- 02 Mar 2023 -

arXiv.org

TL;DR: This article proposed a learning framework that allows avoidance of this phenomenon and improves generalization, an entropy measure to provide more insights on its insurgence, and provide a comprehensive quantitative analysis of various factors such as re-initialization methods, model width and depth, and dataset noise.

...read moreread less

U nmasking the l ottery t icket h ypothesis : w hat ’ s e ncoded in a w inning t icket ’ s m ask ?

Brett W. Larsen, +2 more

TL;DR: This article showed that the role of retraining in IMP is to find a network with new small weights to prune, which limits the fraction of weights that can be pruned at each iteration of IMP.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, +4 more

- 02 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

...read moreread less

Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew Howard, +7 more

- 17 Apr 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

...read moreread less

Proceedings Article

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, +3 more

TL;DR: Deep Compression as mentioned in this paper proposes a three-stage pipeline: pruning, quantization, and Huffman coding to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.

...read moreread less

Posted Content

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, +30 more

- 26 Sep 2016 -

arXiv: Computation and Language

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

...read moreread less

Proceedings Article

Learning both weights and connections for efficient neural networks

Song Han, +3 more

TL;DR: In this paper, the authors proposed a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections using a three-step method.

...read moreread less

Collapse

To prune, or not to prune: exploring the efficacy of pruning for model compression

Citations

BADGE: Speeding Up BERT Inference after Deployment via Block-wise Bypasses and Divergence-based Early Exiting

Artificial Neural Network evaluation of Poincaré constant for Voronoi polygons

Energy-Efficient Approximate Edge Inference Systems

Dodging the Sparse Double Descent

U nmasking the l ottery t icket h ypothesis : w hat ’ s e ncoded in a w inning t icket ’ s m ask ?

References

Rethinking the Inception Architecture for Computer Vision

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Learning both weights and connections for efficient neural networks

Related Papers (5)

Learning both weights and connections for efficient neural networks

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Deep Residual Learning for Image Recognition

Optimal Brain Damage

Learning Multiple Layers of Features from Tiny Images

Trending Questions (1)