scispace - formally typeset
Open AccessPosted Content

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael H. Zhu, +1 more
- 05 Oct 2017 - 
Reads0
Chats0
TLDR
In this article, the authors investigate two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.
Abstract: 
Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. Recent reports (Han et al., 2015; Narang et al., 2017) prune deep networks at the cost of only a marginal loss in accuracy and achieve a sizable reduction in model size. This hints at the possibility that the baseline models in these experiments are perhaps severely over-parameterized at the outset and a viable alternative for model compression might be to simply reduce the number of hidden units while maintaining the model's dense connection structure, exposing a similar trade-off in model size and accuracy. We investigate these two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning and can be seamlessly incorporated within the training process. We compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint. Across a broad range of neural network architectures (deep CNNs, stacked LSTM, and seq2seq LSTM models), we find large-sparse models to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.

read more

Citations
More filters
Proceedings ArticleDOI

Object Detection Edge Performance Optimization on FPGA-Based Heterogeneous Multiprocessor Systems

TL;DR: A workflow with a series of optimization approaches such as model pruning, model quantization and multi-threading design in implementing an object detection task based on YOLOv4-CSP on a FPGA-based heterogeneous multiprocessor system is proposed.
Journal ArticleDOI

Structured Compression of Convolutional Neural Networks for Specialized Tasks

Freddy Gabbay, +2 more
- 08 Oct 2022 - 
TL;DR: The S-VELCRO compression algorithm is introduced, which exploits value locality to trim filters in CNN models utilized for specialized tasks and achieves a compression-saving ratio between 6% and 30%, with no degradation in accuracy for ResNet-18, MobileNet-V2, and GoogLeNet when used for specialty tasks.
Posted Content

Simon Says: Evaluating and Mitigating Bias in Pruned Neural Networks with Knowledge Distillation

TL;DR: This paper proposed two metrics, Combined Error Variance (CEV) and Symmetric Distance Error (SDE), to quantitatively evaluate the induced bias prevention quality of pruned models and demonstrate that knowledge distillation can mitigate induced bias in pruned neural networks, even with unbalanced datasets.
Journal ArticleDOI

Exploring TensorRT to Improve Real-Time Inference for Deep Learning

Kecheng Yang
TL;DR: TensorRT as mentioned in this paper improves the inference efficiency metrics without compromising the inference accuracy by using GPU memory usage and inference output validation, inference time, inference throughput, and GPU memory utilization.
Proceedings ArticleDOI

Convolutional Neural Network Accelerator for Compression Based on Simon k-means

TL;DR: This paper proposes a pre-processing algorithm named Simon k-means based on clustering to quantify trained weight to speed up inference and proposes a new encoding method for the quantized weight, significantly reducing the model's storage size.
References
More filters
Posted Content

Rethinking the Inception Architecture for Computer Vision

TL;DR: This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Proceedings Article

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

TL;DR: Deep Compression as mentioned in this paper proposes a three-stage pipeline: pruning, quantization, and Huffman coding to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
Posted Content

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

TL;DR: GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
Proceedings Article

Learning both weights and connections for efficient neural networks

TL;DR: In this paper, the authors proposed a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections using a three-step method.
Trending Questions (1)
How to prune?

Copilot couldn't generate the response. Please try again after some time.