scispace - formally typeset
Open AccessProceedings ArticleDOI

Deep learning and the information bottleneck principle

TLDR
It is argued that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer.
Abstract
Deep Neural Networks (DNNs) are analyzed via the theoretical framework of the information bottleneck (IB) principle. We first show that any DNN can be quantified by the mutual information between the layers and the input and output variables. Using this representation we can calculate the optimal information theoretic limits of the DNN and obtain finite sample generalization bounds. The advantage of getting closer to the theoretical limit is quantifiable both by the generalization bound and by the network's simplicity. We argue that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer. The hierarchical representations at the layered network naturally correspond to the structural phase transitions along the information curve. We believe that this new insight can lead to new optimality bounds and deep learning algorithms.

read more

Citations
More filters
Posted Content

Opening the Black Box of Deep Neural Networks via Information

TL;DR: This work demonstrates the effectiveness of the Information-Plane visualization of DNNs and shows that the training time is dramatically reduced when adding more hidden layers, and the main advantage of the hidden layers is computational.
Posted Content

When Does Label Smoothing Help

TL;DR: It is shown empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search and that if a teacher network is trained with label smoothed, knowledge distillation into a student network is much less effective.
Proceedings Article

Mutual Information Neural Estimation.

TL;DR: A Mutual Information Neural Estimator (MINE) is presented that is linearly scalable in dimensionality as well as in sample size, trainable through back-prop, and strongly consistent, and applied to improve adversarially trained generative models.
Journal ArticleDOI

A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI

TL;DR: A review on interpretabilities suggested by different research works and categorize them is provided, hoping that insight into interpretability will be born with more considerations for medical practices and initiatives to push forward data-based, mathematically grounded, and technically grounded medical education are encouraged.
Posted Content

Deep Variational Information Bottleneck

TL;DR: It is shown that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.
References
More filters

The information bottleneck method

TL;DR: The variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning, as will be described in detail elsewhere.
Journal ArticleDOI

Deterministic annealing for clustering, compression, classification, regression, and related optimization problems

TL;DR: The deterministic annealing approach to clustering and its extensions has demonstrated substantial performance improvement over standard supervised and unsupervised learning methods in a variety of important applications including compression, estimation, pattern recognition and classification, and statistical regression.
Journal ArticleDOI

Successive refinement of information

TL;DR: It is shown that in order to achieve optimal successive refinement the necessary and sufficient conditions are that the solutions of the rate distortion problem can be written as a Markov chain and all finite alphabet signals with Hamming distortion satisfy these requirements.
Journal ArticleDOI

Statistical mechanics and phase transitions in clustering.

TL;DR: In this paper, a new approach to clustering based on statistical physics is presented, where the problem is formulated as fuzzy clustering and the association probability distribution is obtained by maximizing the entropy at a given average variance.