Deep learning and the information bottleneck principle
Citations
1,159 citations
Cites background from "Deep learning and the information b..."
...Moreover, they suggested that optimized DNNs layers should approach the Information Bottleneck (IB) bound [Tishby et al. (1999)] of the optimal achievable representations of the input X ....
[...]
...This leads to the Information Bottleneck (IB) tradeoff [Tishby et al. (1999)], which provides a computational framework for finding approximate minimal sufficient statistics, or the optimal tradeoff between compression of X and prediction of Y ....
[...]
..., exponential families), Tishby et al. (1999) relaxed this optimization problem by first allowing the map to be stochastic, defined as an encoder P (T |X), and then, by allowing the map to capture as much as possible of I(X;Y ), not necessarily all of it....
[...]
971 citations
Cites background from "Deep learning and the information b..."
...This relates to the information bottleneck theory [23, 24, 25] that explains generalization in terms of compression....
[...]
820 citations
Cites background from "Deep learning and the information b..."
...IB was recently covered in the context of deep learning (Tishby & Zaslavsky, 2015), and as such can be seen as a process to construct an approximation of the minimally sufficient statistics of the data....
[...]
810 citations
Cites methods from "Deep learning and the information b..."
...[97], [98]....
[...]
...Optimization CNN with separable model [141] 7 X Others Information theoretic: Information Bottleneck [97], [98] 7 X...
[...]
757 citations
References
73,978 citations
45,034 citations
"Deep learning and the information b..." refers background in this paper
...If we denote by Ŷ the predicted variable, the DPI implies I(X;Y ) ≥ I(Y ; Ŷ ), with equality if and only if X̂ is a sufficient statistic....
[...]
...An immediate consequence of the DPI is that information about Y that is lost in one layer cannot be recovered in higher layers....
[...]
...The theoretical IB limit and the limitations that are imposed by the DPI on the flow of information between the layers, gives a general picture as to to where each layer of a trained network can be on the information plane....
[...]
...We thus assume the Markov chain Y → X → X̂ and minimize the mutual information I(X; X̂) to obtain the simplest statistics (due to the data processing inequality (DPI) [5]), under a constraint on I(X̂;Y )....
[...]
...The information theoretic interpretation of minimal sufficient statistics [5] suggests a principled way of doing that: find a maximally compressed mapping of the input variable that preserves as much as possible the information on the output variable....
[...]
16,717 citations
11,201 citations
"Deep learning and the information b..." refers background in this paper
...Their performance currently surpass most competitor algorithms and DL wins top machine learning competitions on real data challenges [1], [2], [3]....
[...]
7,767 citations
"Deep learning and the information b..." refers background in this paper
...While there are many different variants of DNNs [9], here we consider the rather general supervised learning settings of feedforward networks in which multiple hidden layers separate the input and output layers of the network (see figure 1)....
[...]