Home
/
Authors
/
Daniel A. Abolafia

Author

Daniel A. Abolafia

Bio: Daniel A. Abolafia is an academic researcher from Google. The author has contributed to research in topics: Artificial neural network & Stochastic gradient descent. The author has an hindex of 6, co-authored 8 publications receiving 715 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•

Sensitivity and Generalization in Neural Networks: an Empirical Study

[...]

Roman Novak¹, Yasaman Bahri¹, Daniel A. Abolafia¹, Jeffrey Pennington¹, Jascha Sohl-Dickstein¹ - Show less +1 more•Institutions (1)

Google¹

15 Feb 2018

TL;DR: In this article, the authors investigate the tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations, and demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.

...read moreread less

Abstract: In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization $-$ such as full-batch training or using random labels $-$ correspond to lower robustness, while factors associated with good generalization $-$ such as data augmentation and ReLU non-linearities $-$ give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.

...read moreread less

242 citations

Proceedings Article•

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

[...]

Roman Novak¹, Lechao Xiao², Jaehoon Lee¹, Yasaman Bahri¹, Greg Yang³, Jiri Hron⁴, Daniel A. Abolafia¹, Jeffrey Pennington¹, Jascha Sohl-Dickstein¹ - Show less +5 more•Institutions (4)

Google¹, University of Pennsylvania², Microsoft³, University of Cambridge⁴

01 Jan 2019

TL;DR: This work derives an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and introduces a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible.

...read moreread less

Abstract: There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit - a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

...read moreread less

241 citations

Posted Content•

Sensitivity and Generalization in Neural Networks: an Empirical Study

[...]

Roman Novak¹, Yasaman Bahri¹, Daniel A. Abolafia¹, Jeffrey Pennington¹, Jascha Sohl-Dickstein¹ - Show less +1 more•Institutions (1)

Google¹

23 Feb 2018-arXiv: Machine Learning

TL;DR: It is found that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization.

...read moreread less

Abstract: In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization We further establish that factors associated with poor generalization $-$ such as full-batch training or using random labels $-$ correspond to lower robustness, while factors associated with good generalization $-$ such as data augmentation and ReLU non-linearities $-$ give rise to more robust functions Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points

...read moreread less

136 citations

Posted Content•

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

[...]

Roman Novak¹, Lechao Xiao², Jaehoon Lee¹, Yasaman Bahri¹, Greg Yang³, Jiri Hron⁴, Daniel A. Abolafia¹, Jeffrey Pennington¹, Jascha Sohl-Dickstein¹ - Show less +5 more•Institutions (4)

Google¹, University of Pennsylvania², Microsoft³, University of Cambridge⁴

11 Oct 2018-arXiv: Machine Learning

TL;DR: In this article, an equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs) was derived for CNNs both with and without pooling layers, and achieved state-of-the-art results on CIFAR10 for GPs without trainable kernels.

...read moreread less

110 citations

Posted Content•

Neural Program Synthesis with Priority Queue Training

[...]

Daniel A. Abolafia, Mohammad Norouzi, Quoc V. Le

10 Jan 2018-arXiv: Artificial Intelligence

TL;DR: By adding a program length penalty to the reward function, this work is able to synthesize short, human readable programs in a simple but expressive Turing complete programming language called BF.

...read moreread less

Abstract: We consider the task of program synthesis in the presence of a reward function over the output of programs, where the goal is to find programs with maximal rewards. We employ an iterative optimization scheme, where we train an RNN on a dataset of K best programs from a priority queue of the generated programs so far. Then, we synthesize new programs and add them to the priority queue by sampling from the RNN. We benchmark our algorithm, called priority queue training (or PQT), against genetic algorithm and reinforcement learning baselines on a simple but expressive Turing complete programming language called BF. Our experimental results show that our simple PQT algorithm significantly outperforms the baselines. By adding a program length penalty to the reward function, we are able to synthesize short, human readable programs.

...read moreread less

45 citations

Cited by

PDF

Open Access

More filters

Book Chapter•DOI•

Convergence of probability measures

[...]

Richard F. Bass

01 Jan 2011

TL;DR: Weakconvergence methods in metric spaces were studied in this article, with applications sufficient to show their power and utility, and the results of the first three chapters are used in Chapter 4 to derive a variety of limit theorems for dependent sequences of random variables.

...read moreread less

Abstract: The author's preface gives an outline: "This book is about weakconvergence methods in metric spaces, with applications sufficient to show their power and utility. The Introduction motivates the definitions and indicates how the theory will yield solutions to problems arising outside it. Chapter 1 sets out the basic general theorems, which are then specialized in Chapter 2 to the space C[0, l ] of continuous functions on the unit interval and in Chapter 3 to the space D [0, 1 ] of functions with discontinuities of the first kind. The results of the first three chapters are used in Chapter 4 to derive a variety of limit theorems for dependent sequences of random variables. " The book develops and expands on Donsker's 1951 and 1952 papers on the invariance principle and empirical distributions. The basic random variables remain real-valued although, of course, measures on C[0, l ] and D[0, l ] are vitally used. Within this framework, there are various possibilities for a different and apparently better treatment of the material. More of the general theory of weak convergence of probabilities on separable metric spaces would be useful. Metrizability of the convergence is not brought up until late in the Appendix. The close relation of the Prokhorov metric and a metric for convergence in probability is (hence) not mentioned (see V. Strassen, Ann. Math. Statist. 36 (1965), 423-439; the reviewer, ibid. 39 (1968), 1563-1572). This relation would illuminate and organize such results as Theorems 4.1, 4.2 and 4.4 which give isolated, ad hoc connections between weak convergence of measures and nearness in probability. In the middle of p. 16, it should be noted that C*(S) consists of signed measures which need only be finitely additive if 5 is not compact. On p. 239, where the author twice speaks of separable subsets having nonmeasurable cardinal, he means "discrete" rather than "separable." Theorem 1.4 is Ulam's theorem that a Borel probability on a complete separable metric space is tight. Theorem 1 of Appendix 3 weakens completeness to topological completeness. After mentioning that probabilities on the rationals are tight, the author says it is an

...read moreread less

3,554 citations

Journal Article•DOI•

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

[...]

Jaehoon Lee¹, Lechao Xiao¹, Samuel S. Schoenholz¹, Yasaman Bahri¹, Roman Novak¹, Jascha Sohl-Dickstein¹, Jeffrey Pennington¹ - Show less +3 more•Institutions (1)

Google¹

18 Feb 2019-arXiv: Machine Learning

TL;DR: In this article, the authors show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

...read moreread less

Abstract: A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

...read moreread less

738 citations

Proceedings Article•

[...]

Simon Kornblith¹, Mohammad Norouzi¹, Honglak Lee², Geoffrey E. Hinton¹•Institutions (2)

Google¹, University of Michigan²

24 May 2019

TL;DR: In this article, the authors introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation, which is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA.

...read moreread less

Abstract: Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network representations based on canonical correlation analysis (CCA). We show that CCA belongs to a family of statistics for measuring multivariate similarity, but that neither CCA nor any other statistic that is invariant to invertible linear transformation can measure meaningful similarities between representations of higher dimension than the number of data points. We introduce a similarity index that measures the relationship between representational similarity matrices and does not suffer from this limitation. This similarity index is equivalent to centered kernel alignment (CKA) and is also closely connected to CCA. Unlike CCA, CKA can reliably identify correspondences between representations in networks trained from different initializations.

...read moreread less

584 citations

Posted Content•

On the Spectral Bias of Neural Networks

[...]

Nasim Rahaman¹, Aristide Baratin², Devansh Arpit³, Felix Draxler¹, Min Lin⁴, Fred A. Hamprecht¹, Yoshua Bengio², Aaron Courville² - Show less +4 more•Institutions (4)

Heidelberg University¹, Université de Montréal², Salesforce.com³, National University of Singapore⁴

22 Jun 2018-arXiv: Machine Learning

TL;DR: This work shows that deep ReLU networks are biased towards low frequency functions, and studies the robustness of the frequency components with respect to parameter perturbation, to develop the intuition that the parameters must be finely tuned to express high frequency functions.

...read moreread less

Abstract: Neural networks are known to be a class of highly expressive functions able to fit even random input-output mappings with $100\%$ accuracy. In this work, we present properties of neural networks that complement this aspect of expressivity. By using tools from Fourier analysis, we show that deep ReLU networks are biased towards low frequency functions, meaning that they cannot have local fluctuations without affecting their global behavior. Intuitively, this property is in line with the observation that over-parameterized networks find simple patterns that generalize across data samples. We also investigate how the shape of the data manifold affects expressivity by showing evidence that learning high frequencies gets \emph{easier} with increasing manifold complexity, and present a theoretical understanding of this behavior. Finally, we study the robustness of the frequency components with respect to parameter perturbation, to develop the intuition that the parameters must be finely tuned to express high frequency functions.

...read moreread less

486 citations

Posted Content•

Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting

[...]

Jun Shu¹, Qi Xie¹, Lixuan Yi¹, Qian Zhao¹, Sanping Zhou¹, Zongben Xu¹, Deyu Meng¹ - Show less +3 more•Institutions (1)

Xi'an Jiaotong University¹

20 Feb 2019-arXiv: Learning

TL;DR: Synthetic and real experiments substantiate the capability of the method for achieving proper weighting functions in class imbalance and noisy label cases, fully complying with the common settings in traditional methods, and more complicated scenarios beyond conventional cases.

...read moreread less

Abstract: Current deep neural networks (DNNs) can easily overfit to biased training data with corrupted labels or class imbalance. Sample re-weighting strategy is commonly used to alleviate this issue by designing a weighting function mapping from training loss to sample weight, and then iterating between weight recalculating and classifier updating. Current approaches, however, need manually pre-specify the weighting function as well as its additional hyper-parameters. It makes them fairly hard to be generally applied in practice due to the significant variation of proper weighting schemes relying on the investigated problem and training data. To address this issue, we propose a method capable of adaptively learning an explicit weighting function directly from data. The weighting function is an MLP with one hidden layer, constituting a universal approximator to almost any continuous functions, making the method able to fit a wide range of weighting functions including those assumed in conventional research. Guided by a small amount of unbiased meta-data, the parameters of the weighting function can be finely updated simultaneously with the learning process of the classifiers. Synthetic and real experiments substantiate the capability of our method for achieving proper weighting functions in class imbalance and noisy label cases, fully complying with the common settings in traditional methods, and more complicated scenarios beyond conventional cases. This naturally leads to its better accuracy than other state-of-the-art methods.

...read moreread less

331 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127

Collapse