Top 15 papers published by Jeffrey Pennington from Google in 2020

Journal Article•DOI•

[...]

Yasaman Bahri¹, Jonathan Kadmon², Jeffrey Pennington¹, Samuel S. Schoenholz¹, Jascha Sohl-Dickstein¹, Surya Ganguli², Surya Ganguli¹ - Show less +3 more•Institutions (2)

Google¹, Stanford University²

16 Mar 2020-Annual Review of Condensed Matter Physics

TL;DR: The recent striking success of deep neural networks in machine learning raises profound questions about the theoretical principles underlying their success.

...read moreread less

Abstract: The recent striking success of deep neural networks in machine learning raises profound questions about the theoretical principles underlying their success. For example, what can such deep networks...

...read moreread less

175 citations

Proceedings Article•

Finite Versus Infinite Neural Networks: an Empirical Study

[...]

Jaehoon Lee¹, Samuel S. Schoenholz¹, Jeffrey Pennington¹, Ben Adlam¹, Lechao Xiao¹, Roman Novak¹, Jascha Sohl-Dickstein¹ - Show less +3 more•Institutions (1)

Google¹

31 Jul 2020

TL;DR: Improved best practices for using NNGP and NT kernels for prediction are developed, including a novel ensembling technique that achieves state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class the authors consider.

...read moreread less

Abstract: We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.

...read moreread less

125 citations

Proceedings Article•

Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition

[...]

Ben Adlam¹, Jeffrey Pennington¹•Institutions (1)

Google¹

01 Jan 2020

TL;DR: This work describes an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels, and compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyzes the strikingly rich phenomenology that arises.

...read moreread less

Abstract: Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning models that simultaneously attain low bias and variance in the heavily overparameterized regime. A primary obstacle in explaining this behavior is that deep learning algorithms typically involve multiple sources of randomness whose individual contributions are not visible in the total variance. To enable fine-grained analysis, we describe an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels. Moreover, we compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyze the strikingly rich phenomenology that arises. We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior and can diverge at the interpolation boundary, even in the absence of label noise. The divergence is caused by the \emph{interaction} between sampling and initialization and can therefore be eliminated by marginalizing over samples (i.e. bagging) \emph{or} over the initial parameters (i.e. ensemble learning).

...read moreread less

51 citations

Posted Content•

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

[...]

Ben Adlam¹, Jeffrey Pennington¹•Institutions (1)

Google¹

15 Aug 2020-arXiv: Machine Learning

TL;DR: This work provides a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent.

...read moreread less

Abstract: Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a \emph{double descent} curve, in which increasing a model's capacity causes its test error to first decrease, then increase to a maximum near the interpolation threshold, and then decrease again in the overparameterized regime. Recent efforts to explain this phenomenon theoretically have focused on simple settings, such as linear regression or kernel regression with unstructured random features, which we argue are too coarse to reveal important nuances of actual neural networks. We provide a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent. Our results reveal that the test error has non-monotonic behavior deep in the overparameterized regime and can even exhibit additional peaks and descents when the number of parameters scales quadratically with the dataset size.

...read moreread less

50 citations

Journal Article•DOI•

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

[...]

Jaehoon Lee¹, Lechao Xiao¹, Samuel S. Schoenholz¹, Yasaman Bahri¹, Roman Novak¹, Jascha Sohl-Dickstein¹, Jeffrey Pennington¹ - Show less +3 more•Institutions (1)

Google¹

21 Dec 2020

TL;DR: This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

...read moreread less

Abstract: A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

...read moreread less

49 citations

Posted Content•

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

[...]

Wei Hu¹, Lechao Xiao², Jeffrey Pennington²•Institutions (2)

Princeton University¹, Google²

16 Jan 2020-arXiv: Learning

TL;DR: The results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

...read moreread less

Abstract: The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

...read moreread less

46 citations

Proceedings Article•

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

[...]

Wei Hu¹, Lechao Xiao², Jeffrey Pennington²•Institutions (2)

Princeton University¹, Google²

30 Apr 2020

TL;DR: In this paper, the authors show that for deep linear networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth of the network.

...read moreread less

Abstract: The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

...read moreread less

41 citations

Proceedings Article•

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

[...]

Ben Adlam¹, Jeffrey Pennington¹•Institutions (1)

Google¹

12 Jul 2020

TL;DR: In this article, the authors provide a high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent.

...read moreread less

Abstract: Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a \emph{double descent} curve, in which increasing a model's capacity causes its test error to first decrease, then increase to a maximum near the interpolation threshold, and then decrease again in the overparameterized regime. Recent efforts to explain this phenomenon theoretically have focused on simple settings, such as linear regression or kernel regression with unstructured random features, which we argue are too coarse to reveal important nuances of actual neural networks. We provide a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent. Our results reveal that the test error has non-monotonic behavior deep in the overparameterized regime and can even exhibit additional peaks and descents when the number of parameters scales quadratically with the dataset size.

...read moreread less

36 citations

Posted Content•

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

[...]

Wei Hu¹, Lechao Xiao², Ben Adlam², Jeffrey Pennington²•Institutions (2)

Princeton University¹, Google²

25 Jun 2020-arXiv: Learning

TL;DR: It is formally proved that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs.

...read moreread less

Abstract: Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes In this work, we show that these common perceptions can be completely false in the early phase of learning In particular, we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically Key to our analysis is to bound the spectral norm of the difference between the Neural Tangent Kernel (NTK) at initialization and an affine transform of the data kernel; however, unlike many previous results utilizing the NTK, we do not require the network to have disproportionately large width, and the network is allowed to escape the kernel regime later in training

...read moreread less

32 citations

Proceedings Article•

Disentangling Trainability and Generalization in Deep Neural Networks

[...]

Lechao Xiao¹, Jeffrey Pennington¹, Samuel S. Schoenholz¹•Institutions (1)

Google¹

12 Jul 2020

TL;DR: In this article, the authors provide necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs).

...read moreread less

Abstract: A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. These theoretical results are corroborated experimentally on CIFAR10 for a variety of network architectures and we include a colab notebook that reproduces the essential results of the paper.

...read moreread less

21 citations

Posted Content•

Finite Versus Infinite Neural Networks: an Empirical Study

[...]

Jaehoon Lee¹, Samuel S. Schoenholz¹, Jeffrey Pennington¹, Ben Adlam¹, Lechao Xiao¹, Roman Novak¹, Jascha Sohl-Dickstein¹ - Show less +3 more•Institutions (1)

Google¹

31 Jul 2020-arXiv: Learning

TL;DR: In this article, the authors perform a large-scale empirical study of the correspondence between wide neural networks and kernel methods and show that kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks, and neural network Gaussian process kernels frequently outperform neural tangent (NT) kernels.

...read moreread less

Abstract: We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.

...read moreread less

Posted Content•

Exact posterior distributions of wide Bayesian neural networks

[...]

Jiri Hron¹, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein - Show less +1 more•Institutions (1)

University of Cambridge¹

18 Jun 2020-arXiv: Machine Learning

TL;DR: This work provides the missing theoretical proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior and shows how to generate exact samples from a finite BNN on a small dataset via rejection sampling.

...read moreread less

Abstract: Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large However, many BNN applications are concerned with the BNN function space posterior While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al (2018), it is limited to small datasets or architectures due to the notorious difficulty of obtaining and verifying exactness of BNN posterior approximations We provide the missing theoretical proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior For empirical validation, we show how to generate exact samples from a finite BNN on a small dataset via rejection sampling

...read moreread less

Posted Content•

Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition

[...]

Ben Adlam¹, Jeffrey Pennington¹•Institutions (1)

Google¹

04 Nov 2020-arXiv: Machine Learning

TL;DR: In this article, an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels is presented, and a high-dimensional asymptotic behavior of this decomposition for random feature kernel regression is computed.

...read moreread less

Abstract: Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning models that simultaneously attain low bias and variance in the heavily overparameterized regime. A primary obstacle in explaining this behavior is that deep learning algorithms typically involve multiple sources of randomness whose individual contributions are not visible in the total variance. To enable fine-grained analysis, we describe an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels. Moreover, we compute the high-dimensional asymptotic behavior of this decomposition for random feature kernel regression, and analyze the strikingly rich phenomenology that arises. We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior and can diverge at the interpolation boundary, even in the absence of label noise. The divergence is caused by the \emph{interaction} between sampling and initialization and can therefore be eliminated by marginalizing over samples (i.e. bagging) \emph{or} over the initial parameters (i.e. ensemble learning).

...read moreread less

Proceedings Article•

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

[...]

Wei Hu¹, Lechao Xiao², Ben Adlam², Jeffrey Pennington²•Institutions (2)

Princeton University¹, Google²

25 Jun 2020

TL;DR: In this paper, the spectral norm of the difference between the Neural Tangent Kernel (NTK) at initialization and an affine transform of the data kernel is bound, and it is shown that the early-time learning dynamics of a two-layer fully connected neural network can be mimicked by training a simple linear model on the inputs.

...read moreread less

Abstract: Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically. Key to our analysis is to bound the spectral norm of the difference between the Neural Tangent Kernel (NTK) at initialization and an affine transform of the data kernel; however, unlike many previous results utilizing the NTK, we do not require the network to have disproportionately large width, and the network is allowed to escape the kernel regime later in training.

...read moreread less

Posted Content•

Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit

[...]

Ben Adlam¹, Jaehoon Lee¹, Lechao Xiao¹, Jeffrey Pennington¹, Jasper Snoek¹ - Show less +1 more•Institutions (1)

Google¹

14 Oct 2020-arXiv: Machine Learning

TL;DR: This work uses the NNGP with a softmax link function to build a probabilistic model for multi-class classification and marginalize over the latent Gaussian outputs to sample from the posterior, leveraging recent theoretical advances that characterize the function-space prior of an ensemble of infinitely-wide NNs as a Gaussian process.

...read moreread less

Abstract: Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ensembles of neural networks (NNs) are typically the best calibrated models on OOD data. Inspired by this, we leverage recent theoretical advances that characterize the function-space prior of an ensemble of infinitely-wide NNs as a Gaussian process, termed the neural network Gaussian process (NNGP). We use the NNGP with a softmax link function to build a probabilistic model for multi-class classification and marginalize over the latent Gaussian outputs to sample from the posterior. This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue. We also examine the calibration of previous approaches to classification with the NNGP, which treat classification problems as regression to the one-hot labels. In this case the Bayesian posterior is exact, and we compare several heuristics to generate a categorical distribution over classes. We find these methods are well calibrated under distributional shift. Finally, we consider an infinite-width final layer in conjunction with a pre-trained embedding. This replicates the important practical use case of transfer learning and allows scaling to significantly larger datasets. As well as achieving competitive predictive accuracy, this approach is better calibrated than its finite width analogue.

...read moreread less

Showing papers by "Jeffrey Pennington published in 2020"