scispace - formally typeset
Search or ask a question

Showing papers by "Jeffrey Pennington published in 2021"


Journal Article
TL;DR: A theory of early learning for models trained with softmax-cross-entropy loss is developed and it is shown that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$.
Abstract: The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature β as well as the magnitude of the logits at initialization, ||βz||2. We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on β is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of β as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal β to be sensitive to the architecture, our results suggest that tuning β over the range 10−2 to 101 improves performance over all architectures studied. We find that smaller β may lead to better peak performance at the cost of learning stability.

9 citations


Proceedings Article
Ben Adlam1, Jaehoon Lee1, Lechao Xiao1, Jeffrey Pennington1, Jasper Snoek1 
03 May 2021
TL;DR: In this article, the function-space prior of an infinitely wide neural network is modeled as a Gaussian process, termed neural network Gaussian Process (NNGP), and a softmax link function is used for multi-class classification.
Abstract: Modern deep learning models have achieved great success in predictive accuracy for many data modalities However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift Previous benchmarks have found that ensembles of neural networks (NNs) are typically the best calibrated models on OOD data Inspired by this, we leverage recent theoretical advances that characterize the function-space prior of an infinitely-wide NN as a Gaussian process, termed the neural network Gaussian process (NNGP) We use the NNGP with a softmax link function to build a probabilistic model for multi-class classification and marginalize over the latent Gaussian outputs to sample from the posterior This gives us a better understanding of the implicit prior NNs place on function space and allows a direct comparison of the calibration of the NNGP and its finite-width analogue We also examine the calibration of previous approaches to classification with the NNGP, which treat classification problems as regression to the one-hot labels In this case the Bayesian posterior is exact, and we compare several heuristics to generate a categorical distribution over classes We find these methods are well calibrated under distributional shift Finally, we consider an infinite-width final layer in conjunction with a pre-trained embedding This replicates the important practical use case of transfer learning and allows scaling to significantly larger datasets As well as achieving competitive predictive accuracy, this approach is better calibrated than its finite width analogue

3 citations


Posted Content
TL;DR: The authors examined the high-dimensional asymptotics of random feature regression under covariate shift and presented a precise characterization of the limiting test error, bias, and variance in this setting.
Abstract: A significant obstacle in the development of robust machine learning models is covariate shift, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same. Despite the prevalence of covariate shift in real-world applications, a theoretical understanding in the context of modern machine learning has remained lacking. In this work, we examine the exact high-dimensional asymptotics of random feature regression under covariate shift and present a precise characterization of the limiting test error, bias, and variance in this setting. Our results motivate a natural partial order over covariate shifts that provides a sufficient condition for determining when the shift will harm (or even help) test performance. We find that overparameterized models exhibit enhanced robustness to covariate shift, providing one of the first theoretical explanations for this intriguing phenomenon. Additionally, our analysis reveals an exact linear relationship between in-distribution and out-of-distribution generalization performance, offering an explanation for this surprising recent empirical observation.