scispace - formally typeset
Search or ask a question
Author

Jeffrey Pennington

Bio: Jeffrey Pennington is an academic researcher from Google. The author has contributed to research in topics: Artificial neural network & Deep learning. The author has an hindex of 32, co-authored 75 publications receiving 28787 citations. Previous affiliations of Jeffrey Pennington include University of Southern California & Princeton University.


Papers
More filters
Posted Content
TL;DR: This work identifies large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize, and finds thatCNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance.
Abstract: A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. These theoretical results are corroborated experimentally on CIFAR10 for a variety of network architectures and we include a colab notebook that reproduces the essential results of the paper.

30 citations

Journal ArticleDOI
TL;DR: In this paper, it was shown that the natural functions for describing the multi-regge limit of six-gluon scattering in planar N=4 super Yang-Mills theory are the single-valued harmonic polylogarithmic functions introduced by Brown.
Abstract: We argue that the natural functions for describing the multi-Regge limit of six-gluon scattering in planar N=4 super Yang-Mills theory are the single-valued harmonic polylogarithmic functions introduced by Brown. These functions depend on a single complex variable and its conjugate, (w,w*). Using these functions, and formulas due to Fadin, Lipatov and Prygarin, we determine the six-gluon MHV remainder function in the leading-logarithmic approximation (LLA) in this limit through ten loops, and the next-to-LLA (NLLA) terms through nine loops. In separate work, we have determined the symbol of the four-loop remainder function for general kinematics, up to 113 constants. Taking its multi-Regge limit and matching to our four-loop LLA and NLLA results, we fix all but one of the constants that survive in this limit. The multi-Regge limit factorizes in the variables ( u,n) which are related to (w,w*) by a Fourier-Mellin transform. We can transform the single-valued harmonic polylogarithms to functions of ( u,n) that incorporate harmonic sums, systematically through transcendental weight six. Combining this information with the four-loop results, we determine the eigenvalues of the BFKL kernel in the adjoint representation to NNLLA accuracy, and the MHV product of impact factors to NNNLLA accuracy, up to constants representing beyond-the-symbol terms and the one symbol-level constant. Remarkably, only derivatives of the polygamma function enter these results. Finally, the LLA approximation to the six-gluon NMHV amplitude is evaluated through ten loops.

29 citations

Journal ArticleDOI
TL;DR: The three-loop remainder function as discussed by the authors describes the scattering of six gluons in the maximally-helicity-violating configuration in planar N=4 super-Yang-Mills theory, as a function of the three dual conformal cross ratios.
Abstract: We present the three-loop remainder function, which describes the scattering of six gluons in the maximally-helicity-violating configuration in planar N=4 super-Yang-Mills theory, as a function of the three dual conformal cross ratios. The result can be expressed in terms of multiple Goncharov polylogarithms. We also employ a more restricted class of "hexagon functions" which have the correct branch cuts and certain other restrictions on their symbols. We classify all the hexagon functions through transcendental weight five, using the coproduct for their Hopf algebra iteratively, which amounts to a set of first-order differential equations. The three-loop remainder function is a particular weight-six hexagon function, whose symbol was determined previously. The differential equations can be integrated numerically for generic values of the cross ratios, or analytically in certain kinematics limits, including the near-collinear and multi-Regge limits. These limits allow us to impose constraints from the operator product expansion and multi-Regge factorization directly at the function level, and thereby to fix uniquely a set of Riemann-zeta-valued constants that could not be fixed at the level of the symbol. The near-collinear limits agree precisely with recent predictions by Basso, Sever and Vieira based on integrability. The multi-Regge limits agree with the factorization formula of Fadin and Lipatov, and determine three constants entering the impact factor at this order. We plot the three-loop remainder function for various slices of the Euclidean region of positive cross ratios, and compare it to the two-loop one. For large ranges of the cross ratios, the ratio of the three-loop to the two-loop remainder function is relatively constant, and close to -7.

29 citations

Posted Content
TL;DR: This work combines several different techniques for randomized estimation and shows that it is possible to construct unbiased estimators to answer a broad class of questions about the spectra of such implicit matrices, even in the presence of noise.
Abstract: Many important problems are characterized by the eigenvalues of a large matrix For example, the difficulty of many optimization problems, such as those arising from the fitting of large models in statistics and machine learning, can be investigated via the spectrum of the Hessian of the empirical loss function Network data can be understood via the eigenstructure of a graph Laplacian matrix using spectral graph theory Quantum simulations and other many-body problems are often characterized via the eigenvalues of the solution space, as are various dynamic systems However, naive eigenvalue estimation is computationally expensive even when the matrix can be represented; in many of these situations the matrix is so large as to only be available implicitly via products with vectors Even worse, one may only have noisy estimates of such matrix vector products In this work, we combine several different techniques for randomized estimation and show that it is possible to construct unbiased estimators to answer a broad class of questions about the spectra of such implicit matrices, even in the presence of noise We validate these methods on large-scale problems in which graph theory and random matrix theory provide ground truth

28 citations

Journal ArticleDOI
TL;DR: The four-loop remainder function for six-gluon scattering with maximal helicity violation in planar N=4 super-Yang-Mills theory is presented in this paper as an analytic function of three dual-conformal cross ratios.
Abstract: We present the four-loop remainder function for six-gluon scattering with maximal helicity violation in planar N=4 super-Yang-Mills theory, as an analytic function of three dual-conformal cross ratios. The function is constructed entirely from its analytic properties, without ever inspecting any multi-loop integrand. We employ the same approach used at three loops, writing an ansatz in terms of hexagon functions, and fixing coefficients in the ansatz using the multi-Regge limit and the operator product expansion in the near-collinear limit. We express the result in terms of multiple polylogarithms, and in terms of the coproduct for the associated Hopf algebra. From the remainder function, we extract the BFKL eigenvalue at next-to-next-to-leading logarithmic accuracy (NNLLA), and the impact factor at NNNLLA. We plot the remainder function along various lines and on one surface, studying ratios of successive loop orders. As seen previously through three loops, these ratios are surprisingly constant over large regions in the space of cross ratios, and they are not far from the value expected at asymptotically large orders of perturbation theory.

28 citations


Cited by
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Posted Content
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

29,480 citations

Proceedings ArticleDOI
11 Oct 2018
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

24,672 citations

Journal ArticleDOI
TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

11,201 citations

Proceedings Article
28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

10,132 citations