scispace - formally typeset
Search or ask a question
Author

Jeffrey Pennington

Bio: Jeffrey Pennington is an academic researcher from Google. The author has contributed to research in topics: Artificial neural network & Deep learning. The author has an hindex of 32, co-authored 75 publications receiving 28787 citations. Previous affiliations of Jeffrey Pennington include University of Southern California & Princeton University.


Papers
More filters
Journal ArticleDOI
TL;DR: In this article, the authors study the properties of the string equations and their physical solutions in the (2,4k) model and show that the localized D-branes of the minimal string theories are directly related to the solitons of the KdV hierarchy.
Abstract: We study the Type 0A string theory in the (2,4k) superconformal minimal model backgrounds, focusing on the fully non-perturbative string equations which define the partition function of the model. The equations admit a parameter, Gamma, which in the spacetime interpretation controls the number of background D-branes, or R-R flux units, depending upon which weak coupling regime is taken. We study the properties of the string equations (often focusing on the (2,4) model in particular) and their physical solutions. The solutions are the potential for an associated Schrodinger problem whose wavefunction is that of an extended D-brane probe. We perform a numerical study of the spectrum of this system for varying Gamma and establish that when Gamma is a positive integer the equations' solutions have special properties consistent with the spacetime interpretation. We also show that a natural solution-generating transformation (that changes Gamma by an integer) is the Backlund transformation of the KdV hierarchy specialized to (scale invariant) solitons at zero velocity. Our results suggest that the localized D-branes of the minimal string theories are directly related to the solitons of the KdV hierarchy. Further, we observe an interesting transition when Gamma=-1.

14 citations

Journal ArticleDOI
TL;DR: In this paper, a string theory in the (2, 4k) superconformal minimal model backgrounds, with background ZZ D-branes or R-R fluxes can be formulated non-perturbatively.
Abstract: Type 0A string theory in the (2, 4k) superconformal minimal model backgrounds, with background ZZ D-branes or R–R fluxes can be formulated non-perturbatively. The branes and fluxes have a description as threshold bound states in an associated one-dimensional quantum mechanics which has a supersymmetric structure, familiar from studies of the generalized KdV system. The relevant bound-state wavefunctions in this problem have unusual asymptotics (they are not normalizable in general, and break supersymmetry) which are consistent with the underlying description in terms of open and closed string sectors. The overall organization of the physics is very pleasing: the physics of the closed strings in the background of branes or fluxes is captured by the generalized KdV system and non-perturbative string equations obtained by reduction of that system (the hierarchy of equations found by Dalley, Johnson, Morris and Watterstam). Meanwhile, the bound-states wavefunctions, which describe the physics of the ZZ D-brane (or flux) background in interaction with probe FZZT D-branes, are captured by the generalized mKdV system, and non-perturbative string equations obtained by reduction of that system (the Painleve II hierachy found by Periwal and Shevitz in this context).

10 citations

Journal ArticleDOI
TL;DR: In this paper, a generating function for the coefficients of the leading logarithmic BFKL Green's function in transverse-momentum space, order by order in alpha_s, in terms of single-valued harmonic polylogarithms was introduced.
Abstract: We introduce a generating function for the coefficients of the leading logarithmic BFKL Green's function in transverse-momentum space, order by order in alpha_s, in terms of single-valued harmonic polylogarithms. As an application, we exhibit fully analytic azimuthal-angle and transverse-momentum distributions for Mueller-Navelet jet cross sections at each order in alpha_s. We also provide a generating function for the total cross section valid to any number of loops.

9 citations

14 May 2022
TL;DR: By analyzing homogenized SGD, the exact value of the limiting excess risk in the case of quadratic losses when trained by SGD is provided and exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation are provided.
Abstract: We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of SGD converges to the statistic under homogenized SGD when the number of samples $n$ and number of features $d$ are polynomially related ($d^c0$). By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation. Further we provide the exact value of the limiting excess risk in the case of quadratic losses when trained by SGD. The analysis is formulated for data matrices and target vectors that satisfy a family of resolvent conditions, which can roughly be viewed as a weak (non-quantitative) form of delocalization of sample-side singular vectors of the data. Several motivating applications are provided including sample covariance matrices with independent samples and random features with non-generative model targets.

9 citations

Journal Article
TL;DR: A theory of early learning for models trained with softmax-cross-entropy loss is developed and it is shown that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$.
Abstract: The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature β as well as the magnitude of the logits at initialization, ||βz||2. We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on β is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of β as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal β to be sensitive to the architecture, our results suggest that tuning β over the range 10−2 to 101 improves performance over all architectures studied. We find that smaller β may lead to better peak performance at the cost of learning stability.

9 citations


Cited by
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Posted Content
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

29,480 citations

Proceedings ArticleDOI
11 Oct 2018
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

24,672 citations

Journal ArticleDOI
TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.
Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

11,201 citations

Proceedings Article
28 May 2020
TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

10,132 citations