Home
/
Authors
/
Jeffrey Pennington

Author

Jeffrey Pennington

Other affiliations: University of Southern California, Princeton University, Stanford University ...read more

Bio: Jeffrey Pennington is an academic researcher from Google. The author has contributed to research in topics: Artificial neural network & Deep learning. The author has an hindex of 32, co-authored 75 publications receiving 28787 citations. Previous affiliations of Jeffrey Pennington include University of Southern California & Princeton University.

Papers published on a yearly basis

2022
2021
2020
2019
2018
2017
2015
2014
2013
2012
2011
2010
2008
2007
2005

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Backlund Transformations, D-Branes, and Fluxes in Minimal Type 0 Strings

[...]

James E. Carlisle¹, Clifford V. Johnson², Jeffrey Pennington²•Institutions (2)

Durham University¹, University of Southern California²

03 Jan 2005-arXiv: High Energy Physics - Theory

TL;DR: In this article, the authors study the properties of the string equations and their physical solutions in the (2,4k) model and show that the localized D-branes of the minimal string theories are directly related to the solitons of the KdV hierarchy.

...read moreread less

Abstract: We study the Type 0A string theory in the (2,4k) superconformal minimal model backgrounds, focusing on the fully non-perturbative string equations which define the partition function of the model. The equations admit a parameter, Gamma, which in the spacetime interpretation controls the number of background D-branes, or R-R flux units, depending upon which weak coupling regime is taken. We study the properties of the string equations (often focusing on the (2,4) model in particular) and their physical solutions. The solutions are the potential for an associated Schrodinger problem whose wavefunction is that of an extended D-brane probe. We perform a numerical study of the spectrum of this system for varying Gamma and establish that when Gamma is a positive integer the equations' solutions have special properties consistent with the spacetime interpretation. We also show that a natural solution-generating transformation (that changes Gamma by an integer) is the Backlund transformation of the KdV hierarchy specialized to (scale invariant) solitons at zero velocity. Our results suggest that the localized D-branes of the minimal string theories are directly related to the solitons of the KdV hierarchy. Further, we observe an interesting transition when Gamma=-1.

...read moreread less

14 citations

Journal Article•DOI•

D-branes and fluxes in supersymmetric quantum mechanics

[...]

James E. Carlisle¹, Clifford V. Johnson², Jeffrey Pennington²•Institutions (2)

Durham University¹, University of Southern California²

12 Feb 2008-Journal of Physics A

TL;DR: In this paper, a string theory in the (2, 4k) superconformal minimal model backgrounds, with background ZZ D-branes or R-R fluxes can be formulated non-perturbatively.

...read moreread less

Abstract: Type 0A string theory in the (2, 4k) superconformal minimal model backgrounds, with background ZZ D-branes or R–R fluxes can be formulated non-perturbatively. The branes and fluxes have a description as threshold bound states in an associated one-dimensional quantum mechanics which has a supersymmetric structure, familiar from studies of the generalized KdV system. The relevant bound-state wavefunctions in this problem have unusual asymptotics (they are not normalizable in general, and break supersymmetry) which are consistent with the underlying description in terms of open and closed string sectors. The overall organization of the physics is very pleasing: the physics of the closed strings in the background of branes or fluxes is captured by the generalized KdV system and non-perturbative string equations obtained by reduction of that system (the hierarchy of equations found by Dalley, Johnson, Morris and Watterstam). Meanwhile, the bound-states wavefunctions, which describe the physics of the ZZ D-brane (or flux) background in interaction with probe FZZT D-branes, are captured by the generalized mKdV system, and non-perturbative string equations obtained by reduction of that system (the Painleve II hierachy found by Periwal and Shevitz in this context).

...read moreread less

10 citations

Journal Article•DOI•

The BFKL equation, Mueller-Navelet jets and single-valued harmonic polylogarithms

[...]

Vittorio Del Duca, Lance J. Dixon¹, Claude Duhr², Claude Duhr³, Jeffrey Pennington¹ - Show less +1 more•Institutions (3)

Stanford University¹, ETH Zurich², Durham University³

25 Sep 2013-arXiv: High Energy Physics - Phenomenology

TL;DR: In this paper, a generating function for the coefficients of the leading logarithmic BFKL Green's function in transverse-momentum space, order by order in alpha_s, in terms of single-valued harmonic polylogarithms was introduced.

...read moreread less

Abstract: We introduce a generating function for the coefficients of the leading logarithmic BFKL Green's function in transverse-momentum space, order by order in alpha_s, in terms of single-valued harmonic polylogarithms. As an application, we exhibit fully analytic azimuthal-angle and transverse-momentum distributions for Mueller-Navelet jet cross sections at each order in alpha_s. We also provide a generating function for the total cross section valid to any number of loops.

...read moreread less

9 citations

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

[...]

Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

14 May 2022

TL;DR: By analyzing homogenized SGD, the exact value of the limiting excess risk in the case of quadratic losses when trained by SGD is provided and exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation are provided.

...read moreread less

Abstract: We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of SGD converges to the statistic under homogenized SGD when the number of samples $n$ and number of features $d$ are polynomially related ($d^c0$). By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation. Further we provide the exact value of the limiting excess risk in the case of quadratic losses when trained by SGD. The analysis is formulated for data matrices and target vectors that satisfy a family of resolvent conditions, which can roughly be viewed as a weak (non-quantitative) form of delocalization of sample-side singular vectors of the data. Several motivating applications are provided including sample covariance matrices with independent samples and random features with non-generative model targets.

...read moreread less

9 citations

Journal Article•

Temperature check: theory and practice for training models with softmax-cross-entropy losses

[...]

Atish Agarwala¹, Samuel S. Schoenholz¹, Jeffrey Pennington¹, Yann N. Dauphin¹•Institutions (1)

Google¹

04 May 2021-arXiv: Learning

TL;DR: A theory of early learning for models trained with softmax-cross-entropy loss is developed and it is shown that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$.

...read moreread less

Abstract: The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature β as well as the magnitude of the logits at initialization, ||βz||2. We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on β is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of β as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal β to be sensitive to the architecture, our results suggest that tuning β over the range 10−2 to 101 improves performance over all architectures studied. We find that smaller β may lead to better peak performance at the cost of learning stability.

...read moreread less

9 citations

1
2
3
4
5
6
7
8
…
9
10
11
12
13
14
15
…
16
17

Collapse

Cited by

PDF

Open Access

More filters

Book•

Deep Learning

[...]

Ian Goodfellow¹, Yoshua Bengio², Aaron Courville²•Institutions (2)

Google¹, Université de Montréal²

18 Nov 2016

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

...read moreread less

38,208 citations

Posted Content•

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

Jacob Devlin¹, Ming-Wei Chang¹, Kenton Lee¹, Kristina Toutanova¹•Institutions (1)

Google¹

11 Oct 2018-arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

...read moreread less

29,480 citations

Proceedings Article•DOI•

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

Jacob Devlin¹, Ming-Wei Chang¹, Kenton Lee¹, Kristina Toutanova¹•Institutions (1)

Google¹

11 Oct 2018

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

...read moreread less

24,672 citations

Journal Article•DOI•

Representation Learning: A Review and New Perspectives

[...]

Yoshua Bengio, Aaron Courville, Pascal Vincent

01 Aug 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.

...read moreread less

Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

...read moreread less

11,201 citations

Proceedings Article•

Language Models are Few-Shot Learners

[...]

Tom B. Brown¹, Benjamin Mann, Nick Ryder², Melanie Subbiah, Jared Kaplan³, Prafulla Dhariwal¹, Arvind Neelakantan⁴, Pranav Shyam, Girish Sastry¹, Amanda Askell¹, Sandhini Agarwal¹, Ariel Herbert-Voss¹, Gretchen Krueger¹, Thomas Henighan¹, Rewon Child¹, Aditya Ramesh¹, Daniel M. Ziegler⁵, Jeffrey Wu¹, Clemens Winter, Christopher Hesse¹, Mark Chen¹, Eric Sigler, Mateusz Litwin, Scott Gray¹, Benjamin Chess¹, Jack Clark¹, Christopher Berner, Samuel McCandlish¹, Alec Radford¹, Ilya Sutskever¹, Dario Amodei¹ - Show less +27 more•Institutions (5)

OpenAI¹, University of California, Berkeley², Johns Hopkins University³, Google⁴, Massachusetts Institute of Technology⁵

28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

...read moreread less

10,132 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse