Home
/
Authors
/
Jeffrey Pennington

Author

Jeffrey Pennington

Other affiliations: University of Southern California, Princeton University, Stanford University ...read more

Bio: Jeffrey Pennington is an academic researcher from Google. The author has contributed to research in topics: Artificial neural network & Deep learning. The author has an hindex of 32, co-authored 75 publications receiving 28787 citations. Previous affiliations of Jeffrey Pennington include University of Southern California & Princeton University.

Papers published on a yearly basis

2022
2021
2020
2019
2018
2017
2015
2014
2013
2012
2011
2010
2008
2007
2005

Papers

PDF

Open Access

More filters

Posted Content•

Disentangling Trainability and Generalization in Deep Neural Networks.

[...]

Lechao Xiao¹, Jeffrey Pennington¹, Samuel S. Schoenholz¹•Institutions (1)

Google¹

30 Dec 2019-arXiv: Learning

TL;DR: This work identifies large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize, and finds thatCNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance.

...read moreread less

Abstract: A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. These theoretical results are corroborated experimentally on CIFAR10 for a variety of network architectures and we include a colab notebook that reproduces the essential results of the paper.

...read moreread less

30 citations

Journal Article•DOI•

Single-valued harmonic polylogarithms and the multi-Regge limit

[...]

Lance J. Dixon¹, Claude Duhr², Jeffrey Pennington¹•Institutions (2)

Stanford University¹, ETH Zurich²

01 Jul 2012-arXiv: High Energy Physics - Theory

TL;DR: In this paper, it was shown that the natural functions for describing the multi-regge limit of six-gluon scattering in planar N=4 super Yang-Mills theory are the single-valued harmonic polylogarithmic functions introduced by Brown.

...read moreread less

Abstract: We argue that the natural functions for describing the multi-Regge limit of six-gluon scattering in planar N=4 super Yang-Mills theory are the single-valued harmonic polylogarithmic functions introduced by Brown. These functions depend on a single complex variable and its conjugate, (w,w*). Using these functions, and formulas due to Fadin, Lipatov and Prygarin, we determine the six-gluon MHV remainder function in the leading-logarithmic approximation (LLA) in this limit through ten loops, and the next-to-LLA (NLLA) terms through nine loops. In separate work, we have determined the symbol of the four-loop remainder function for general kinematics, up to 113 constants. Taking its multi-Regge limit and matching to our four-loop LLA and NLLA results, we fix all but one of the constants that survive in this limit. The multi-Regge limit factorizes in the variables ( u,n) which are related to (w,w*) by a Fourier-Mellin transform. We can transform the single-valued harmonic polylogarithms to functions of ( u,n) that incorporate harmonic sums, systematically through transcendental weight six. Combining this information with the four-loop results, we determine the eigenvalues of the BFKL kernel in the adjoint representation to NNLLA accuracy, and the MHV product of impact factors to NNNLLA accuracy, up to constants representing beyond-the-symbol terms and the one symbol-level constant. Remarkably, only derivatives of the polygamma function enter these results. Finally, the LLA approximation to the six-gluon NMHV amplitude is evaluated through ten loops.

...read moreread less

29 citations

Journal Article•DOI•

Hexagon functions and the three-loop remainder function

[...]

Lance J. Dixon¹, James M. Drummond², James M. Drummond³, Matt von Hippel⁴, Matt von Hippel¹, Jeffrey Pennington¹ - Show less +2 more•Institutions (4)

Stanford University¹, CERN², University of Savoy³, Stony Brook University⁴

10 Aug 2013-arXiv: High Energy Physics - Theory

TL;DR: The three-loop remainder function as discussed by the authors describes the scattering of six gluons in the maximally-helicity-violating configuration in planar N=4 super-Yang-Mills theory, as a function of the three dual conformal cross ratios.

...read moreread less

Abstract: We present the three-loop remainder function, which describes the scattering of six gluons in the maximally-helicity-violating configuration in planar N=4 super-Yang-Mills theory, as a function of the three dual conformal cross ratios. The result can be expressed in terms of multiple Goncharov polylogarithms. We also employ a more restricted class of "hexagon functions" which have the correct branch cuts and certain other restrictions on their symbols. We classify all the hexagon functions through transcendental weight five, using the coproduct for their Hopf algebra iteratively, which amounts to a set of first-order differential equations. The three-loop remainder function is a particular weight-six hexagon function, whose symbol was determined previously. The differential equations can be integrated numerically for generic values of the cross ratios, or analytically in certain kinematics limits, including the near-collinear and multi-Regge limits. These limits allow us to impose constraints from the operator product expansion and multi-Regge factorization directly at the function level, and thereby to fix uniquely a set of Riemann-zeta-valued constants that could not be fixed at the level of the symbol. The near-collinear limits agree precisely with recent predictions by Basso, Sever and Vieira based on integrability. The multi-Regge limits agree with the factorization formula of Fadin and Lipatov, and determine three constants entering the impact factor at this order. We plot the three-loop remainder function for various slices of the Euclidean region of positive cross ratios, and compare it to the two-loop one. For large ranges of the cross ratios, the ratio of the three-loop to the two-loop remainder function is relatively constant, and close to -7.

...read moreread less

29 citations

Posted Content•

Estimating the Spectral Density of Large Implicit Matrices

[...]

Ryan P. Adams, Jeffrey Pennington, Matthew J. Johnson, Jamie Smith, Yaniv Ovadia, Brian Patton, James Saunderson - Show less +3 more

09 Feb 2018-arXiv: Machine Learning

TL;DR: This work combines several different techniques for randomized estimation and shows that it is possible to construct unbiased estimators to answer a broad class of questions about the spectra of such implicit matrices, even in the presence of noise.

...read moreread less

Abstract: Many important problems are characterized by the eigenvalues of a large matrix For example, the difficulty of many optimization problems, such as those arising from the fitting of large models in statistics and machine learning, can be investigated via the spectrum of the Hessian of the empirical loss function Network data can be understood via the eigenstructure of a graph Laplacian matrix using spectral graph theory Quantum simulations and other many-body problems are often characterized via the eigenvalues of the solution space, as are various dynamic systems However, naive eigenvalue estimation is computationally expensive even when the matrix can be represented; in many of these situations the matrix is so large as to only be available implicitly via products with vectors Even worse, one may only have noisy estimates of such matrix vector products In this work, we combine several different techniques for randomized estimation and show that it is possible to construct unbiased estimators to answer a broad class of questions about the spectra of such implicit matrices, even in the presence of noise We validate these methods on large-scale problems in which graph theory and random matrix theory provide ground truth

...read moreread less

28 citations

Journal Article•DOI•

The four-loop remainder function and multi-Regge behavior at NNLLA in planar N=4 super-Yang-Mills theory

[...]

Lance J. Dixon¹, James M. Drummond², James M. Drummond³, James M. Drummond⁴, Claude Duhr⁵, Jeffrey Pennington¹ - Show less +2 more•Institutions (5)

Stanford University¹, University of Savoy², University of Southampton³, CERN⁴, Durham University⁵

13 Feb 2014-arXiv: High Energy Physics - Theory

TL;DR: The four-loop remainder function for six-gluon scattering with maximal helicity violation in planar N=4 super-Yang-Mills theory is presented in this paper as an analytic function of three dual-conformal cross ratios.

...read moreread less

Abstract: We present the four-loop remainder function for six-gluon scattering with maximal helicity violation in planar N=4 super-Yang-Mills theory, as an analytic function of three dual-conformal cross ratios. The function is constructed entirely from its analytic properties, without ever inspecting any multi-loop integrand. We employ the same approach used at three loops, writing an ansatz in terms of hexagon functions, and fixing coefficients in the ansatz using the multi-Regge limit and the operator product expansion in the near-collinear limit. We express the result in terms of multiple polylogarithms, and in terms of the coproduct for the associated Hopf algebra. From the remainder function, we extract the BFKL eigenvalue at next-to-next-to-leading logarithmic accuracy (NNLLA), and the impact factor at NNNLLA. We plot the remainder function along various lines and on one surface, studying ratios of successive loop orders. As seen previously through three loops, these ratios are surprisingly constant over large regions in the space of cross ratios, and they are not far from the value expected at asymptotically large orders of perturbation theory.

...read moreread less

28 citations

1
2
3
4
5
…
6
7
8
9
10
11
12
…
13
14
15
16
17

Collapse

Cited by

PDF

Open Access

More filters

Book•

Deep Learning

[...]

Ian Goodfellow¹, Yoshua Bengio², Aaron Courville²•Institutions (2)

Google¹, Université de Montréal²

18 Nov 2016

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.

...read moreread less

Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

...read moreread less

38,208 citations

Posted Content•

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

Jacob Devlin¹, Ming-Wei Chang¹, Kenton Lee¹, Kristina Toutanova¹•Institutions (1)

Google¹

11 Oct 2018-arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

...read moreread less

29,480 citations

Proceedings Article•DOI•

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

Jacob Devlin¹, Ming-Wei Chang¹, Kenton Lee¹, Kristina Toutanova¹•Institutions (1)

Google¹

11 Oct 2018

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

...read moreread less

24,672 citations

Journal Article•DOI•

Representation Learning: A Review and New Perspectives

[...]

Yoshua Bengio, Aaron Courville, Pascal Vincent

01 Aug 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Recent work in the area of unsupervised feature learning and deep learning is reviewed, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks.

...read moreread less

Abstract: The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.

...read moreread less

11,201 citations

Proceedings Article•

Language Models are Few-Shot Learners

[...]

Tom B. Brown¹, Benjamin Mann, Nick Ryder², Melanie Subbiah, Jared Kaplan³, Prafulla Dhariwal¹, Arvind Neelakantan⁴, Pranav Shyam, Girish Sastry¹, Amanda Askell¹, Sandhini Agarwal¹, Ariel Herbert-Voss¹, Gretchen Krueger¹, Thomas Henighan¹, Rewon Child¹, Aditya Ramesh¹, Daniel M. Ziegler⁵, Jeffrey Wu¹, Clemens Winter, Christopher Hesse¹, Mark Chen¹, Eric Sigler, Mateusz Litwin, Scott Gray¹, Benjamin Chess¹, Jack Clark¹, Christopher Berner, Samuel McCandlish¹, Alec Radford¹, Ilya Sutskever¹, Dario Amodei¹ - Show less +27 more•Institutions (5)

OpenAI¹, University of California, Berkeley², Johns Hopkins University³, Google⁴, Massachusetts Institute of Technology⁵

28 May 2020

TL;DR: GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

...read moreread less

Abstract: Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

...read moreread less

10,132 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse