Home
/
Authors
/
Jacob Devlin

Author

Jacob Devlin

Other affiliations: Carnegie Mellon University, BBN Technologies, Microsoft

Bio: Jacob Devlin is an academic researcher from Google. The author has contributed to research in topics: Machine translation & Artificial neural network. The author has an hindex of 23, co-authored 35 publications receiving 31122 citations. Previous affiliations of Jacob Devlin include Carnegie Mellon University & BBN Technologies.

Papers

PDF

Open Access

More filters

Posted Content•

Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU

[...]

Jacob Devlin¹•Institutions (1)

Microsoft¹

04 May 2017-arXiv: Computation and Language

TL;DR: This work proposes a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep, which achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost.

...read moreread less

Abstract: Attentional sequence-to-sequence models have become the new standard for machine translation, but one challenge of such models is a significant increase in training and decoding cost compared to phrase-based systems. Here, we focus on efficient decoding, with a goal of achieving accuracy close the state-of-the-art in neural machine translation (NMT), while achieving CPU decoding speed/throughput close to that of a phrasal decoder. We approach this problem from two angles: First, we describe several techniques for speeding up an NMT beam search decoder, which obtain a 4.4x speedup over a very efficient baseline decoder without changing the decoder output. Second, we propose a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep. This architecture achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost. By combining these techniques, our best system achieves a very competitive accuracy of 38.3 BLEU on WMT English-French NewsTest2014, while decoding at 100 words/sec on single-threaded CPU. We believe this is the best published accuracy/speed trade-off of an NMT system.

...read moreread less

9 citations

Posted Content•

Zero-Shot Entity Linking by Reading Entity Descriptions.

[...]

Lajanugen Logeswaran¹, Ming-Wei Chang², Kenton Lee², Kristina Toutanova², Jacob Devlin², Honglak Lee¹ - Show less +2 more•Institutions (2)

University of Michigan¹, Google²

18 Jun 2019-arXiv: Computation and Language

TL;DR: In this article, a simple adaptive pre-training strategy is proposed to address the domain shift problem associated with linking unseen entities in a new domain by using strong reading comprehension models pre-trained on large unlabeled data.

...read moreread less

Abstract: We present the zero-shot entity linking task, where mentions must be linked to unseen entities without in-domain labeled data. The goal is to enable robust transfer to highly specialized domains, and so no metadata or alias tables are assumed. In this setting, entities are only identified by text descriptions, and models must rely strictly on language understanding to resolve the new entities. First, we show that strong reading comprehension models pre-trained on large unlabeled data can be used to generalize to unseen entities. Second, we propose a simple and effective adaptive pre-training strategy, which we term domain-adaptive pre-training (DAP), to address the domain shift problem associated with linking unseen entities in a new domain. We present experiments on a new dataset that we construct for this task and show that DAP improves over strong pre-training baselines, including BERT. The data and code are available at this https URL.

...read moreread less

8 citations

Posted Content•

Statistical Machine Translation Features with Multitask Tensor Networks

[...]

Hendra Setiawan¹, Zhongqiang Huang¹, Jacob Devlin¹, Thomas Lamar¹, Rabih Zbib¹, Richard Schwartz¹, John Makhoul² - Show less +3 more•Institutions (2)

BBN Technologies¹, Microsoft²

01 Jun 2015-arXiv: Computation and Language

TL;DR: This article proposed new features based on neural networks to model various non-local translation phenomena, and augmented the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units.

...read moreread less

Abstract: We present a three-pronged approach to improving Statistical Machine Translation (SMT), building on recent success in the application of neural networks to SMT. First, we propose new features based on neural networks to model various non-local translation phenomena. Second, we augment the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units. Third, we apply multitask learning to estimate the neural network parameters jointly. Each of our proposed methods results in significant improvements that are complementary. The overall improvement is +2.7 and +1.8 BLEU points for Arabic-English and Chinese-English translation over a state-of-the-art system that already includes neural network features.

...read moreread less

4 citations

Proceedings Article•DOI•

Pre-Computable Multi-Layer Neural Network Language Models

[...]

Jacob Devlin¹, Chris Quirk¹, Arul Menezes¹•Institutions (1)

Microsoft¹

01 Sep 2015

TL;DR: A neural network architecture where the embedding layer feeds into multiple hidden layers that are placed “next to” one another so that each can be pre-computed independently, which achieves a 10x speedup at runtime and a significant reduction in perplexity when compared to a standard multilayer network.

...read moreread less

Abstract: In the last several years, neural network models have significantly improved accuracy in a number of NLP tasks. However, one serious drawback that has impeded their adoption in production systems is the slow runtime speed of neural network models compared to alternate models, such as maximum entropy classifiers. In Devlin et al. (2014), the authors presented a simple technique for speeding up feed-forward embedding-based neural network models, where the dot product between each word embedding and part of the first hidden layer are pre-computed offline. However, this technique cannot be used for hidden layers beyond the first. In this paper, we explore a neural network architecture where the embedding layer feeds into multiple hidden layers that are placed “next to” one another so that each can be pre-computed independently. On a large scale language modeling task, this architecture achieves a 10x speedup at runtime and a significant reduction in perplexity when compared to a standard multilayer network.

...read moreread less

4 citations

Proceedings Article•

System Combination Using Discriminative Cross-Adaptation

[...]

Jacob Devlin¹, Antti-Veikko Rosti¹, Sankaranarayanan Ananthakrishnan¹, Spyros Matsoukas¹•Institutions (1)

BBN Technologies¹

01 Nov 2011

TL;DR: This paper describes a method of cross-adaptation based system combination which only requires the final output from each alternate system, and presents a novel procedure called non-linear Expected-BLEU optimization that can be used to optimize arbitrary nonlinear parameters for any decoding feature.

...read moreread less

Abstract: Cross-adaptation (CA) based methods of machine translation (MT) system combination work by adapting the decoding step of a baseline system using information from alternate systems. Generally, the required information is very deep, such as a full decoding forest. In this paper, we describe a method of cross-adaptation based system combination which only requires the final output from each alternate system. This is achieved by adding a discriminatively weighted n-gram confidence feature to our decoder. In order to optimize the confidence weight of each system, we present a novel procedure called non-linear Expected-BLEU optimization that can be used to optimize arbitrary nonlinear parameters for any decoding feature. We also describe a method for explicitly creating an adapted system that is dissimilar from each particular input system, which we have found to be useful in combination. Although our new method does not outperform a state-of-the-art confusion network (CN) based combination system on its own, we obtain statistically significant gains of 0.21-0.45 BLEU when the CA output is used as an additional system in CN combination. ∗This work was supported by DARPA/I2O Contract No. HR0011-06-C-0022 under the GALE program (Approved for Public Release, Distribution Unlimited). The views, opinions, and/or findings contained in this article/presentation are those of the author/presenter and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

...read moreread less

3 citations

1
2
3
…
4
5
6
7
8
9

Collapse

Cited by

PDF

Open Access

More filters

Posted Content•

RoBERTa: A Robustly Optimized BERT Pretraining Approach

[...]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Michael Lewis, Luke Zettlemoyer, Veselin Stoyanov - Show less +6 more

26 Jul 2019-arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

...read moreread less

13,994 citations

Posted Content•

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[...]

Alexey Dosovitskiy¹, Lucas Beyer¹, Alexander Kolesnikov¹, Dirk Weissenborn², Xiaohua Zhai¹, Thomas Unterthiner¹, Mostafa Dehghani¹, Matthias Minderer¹, Georg Heigold², Sylvain Gelly¹, Jakob Uszkoreit¹, Neil Houlsby¹ - Show less +8 more•Institutions (2)

Google¹, German Research Centre for Artificial Intelligence²

22 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

12,690 citations

Journal Article•DOI•

Highly accurate protein structure prediction with AlphaFold

[...]

John M. Jumper, Richard O. Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russell Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, R. D. Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger¹, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David L. Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, Demis Hassabis - Show less +30 more•Institutions (1)

Seoul National University¹

15 Jul 2021-Nature

TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.

...read moreread less

Abstract: Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1–4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’8—has been an important open research problem for more than 50 years9. Despite recent progress10–14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

...read moreread less

10,601 citations

Posted Content•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

[...]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

23 Oct 2019-arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

6,953 citations

Proceedings Article•DOI•

Transformers: State-of-the-Art Natural Language Processing

[...]

Thomas Wolf¹, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Clara Ma, Yacine Jernite², Julien Plu³, Canwen Xu⁴, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander M. Rush⁵ - Show less +12 more•Institutions (5)

Central European University¹, New York University², Institut Eurécom³, Wuhan University⁴, Cornell University⁵

01 Oct 2020

TL;DR: Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

...read moreread less

Abstract: Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. Transformers is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. Transformers is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at https://github.com/huggingface/transformers.

...read moreread less

4,798 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse