scispace - formally typeset
Search or ask a question
Author

Jacob Devlin

Bio: Jacob Devlin is an academic researcher from Google. The author has contributed to research in topics: Machine translation & Artificial neural network. The author has an hindex of 23, co-authored 35 publications receiving 31122 citations. Previous affiliations of Jacob Devlin include Carnegie Mellon University & BBN Technologies.

Papers
More filters
Posted Content
Jacob Devlin1
TL;DR: This work proposes a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep, which achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost.
Abstract: Attentional sequence-to-sequence models have become the new standard for machine translation, but one challenge of such models is a significant increase in training and decoding cost compared to phrase-based systems. Here, we focus on efficient decoding, with a goal of achieving accuracy close the state-of-the-art in neural machine translation (NMT), while achieving CPU decoding speed/throughput close to that of a phrasal decoder. We approach this problem from two angles: First, we describe several techniques for speeding up an NMT beam search decoder, which obtain a 4.4x speedup over a very efficient baseline decoder without changing the decoder output. Second, we propose a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep. This architecture achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost. By combining these techniques, our best system achieves a very competitive accuracy of 38.3 BLEU on WMT English-French NewsTest2014, while decoding at 100 words/sec on single-threaded CPU. We believe this is the best published accuracy/speed trade-off of an NMT system.

9 citations

Posted Content
TL;DR: In this article, a simple adaptive pre-training strategy is proposed to address the domain shift problem associated with linking unseen entities in a new domain by using strong reading comprehension models pre-trained on large unlabeled data.
Abstract: We present the zero-shot entity linking task, where mentions must be linked to unseen entities without in-domain labeled data. The goal is to enable robust transfer to highly specialized domains, and so no metadata or alias tables are assumed. In this setting, entities are only identified by text descriptions, and models must rely strictly on language understanding to resolve the new entities. First, we show that strong reading comprehension models pre-trained on large unlabeled data can be used to generalize to unseen entities. Second, we propose a simple and effective adaptive pre-training strategy, which we term domain-adaptive pre-training (DAP), to address the domain shift problem associated with linking unseen entities in a new domain. We present experiments on a new dataset that we construct for this task and show that DAP improves over strong pre-training baselines, including BERT. The data and code are available at this https URL.

8 citations

Posted Content
TL;DR: This article proposed new features based on neural networks to model various non-local translation phenomena, and augmented the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units.
Abstract: We present a three-pronged approach to improving Statistical Machine Translation (SMT), building on recent success in the application of neural networks to SMT. First, we propose new features based on neural networks to model various non-local translation phenomena. Second, we augment the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units. Third, we apply multitask learning to estimate the neural network parameters jointly. Each of our proposed methods results in significant improvements that are complementary. The overall improvement is +2.7 and +1.8 BLEU points for Arabic-English and Chinese-English translation over a state-of-the-art system that already includes neural network features.

4 citations

Proceedings ArticleDOI
01 Sep 2015
TL;DR: A neural network architecture where the embedding layer feeds into multiple hidden layers that are placed “next to” one another so that each can be pre-computed independently, which achieves a 10x speedup at runtime and a significant reduction in perplexity when compared to a standard multilayer network.
Abstract: In the last several years, neural network models have significantly improved accuracy in a number of NLP tasks. However, one serious drawback that has impeded their adoption in production systems is the slow runtime speed of neural network models compared to alternate models, such as maximum entropy classifiers. In Devlin et al. (2014), the authors presented a simple technique for speeding up feed-forward embedding-based neural network models, where the dot product between each word embedding and part of the first hidden layer are pre-computed offline. However, this technique cannot be used for hidden layers beyond the first. In this paper, we explore a neural network architecture where the embedding layer feeds into multiple hidden layers that are placed “next to” one another so that each can be pre-computed independently. On a large scale language modeling task, this architecture achieves a 10x speedup at runtime and a significant reduction in perplexity when compared to a standard multilayer network.

4 citations

Proceedings Article
01 Nov 2011
TL;DR: This paper describes a method of cross-adaptation based system combination which only requires the final output from each alternate system, and presents a novel procedure called non-linear Expected-BLEU optimization that can be used to optimize arbitrary nonlinear parameters for any decoding feature.
Abstract: Cross-adaptation (CA) based methods of machine translation (MT) system combination work by adapting the decoding step of a baseline system using information from alternate systems. Generally, the required information is very deep, such as a full decoding forest. In this paper, we describe a method of cross-adaptation based system combination which only requires the final output from each alternate system. This is achieved by adding a discriminatively weighted n-gram confidence feature to our decoder. In order to optimize the confidence weight of each system, we present a novel procedure called non-linear Expected-BLEU optimization that can be used to optimize arbitrary nonlinear parameters for any decoding feature. We also describe a method for explicitly creating an adapted system that is dissimilar from each particular input system, which we have found to be useful in combination. Although our new method does not outperform a state-of-the-art confusion network (CN) based combination system on its own, we obtain statistically significant gains of 0.21-0.45 BLEU when the CA output is used as an additional system in CN combination. ∗This work was supported by DARPA/I2O Contract No. HR0011-06-C-0022 under the GALE program (Approved for Public Release, Distribution Unlimited). The views, opinions, and/or findings contained in this article/presentation are those of the author/presenter and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

3 citations


Cited by
More filters
Posted Content
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

13,994 citations

Posted Content
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

12,690 citations

Journal ArticleDOI
15 Jul 2021-Nature
TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.
Abstract: Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1–4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’8—has been an important open research problem for more than 50 years9. Despite recent progress10–14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

10,601 citations

Posted Content
TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

6,953 citations

Proceedings ArticleDOI
01 Oct 2020
TL;DR: Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Abstract: Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. Transformers is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered state-of-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. Transformers is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at https://github.com/huggingface/transformers.

4,798 citations