BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

doi:10.18653/V1/N19-1423

Home
/
Papers
/
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Proceedings Article•DOI•

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin¹, Ming-Wei Chang¹, Kenton Lee¹, Kristina Toutanova¹•Institutions (1)

Google¹

11 Oct 2018-pp 4171-4186

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

read less

Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

RoBERTa: A Robustly Optimized BERT Pretraining Approach

[...]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Michael Lewis, Luke Zettlemoyer, Veselin Stoyanov - Show less +6 more

26 Jul 2019-arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

...read moreread less

13,994 citations

Posted Content•

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[...]

Alexey Dosovitskiy¹, Lucas Beyer¹, Alexander Kolesnikov¹, Dirk Weissenborn², Xiaohua Zhai¹, Thomas Unterthiner¹, Mostafa Dehghani¹, Matthias Minderer¹, Georg Heigold², Sylvain Gelly¹, Jakob Uszkoreit¹, Neil Houlsby¹ - Show less +8 more•Institutions (2)

Google¹, German Research Centre for Artificial Intelligence²

22 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

12,690 citations

Journal Article•DOI•

Highly accurate protein structure prediction with AlphaFold

[...]

John M. Jumper, Richard O. Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russell Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, R. D. Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger¹, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David L. Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, Demis Hassabis - Show less +30 more•Institutions (1)

Seoul National University¹

15 Jul 2021-Nature

TL;DR: For example, AlphaFold as mentioned in this paper predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture. But the accuracy is limited by the fact that no homologous structure is available.

...read moreread less

Abstract: Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1–4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’8—has been an important open research problem for more than 50 years9. Despite recent progress10–14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

...read moreread less

10,601 citations

Posted Content•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

[...]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

23 Oct 2019-arXiv: Learning

TL;DR: This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

...read moreread less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

6,953 citations

Proceedings Article•DOI•

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

[...]

Michael Lewis¹, Yinhan Liu¹, Naman Goyal¹, Marjan Ghazvininejad¹, Abdelrahman Mohamed¹, Omer Levy², Veselin Stoyanov¹, Luke Zettlemoyer¹ - Show less +4 more•Institutions (2)

Facebook¹, University of Washington²

01 Jul 2020

TL;DR: BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.

...read moreread less

Abstract: We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance.

...read moreread less

4,505 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

SQuAD: 100,000+ Questions for Machine Comprehension of Text

[...]

Pranav Rajpurkar¹, Jian Zhang¹, Konstantin Lopyrev¹, Percy Liang¹•Institutions (1)

Stanford University¹

16 Jun 2016

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

...read moreread less

Abstract: We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at this https URL

...read moreread less

3,667 citations

Proceedings Article•DOI•

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

[...]

Erik Tjong Kim Sang¹, Fien De Meulder¹•Institutions (1)

University of Antwerp¹

31 May 2003

TL;DR: The CoNLL-2003 shared task on NER as mentioned in this paper was the first NER task with language-independent named entity recognition (NER) data sets and evaluation method, and a general overview of the systems that participated in the task and their performance.

...read moreread less

Abstract: We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.

...read moreread less

3,554 citations

Journal Article•DOI•

Class-based n -gram models of natural language

[...]

Peter Fitzhugh Brown¹, Peter Vincent Desouza¹, Robert Leroy Mercer¹, Vincent J. Della Pietra¹, Jenifer C. Lai¹ - Show less +1 more•Institutions (1)

IBM¹

01 Dec 1992-Computational Linguistics

TL;DR: This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics.

...read moreread less

Abstract: We address the problem of predicting a word from previous words in a sample of text. In particular, we discuss n-gram models based on classes of words. We also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.

...read moreread less

3,336 citations

Proceedings Article•DOI•

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

[...]

Adina Williams¹, Nikita Nangia¹, Samuel R. Bowman¹•Institutions (1)

New York University¹

01 Jun 2018

TL;DR: The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

...read moreread less

Abstract: This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding. At 433k examples, this resource is one of the largest corpora available for natural language inference (a.k.a. recognizing textual entailment), improving upon available resources in both its coverage and difficulty. MultiNLI accomplishes this by offering data from ten distinct genres of written and spoken English, making it possible to evaluate systems on nearly the full complexity of the language, while supplying an explicit setting for evaluating cross-genre domain adaptation. In addition, an evaluation using existing machine learning models designed for the Stanford NLI corpus shows that it represents a substantially more difficult task than does that corpus, despite the two showing similar levels of inter-annotator agreement.

...read moreread less

3,148 citations

Proceedings Article•DOI•

A large annotated corpus for learning natural language inference

[...]

Samuel R. Bowman¹, Gabor Angeli¹, Christopher Potts¹, Christopher D. Manning¹•Institutions (1)

Stanford University¹

21 Aug 2015

TL;DR: The Stanford Natural Language Inference (SNLI) corpus as discussed by the authors is a large-scale collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning.

...read moreread less

Abstract: Understanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for the development of semantic representations. However, machine learning research in this area has been dramatically limited by the lack of large-scale resources. To address this, we introduce the Stanford Natural Language Inference corpus, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. At 570K pairs, it is two orders of magnitude larger than all other resources of its type. This increase in scale allows lexicalized classifiers to outperform some sophisticated existing entailment models, and it allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

...read moreread less

3,100 citations