Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Learning from others' mistakes: Avoiding dataset biases without modeling them

[...]

Victor Sanh, Thomas Wolf¹, Yonatan Belinkov², Alexander M. Rush³•Institutions (3)

École Polytechnique¹, Technion – Israel Institute of Technology², Cornell University³

03 May 2021

TL;DR: This paper showed that models with limited capacity primarily learn to exploit biases in the dataset and leverage the errors of such limited capacity models to train a more robust model in a product of experts, thus bypassing the need to handcraft a biased model.

...read moreread less

Abstract: State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended underlying task Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available We consider cases where the bias issues may not be explicitly identified, and show a method for training models that learn to ignore these problematic correlations Our approach relies on the observation that models with limited capacity primarily learn to exploit biases in the dataset We can leverage the errors of such limited capacity models to train a more robust model in a product of experts, thus bypassing the need to hand-craft a biased model We show the effectiveness of this method to retain improvements in out-of-distribution settings even if no particular bias is targeted by the biased model

...read moreread less

21 citations

Posted Content•

Biomedical Question Answering: A Survey of Approaches and Challenges

[...]

Qiao Jin, Zheng Yuan, Guangzhi Xiong, Qianlan Yu, Huaiyuan Ying, Chuanqi Tan, Mosha Chen, Songfang Huang, Xiaozhong Liu, Sheng Yu - Show less +6 more

10 Feb 2021-arXiv: Computation and Language

TL;DR: There have been tremendous developments of BQA in the past two decades, which can be classified into five distinctive approaches: classic, information retrieval, machine reading comprehension, knowledge base and question entailment approaches as mentioned in this paper.

...read moreread less

Abstract: Automatic Question Answering (QA) has been successfully applied in various domains such as search engines and chatbots. Biomedical QA (BQA), as an emerging QA task, enables innovative applications to effectively perceive, access and understand complex biomedical knowledge. There have been tremendous developments of BQA in the past two decades, which we classify into 5 distinctive approaches: classic, information retrieval, machine reading comprehension, knowledge base and question entailment approaches. In this survey, we introduce available datasets and representative methods of each BQA approach in detail. Despite the developments, BQA systems are still immature and rarely used in real-life settings. We identify and characterize several key challenges in BQA that might lead to this issue, and discuss some potential future directions to explore.

...read moreread less

21 citations

Posted Content•DOI•

ProteinBERT: A universal deep-learning model of protein sequence and function

[...]

Nadav Brandes¹, Dan Ofer², Yam Peleg, Nadav Rappoport³, Michal Linial¹ - Show less +1 more•Institutions (3)

Hebrew University of Jerusalem¹, Medtronic plc², Ben-Gurion University of the Negev³

25 May 2021-bioRxiv

TL;DR: This article proposed ProteinBERT, a deep language model specifically designed for proteins, which consists of masked language modeling combined with a novel task of Gene Ontology (GO) annotation prediction.

...read moreread less

Abstract: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme consists of masked language modeling combined with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to very large sequence lengths. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains state-of-the-art performance on multiple benchmarks covering diverse protein properties (including protein structure, post translational modifications and biophysical attributes), despite using a far smaller model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.

...read moreread less

21 citations

Posted Content•

Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision

[...]

Faeze Brahman¹, Vered Shwartz², Rachel Rudinger³, Yejin Choi³•Institutions (3)

University of California, Santa Cruz¹, University of Maryland, College Park², Allen Institute for Artificial Intelligence³

14 Dec 2020-arXiv: Computation and Language

TL;DR: This paper investigates multiple ways to automatically generate rationales using pre-trained language models, neural knowledge models, and distant supervision from related tasks, and trains generative models capable of composing explanatory rationales for unseen instances.

...read moreread less

Abstract: The black-box nature of neural models has motivated a line of research that aims to generate natural language rationales to explain why a model made certain predictions. Such rationale generation models, to date, have been trained on dataset-specific crowdsourced rationales, but this approach is costly and is not generalizable to new tasks and domains. In this paper, we investigate the extent to which neural models can reason about natural language rationales that explain model predictions, relying only on distant supervision with no additional annotation cost for human-written rationales. We investigate multiple ways to automatically generate rationales using pre-trained language models, neural knowledge models, and distant supervision from related tasks, and train generative models capable of composing explanatory rationales for unseen instances. We demonstrate our approach on the defeasible inference task, a nonmonotonic reasoning task in which an inference may be strengthened or weakened when new information (an update) is introduced. Our model shows promises at generating post-hoc rationales explaining why an inference is more or less likely given the additional information, however, it mostly generates trivial rationales reflecting the fundamental limitations of neural language models. Conversely, the more realistic setup of jointly predicting the update or its type and generating rationale is more challenging, suggesting an important future direction.

...read moreread less

21 citations

Proceedings Article•DOI•

Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order

[...]

Yi Liao, Xin Jiang¹, Qun Liu¹•Institutions (1)

Huawei¹

24 Apr 2020

TL;DR: This article proposed a probabilistic masking scheme for the Masked Language Model (PMLM) and showed that it is equivalent to an autoregressive permutated language model.

...read moreread less

Abstract: Masked language model and autoregressive language model are two types of language models. While pretrained masked language models such as BERT overwhelm the line of natural language understanding (NLU) tasks, autoregressive language models such as GPT are especially capable in natural language generation (NLG). In this paper, we propose a probabilistic masking scheme for the masked language model, which we call probabilistically masked language model (PMLM). We implement a specific PMLM with a uniform prior distribution on the masking ratio named u-PMLM. We prove that u-PMLM is equivalent to an autoregressive permutated language model. One main advantage of the model is that it supports text generation in arbitrary order with surprisingly good quality, which could potentially enable new applications over traditional unidirectional generation. Besides, the pretrained u-PMLM also outperforms BERT on a bunch of downstream NLU tasks.

...read moreread less

20 citations