Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

[...]

Ruiqi Zhong¹, Dhruba Ghosh, Dan Klein¹, Jacob Steinhardt¹•Institutions (1)

University of California, Berkeley¹

01 Aug 2021

TL;DR: The authors found that BERT-Large is worse than BERT Mini on at least 1-4% of instances across MNLI, SST-2, and QQP, compared to the overall accuracy improvement of 2-10%.

...read moreread less

Abstract: Larger language models have higher accuracy on average, but are they better on every single instance (datapoint)? Some work suggests larger models have higher out-of-distribution robustness, while other work suggests they have lower accuracy on rare subgroups. To understand these differences, we investigate these models at the level of individual instances. However, one major challenge is that individual predictions are highly sensitive to noise in the randomness in training. We develop statistically rigorous methods to address this, and after accounting for pretraining and finetuning noise, we find that our BERT-Large is worse than BERT-Mini on at least 1-4% of instances across MNLI, SST-2, and QQP, compared to the overall accuracy improvement of 2-10%. We also find that finetuning noise increases with model size and that instance-level accuracy has momentum: improvement from BERT-Mini to BERT-Medium correlates with improvement from BERT-Medium to BERT-Large. Our findings suggest that instance-level predictions provide a rich source of information; we therefore, recommend that researchers supplement model weights with model predictions.

...read moreread less

4 citations

Proceedings Article•DOI•

Designing Templates for Eliciting Commonsense Knowledge from Pretrained Sequence-to-Sequence Models.

[...]

Jheng-Hong Yang¹, Sheng-Chieh Lin², Rodrigo Nogueira³, Ming-Feng Tsai⁴, Chuan-Ju Wang², Jimmy Lin¹ - Show less +2 more•Institutions (4)

University of Waterloo¹, Academia Sinica², State University of Campinas³, Anschutz Medical Campus⁴

01 Dec 2020

TL;DR: This work explores a template-based approach to extract implicit knowledge for commonsense reasoning on multiple-choice question answering tasks using the text-to-text transfer transformer (T5) model, and initiates further research to find generic natural language templates that can effectively leverage stored knowledge in pretrained models.

...read moreread less

Abstract: While internalized “implicit knowledge” in pretrained transformers has led to fruitful progress in many natural language understanding tasks, how to most effectively elicit such knowledge remains an open question. Based on the text-to-text transfer transformer (T5) model, this work explores a template-based approach to extract implicit knowledge for commonsense reasoning on multiple-choice (MC) question answering tasks. Experiments on three representative MC datasets show the surprisingly good performance of our simple template, coupled with a logit normalization technique for disambiguation. Furthermore, we verify that our proposed template can be easily extended to other MC tasks with contexts such as supporting facts in open-book question answering settings. Starting from the MC task, this work initiates further research to find generic natural language templates that can effectively leverage stored knowledge in pretrained models.

...read moreread less

4 citations

Posted Content•

Finetuning Pretrained Transformers into Variational Autoencoders

[...]

Seongmin Park, Jihwa Lee

05 Aug 2021-arXiv: Computation and Language

TL;DR: This article proposed a simple two-phase training scheme to convert a sequence-to-sequence Transformer into a VAE with just finetuning, which is competitive with massively pretrained Transformer-based VAEs in some internal metrics while falling short on others.

...read moreread less

Abstract: Text variational autoencoders (VAEs) are notorious for posterior collapse, a phenomenon where the model's decoder learns to ignore signals from the encoder. Because posterior collapse is known to be exacerbated by expressive decoders, Transformers have seen limited adoption as components of text VAEs. Existing studies that incorporate Transformers into text VAEs (Li et al., 2020; Fang et al., 2021) mitigate posterior collapse using massive pretraining, a technique unavailable to most of the research community without extensive computing resources. We present a simple two-phase training scheme to convert a sequence-to-sequence Transformer into a VAE with just finetuning. The resulting language model is competitive with massively pretrained Transformer-based VAEs in some internal metrics while falling short on others. To facilitate training we comprehensively explore the impact of common posterior collapse alleviation techniques in the literature. We release our code for reproducability.

...read moreread less

4 citations

Posted Content•

FeTaQA: Free-form Table Question Answering.

[...]

Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kryscinski, Nick Schoelkopf, Riley Kong, Xiangru Tang, Murori Mutuma, Ben Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, Dragomir R. Radev - Show less +13 more

01 Apr 2021-arXiv: Computation and Language

TL;DR: FeTaQA as discussed by the authors is a table question answering dataset with 10k Wikipedia-based pairs, where answers are human-generated explanations involving entities and their high-level relations.

...read moreread less

Abstract: Existing table question answering datasets contain abundant factual questions that primarily evaluate the query and schema comprehension capability of a system, but they fail to include questions that require complex reasoning and integration of information due to the constraint of the associated short-form answers. To address these issues and to demonstrate the full challenge of table question answering, we introduce FeTaQA, a new dataset with 10K Wikipedia-based {table, question, free-form answer, supporting table cells} pairs. FeTaQA yields a more challenging table question answering setting because it requires generating free-form text answers after retrieval, inference, and integration of multiple discontinuous facts from a structured knowledge source. Unlike datasets of generative QA over text in which answers are prevalent with copies of short text spans from the source, answers in our dataset are human-generated explanations involving entities and their high-level relations. We provide two benchmark methods for the proposed task: a pipeline method based on semantic-parsing-based QA systems and an end-to-end method based on large pretrained text generation models, and show that FeTaQA poses a challenge for both methods.

...read moreread less

4 citations

Posted Content•

Probing Across Time: What Does RoBERTa Know and When?

[...]

Leo Z. Liu¹, Yizhong Wang¹, Jungo Kasai¹, Hannaneh Hajishirzi¹, Noah A. Smith¹ - Show less +1 more•Institutions (1)

University of Washington¹

16 Apr 2021-arXiv: Computation and Language

TL;DR: The authors investigated the extent to which linguistic abstractions, factual and commonsense knowledge, and reasoning abilities a language model learns during pre-training and found that linguistic knowledge is acquired fast, stably, and robustly across domains.

...read moreread less

Abstract: Models of language trained on very large corpora have been demonstrated useful for NLP. As fixed artifacts, they have become the object of intense study, with many researchers "probing" the extent to which linguistic abstractions, factual and commonsense knowledge, and reasoning abilities they acquire and readily demonstrate. Building on this line of work, we consider a new question: for types of knowledge a language model learns, when during (pre)training are they acquired? We plot probing performance across iterations, using RoBERTa as a case study. Among our findings: linguistic knowledge is acquired fast, stably, and robustly across domains. Facts and commonsense are slower and more domain-sensitive. Reasoning abilities are, in general, not stably acquired. As new datasets, pretraining protocols, and probes emerge, we believe that probing-across-time analyses can help researchers understand the complex, intermingled learning that these models undergo and guide us toward more efficient approaches that accomplish necessary learning faster.

...read moreread less

4 citations