Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

[...]

Phillip Rust, Jonas Pfeiffer¹, Ivan Vuli², Sebastian Ruder³, Iryna Gurevych⁴ - Show less +1 more•Institutions (4)

Technische Universität Darmstadt¹, University of Cambridge², Google³, University of Paderborn⁴

01 Aug 2021

TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

...read moreread less

Abstract: In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

...read moreread less

29 citations

Proceedings Article•DOI•

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

[...]

01 Jan 2022

TL;DR: This paper used the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, identify performant prompts, which yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.

...read moreread less

Abstract: When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are “fantastic” and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, we use the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.

...read moreread less

29 citations

Proceedings Article•DOI•

Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models

[...]

Tongshuang Wu¹, Marco Tulio Ribeiro², Jeffrey Heer¹, Daniel S. Weld³•Institutions (3)

University of Washington¹, Microsoft², Allen Institute for Artificial Intelligence³

01 Aug 2021

TL;DR: Polyjuice as discussed by the authors is a general-purpose counterfactual generator that allows for control over perturbation types and locations, trained by finetuning GPT-2 on multiple datasets of paired sentences.

...read moreread less

Abstract: While counterfactual examples are useful for analysis and training of NLP models, current generation methods either rely on manual labor to create very few counterfactuals, or only instantiate limited types of perturbations such as paraphrases or word substitutions. We present Polyjuice, a general-purpose counterfactual generator that allows for control over perturbation types and locations, trained by finetuning GPT-2 on multiple datasets of paired sentences. We show that Polyjuice produces diverse sets of realistic counterfactuals, which in turn are useful in various distinct applications: improving training and evaluation on three different tasks (with around 70% less annotation effort than manual generation), augmenting state-of-the-art explanation techniques, and supporting systematic counterfactual error analysis by revealing behaviors easily missed by human experts.

...read moreread less

29 citations

Posted Content•

Explaining NLP Models via Minimal Contrastive Editing (MiCE)

[...]

Alexis Ross¹, Ana Marasović¹, Matthew E. Peters¹•Institutions (1)

Allen Institute for Artificial Intelligence¹

27 Dec 2020-arXiv: Computation and Language

TL;DR: The authors presented Minimal Contrastive Editing (MiCE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case.

...read moreread less

Abstract: Humans have been shown to give contrastive explanations, which explain why an observed event happened rather than some other counterfactual event (the contrast case). Despite the influential role that contrastivity plays in how humans explain, this property is largely missing from current methods for explaining NLP models. We present Minimal Contrastive Editing (MiCE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks--binary sentiment classification, topic classification, and multiple-choice question answering--show that MiCE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MiCE edits can be used for two use cases in NLP system development--debugging incorrect model outputs and uncovering dataset artifacts--and thereby illustrate that producing contrastive explanations is a promising research direction for model interpretability.

...read moreread less

29 citations

Proceedings Article•DOI•

Efficient Automatic Punctuation Restoration Using Bidirectional Transformers with Robust Inference.

[...]

Maury Courtland, Adam Faulkner, Gayle McElvain

01 Jul 2020

TL;DR: This work proposes a solution for automatic punctuation that is both cost efficient and easy to train, and modify the typical framing of this task by predicting punctuation for sequences rather than individual tokens, which makes for more efficient training and inference.

...read moreread less

Abstract: Though people rarely speak in complete sentences, punctuation confers many benefits to the readers of transcribed speech Unfortunately, most ASR systems do not produce punctuated output To address this, we propose a solution for automatic punctuation that is both cost efficient and easy to train Our solution benefits from the recent trend in fine-tuning transformer-based language models We also modify the typical framing of this task by predicting punctuation for sequences rather than individual tokens, which makes for more efficient training and inference Finally, we find that aggregating predictions across multiple context windows improves accuracy even further Our best model achieves a new state of the art on benchmark data (TED Talks) with a combined F1 of 839, representing a 487% relative improvement (153 absolute) over the previous state of the art

...read moreread less

29 citations