Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

FNet: Mixing Tokens with Fourier Transforms

[...]

01 Jan 2022

TL;DR: Lee-Thorp et al. as discussed by the authors presented a paper on the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NACL 2022.

...read moreread less

Abstract: James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022.

...read moreread less

34 citations

Proceedings Article•

Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval

[...]

Wenhan Xiong¹, Xiang Li², Srini Iyer³, Jingfei Du⁴, Patrick S. H. Lewis⁵, William Yang Wang¹, Yashar Mehdad⁴, Scott Yih⁴, Sebastian Riedel⁵, Douwe Kiela⁴, Barlas Oguz⁴ - Show less +7 more•Institutions (5)

University of California, Santa Barbara¹, University of Massachusetts Amherst², University of Washington³, Facebook⁴, University College London⁵

03 May 2021

TL;DR: This article proposed a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multihop datasets, HotpotQA and multi-evidence FEVER.

...read moreread less

Abstract: We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multi-hop datasets, HotpotQA and multi-evidence FEVER. Contrary to previous work, our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers, and can be applied to any unstructured text corpus. Our system also yields a much better efficiency-accuracy trade-off, matching the best published accuracy on HotpotQA while being 10 times faster at inference time.

...read moreread less

33 citations

Posted Content•

Few-Shot Text Generation with Pattern-Exploiting Training.

[...]

Timo Schick¹, Hinrich Schütze•Institutions (1)

Ludwig Maximilian University of Munich¹

22 Dec 2020-arXiv: Computation and Language

TL;DR: The authors adapt Pattern-Exploiting Training (PET) for finetuning generative language models on text generation tasks, and show that their proposed variant of PET gives consistent improvements over a strong baseline in few-shot settings.

...read moreread less

Abstract: Providing pretrained language models with simple task descriptions or prompts in natural language yields impressive few-shot results for a wide range of text classification tasks when combined with gradient-based learning from examples. In this paper, we show that the underlying idea can also be applied to text generation tasks: We adapt Pattern-Exploiting Training (PET), a recently proposed few-shot approach, for finetuning generative language models on text generation tasks. On several text summarization and headline generation datasets, our proposed variant of PET gives consistent improvements over a strong baseline in few-shot settings.

...read moreread less

33 citations

Posted Content•

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP.

[...]

Timo Schick¹, Sahana Udupa¹, Hinrich Schütze•Institutions (1)

Ludwig Maximilian University of Munich¹

28 Feb 2021-arXiv: Computation and Language

TL;DR: This article proposed a decoding algorithm that reduces the probability of a model producing problematic text given only a textual description of the undesired behavior, which does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters.

...read moreread less

Abstract: When trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. As large models often require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we investigate whether pretrained language models at least know when they exhibit some undesirable bias or produce toxic content. Based on our findings, we propose a decoding algorithm that reduces the probability of a model producing problematic text given only a textual description of the undesired behavior. This algorithm does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters. While our approach does by no means eliminate the issue of language models generating biased text, we believe it to be an important step in this direction.

...read moreread less

33 citations

Proceedings Article•DOI•

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

[...]

Wenhui Wang¹, Hangbo Bao¹, Shaohan Huang¹, Li Dong¹, Furu Wei¹ - Show less +1 more•Institutions (1)

Microsoft¹

01 Aug 2021

TL;DR: Li et al. as discussed by the authors proposed multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each selfattention module, and employed the above relational knowledge to train the student model.

...read moreread less

Abstract: We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student's attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. We conduct extensive experiments on compressing both monolingual and multilingual pretrained models. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, RoBERTa and XLM-R) outperform the state-of-the-art.

...read moreread less

33 citations