scispace - formally typeset
Search or ask a question
Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Jan 2022
TL;DR: Lee-Thorp et al. as discussed by the authors presented a paper on the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NACL 2022.
Abstract: James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022.

34 citations

Proceedings Article
03 May 2021
TL;DR: This article proposed a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multihop datasets, HotpotQA and multi-evidence FEVER.
Abstract: We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multi-hop datasets, HotpotQA and multi-evidence FEVER. Contrary to previous work, our method does not require access to any corpus-specific information, such as inter-document hyperlinks or human-annotated entity markers, and can be applied to any unstructured text corpus. Our system also yields a much better efficiency-accuracy trade-off, matching the best published accuracy on HotpotQA while being 10 times faster at inference time.

33 citations

Posted Content
TL;DR: The authors adapt Pattern-Exploiting Training (PET) for finetuning generative language models on text generation tasks, and show that their proposed variant of PET gives consistent improvements over a strong baseline in few-shot settings.
Abstract: Providing pretrained language models with simple task descriptions or prompts in natural language yields impressive few-shot results for a wide range of text classification tasks when combined with gradient-based learning from examples. In this paper, we show that the underlying idea can also be applied to text generation tasks: We adapt Pattern-Exploiting Training (PET), a recently proposed few-shot approach, for finetuning generative language models on text generation tasks. On several text summarization and headline generation datasets, our proposed variant of PET gives consistent improvements over a strong baseline in few-shot settings.

33 citations

Posted Content
TL;DR: This article proposed a decoding algorithm that reduces the probability of a model producing problematic text given only a textual description of the undesired behavior, which does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters.
Abstract: When trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. As large models often require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we investigate whether pretrained language models at least know when they exhibit some undesirable bias or produce toxic content. Based on our findings, we propose a decoding algorithm that reduces the probability of a model producing problematic text given only a textual description of the undesired behavior. This algorithm does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters. While our approach does by no means eliminate the issue of language models generating biased text, we believe it to be an important step in this direction.

33 citations

Proceedings ArticleDOI
Wenhui Wang1, Hangbo Bao1, Shaohan Huang1, Li Dong1, Furu Wei1 
01 Aug 2021
TL;DR: Li et al. as discussed by the authors proposed multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each selfattention module, and employed the above relational knowledge to train the student model.
Abstract: We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student's attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. We conduct extensive experiments on compressing both monolingual and multilingual pretrained models. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, RoBERTa and XLM-R) outperform the state-of-the-art.

33 citations

Trending Questions (1)
What are the limitations of transfer learning with a unified text-to-text transformer?

The paper does not mention the limitations of transfer learning with a unified text-to-text transformer.