Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning Skills

[...]

Ori Yoran¹, Alon Talmor¹, Jonathan Berant¹•Institutions (1)

Tel Aviv University¹

15 Jul 2021-arXiv: Computation and Language

TL;DR: The authors propose to leverage semi-structured tables, and automatically generate at scale question-paragraph pairs, where answering the question requires reasoning over multiple facts in the paragraph, such as number comparison, conjunction, and fact composition.

...read moreread less

Abstract: Models pre-trained with a language modeling objective possess ample world knowledge and language skills, but are known to struggle in tasks that require reasoning. In this work, we propose to leverage semi-structured tables, and automatically generate at scale question-paragraph pairs, where answering the question requires reasoning over multiple facts in the paragraph. We add a pre-training step over this synthetic data, which includes examples that require 16 different reasoning skills such as number comparison, conjunction, and fact composition. To improve data efficiency, we propose sampling strategies that focus training on reasoning skills the model is currently lacking. We evaluate our approach on three reading comprehension datasets that are focused on reasoning, and show that our model, PReasM, substantially outperforms T5, a popular pre-trained encoder-decoder model. Moreover, sampling examples based on current model errors leads to faster training and higher overall performance.

...read moreread less

6 citations

Proceedings Article•DOI•

Promoting Graph Awareness in Linearized Graph-to-Text Generation

[...]

Alexander Miserlis Hoyle¹, Ana Marasovi, Noah A. Smith²•Institutions (2)

University of Maryland, College Park¹, University of Washington²

01 Aug 2021

6 citations

Posted Content•

Reference Knowledgeable Network for Machine Reading Comprehension.

[...]

Yilin Zhao, Zhuosheng Zhang, Hai Zhao

07 Dec 2020-arXiv: Computation and Language

TL;DR: A novel reference-based knowledge enhancement model called RekNet, which simulates human reading strategies to refine critical information from the passage and quote explicit knowledge in necessity, obtaining consistent and remarkable performance improvement with observable statistical significance level over strong baselines.

...read moreread less

Abstract: Multi-choice Machine Reading Comprehension (MRC) as a challenge requires model to select the most appropriate answer from a set of candidates given passage and question. Most of the existing researches focus on the modeling of the task datasets without explicitly referring to external fine-grained knowledge sources, which is supposed to greatly make up the deficiency of the given passage. Thus we propose a novel reference-based knowledge enhancement model called Reference Knowledgeable Network (RekNet), which refines critical information from the passage and quote explicit knowledge in necessity. In detail, RekNet refines fine-grained critical information and defines it as Reference Span, then quotes explicit knowledge quadruples by the co-occurrence information of Reference Span and candidates. The proposed RekNet is evaluated on three multi-choice MRC benchmarks: RACE, DREAM and Cosmos QA, which shows consistent and remarkable performance improvement with observable statistical significance level over strong baselines.

...read moreread less

6 citations

Proceedings Article•DOI•

Robust Transfer Learning with Pretrained Language Models through Adapters

[...]

Wenjuan Han¹, Bo Pang², Ying Nian Wu²•Institutions (2)

ShanghaiTech University¹, University of California, Los Angeles²

01 Aug 2021

TL;DR: This paper proposed a simple yet effective adapter-based approach to mitigate adversarial attack by inserting small bottleneck layers (i.e., adapter) within each layer of a pretrained model, then fix the pretrained layers and train the adapter layers on the downstream task data.

...read moreread less

Abstract: Transfer learning with large pretrained transformer-based language models like BERT has become a dominating approach for most NLP tasks. Simply fine-tuning those large language models on downstream tasks or combining it with task-specific pretraining is often not robust. In particular, the performance considerably varies as the random seed changes or the number of pretraining and/or fine-tuning iterations varies, and the fine-tuned model is vulnerable to adversarial attack. We propose a simple yet effective adapter-based approach to mitigate these issues. Specifically, we insert small bottleneck layers (i.e., adapter) within each layer of a pretrained model, then fix the pretrained layers and train the adapter layers on the downstream task data, with (1) task-specific unsupervised pretraining and then (2) task-specific supervised training (e.g., classification, sequence labeling). Our experiments demonstrate that such a training scheme leads to improved stability and adversarial robustness in transfer learning to various downstream tasks.

...read moreread less

6 citations

Posted Content•

Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

[...]

Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, Yi-Hsuan Yang¹•Institutions (1)

Academia Sinica¹

07 Jan 2021-arXiv: Sound

TL;DR: In this paper, a new Transformer decoder architecture was proposed that uses different feed-forward heads to model tokens of different types, such as note types and metric types, for music generation.

...read moreread less

Abstract: To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pitch, duration, velocity (dynamics), and placement (onset time) along the time grid. While different types of tokens may possess different properties, existing models usually treat them equally, in the same way as modeling words in natural languages. In this paper, we present a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types. And, we propose a new Transformer decoder architecture that uses different feed-forward heads to model tokens of different types. With an expansion-compression trick, we convert a piece of music to a sequence of compound words by grouping neighboring tokens, greatly reducing the length of the token sequences. We show that the resulting model can be viewed as a learner over dynamic directed hypergraphs. And, we employ it to learn to compose expressive Pop piano music of full-song length (involving up to 10K individual tokens per song), both conditionally and unconditionally. Our experiment shows that, compared to state-of-the-art models, the proposed model converges 5--10 times faster at training (i.e., within a day on a single GPU with 11 GB memory), and with comparable quality in the generated music.

...read moreread less

6 citations