Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Training Question Answering Models From Synthetic Data

[...]

Raul Puri¹, Ryan Spring², Mohammad Shoeybi¹, Md. Mostofa Ali Patwary³, Bryan Catanzaro¹ - Show less +1 more•Institutions (3)

Nvidia¹, Rice University², University of Bergen³

22 Feb 2020

TL;DR: The authors synthesize questions and answers from a synthetic text corpus generated by an 8.3 billion parameter GPT-2 model and achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQuAD1.1 dev set.

...read moreread less

Abstract: Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQuAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQuAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic text corpus generated by an 8.3 billion parameter GPT-2 model and achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQuAD1.1 dev set. We further apply our methodology to SQuAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data.

...read moreread less

48 citations

Proceedings Article•DOI•

Improving AMR Parsing with Sequence-to-Sequence Pre-training

[...]

Dongqin Xu, Junhui Li¹, Muhua Zhu², Min Zhang¹, Guodong Zhou³ - Show less +1 more•Institutions (3)

Soochow University (Suzhou)¹, Alibaba Group², The Chinese University of Hong Kong³

01 Nov 2020

TL;DR: This paper proposes a seq2seq pre-training approach to build pre-trained models in both single and joint way on three relevant tasks, i.e., machine translation, syntactic parsing, and AMR parsing itself, and extends the vanilla fine-tuning method to a multi-task learning fine- Tuning method that optimizes for the performance of AMR parse while endeavors to preserve the response of pre- trained models.

...read moreread less

Abstract: In the literature, the research on abstract meaning representation (AMR) parsing is much restricted by the size of human-curated dataset which is critical to build an AMR parser with good performance. To alleviate such data size restriction, pre-trained models have been drawing more and more attention in AMR parsing. However, previous pre-trained models, like BERT, are implemented for general purpose which may not work as expected for the specific task of AMR parsing. In this paper, we focus on sequence-to-sequence (seq2seq) AMR parsing and propose a seq2seq pre-training approach to build pre-trained models in both single and joint way on three relevant tasks, i.e., machine translation, syntactic parsing, and AMR parsing itself. Moreover, we extend the vanilla fine-tuning method to a multi-task learning fine-tuning method that optimizes for the performance of AMR parsing while endeavors to preserve the response of pre-trained models. Extensive experimental results on two English benchmark datasets show that both the single and joint pre-trained models significantly improve the performance (e.g., from 71.5 to 80.2 on AMR 2.0), which reaches the state of the art. The result is very encouraging since we achieve this with seq2seq models rather than complex models. We make our code and model available at https:// github.com/xdqkid/S2S-AMR-Parser.

...read moreread less

47 citations

Proceedings Article•DOI•

End-to-end synthetic data generation for domain adaptation of question answering systems

[...]

Siamak Shakeri¹, Cicero Nogueira dos Santos², Henghui Zhu², Patrick Ng², Feng Nan², Zhiguo Wang², Ramesh Nallapati², Bing Xiang¹ - Show less +4 more•Institutions (2)

Google¹, Amazon.com²

01 Nov 2020

TL;DR: This model comprises a single transformer-based encoder-decoder network that is trained end-to-end to generate both answers and questions and indicates significant improvements in the domain adaptation of QA models outperforming current state-of-the-art methods.

...read moreread less

Abstract: We propose an end-to-end approach for synthetic QA data generation. Our model comprises a single transformer-based encoder-decoder network that is trained end-to-end to generate both answers and questions. In a nutshell, we feed a passage to the encoder and ask the decoder to generate a question and an answer token-by-token. The likelihood produced in the generation process is used as a filtering score, which avoids the need for a separate filtering model. Our generator is trained by fine-tuning a pretrained LM using maximum likelihood estimation. The experimental results indicate significant improvements in the domain adaptation of QA models outperforming current state-of-the-art methods.

...read moreread less

47 citations

Posted Content•

Pre-trained Summarization Distillation

[...]

Sam Shleifer, Alexander M. Rush

24 Oct 2020-arXiv: Computation and Language

TL;DR: This work compares the three approaches for distillation of Pegasus and BART, the current and former state of the art, pre-trained summarization models, and finds that SFT outperforms knowledge distillation and pseudo-labeling on the CNN/DailyMail dataset, but under-performs pseudo- labelship on the more abstractive XSUM dataset.

...read moreread less

Abstract: Recent state-of-the-art approaches to summarization utilize large pre-trained Transformer models. Distilling these models to smaller student models has become critically important for practical use; however there are many different distillation methods proposed by the NLP literature. Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning. We compare these three approaches for distillation of Pegasus and BART, the current and former state of the art, pre-trained summarization models, and find that SFT outperforms knowledge distillation and pseudo-labeling on the CNN/DailyMail dataset, but under-performs pseudo-labeling on the more abstractive XSUM dataset. PyTorch Code and checkpoints of different sizes are available through Hugging Face transformers here this http URL

...read moreread less

47 citations

Posted Content•

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

[...]

Zhuohan Li¹, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez - Show less +3 more•Institutions (1)

University of California, Berkeley¹

26 Feb 2020-arXiv: Computation and Language

TL;DR: It is shown that large models are more robust to compression techniques such as quantization and pruning than small models, and one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

...read moreread less

Abstract: Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

...read moreread less

47 citations