Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

On Losses for Modern Language Models

[...]

Stéphane Aroca-Ouellette¹, Frank Rudzicz²•Institutions (2)

University of Toronto¹, St. Michael's Hospital²

04 Oct 2020

TL;DR: It is shown that NSP is detrimental to training due to its context splitting and shallow semantic signal, and it is demonstrated that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task.

...read moreread less

Abstract: BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP's effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks -- sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant -- that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERT\textsubscript{Base} on the GLUE benchmark using fewer than a quarter of the training tokens.

...read moreread less

24 citations

Proceedings Article•DOI•

Teaching Machine Comprehension with Compositional Explanations.

[...]

Qinyuan Ye¹, Xiao Huang, Elizabeth Boschee², Xiang Ren³•Institutions (3)

University of Southern California¹, BBN Technologies², Zhejiang University³

01 Nov 2020

TL;DR: This paper extracts structured variables and rules from explanations and compose neural module teachers that annotate instances for training downstream MRC models, and uses learnable neural modules and soft logic to handle linguistic variation and overcome sparse coverage.

...read moreread less

Abstract: Advances in machine reading comprehension (MRC) rely heavily on the collection of large scale human-annotated examples in the form of (question, paragraph, answer) triples. In contrast, humans are typically able to generalize with only a few examples, relying on deeper underlying world knowledge, linguistic sophistication, and/or simply superior deductive powers. In this paper, we focus on “teaching” machines reading comprehension, using a small number of semi-structured explanations that explicitly inform machines why answer spans are correct. We extract structured variables and rules from explanations and compose neural module teachers that annotate instances for training downstream MRC models. We use learnable neural modules and soft logic to handle linguistic variation and overcome sparse coverage; the modules are jointly optimized with the MRC model to improve final performance. On the SQuAD dataset, our proposed method achieves 70.14% F1 score with supervision from 26 explanations, comparable to plain supervised learning using 1,100 labeled instances, yielding a 12x speed up.

...read moreread less

24 citations

Posted Content•

MT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs.

[...]

Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang Xian-Ling Mao, Heyan Huang, Furu Wei - Show less +2 more

18 Apr 2021-arXiv: Computation and Language

TL;DR: This article proposed a partially non-autoregressive objective for text-to-text pre-training and showed that the proposed mT6 improves cross-lingual transferability over mT5.

...read moreread less

Abstract: Multilingual T5 (mT5) pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on eight multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.

...read moreread less

24 citations

Proceedings Article•DOI•

Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning.

[...]

Zhaojiang Lin¹, Andrea Madotto¹, Pascale Fung¹•Institutions (1)

Hong Kong University of Science and Technology¹

01 Apr 2020

TL;DR: This paper proposed an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pretrained model, which can maintain or even improve the performance of fine-tuning the whole model.

...read moreread less

Abstract: Fine-tuning pre-trained generative language models to down-stream language generation tasks has shown promising results. However, this comes with the cost of having a single, large model for each task, which is not ideal in low-memory/power scenarios (e.g., mobile). In this paper, we propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pretrained model. The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.

...read moreread less

24 citations

Posted Content•

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.

[...]

Samyam Rajbhandari¹, Olatunji Ruwase¹, Jeff Rasley¹, Shaden Smith¹, Yuxiong He¹ - Show less +1 more•Institutions (1)

Microsoft¹

16 Apr 2021-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: The ZeRO-Infinity project as discussed by the authors leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring.

...read moreread less

Abstract: In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model. In this paper we present ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale on limited resources without requiring model code refactoring. At the same time it achieves excellent training throughput and scalability, unencumbered by the limited CPU or NVMe bandwidth. ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters. It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible. In terms of training throughput and scalability, it sustains over 25 petaflops on 512 NVIDIA V100 GPUs(40% of peak), while also demonstrating super linear scalability. An open source implementation of ZeRO-Infinity is available through DeepSpeed, a deep learning optimization library that makes distributed training easy, efficient, and effective.

...read moreread less

24 citations