Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Deep Learning for Text Style Transfer: A Survey

[...]

Di Jin¹, Zhijing Jin², Zhiting Hu³, Olga Vechtomova⁴, Rada Mihalcea⁵ - Show less +1 more•Institutions (5)

Massachusetts Institute of Technology¹, Max Planck Society², University of California, San Diego³, University of Waterloo⁴, University of Michigan⁵

01 Nov 2020-arXiv: Computation and Language

TL;DR: This article presents a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neuralText style transfer work in 2017, as well as the rich methodologies in the presence of parallel and non-parallel data.

...read moreread less

Abstract: Text style transfer (TST) is an important task in natural language generation (NLG), which aims to control certain attributes in the generated text, such as politeness, emotion, humor, and many others. It has a long history in the field of natural language processing (NLP), and recently has re-gained significant attention thanks to the promising performance brought by deep neural models. In this paper, we present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017. We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data. We also provide discussions on a variety of important topics regarding the future development of TST. Our curated paper list is at this https URL

...read moreread less

101 citations

Proceedings Article•DOI•

SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization

[...]

Haoming Jiang¹, Pengcheng He², Weizhu Chen², Xiaodong Liu³, Jianfeng Gao², Tuo Zhao¹ - Show less +2 more•Institutions (3)

Georgia Institute of Technology¹, Microsoft², University of Electronic Science and Technology of China³

01 Apr 2020

TL;DR: This article proposed a new learning framework for robust and efficient fine-tuning for pre-trained models to attain better generalization performance, which contains two important ingredients: 1. Smoothness-inducing regularization, which effectively manages the complexity of the model; 2. Bregman proximal point optimization, which is an instance of trust-region methods and can prevent aggressive updating.

...read moreread less

Abstract: Transfer learning has fundamentally changed the landscape of natural language processing (NLP). Many state-of-the-art models are first pre-trained on a large text corpus and then fine-tuned on downstream tasks. However, due to limited data resources from downstream tasks and the extremely high complexity of pre-trained models, aggressive fine-tuning often causes the fine-tuned model to overfit the training data of downstream tasks and fail to generalize to unseen data. To address such an issue in a principled manner, we propose a new learning framework for robust and efficient fine-tuning for pre-trained models to attain better generalization performance. The proposed framework contains two important ingredients: 1. Smoothness-inducing regularization, which effectively manages the complexity of the model; 2. Bregman proximal point optimization, which is an instance of trust-region methods and can prevent aggressive updating. Our experiments show that the proposed framework achieves new state-of-the-art performance on a number of NLP tasks including GLUE, SNLI, SciTail and ANLI. Moreover, it also outperforms the state-of-the-art T5 model, which is the largest pre-trained model containing 11 billion parameters, on GLUE.

...read moreread less

101 citations

Posted Content•

mT5: A massively multilingual pre-trained text-to-text transformer

[...]

Linting Xue¹, Noah Constant¹, Adam Roberts², Mihir Kale¹, Rami Al-Rfou¹, Aditya Siddhant¹, Aditya Barua¹, Colin Raffel¹ - Show less +4 more•Institutions (2)

Google¹, University of Chester²

22 Oct 2020-arXiv: Computation and Language

TL;DR: This article proposed a multilingual variant of T5, mT5, which was pre-trained on a new Common Crawl-based dataset covering 101 languages and achieved state-of-the-art performance on many multilingual benchmarks.

...read moreread less

Abstract: The recent "Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent "accidental translation" in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

...read moreread less

99 citations

Proceedings Article•DOI•

Byte Pair Encoding is Suboptimal for Language Model Pretraining

[...]

Kaj Bostrom¹, Greg Durrett¹•Institutions (1)

University of Texas at Austin¹

07 Apr 2020

TL;DR: This paper analyzed differences between byte-pair encoding and unigram transformer language models and found that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE's greedy construction procedure.

...read moreread less

Abstract: The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on language model pretraining. We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked language models pretrained with these tokenizations. Across downstream tasks and two languages (English and Japanese), we find that the unigram LM tokenization method matches or outperforms BPE. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.

...read moreread less

98 citations

Proceedings Article•DOI•

Continual Lifelong Learning in Natural Language Processing: A Survey

[...]

Magdalena Biesialska¹, Katarzyna Biesialska¹, Marta R. Costa-jussà¹•Institutions (1)

Polytechnic University of Catalonia¹

17 Dec 2020-arXiv: Computation and Language

TL;DR: This work looks at the problem of CL through the lens of various NLP tasks, and discusses major challenges in CL and current methods applied in neural network models.

...read moreread less

Abstract: Continual learning (CL) aims to enable information systems to learn from a continuous data stream across time. However, it is difficult for existing deep learning architectures to learn a new task without largely forgetting previously acquired knowledge. Furthermore, CL is particularly challenging for language learning, as natural language is ambiguous: it is discrete, compositional, and its meaning is context-dependent. In this work, we look at the problem of CL through the lens of various NLP tasks. Our survey discusses major challenges in CL and current methods applied in neural network models. We also provide a critical review of the existing CL evaluation methods and datasets in NLP. Finally, we present our outlook on future research directions.

...read moreread less

96 citations