Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Retrieval-guided Counterfactual Generation for QA

[...]

Bhargavi Paranjape¹, Matthew Lamm², Ian Tenney²•Institutions (2)

University of Washington¹, Google²

14 Oct 2021-arXiv: Computation and Language

TL;DR: This paper developed a Retrieve-Generate-Filter (RGF) technique to create counterfactual evaluation and training data with minimal human supervision, using an open-domain QA framework and question generation model trained on original task data.

...read moreread less

Abstract: Deep NLP models have been shown to learn spurious correlations, leaving them brittle to input perturbations. Recent work has shown that counterfactual or contrastive data -- i.e. minimally perturbed inputs -- can reveal these weaknesses, and that data augmentation using counterfactuals can help ameliorate them. Proposed techniques for generating counterfactuals rely on human annotations, perturbations based on simple heuristics, and meaning representation frameworks. We focus on the task of creating counterfactuals for question answering, which presents unique challenges related to world knowledge, semantic diversity, and answerability. To address these challenges, we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual evaluation and training data with minimal human supervision. Using an open-domain QA framework and question generation model trained on original task data, we create counterfactuals that are fluent, semantically diverse, and automatically labeled. Data augmentation with RGF counterfactuals improves performance on out-of-domain and challenging evaluation sets over and above existing methods, in both the reading comprehension and open-domain QA settings. Moreover, we find that RGF data leads to significant improvements in a model's robustness to local perturbations.

...read moreread less

1 citations

Posted Content•

Towards Accurate and Reliable Energy Measurement of NLP Models

[...]

Qingqing Cao¹, Aruna Balasubramanian¹, Niranjan Balasubramanian¹•Institutions (1)

Stony Brook University¹

11 Oct 2020-arXiv: Computation and Language

TL;DR: In this article, the authors show that existing software-based energy measurements are not accurate because they do not take into account hardware differences and how resource utilization affects energy consumption, and propose a more accurate energy estimation model that takes into account the hardware variabilities and the non-linear relationship between resource utilization and energy consumption.

...read moreread less

Abstract: Accurate and reliable measurement of energy consumption is critical for making well-informed design choices when choosing and training large scale NLP models. In this work, we show that existing software-based energy measurements are not accurate because they do not take into account hardware differences and how resource utilization affects energy consumption. We conduct energy measurement experiments with four different models for a question answering task. We quantify the error of existing software-based energy measurements by using a hardware power meter that provides highly accurate energy measurements. Our key takeaway is the need for a more accurate energy estimation model that takes into account hardware variabilities and the non-linear relationship between resource utilization and energy consumption. We release the code and data at this https URL.

...read moreread less

1 citations

Posted Content•

BioMegatron: Larger Biomedical Domain Language Model

[...]

Hoo-Chang Shin¹, Yang Zhang¹, Evelina Bakhturina, Raul Puri², Mostofa Patwary¹, Mohammad Shoeybi¹, Raghav Mani - Show less +3 more•Institutions (2)

Nvidia¹, University of California, Berkeley²

12 Oct 2020-arXiv: Computation and Language

TL;DR: The authors empirically study and evaluate several factors that can affect performance on domain language applications, such as the sub-word vocabulary set, model size, pre-training corpus, and domain transfer, and show consistent improvements on benchmarks with their larger BioMegatron model trained on a larger domain corpus, contributing to their understanding of domain language model applications.

...read moreread less

Abstract: There has been an influx of biomedical domain-specific language models, showing language models pre-trained on biomedical text perform better on biomedical domain benchmarks than those trained on general domain text corpora such as Wikipedia and Books. Yet, most works do not study the factors affecting each domain language application deeply. Additionally, the study of model size on domain-specific models has been mostly missing. We empirically study and evaluate several factors that can affect performance on domain language applications, such as the sub-word vocabulary set, model size, pre-training corpus, and domain transfer. We show consistent improvements on benchmarks with our larger BioMegatron model trained on a larger domain corpus, contributing to our understanding of domain language model applications. We demonstrate noticeable improvements over the previous state-of-the-art (SOTA) on standard biomedical NLP benchmarks of named entity recognition, relation extraction, and question answering. Model checkpoints and code are available at [this https URL] and [this https URL].

...read moreread less

1 citations

Posted Content•

LiST: Lite Self-training Makes Efficient Few-shot Learners.

[...]

Yaqing Wang, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, Jianfeng Gao - Show less +2 more

12 Oct 2021-arXiv: Computation and Language

TL;DR: LiST as mentioned in this paper uses self-training in conjunction with meta-learning for reweighting noisy pseudo-prompt labels, which not only improves the model performance for few-shot learning on target domains but also reduces the model memory footprint.

...read moreread less

Abstract: We present a new method LiST for efficient fine-tuning of large pre-trained language models (PLMs) in few-shot learning settings. LiST significantly improves over recent methods that adopt prompt fine-tuning using two key techniques. The first one is the use of self-training to leverage large amounts of unlabeled data for prompt-tuning to significantly boost the model performance in few-shot settings. We use self-training in conjunction with meta-learning for re-weighting noisy pseudo-prompt labels. However, traditional self-training is expensive as it requires updating all the model parameters repetitively. Therefore, we use a second technique for light-weight fine-tuning where we introduce a small number of task-specific adapter parameters that are fine-tuned during self-training while keeping the PLM encoder frozen. This also significantly reduces the overall model footprint across several tasks that can now share a common PLM encoder as backbone for inference. Combining the above techniques, LiST not only improves the model performance for few-shot learning on target domains but also reduces the model memory footprint. We present a comprehensive study on six NLU tasks to validate the effectiveness of LiST. The results show that LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.

...read moreread less

1 citations

Proceedings Article•

Learn to Copy from the Copying History: Correlational Copy Network for Abstractive Summarization

[...]

Haoran Li¹, Song Xu², Peng Yuan, Yujia Wang, Youzheng Wu³, Xiaodong He¹, Bowen Zhou² - Show less +3 more•Institutions (3)

Chinese Academy of Sciences¹, IBM², The Chinese University of Hong Kong³

01 Nov 2021

TL;DR: CoCoNet as mentioned in this paper enhances the standard copying mechanism by keeping track of the copying history and explicitly encourages the model to copy the input word that is relevant to the previously copied one.

...read moreread less

Abstract: The copying mechanism has had considerable success in abstractive summarization, facilitating models to directly copy words from the input text to the output summary. Existing works mostly employ encoder-decoder attention, which applies copying at each time step independently of the former ones. However, this may sometimes lead to incomplete copying. In this paper, we propose a novel copying scheme named Correlational Copying Network (CoCoNet) that enhances the standard copying mechanism by keeping track of the copying history. It thereby takes advantage of prior copying distributions and, at each time step, explicitly encourages the model to copy the input word that is relevant to the previously copied one. In addition, we strengthen CoCoNet through pre-training with suitable corpora that simulate the copying behaviors. Experimental results show that CoCoNet can copy more accurately and achieves new state-of-the-art performances on summarization benchmarks, including CNN/DailyMail for news summarization and SAMSum for dialogue summarization. The code and checkpoint will be publicly available.

...read moreread less

1 citations