Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

BoostingBERT: Integrating Multi-Class Boosting into BERT for NLP Tasks.

[...]

Tongwen Huang¹, Qingyun She, Junlin Zhang•Institutions (1)

Tencent¹

13 Sep 2020-arXiv: Computation and Language

TL;DR: Replacing the BERT base with RoBERTa as base classifier, BoostingBERT achieves new state-of-the-art results in several NLP Tasks and uses knowledge distillation within the "teacher-student" framework to reduce the computational overhead and model storage of BoostingberT while keeping its performance for practical application.

...read moreread less

Abstract: As a pre-trained Transformer model, BERT (Bidirectional Encoder Representations from Transformers) has achieved ground-breaking performance on multiple NLP tasks. On the other hand, Boosting is a popular ensemble learning technique which combines many base classifiers and has been demonstrated to yield better generalization performance in many machine learning tasks. Some works have indicated that ensemble of BERT can further improve the application performance. However, current ensemble approaches focus on bagging or stacking and there has not been much effort on exploring the boosting. In this work, we proposed a novel Boosting BERT model to integrate multi-class boosting into the BERT. Our proposed model uses the pre-trained Transformer as the base classifier to choose harder training sets to fine-tune and gains the benefits of both the pre-training language knowledge and boosting ensemble in NLP tasks. We evaluate the proposed model on the GLUE dataset and 3 popular Chinese NLU benchmarks. Experimental results demonstrate that our proposed model significantly outperforms BERT on all datasets and proves its effectiveness in many NLP tasks. Replacing the BERT base with RoBERTa as base classifier, BoostingBERT achieves new state-of-the-art results in several NLP Tasks. We also use knowledge distillation within the "teacher-student" framework to reduce the computational overhead and model storage of BoostingBERT while keeping its performance for practical application.

...read moreread less

7 citations

Proceedings Article•

Few-Shot Text Generation with Natural Language Instructions.

[...]

Timo Schick¹, Hinrich Schütze¹•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Nov 2021

TL;DR: GenPET as discussed by the authors is a method for text generation that is based on pattern-exploiting training, a recent approach for combining textual instructions with supervised learning that only works for classification tasks.

...read moreread less

Abstract: Providing pretrained language models with simple task descriptions in natural language enables them to solve some tasks in a fully unsupervised fashion. Moreover, when combined with regular learning from examples, this idea yields impressive few-shot results for a wide range of text classification tasks. It is also a promising direction to improve data efficiency in generative settings, but there are several challenges to using a combination of task descriptions and example-based learning for text generation. In particular, it is crucial to find task descriptions that are easy to understand for the pretrained model and to ensure that it actually makes good use of them; furthermore, effective measures against overfitting have to be implemented. In this paper, we show how these challenges can be tackled: We introduce GenPET, a method for text generation that is based on pattern-exploiting training, a recent approach for combining textual instructions with supervised learning that only works for classification tasks. On several summarization and headline generation datasets, GenPET gives consistent improvements over strong baselines in few-shot settings.

...read moreread less

7 citations

Posted Content•

Making Transformers Solve Compositional Tasks.

[...]

Santiago Ontañón, Joshua Ainslie, Vaclav Cvicek, Zachary Fisher¹•Institutions (1)

Google¹

09 Aug 2021-arXiv: Artificial Intelligence

TL;DR: The authors explored the design space of Transformer models and showed that the inductive biases given to the model by several design decisions significantly impact compositional generalization, and identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks.

...read moreread less

Abstract: Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. Through this exploration, we identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks, and that achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).

...read moreread less

7 citations

Proceedings Article•

The Power of Scale for Parameter-Efficient Prompt Tuning

[...]

Brian Lester¹, Rami Al-Rfou¹, Noah Constant¹•Institutions (1)

Google¹

18 Apr 2021

TL;DR: This article proposed a simple yet effective mechanism for learning soft prompts to condition frozen language models to perform specific downstream tasks, which can be learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples.

...read moreread less

Abstract: In this work, we explore “prompt tuning,” a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3’s few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method “closes the gap” and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed “prefix tuning” of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient “prompt ensembling.” We release code and model checkpoints to reproduce our experiments.

...read moreread less

7 citations

Posted Content•

CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing

[...]

Firoj Alam¹, Hassan Sajjad¹, Muhammad Imran¹, Ferda Ofli¹•Institutions (1)

Qatar Airways¹

14 Apr 2020-arXiv: Social and Information Networks

TL;DR: In this paper, the authors proposed a consolidated dataset of crisis-related data and benchmarks for both binary and multiclass classification tasks using several deep learning architec-crures including, CNN, fastText, and transformers.

...read moreread less

Abstract: Time-critical analysis of social media streams is important for humanitarian organizations for planing rapid response during disasters. The \textit{crisis informatics} research community has developed several techniques and systems for processing and classifying big crisis-related data posted on social media. However, due to the dispersed nature of the datasets used in the literature (e.g., for training models), it is not possible to compare the results and measure the progress made towards building better models for crisis informatics tasks. In this work, we attempt to bridge this gap by combining various existing crisis-related datasets. We consolidate eight human-annotated datasets and provide 166.1k and 141.5k tweets for \textit{informativeness} and \textit{humanitarian} classification tasks, respectively. We believe that the consolidated dataset will help train more sophisticated models. Moreover, we provide benchmarks for both binary and multiclass classification tasks using several deep learning architecrures including, CNN, fastText, and transformers. We make the dataset and scripts available at: this https URL

...read moreread less

7 citations