Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Out-of-Task Training for Dialog State Tracking Models

[...]

Michael Heck¹, Christian Geishauser¹, Hsien-chin Lin¹, Nurul Lubis¹, Marco Moresi¹, Carel van Niekerk¹, Milica Gasic¹ - Show less +3 more•Institutions (1)

University of Düsseldorf¹

01 Dec 2020

TL;DR: This work successfully utilize non-dialog data from unrelated NLP tasks to train dialog state trackers to mitigate the data sparsity issue inherent to DST.

...read moreread less

Abstract: Dialog state tracking (DST) suffers from severe data sparsity. While many natural language processing (NLP) tasks benefit from transfer learning and multi-task learning, in dialog these methods are limited by the amount of available data and by the specificity of dialog applications. In this work, we successfully utilize non-dialog data from unrelated NLP tasks to train dialog state trackers. This opens the door to the abundance of unrelated NLP corpora to mitigate the data sparsity issue inherent to DST.

...read moreread less

2 citations

Proceedings Article•DOI•

Learning to Perturb Word Embeddings for Out-of-distribution QA

[...]

Seanie Lee¹, Minki Kang¹, Juho Lee¹, Sung Ju Hwang¹•Institutions (1)

KAIST¹

01 Aug 2021

TL;DR: This paper proposed a data augmentation method based on a stochastic noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.

...read moreread less

Abstract: QA models based on pretrained language models have achieved remarkable performance on various benchmark datasets. However, QA models do not generalize well to unseen data that falls outside the training distribution, due to distributional shifts. Data augmentation (DA) techniques which drop/replace words have shown to be effective in regularizing the model from overfitting to the training data. Yet, they may adversely affect the QA tasks since they incur semantic changes that may lead to wrong answers for the QA task. To tackle this problem, we propose a simple yet effective DA method based on a stochastic noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics. We validate the performance of the QA models trained with our word embedding perturbation on a single source dataset, on five different target domains. The results show that our method significantly outperforms the baseline DA methods. Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.

...read moreread less

2 citations

Proceedings Article•DOI•

Can Transformer Language Models Predict Psychometric Properties

[...]

Antonio Laverghetta¹, Animesh Nighojkar¹, Jamshidbek Mirzakhalov¹, John Licato¹•Institutions (1)

University of South Florida¹

12 Jun 2021

TL;DR: This article found that transformer-based LMs can predict psychometric properties consistently well in certain categories but consistently poorly in others, thus providing new insights into fundamental similarities and differences between human and LM reasoning.

...read moreread less

Abstract: Transformer-based language models (LMs) continue to advance state-of-the-art performance on NLP benchmark tasks, including tasks designed to mimic human-inspired “commonsense” competencies. To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts of the field of psychometrics. But to what extent can the benefits flow in the other direction? I.e., can LMs be of use in predicting what the psychometric properties of test items will be when those items are given to human participants? We gather responses from numerous human participants and LMs (transformer- and non-transformer-based) on a broad diagnostic test of linguistic competencies. We then use the responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately. We then determine how well these two sets of predictions match. We find cases in which transformer-based LMs predict psychometric properties consistently well in certain categories but consistently poorly in others, thus providing new insights into fundamental similarities and differences between human and LM reasoning.

...read moreread less

2 citations

Proceedings Article•DOI•

From Transformers to Reformers

[...]

Nauman Riaz¹, Seemab Latif¹, Rabia Latif²•Institutions (2)

National University of Sciences and Technology¹, Prince Sultan University²

20 May 2021

TL;DR: In this paper, the authors investigated different deep learning models for various tasks of Natural Language Processing (NLP), including the Transformer model and Reformer model, and evaluated the performance of these two models on various NLP tasks.

...read moreread less

Abstract: This paper investigates different deep learning models for various tasks of Natural Language Processing. Recent ongoing research is about the Transformer models and their variations (like the Reformer model). The Recurrent Neural Networks models were efficient up to an only a fixed size of the window. They were unable to capture long-term dependencies for large sequences. To overcome this limitation, the attention mechanism was introduced which is incorporated in the Transformer model. The dot product attention in transformers has a complexity of O(n2) where n is the sequence length. This computation becomes infeasible for large sequences. Also, the residual layers consume a lot of memory because activations need to be stored for back-propagation. To overcome this limitation of memory efficiency and to make transformers learn over larger sequences, the Reformer models were introduced. Our research includes the evaluation of the performance of these two models on various Natural Language Processing tasks.

...read moreread less

2 citations

Posted Content•

UPRec: User-Aware Pre-training for Recommender Systems.

[...]

Chaojun Xiao, Ruobing Xie, Yuan Yao, Zhiyuan Liu, Maosong Sun, Xu Zhang, Leyu Lin - Show less +3 more

22 Feb 2021-arXiv: Information Retrieval

TL;DR: Zhang et al. as discussed by the authors proposed a method to enhance pre-trained models with heterogeneous user information, called User-aware Pre-training for Recommendation (UPRec), which leverages the user attributes andstructured social graphs to construct self-supervised objectives in the pre-training stage.

...read moreread less

Abstract: Existing sequential recommendation methods rely on large amounts of training data and usually suffer from the data sparsity problem. To tackle this, the pre-training mechanism has been widely adopted, which attempts to leverage large-scale data to perform self-supervised learning and transfer the pre-trained parameters to downstream tasks. However, previous pre-trained models for recommendation focus on leverage universal sequence patterns from user behaviour sequences and item information, whereas ignore capturing personalized interests with the heterogeneous user information, which has been shown effective in contributing to personalized recommendation. In this paper, we propose a method to enhance pre-trained models with heterogeneous user information, called User-aware Pre-training for Recommendation (UPRec). Specifically, UPRec leverages the user attributes andstructured social graphs to construct self-supervised objectives in the pre-training stage and proposes two user-aware pre-training tasks. Comprehensive experimental results on several real-world large-scale recommendation datasets demonstrate that UPRec can effectively integrate user information into pre-trained models and thus provide more appropriate recommendations for users.

...read moreread less

2 citations