Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce

[...]

Song Xu¹, Haoran Li², Peng Yuan, Yujia Wang, Youzheng Wu³, Xiaodong He², Ying Liu⁴, Bowen Zhou¹ - Show less +4 more•Institutions (4)

IBM¹, Chinese Academy of Sciences², The Chinese University of Hong Kong³, Renmin University of China⁴

14 Apr 2021-arXiv: Computation and Language

TL;DR: This paper proposed K-PLUG, a knowledge-injected pre-trained language model based on the encoder-decoder transformer that can be transferred to both natural language understanding and generation tasks.

...read moreread less

Abstract: Existing pre-trained language models (PLMs) have demonstrated the effectiveness of self-supervised learning for a broad range of natural language processing (NLP) tasks. However, most of them are not explicitly aware of domain-specific knowledge, which is essential for downstream tasks in many domains, such as tasks in e-commerce scenarios. In this paper, we propose K-PLUG, a knowledge-injected pre-trained language model based on the encoder-decoder transformer that can be transferred to both natural language understanding and generation tasks. We verify our method in a diverse range of e-commerce scenarios that require domain-specific knowledge. Specifically, we propose five knowledge-aware self-supervised pre-training objectives to formulate the learning of domain-specific knowledge, including e-commerce domain-specific knowledge-bases, aspects of product entities, categories of product entities, and unique selling propositions of product entities. K-PLUG achieves new state-of-the-art results on a suite of domain-specific NLP tasks, including product knowledge base completion, abstractive product summarization, and multi-turn dialogue, significantly outperforms baselines across the board, which demonstrates that the proposed method effectively learns a diverse set of domain-specific knowledge for both language understanding and generation tasks.

...read moreread less

3 citations

Proceedings Article•DOI•

Could you give me a hint ? Generating inference graphs for defeasible reasoning

[...]

Aman Madaan¹, Dheeraj Rajagopal¹, Niket Tandon¹, Yiming Yang¹, Eduard Hovy¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

01 Aug 2021

TL;DR: The authors automatically generate such inference graphs through transfer learning from another NLP task that shares the kind of reasoning that inference graphs support, and find that human accuracy on this task improves by 20% by consulting the generated graphs.

...read moreread less

Abstract: Defeasible reasoning is the mode of reasoning where conclusions can be overturned by taking into account new evidence. A commonly used method in cognitive science and logic literature is to handcraft argumentation supporting inference graphs. While humans find inference graphs very useful for reasoning, constructing them at scale is difficult. In this paper, we automatically generate such inference graphs through transfer learning from another NLP task that shares the kind of reasoning that inference graphs support. Through automated metrics and human evaluation, we find that our method generates meaningful graphs for the defeasible inference task. Human accuracy on this task improves by 20% by consulting the generated graphs. Our findings open up exciting new research avenues for cases where machine reasoning can help human reasoning. (A dataset of 230,000 influence graphs for each defeasible query is located at: this https URL.)

...read moreread less

3 citations

Proceedings Article•DOI•

Interpreting text classifiers by learning context-sensitive influence of words

[...]

Sawan Kumar, Kalpit Dixit, Kashif Shah

01 Jun 2021

TL;DR: This work proposes MOXIE (MOdeling conteXt-sensitive InfluencE of words) with an aim to enable a richer interface for a user to interact with the model being interpreted and to produce testable predictions, and aims to make predictions for importance scores, counterfactuals and learned biases withMOXIE.

...read moreread less

Abstract: Many existing approaches for interpreting text classification models focus on providing importance scores for parts of the input text, such as words, but without a way to test or improve the interpretation method itself. This has the effect of compounding the problem of understanding or building trust in the model, with the interpretation method itself adding to the opacity of the model. Further, importance scores on individual examples are usually not enough to provide a sufficient picture of model behavior. To address these concerns, we propose MOXIE (MOdeling conteXt-sensitive InfluencE of words) with an aim to enable a richer interface for a user to interact with the model being interpreted and to produce testable predictions. In particular, we aim to make predictions for importance scores, counterfactuals and learned biases with MOXIE. In addition, with a global learning objective, MOXIE provides a clear path for testing and improving itself. We evaluate the reliability and efficiency of MOXIE on the task of sentiment analysis.

...read moreread less

3 citations

Proceedings Article•DOI•

BERTGen: Multi-task Generation through BERT

[...]

Faidon Mitzalis, Ozan Caglayan¹, Pranava Madhyastha¹, Lucia Specia¹•Institutions (1)

Imperial College London¹

01 Aug 2021

TL;DR: BERTGen as mentioned in this paper extends BERT by fusing multimodal and multilingual pre-trained models VLBERT and M-BERT, respectively, and achieves competitive performance for zero-shot language generation.

...read moreread less

Abstract: We present BERTGen, a novel, generative, decoder-only model which extends BERT by fusing multimodal and multilingual pre-trained models VL-BERT and M-BERT, respectively. BERTGen is auto-regressively trained for language generation tasks, namely image captioning, machine translation and multimodal machine translation, under a multi-task setting. With a comprehensive set of evaluations, we show that BERTGen outperforms many strong baselines across the tasks explored. We also show BERTGen’s ability for zero-shot language generation, where it exhibits competitive performance to supervised counterparts. Finally, we conduct ablation studies which demonstrate that BERTGen substantially benefits from multi-tasking and effectively transfers relevant inductive biases from the pre-trained models.

...read moreread less

3 citations

Posted Content•

Generative Pre-training for Paraphrase Generation by Representing and Predicting Spans in Exemplars

[...]

Tien-Cuong Bui¹, Van-Duc Le¹, Hai-Thien To¹, Sang Kyun Cha¹•Institutions (1)

Seoul National University¹

29 Nov 2020-arXiv: Computation and Language

TL;DR: A novel approach to paraphrasing sentences, extended from the GPT-2 model, is presented, which develops a template masking technique, named first-order masking, to masked out irrelevant words in exemplars utilizing POS taggers and introduces a technique, referred to as second-ordermasking, which utilizes Bernoulli distribution to control the visibility of the first- order-masked template’s tokens.

...read moreread less

Abstract: Paraphrase generation is a long-standing problem and serves an essential role in many natural language processing problems. Despite some encouraging results, recent methods either confront the problem of favoring generic utterance or need to retrain the model from scratch for each new dataset. This paper presents a novel approach to paraphrasing sentences, extended from the GPT-2 model. We develop a template masking technique, named first-order masking, to masked out irrelevant words in exemplars utilizing POS taggers. So that, the paraphrasing task is changed to predicting spans in masked templates. Our proposed approach outperforms competitive baselines, especially in the semantic preservation aspect. To prevent the model from being biased towards a given template, we introduce a technique, referred to as second-order masking, which utilizes Bernoulli distribution to control the visibility of the first-order-masked template's tokens. Moreover, this technique allows the model to provide various paraphrased sentences in testing by adjusting the second-order-masking level. For scale-up objectives, we compare the performance of two alternatives template-selection methods, which shows that they were equivalent in preserving semantic information.

...read moreread less

3 citations