Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Pre-training Text-to-Text Transformers for Concept-centric Common Sense

[...]

Wangchunshu Zhou¹, Dong-Ho Lee², Ravi Kiran Selvam², Seyeon Lee², Xiang Ren² - Show less +1 more•Institutions (2)

Beihang University¹, University of Southern California²

03 May 2021

TL;DR: The authors propose generative and contrastive objectives as intermediate self-supervised pre-training tasks between general pretraining and downstream task-specific fine-tuning to augment pre-trained language models with commonsense knowledge.

...read moreread less

Abstract: Pretrained language models (PTLM) have achieved impressive results in a range of natural language understanding (NLU) and generation (NLG) tasks that require a syntactic and semantic understanding of the text. However, current pre-training objectives such as masked token prediction (for BERT-style PTLMs) and masked span infilling (for T5-style PTLMs) do not explicitly model the relational and compositional commonsense knowledge about everyday concepts, which is crucial to many downstream tasks requiring commonsense reasoning. To augment PTLMs with common sense, we propose generative and contrastive objectives as intermediate self-supervised pre-training tasks between general pre-training and downstream task-specific fine-tuning. We also propose a joint training framework to unify generative and contrastive objectives so that these objectives can be more effective. Our proposed objectives can pack more commonsense knowledge into the parameters of a pre-trained text-to-text transformer without relying on external knowledge bases, yielding better performance on both NLU and NLG tasks. We apply our method on a pre-trained T5 model in an intermediate task transfer learning fashion to train a concept-aware language model (CALM) and experiment with five commonsense benchmarks (four NLU tasks and one NLG task). Experimental results show that CALM outperforms baseline methods by a consistent margin.

...read moreread less

5 citations

Proceedings Article•DOI•

EmailSum: Abstractive Email Thread Summarization

[...]

Shiyue Zhang¹, Asli Celikyilmaz², Jianfeng Gao³, Mohit Bansal⁴•Institutions (4)

Tsinghua University¹, Facebook², Microsoft³, University of North Carolina at Chapel Hill⁴

01 Aug 2021

TL;DR: The authors developed an abstractive email thread summarization (EmailSum) dataset, which contains human-annotated short (<30 words) and long (<100 words) summaries of 2,549 email threads (each containing 3 to 10 emails) over a wide variety of topics.

...read moreread less

Abstract: Recent years have brought about an interest in the challenging task of summarizing conversation threads (meetings, online discussions, etc.). Such summaries help analysis of the long text to quickly catch up with the decisions made and thus improve our work or communication efficiency. To spur research in thread summarization, we have developed an abstractive Email Thread Summarization (EmailSum) dataset, which contains human-annotated short (<30 words) and long (<100 words) summaries of 2,549 email threads (each containing 3 to 10 emails) over a wide variety of topics. We perform a comprehensive empirical study to explore different summarization techniques (including extractive and abstractive methods, single-document and hierarchical models, as well as transfer and semisupervised learning) and conduct human evaluations on both short and long summary generation tasks. Our results reveal the key challenges of current abstractive summarization models in this task, such as understanding the sender’s intent and identifying the roles of sender and receiver. Furthermore, we find that widely used automatic evaluation metrics (ROUGE, BERTScore) are weakly correlated with human judgments on this email thread summarization task. Hence, we emphasize the importance of human evaluation and the development of better metrics by the community.

...read moreread less

5 citations

Posted Content•

Simulated Chats for Building Dialog Systems: Learning to Generate Conversations from Instructions

[...]

Biswesh Mohapatra, Gaurav Pandey¹, Danish Contractor², Sachindra Joshi•Institutions (2)

IBM¹, Indian Institute of Technology Delhi²

20 Oct 2020-arXiv: Computation and Language

TL;DR: In this article, a data creation strategy that uses the pre-trained language model, GPT2, to simulate the interaction between crowd workers by creating a user bot and an agent bot is presented.

...read moreread less

Abstract: Popular dialog datasets such as MultiWOZ are created by providing crowd workers an instruction, expressed in natural language, that describes the task to be accomplished. Crowd workers play the role of a user and an agent to generate dialogs to accomplish tasks involving booking restaurant tables, calling a taxi etc. In this paper, we present a data creation strategy that uses the pre-trained language model, GPT2, to simulate the interaction between crowd workers by creating a user bot and an agent bot. We train the simulators using a smaller percentage of actual crowd-generated conversations and their corresponding instructions. We demonstrate that by using the simulated data, we achieve significant improvements in low-resource settings on two publicly available datasets - the MultiWOZ dataset and the Persona chat dataset.

...read moreread less

5 citations

Proceedings Article•

How to Select One Among All? An Extensive Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding.

[...]

Tianda Li¹, Ahmad Rashid², Aref Jafari³, Pranav Sharma, Ali Ghodsi³, Mehdi Rezagholizadeh² - Show less +2 more•Institutions (3)

Queen's University¹, Huawei², University of Waterloo³

13 Sep 2021

TL;DR: Combined-KD as discussed by the authors proposes a framework to assess adversarial robustness of multiple KD algorithms and achieves state-of-the-art results on the GLUE benchmark, out-ofdomain generalization, and adversarial-robustness compared to competitive methods.

...read moreread less

Abstract: Knowledge Distillation (KD) is a model compression algorithm that helps transfer the knowledge in a large neural network into a smaller one. Even though KD has shown promise on a wide range of Natural Language Processing (NLP) applications, little is understood about how one KD algorithm compares to another and whether these approaches can be complimentary to each other. In this work, we evaluate various KD algorithms on in-domain, out-of-domain and adversarial testing. We propose a framework to assess adversarial robustness of multiple KD algorithms. Moreover, we introduce a new KD algorithm, Combined-KD, which takes advantage of two promising approaches (better training scheme and more efficient data augmentation). Our extensive experimental results show that Combined-KD achieves state-of-the-art results on the GLUE benchmark, out-of-domain generalization, and adversarial robustness compared to competitive methods.

...read moreread less

5 citations

Proceedings Article•DOI•

ANA at SemEval-2020 Task 4: MUlti-task learNIng for cOmmonsense reasoNing (UNION)

[...]

Anandh Konar, Chenyang Huang¹, Amine Trabelsi¹, Osmar R. Zaïane¹•Institutions (1)

University of Alberta¹

01 Dec 2020

TL;DR: In this paper, a unified end-to-end framework for commonsense reasoning was proposed to learn more dynamics under the scope of commonsense Reasoning, which can generate more meaningful explanations.

...read moreread less

Abstract: In this paper, we describe our mUlti-task learNIng for cOmmonsense reasoNing (UNION) system submitted for Task C of the SemEval2020 Task 4, which is to generate a reason explaining why a given false statement is non-sensical. However, we found in the early experiments that simple adaptations such as fine-tuning GPT2 often yield dull and non-informative generations (e.g. simple negations). In order to generate more meaningful explanations, we propose UNION, a unified end-to-end framework, to utilize several existing commonsense datasets so that it allows a model to learn more dynamics under the scope of commonsense reasoning. In order to perform model selection efficiently, accurately, and promptly, we also propose a couple of auxiliary automatic evaluation metrics so that we can extensively compare the models from different perspectives. Our submitted system not only results in a good performance in the proposed metrics but also outperforms its competitors with the highest achieved score of 2.10 for human evaluation while remaining a BLEU score of 15.7. Our code is made publicly available.

...read moreread less

5 citations