Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Effective Sequence-to-Sequence Dialogue State Tracking

[...]

Jeffrey Zhao¹, Mahdis Mahdieh¹, Ye Zhang¹, Yuan Cao¹, Yonghui Wu¹ - Show less +1 more•Institutions (1)

Google¹

31 Aug 2021-arXiv: Computation and Language

TL;DR: This paper showed that the choice of pre-training objective makes a significant difference to the state tracking quality and that masked span prediction is more effective than auto-regressive language modeling for text summarization.

...read moreread less

Abstract: Sequence-to-sequence models have been applied to a wide variety of NLP tasks, but how to properly use them for dialogue state tracking has not been systematically investigated. In this paper, we study this problem from the perspectives of pre-training objectives as well as the formats of context representations. We demonstrate that the choice of pre-training objective makes a significant difference to the state tracking quality. In particular, we find that masked span prediction is more effective than auto-regressive language modeling. We also explore using Pegasus, a span prediction-based pre-training objective for text summarization, for the state tracking model. We found that pre-training for the seemingly distant summarization task works surprisingly well for dialogue state tracking. In addition, we found that while recurrent state context representation works also reasonably well, the model may have a hard time recovering from earlier mistakes. We conducted experiments on the MultiWOZ 2.1-2.4, WOZ 2.0, and DSTC2 datasets with consistent observations.

...read moreread less

Unsupervised Multiple Choices Question Answering: Start Learning from Basic Knowledge

[...]

Chi-Liang Liu, Hung-yi Lee

01 Nov 2021

TL;DR: In this paper, the authors study the possibility of unsupervised multiple choices question answering (MCQA) from very basic knowledge, the MCQA model knows that some choices have higher probabilities of being correct than others.

...read moreread less

Abstract: In this paper, we study the possibility of unsupervised Multiple Choices Question Answering (MCQA). From very basic knowledge, the MCQA model knows that some choices have higher probabilities of being correct than others. The information, though very noisy, guides the training of an MCQA model. The proposed method is shown to outperform the baseline approaches on RACE and is even comparable with some supervised learning approaches on MC500.

...read moreread less

Posted Content•

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation.

[...]

Subhabrata Mukherjee, Ahmed Hassan Awadallah, Jianfeng Gao¹•Institutions (1)

Microsoft¹

08 Jun 2021-arXiv: Computation and Language

TL;DR: This article developed a new task-agnostic distillation framework XtremeDistilTransformers that leverages the advantage of task-specific methods for learning a small universal model that can be applied to arbitrary tasks and languages.

...read moreread less

Abstract: While deep and large pre-trained models are the state-of-the-art for various natural language processing tasks, their huge size poses significant challenges for practical uses in resource constrained settings. Recent works in knowledge distillation propose task-agnostic as well as task-specific methods to compress these models, with task-specific ones often yielding higher compression rate. In this work, we develop a new task-agnostic distillation framework XtremeDistilTransformers that leverages the advantage of task-specific methods for learning a small universal model that can be applied to arbitrary tasks and languages. To this end, we study the transferability of several source tasks, augmentation resources and model architecture for distillation. We evaluate our model performance on multiple tasks, including the General Language Understanding Evaluation (GLUE) benchmark, SQuAD question answering dataset and a massive multi-lingual NER dataset with 41 languages. We release three distilled task-agnostic checkpoints with 13MM, 22MM and 33MM parameters obtaining SOTA performance in several tasks.

...read moreread less

Proceedings Article•

Memory and Knowledge Augmented Language Models for Inferring Salience in Long-Form Stories

[...]

David Wilmot¹, Frank Keller•Institutions (1)

University of Edinburgh¹

26 Aug 2021

TL;DR: This paper improved the standard transformer language model by incorporating an external knowledgebase (derived from Retrieval Augmented Generation) and adding a memory mechanism to enhance performance on longer works, using a novel approach to derive salience annotation using chapter-aligned summaries from the Shmoop corpus for classic literary works.

...read moreread less

Abstract: Measuring event salience is essential in the understanding of stories. This paper takes a recent unsupervised method for salience detection derived from Barthes Cardinal Functions and theories of surprise and applies it to longer narrative forms. We improve the standard transformer language model by incorporating an external knowledgebase (derived from Retrieval Augmented Generation) and adding a memory mechanism to enhance performance on longer works. We use a novel approach to derive salience annotation using chapter-aligned summaries from the Shmoop corpus for classic literary works. Our evaluation against this data demonstrates that our salience detection model improves performance over and above a non-knowledgebase and memory augmented language model, both of which are crucial to this improvement.

...read moreread less

Proceedings Article•

Evidence-based Fact-Checking of Health-related Claims

[...]

Mourad Sarrouti¹, Asma Ben Abacha¹, Yassine Mrabet¹, Dina Demner-Fushman¹•Institutions (1)

National Institutes of Health¹

01 Nov 2021

TL;DR: Healthver as discussed by the authors is a dataset for evidence-based fact-checking of health-related claims that allows to study the validity of real-world claims by evaluating their truthfulness against scientific articles.

...read moreread less

Abstract: The task of verifying the truthfulness of claims in textual documents, or fact-checking, has received significant attention in recent years. Many existing evidence-based factchecking datasets contain synthetic claims and the models trained on these data might not be able to verify real-world claims. Particularly few studies addressed evidence-based fact-checking of health-related claims that require medical expertise or evidence from the scientific literature. In this paper, we introduce HEALTHVER, a new dataset for evidence-based fact-checking of health-related claims that allows to study the validity of real-world claims by evaluating their truthfulness against scientific articles. Using a three-step data creation method, we first retrieved real-world claims from snippets returned by a search engine for questions about COVID-19. Then we automatically retrieved and re-ranked relevant scientific papers using a T5 relevance-based model. Finally, the relations between each evidence statement and the associated claim were manually annotated as SUPPORT, REFUTE and NEUTRAL. To validate the created dataset of 14,330 evidence-claim pairs, we developed baseline models based on pretrained language models. Our experiments showed that training deep learning models on real-world medical claims greatly improves performance compared to models trained on synthetic and open-domain claims. Our results and manual analysis suggest that HEALTHVER provides a realistic and challenging dataset for future efforts on evidence-based fact-checking of health-related claims. The dataset, source code, and a leaderboard are available at https://github.com/sarrouti/healthver.

...read moreread less