Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Studying Strategically: Learning to Mask for Closed-book QA.

[...]

Qinyuan Ye¹, Belinda Z. Li¹, Sinong Wang², Benjamin Bolte, Hao Ma, Wen-tau Yih, Xiang Ren, Madian Khabsa - Show less +4 more•Institutions (2)

University of Southern California¹, Massachusetts Institute of Technology²

31 Dec 2020-arXiv: Computation and Language

TL;DR: The authors propose to learn the optimal masking strategy for the intermediate pre-training stage by using supervision from the downstream task itself, and then deploy the learned policy during intermediate pre training.

...read moreread less

Abstract: Closed-book question-answering (QA) is a challenging task that requires a model to directly answer questions without access to external knowledge. It has been shown that directly fine-tuning pre-trained language models with (question, answer) examples yields surprisingly competitive performance, which is further improved upon through adding an intermediate pre-training stage between general pre-training and fine-tuning. Prior work used a heuristic during this intermediate stage, whereby named entities and dates are masked, and the model is trained to recover these tokens. In this paper, we aim to learn the optimal masking strategy for the intermediate pre-training stage. We first train our masking policy to extract spans that are likely to be tested, using supervision from the downstream task itself, then deploy the learned policy during intermediate pre-training. Thus, our policy packs task-relevant knowledge into the parameters of a language model. Our approach is particularly effective on TriviaQA, outperforming strong heuristics when used to pre-train BART.

...read moreread less

5 citations

Proceedings Article•DOI•

Morph Call: Probing Morphosyntactic Content of Multilingual Transformers

[...]

Vladislav Mikhailov, Oleg Serikov, Ekaterina Artemova

01 Jun 2021

TL;DR: Morph Call presents Morph Call, a suite of 46 probing tasks for four Indo-European languages of different morphology: Russian, French, English and German, and proposes a new type of probing tasks based on detection of guided sentence perturbations.

...read moreread less

Abstract: The outstanding performance of transformer-based language models on a great variety of NLP and NLU tasks has stimulated interest in exploration of their inner workings. Recent research has been primarily focused on higher-level and complex linguistic phenomena such as syntax, semantics, world knowledge and common-sense. The majority of the studies is anglocentric, and little remains known regarding other languages, specifically their morphosyntactic properties. To this end, our work presents Morph Call, a suite of 46 probing tasks for four Indo-European languages of different morphology: Russian, French, English and German. We propose a new type of probing tasks based on detection of guided sentence perturbations. We use a combination of neuron-, layer- and representation-level introspection techniques to analyze the morphosyntactic content of four multilingual transformers, including their understudied distilled versions. Besides, we examine how fine-tuning on POS-tagging task affects the probing performance.

...read moreread less

5 citations

Posted Content•

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

[...]

Yixuan Su¹, Lei Shu², Elman Mansimov², Arshit Gupta², Deng Cai², Yi-An Lai², Yi Zhang² - Show less +3 more•Institutions (2)

University of Cambridge¹, Amazon.com²

29 Sep 2021-arXiv: Computation and Language

TL;DR: This article proposed a unified plug-and-play model for task-oriented dialogue, called PPTOD, which uses a multi-task pre-training strategy that allows the model to learn the primary task completion skills from heterogeneous dialog corpora.

...read moreread less

Abstract: Pre-trained language models have been recently shown to benefit task-oriented dialogue (TOD) systems. Despite their success, existing methods often formulate this task as a cascaded generation problem which can lead to error accumulation across different sub-tasks and greater data annotation overhead. In this study, we present PPTOD, a unified plug-and-play model for task-oriented dialogue. In addition, we introduce a new dialogue multi-task pre-training strategy that allows the model to learn the primary TOD task completion skills from heterogeneous dialog corpora. We extensively test our model on three benchmark TOD tasks, including end-to-end dialogue modelling, dialogue state tracking, and intent classification. Experimental results show that PPTOD achieves new state of the art on all evaluated tasks in both high-resource and low-resource scenarios. Furthermore, comparisons against previous SOTA methods show that the responses generated by PPTOD are more factually correct and semantically coherent as judged by human annotators.

...read moreread less

5 citations

Posted Content•

Structured Prediction as Translation between Augmented Natural Languages

[...]

Giovanni Paolini¹, Ben Athiwaratkun¹, Jason Krone¹, Jie Ma², Alessandro Achille¹, Rishita Anubhai³, Cicero Nogueira dos Santos¹, Bing Xiang¹, Stefano Soatto⁴ - Show less +5 more•Institutions (4)

Amazon.com¹, Cornell University², Stanford University³, University of California, Los Angeles⁴

14 Jan 2021-arXiv: Learning

TL;DR: The authors propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking.

...read moreread less

Abstract: We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. Instead of tackling the problem by training task-specific discriminative classifiers, we frame it as a translation task between augmented natural languages, from which the task-relevant information can be easily extracted. Our approach can match or outperform task-specific models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction (CoNLL04, ADE, NYT, and ACE2005 datasets), relation classification (FewRel and TACRED), and semantic role labeling (CoNLL-2005 and CoNLL-2012). We accomplish this while using the same architecture and hyperparameters for all tasks and even when training a single model to solve all tasks at the same time (multi-task learning). Finally, we show that our framework can also significantly improve the performance in a low-resource regime, thanks to better use of label semantics.

...read moreread less

5 citations

Proceedings Article•DOI•

Boosting low-resource biomedical QA via entity-aware masking strategies

[...]

Gabriele Pergola¹, Elena Kochkina¹, Lin Gui¹, Maria Liakata¹, Yulan He¹ - Show less +1 more•Institutions (1)

University of Warwick¹

16 Feb 2021

TL;DR: This article proposed biomedical entity-aware masking (BEM), which encourages masked language models to learn entity-centric knowledge based on the pivotal entities characterizing the domain at hand, and employ those entities to drive the LM fine-tuning.

...read moreread less

Abstract: Biomedical question-answering (QA) has gained increased attention for its capability to provide users with high-quality information from a vast scientific literature. Although an increasing number of biomedical QA datasets has been recently made available, those resources are still rather limited and expensive to produce. Transfer learning via pre-trained language models (LMs) has been shown as a promising approach to leverage existing general-purpose knowledge. However, fine-tuning these large models can be costly and time consuming, often yielding limited benefits when adapting to specific themes of specialised domains, such as the COVID-19 literature. To bootstrap further their domain adaptation, we propose a simple yet unexplored approach, which we call biomedical entity-aware masking (BEM). We encourage masked language models to learn entity-centric knowledge based on the pivotal entities characterizing the domain at hand, and employ those entities to drive the LM fine-tuning. The resulting strategy is a downstream process applicable to a wide variety of masked LMs, not requiring additional memory or components in the neural architectures. Experimental results show performance on par with state-of-the-art models on several biomedical QA datasets. © 2021 Association for Computational Linguistics

...read moreread less

5 citations