Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models.

[...]

Yada Pruksachatkun¹, Philip Yeres¹, Haokun Liu¹, Jason Phang¹, Phu Mon Htut¹, Alex Wang¹, Ian Tenney², Samuel R. Bowman¹ - Show less +4 more•Institutions (2)

New York University¹, Google²

04 Mar 2020-arXiv: Computation and Language

TL;DR: Jiant is introduced, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks and it is demonstrated that jiant reproduces published performance on a variety of tasks and models.

...read moreread less

Abstract: We introduce jiant, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks. jiant enables modular and configuration-driven experimentation with state-of-the-art models and implements a broad set of tasks for probing, transfer learning, and multitask training experiments. jiant implements over 50 NLU tasks, including all GLUE and SuperGLUE benchmark tasks. We demonstrate that jiant reproduces published performance on a variety of tasks and models, including BERT and RoBERTa. jiant is available at this https URL.

...read moreread less

24 citations

Proceedings Article•

Structured Prediction as Translation between Augmented Natural Languages

[...]

Giovanni Paolini¹, Ben Athiwaratkun¹, Jason Krone¹, Jie Ma², Alessandro Achille¹, Rishita Anubhai³, Cicero Nogueira dos Santos¹, Bing Xiang¹, Stefano Soatto⁴ - Show less +5 more•Institutions (4)

Amazon.com¹, Cornell University², Stanford University³, University of California, Los Angeles⁴

03 May 2021

TL;DR: This paper propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking.

...read moreread less

Abstract: We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. Instead of tackling the problem by training task-specific discriminative classifiers, we frame it as a translation task between augmented natural languages, from which the task-relevant information can be easily extracted. Our approach can match or outperform task-specific models on all tasks, and in particular achieves new state-of-the-art results on joint entity and relation extraction (CoNLL04, ADE, NYT, and ACE2005 datasets), relation classification (FewRel and TACRED), and semantic role labeling (CoNLL-2005 and CoNLL-2012). We accomplish this while using the same architecture and hyperparameters for all tasks, and even when training a single model to solve all tasks at the same time (multi-task learning). Finally, we show that our framework can also significantly improve the performance in a low-resource regime, thanks to better use of label semantics.

...read moreread less

24 citations

Posted Content•

Noisy Self-Knowledge Distillation for Text Summarization

[...]

Yang Liu¹, Sheng Shen², Mirella Lapata³•Institutions (3)

Microsoft¹, University of California, Berkeley², University of Edinburgh³

15 Sep 2020-arXiv: Computation and Language

TL;DR: Self-knowledge distillation is applied to text summarization which it is argued can alleviate problems with maximum-likelihood training on single reference and noisy datasets and introduce multiple noise signals for both teacher and student models.

...read moreread less

Abstract: In this paper we apply self-knowledge distillation to text summarization which we argue can alleviate problems with maximum-likelihood training on single reference and noisy datasets. Instead of relying on one-hot annotation labels, our student summarization model is trained with guidance from a teacher which generates smoothed labels to help regularize training. Furthermore, to better model uncertainty during training, we introduce multiple noise signals for both teacher and student models. We demonstrate experimentally on three benchmarks that our framework boosts the performance of both pretrained and non-pretrained summarizers achieving state-of-the-art results.

...read moreread less

24 citations

Posted Content•

Constrained Language Models Yield Few-Shot Semantic Parsers

[...]

Richard Shin¹, Christopher H. Lin², Sam Thomson², Charles Chen², Subhro Roy², Emmanouil Antonios Platanios², Adam Pauls¹, Dan Klein¹, Jason Eisner³, Benjamin Van Durme³ - Show less +6 more•Institutions (3)

University of California, Berkeley¹, Microsoft², Johns Hopkins University³

18 Apr 2021-arXiv: Computation and Language

TL;DR: This article explore the use of large pre-trained language models as few-shot semantic parsers to generate a structured meaning representation given a natural language input and demonstrate good performance on multiple tasks.

...read moreread less

Abstract: We explore the use of large pretrained language models as few-shot semantic parsers. The goal in semantic parsing is to generate a structured meaning representation given a natural language input. However, language models are trained to generate natural language. To bridge the gap, we use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation. With a small amount of data and very little code to convert into English-like representations, we provide a blueprint for rapidly bootstrapping semantic parsers and demonstrate good performance on multiple tasks.

...read moreread less

23 citations

Proceedings Article•DOI•

Sentence Meta-Embeddings for Unsupervised Semantic Textual Similarity

[...]

Nina Poerner¹, Ulli Waltinger², Hinrich Schütze¹•Institutions (2)

Ludwig Maximilian University of Munich¹, Siemens²

06 Jul 2020

TL;DR: This work addresses the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings and applies, extend and evaluates different meta- embedding methods from the word embedding literature at the sentence level, including dimensionality reduction and generalized Canonical Correlation Analysis.

...read moreread less

Abstract: We address the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings. We apply, extend and evaluate different meta-embedding methods from the word embedding literature at the sentence level, including dimensionality reduction (Yin and Schutze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015) and cross-view auto-encoders (Bollegala and Bao, 2018). Our sentence meta-embeddings set a new unsupervised State of The Art (SoTA) on the STS Benchmark and on the STS12-STS16 datasets, with gains of between 3.7% and 6.4% Pearson’s r over single-source systems.

...read moreread less

23 citations