Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

XLM-T: A Multilingual Language Model Toolkit for Twitter.

[...]

Francesco Barbieri, Luis Espinosa Anke¹, Jose Camacho-Collados¹•Institutions (1)

Cardiff University¹

25 Apr 2021-arXiv: Computation and Language

TL;DR: This article introduced XLM-T, a framework for using and evaluating multilingual language models in Twitter, which can be extended to additional tasks, as well as integrated with recent efforts also aimed at the homogenization of Twitter-specific datasets.

...read moreread less

Abstract: Language models are ubiquitous in current NLP, and their multilingual capacity has recently attracted considerable attention. However, current analyses have almost exclusively focused on (multilingual variants of) standard benchmarks, and have relied on clean pre-training and task-specific corpora as multilingual signals. In this paper, we introduce XLM-T, a framework for using and evaluating multilingual language models in Twitter. This framework features two main assets: (1) a strong multilingual baseline consisting of an XLM-R (Conneau et al. 2020) model pre-trained on millions of tweets in over thirty languages, alongside starter code to subsequently fine-tune on a target task; and (2) a set of unified sentiment analysis Twitter datasets in eight different languages. This is a modular framework that can easily be extended to additional tasks, as well as integrated with recent efforts also aimed at the homogenization of Twitter-specific datasets (Barbieri et al. 2020).

...read moreread less

8 citations

Proceedings Article•DOI•

General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

[...]

Jingfei Du¹, Myle Ott¹, Haoran Li¹, Xing Zhou¹, Veselin Stoyanov¹ - Show less +1 more•Institutions (1)

Facebook¹

29 Apr 2020

TL;DR: This work aims to reduce the inference cost in a setting where many different predictions are made on a single piece of text, and shows that through binary quantization, it can reduce the size of the extracted representations by a factor of 16 to store them for later use.

...read moreread less

Abstract: The state of the art on many NLP tasks is currently achieved by large pre-trained language models, which require a considerable amount of computation. We aim to reduce the inference cost in a setting where many different predictions are made on a single piece of text. In that case, computational cost during inference can be amortized over the different predictions (tasks) using a shared text encoder. We compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks. We also compare ways of extracting fixed- and limited-size representations from this encoder, including pooling features extracted from multiple layers or positions. Our best approach compares favorably to knowledge distillation, achieving higher accuracy and lower computational cost once the system is handling around 7 tasks. Further, we show that through binary quantization, we can reduce the size of the extracted representations by a factor of 16 to store them for later use. The resulting method offers a compelling solution for using large-scale pre-trained models at a fraction of the computational cost when multiple tasks are performed on the same text.

...read moreread less

8 citations

Posted Content•

An Investigation of the (In)effectiveness of Counterfactually Augmented Data.

[...]

Nitish Joshi¹, He He¹•Institutions (1)

New York University¹

01 Jul 2021-arXiv: Computation and Language

TL;DR: The authors show that the lack of perturbation diversity in current CAD datasets limits its effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbations of examples.

...read moreread less

Abstract: While pretrained language models achieve excellent performance on natural language understanding benchmarks, they tend to rely on spurious correlations and generalize poorly to out-of-distribution (OOD) data. Recent work has explored using counterfactually-augmented data (CAD) -- data generated by minimally perturbing examples to flip the ground-truth label -- to identify robust features that are invariant under distribution shift. However, empirical results using CAD for OOD generalization have been mixed. To explain this discrepancy, we draw insights from a linear Gaussian model and demonstrate the pitfalls of CAD. Specifically, we show that (a) while CAD is effective at identifying robust features, it may prevent the model from learning unperturbed robust features, and (b) CAD may exacerbate existing spurious correlations in the data. Our results show that the lack of perturbation diversity in current CAD datasets limits its effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbation of examples.

...read moreread less

8 citations

Posted Content•

Label Verbalization and Entailment for Effective Zero- and Few-Shot Relation Extraction

[...]

Oscar Sainz, Oier Lopez de Lacalle¹, Gorka Labaka, Ander Barrena¹, Eneko Agirre¹ - Show less +1 more•Institutions (1)

University of the Basque Country¹

08 Sep 2021-arXiv: Computation and Language

TL;DR: This article reformulated relation extraction as an entailment task, with simple, hand-made, verbalizations of relations produced in less than 15 min per relation, achieving state-of-the-art performance on TACRED.

...read moreread less

Abstract: Relation extraction systems require large amounts of labeled examples which are costly to annotate. In this work we reformulate relation extraction as an entailment task, with simple, hand-made, verbalizations of relations produced in less than 15 min per relation. The system relies on a pretrained textual entailment engine which is run as-is (no training examples, zero-shot) or further fine-tuned on labeled examples (few-shot or fully trained). In our experiments on TACRED we attain 63% F1 zero-shot, 69% with 16 examples per relation (17% points better than the best supervised system on the same conditions), and only 4 points short to the state-of-the-art (which uses 20 times more training data). We also show that the performance can be improved significantly with larger entailment models, up to 12 points in zero-shot, allowing to report the best results to date on TACRED when fully trained. The analysis shows that our few-shot systems are specially effective when discriminating between relations, and that the performance difference in low data regimes comes mainly from identifying no-relation cases.

...read moreread less

8 citations

Posted Content•

ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning.

[...]

Yujia Qin¹, Yankai Lin², Ryuichi Takanobu¹, Zhiyuan Liu¹, Peng Li³, Heng Ji, Minlie Huang¹, Maosong Sun¹, Jie Zhou² - Show less +5 more•Institutions (3)

Tsinghua University¹, Tencent², Dalian University of Technology³

30 Dec 2020-arXiv: Computation and Language

TL;DR: This article proposed a contrastive learning framework to obtain a deep understanding of the entities and their relations in text, which can improve relation extraction, entity typing, and question answering under low-resource settings.

...read moreread less

Abstract: Pre-trained Language Models (PLMs) have shown superior performance on various downstream Natural Language Processing (NLP) tasks. However, conventional pre-training objectives do not explicitly model relational facts in text, which are crucial for textual understanding. To address this issue, we propose a novel contrastive learning framework ERICA to obtain a deep understanding of the entities and their relations in text. Specifically, we define two novel pre-training tasks to better understand entities and relations: (1) the entity discrimination task to distinguish which tail entity can be inferred by the given head entity and relation; (2) the relation discrimination task to distinguish whether two relations are close or not semantically, which involves complex relational reasoning. Experimental results demonstrate that ERICA can improve typical PLMs (BERT and RoBERTa) on several language understanding tasks, including relation extraction, entity typing and question answering, especially under low-resource settings.

...read moreread less

8 citations