Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

CommonsenseQA 2.0: Exposing the Limits of AI through Gamification

[...]

Alon Talmor¹, Ori Yoran¹, Ronan Le Bras², Chandra Bhagavatula², Yoav Goldberg³, Yejin Choi⁴, Jonathan Berant¹ - Show less +3 more•Institutions (4)

Tel Aviv University¹, Allen Institute for Artificial Intelligence², Bar-Ilan University³, University of Washington⁴

07 Jun 2021

TL;DR: This work proposes gamiﬁcation as a framework for data construction and creates CommonsenseQA 2.0, which includes 14,343 yes/no questions, and demonstrates its difﬂculty for models that are orders-of-magnitude larger than the AI used in the game itself.

...read moreread less

Abstract: Constructing benchmarks that test the abilities of modern natural language understanding models is difficult – pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense. In this work, we propose gamification as a framework for data construction. The goal of players in the game is to compose questions that mislead a rival AI, while using specific phrases for extra points. The game environment leads to enhanced user engagement and simultaneously gives the game designer control over the collected data, allowing us to collect high-quality data at scale. Using our method we create CommonsenseQA 2.0, which includes 14,343 yes/no questions, and demonstrate its difficulty for models that are orders-of-magnitude larger than the AI used in the game itself. Our best baseline, the T5-based UNICORN with 11B parameters achieves an accuracy of 70.2%, substantially higher than GPT-3 (52.9%) in a few-shot inference setup. Both score well below human performance which is at 94.1%.

...read moreread less

62 citations

Book Chapter•DOI•

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

[...]

Vishvak Murahari¹, Dhruv Batra¹, Devi Parikh¹, Abhishek Das¹•Institutions (1)

Georgia Institute of Technology¹

23 Aug 2020

TL;DR: This article adapt the ViLBERT model for multi-turn visually-grounded conversations, which is pretrained on Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial.

...read moreread less

Abstract: Prior work in visual dialog has focused on training deep neural models on VisDial in isolation. Instead, we present an approach to leverage pretraining on related vision-language datasets before transferring to visual dialog. We adapt the recently proposed ViLBERT model for multi-turn visually-grounded conversations. Our model is pretrained on the Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial. Our best single model outperforms prior published work by \(1\%\) absolute on NDCG and MRR.

...read moreread less

62 citations

Posted Content•

ViViT: A Video Vision Transformer

[...]

Anurag Arnab¹, Mostafa Dehghani¹, Georg Heigold², Chen Sun³, Mario Lucic¹, Cordelia Schmid¹ - Show less +2 more•Institutions (3)

Google¹, German Research Centre for Artificial Intelligence², Brown University³

29 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a pure transformer based model is proposed for video classification, which extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers.

...read moreread less

Abstract: We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we will release code and models.

...read moreread less

62 citations

Proceedings Article•DOI•

Universal Natural Language Processing with Limited Annotations: Try Few-shot Textual Entailment as a Start

[...]

Wenpeng Yin¹, Nazneen Fatema Rajani¹, Dragomir R. Radev², Richard Socher¹, Caiming Xiong¹ - Show less +1 more•Institutions (2)

Salesforce.com¹, Yale University²

01 Nov 2020

TL;DR: This work demonstrates that this framework enables a pretrained entailment model to work well on new entailment domains in a few-shot setting, and shows its effectiveness as a unified solver for several downstream NLP tasks such as question answering and coreference resolution when the end-task annotations are limited.

...read moreread less

Abstract: A standard way to address different NLP problems is by first constructing a problem-specific dataset, then building a model to fit this dataset. To build the ultimate artificial intelligence, we desire a single machine that can handle diverse new problems, for which task-specific annotations are limited. We bring up textual entailment as a unified solver for such NLP problems. However, current research of textual entailment has not spilled much ink on the following questions: (i) How well does a pretrained textual entailment system generalize across domains with only a handful of domain-specific examples? and (ii) When is it worth transforming an NLP task into textual entailment? We argue that the transforming is unnecessary if we can obtain rich annotations for this task. Textual entailment really matters particularly when the target NLP task has insufficient annotations. Universal NLP can be probably achieved through different routines. In this work, we introduce Universal Few-shot textual Entailment (UFO-Entail). We demonstrate that this framework enables a pretrained entailment model to work well on new entailment domains in a few-shot setting, and show its effectiveness as a unified solver for several downstream NLP tasks such as question answering and coreference resolution when the end-task annotations are limited.

...read moreread less

61 citations

Proceedings Article•DOI•

Improve Transformer Models with Better Relative Position Embeddings

[...]

Zhiheng Huang¹, Davis Liang¹, Peng Xu², Bing Xiang³•Institutions (3)

Amazon.com¹, Henan Normal University², Google³

01 Nov 2020

TL;DR: This paper proposes new methods to encourage increased interaction between query, key and relative position embeddings in the self-attention mechanism and demonstrates empirically that the relative embedding method can be reasonably generalized to and is robust in the inductive perspective.

...read moreread less

Abstract: The transformer model has demonstrated superior results on NLP tasks including machine translation and question answering. In this paper, we argue that the position information is not fully utilized in existing work. For example, the initial proposal of a sinusoid embedding is fixed and not learnable. In this paper, we first review the absolute position embeddings and existing relative position embedding methods. We then propose new methods to encourage increased interaction between query, key and relative position embeddings in the self-attention mechanism. Our most promising approach is a generalization of the absolute position embedding. Our method results in increased accuracy compared to previous approaches in absolute and relative position embeddings on the SQuAD1.1 dataset. In addition, we address the inductive property of whether a position embedding can be robust enough to handle long sequences. We demonstrate empirically that our relative embedding method can be reasonably generalized to and is robust in the inductive perspective. Finally, we show that our proposed method can be effectively and efficiently adopted as a near drop-in replacement for improving the accuracy of large models with little computational overhead.

...read moreread less

61 citations