Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Domain-robust VQA with diverse datasets and methods but no target labels

[...]

Mingda Zhang¹, Tristan Maidment¹, Ahmad Diab¹, Adriana Kovashka¹, Rebecca Hwa¹ - Show less +1 more•Institutions (1)

University of Pittsburgh¹

29 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors quantify domain shifts between popular VQA datasets, in both visual and textual space, and test the robustness of different families of visual question answering methods (two-stream, transformer and neuro-symbolic) to these shifts.

...read moreread less

Abstract: The observation that computer vision methods overfit to dataset specifics has inspired diverse attempts to make object recognition models robust to domain shifts. However, similar work on domain-robust visual question answering methods is very limited. Domain adaptation for VQA differs from adaptation for object recognition due to additional complexity: VQA models handle multimodal inputs, methods contain multiple steps with diverse modules resulting in complex optimization, and answer spaces in different datasets are vastly different. To tackle these challenges, we first quantify domain shifts between popular VQA datasets, in both visual and textual space. To disentangle shifts between datasets arising from different modalities, we also construct synthetic shifts in the image and question domains separately. Second, we test the robustness of different families of VQA methods (classic two-stream, transformer, and neuro-symbolic methods) to these shifts. Third, we test the applicability of existing domain adaptation methods and devise a new one to bridge VQA domain gaps, adjusted to specific VQA models. To emulate the setting of real-world generalization, we focus on unsupervised domain adaptation and the open-ended classification task formulation.

...read moreread less

Posted Content•

Understanding How Encoder-Decoder Architectures Attend

[...]

Kyle Aitken¹, Vinay Ramasesh², Yuan Cao³, Niru Maheswaranathan³•Institutions (3)

University of Washington¹, University of California, Berkeley², Google³

28 Oct 2021-arXiv: Learning

TL;DR: This paper decompose hidden states over a sequence into temporal (independent of input) and input-driven components, and show that depending on the task requirements, networks rely more heavily on either the temporal or input driven components.

...read moreread less

Abstract: Encoder-decoder networks with attention have proven to be a powerful way to solve many sequence-to-sequence tasks. In these networks, attention aligns encoder and decoder states and is often used for visualizing network behavior. However, the mechanisms used by networks to generate appropriate attention matrices are still mysterious. Moreover, how these mechanisms vary depending on the particular architecture used for the encoder and decoder (recurrent, feed-forward, etc.) are also not well understood. In this work, we investigate how encoder-decoder networks solve different sequence-to-sequence tasks. We introduce a way of decomposing hidden states over a sequence into temporal (independent of input) and input-driven (independent of sequence position) components. This reveals how attention matrices are formed: depending on the task requirements, networks rely more heavily on either the temporal or input-driven components. These findings hold across both recurrent and feed-forward architectures despite their differences in forming the temporal components. Overall, our results provide new insight into the inner workings of attention-based encoder-decoder networks.

...read moreread less

Posted Content•

Towards Efficient NLP: A Standard Evaluation and A Strong Baseline.

[...]

Xiangyang Liu, Tianxiang Sun, Junliang He, Lingling Wu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu - Show less +5 more

13 Oct 2021-arXiv: Computation and Language

TL;DR: The ELUE (Efficient Language Understanding Evaluation) benchmark as discussed by the authors is dedicated to depicting the Pareto Front for various language understanding tasks, such that it can tell whether and how much a method achieves pareto improvement.

...read moreread less

Abstract: Supersized pre-trained language models have pushed the accuracy of various NLP tasks to a new state-of-the-art (SOTA). Rather than pursuing the reachless SOTA accuracy, most works are pursuing improvement on other dimensions such as efficiency, leading to "Pareto SOTA". Different from accuracy, the metric for efficiency varies across different studies, making them hard to be fairly compared. To that end, this work presents ELUE (Efficient Language Understanding Evaluation), a standard evaluation, and a public leaderboard for efficient NLP models. ELUE is dedicated to depicting the Pareto Front for various language understanding tasks, such that it can tell whether and how much a method achieves Pareto improvement. Along with the benchmark, we also pre-train and release a strong baseline, ElasticBERT, whose elasticity is both static and dynamic. ElasticBERT is static in that it allows reducing model layers on demand. ElasticBERT is dynamic in that it selectively executes parts of model layers conditioned on the input. We demonstrate the ElasticBERT, despite its simplicity, outperforms or performs on par with SOTA compressed and early exiting models. The ELUE benchmark is publicly available at this http URL

...read moreread less

Proceedings Article•DOI•

Towards Efficient NLP: A Standard Evaluation and A Strong Baseline

[...]

01 Jan 2022

TL;DR: Liu et al. as discussed by the authors presented the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (ACLT), which focused on human language technologies.

...read moreread less

Abstract: Xiangyang Liu, Tianxiang Sun, Junliang He, Jiawen Wu, Lingling Wu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022.

...read moreread less

Posted Content•

CIDER: Commonsense Inference for Dialogue Explanation and Reasoning

[...]

Deepanway Ghosal¹, Pengfei Hong¹, Siqi Shen², Navonil Majumder¹, Rada Mihalcea², Soujanya Poria¹ - Show less +2 more•Institutions (2)

Singapore University of Technology and Design¹, University of Michigan²

01 Jun 2021-arXiv: Computation and Language

TL;DR: The CIDER dataset as mentioned in this paper contains dyadic dialogue explanations in the form of implicit and explicit knowledge triplets inferred using contextual commonsense inference, which are categorized by the type of commonsense knowledge present.

...read moreread less

Abstract: Commonsense inference to understand and explain human language is a fundamental research problem in natural language processing. Explaining human conversations poses a great challenge as it requires contextual understanding, planning, inference, and several aspects of reasoning including causal, temporal, and commonsense reasoning. In this work, we introduce CIDER -- a manually curated dataset that contains dyadic dialogue explanations in the form of implicit and explicit knowledge triplets inferred using contextual commonsense inference. Extracting such rich explanations from conversations can be conducive to improving several downstream applications. The annotated triplets are categorized by the type of commonsense knowledge present (e.g., causal, conditional, temporal). We set up three different tasks conditioned on the annotated dataset: Dialogue-level Natural Language Inference, Span Extraction, and Multi-choice Span Selection. Baseline results obtained with transformer-based models reveal that the tasks are difficult, paving the way for promising future research. The dataset and the baseline implementations are publicly available at this https URL.

...read moreread less