Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

TextSETTR: Few-Shot Text Style Extraction and Tunable Targeted Restyling

[...]

Parker Riley¹, Noah Constant², Mandy Guo², Girish Kumar³, David C. Uthus², Zarana Parekh² - Show less +2 more•Institutions (3)

University of Rochester¹, Google², Stanford University³

08 Oct 2020-arXiv: Computation and Language

TL;DR: The authors adapts T5 (Raffel et al., 2020) to extract a style vector from text and use it to condition the decoder to perform style transfer, recast transfers as "targeted restyling" vector operations that adjust specific attributes of the input while preserving others.

...read moreread less

Abstract: We present a novel approach to the problem of text style transfer. Unlike previous approaches requiring style-labeled training data, our method makes use of readily-available unlabeled text by relying on the implicit connection in style between adjacent sentences, and uses labeled data only at inference time. We adapt T5 (Raffel et al., 2020), a strong pretrained text-to-text model, to extract a style vector from text and use it to condition the decoder to perform style transfer. As our label-free training results in a style vector space encoding many facets of style, we recast transfers as "targeted restyling" vector operations that adjust specific attributes of the input while preserving others. We demonstrate that training on unlabeled Amazon reviews data results in a model that is competitive on sentiment transfer, even compared to models trained fully on labeled data. Furthermore, applying our novel method to a diverse corpus of unlabeled web text results in a single model capable of transferring along multiple dimensions of style (dialect, emotiveness, formality, politeness, sentiment) despite no additional training and using only a handful of exemplars at inference time.

...read moreread less

1 citations

Posted Content•

It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning

[...]

Alexey Tikhonov¹, Max Ryabinin²•Institutions (2)

Yandex¹, National Research University – Higher School of Economics²

22 Jun 2021-arXiv: Computation and Language

TL;DR: This paper designed a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features, which can be applied to other languages in a zero-shot manner.

...read moreread less

Abstract: Commonsense reasoning is one of the key problems in natural language processing, but the relative scarcity of labeled data holds back the progress for languages other than English. Pretrained cross-lingual models are a source of powerful language-agnostic representations, yet their inherent reasoning capabilities are still actively studied. In this work, we design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features. To evaluate this approach, we create a multilingual Winograd Schema corpus by processing several datasets from prior work within a standardized pipeline and measure cross-lingual generalization ability in terms of out-of-sample performance. The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning, even when applied to other languages in a zero-shot manner. Also, we demonstrate that most of the performance is given by the same small subset of attention heads for all studied languages, which provides evidence of universal reasoning capabilities in multilingual encoders.

...read moreread less

1 citations

Posted Content•

DeLighT: Deep and Light-weight Transformer

[...]

Sachin Mehta¹, Marjan Ghazvininejad², Srinivasan Iyer², Luke Zettlemoyer¹, Hannaneh Hajishirzi¹ - Show less +1 more•Institutions (2)

University of Washington¹, Facebook²

03 Aug 2020-arXiv: Learning

TL;DR: DeLighT as discussed by the authors is a deep and light-weight transformer-based model for machine translation and language modeling tasks with 2.5 to 4 times fewer parameters on average.

...read moreread less

Abstract: We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within each Transformer block using the DeLighT transformation, a deep and light-weight transformation, and (2) across blocks using block-wise scaling, which allows for shallower and narrower DeLighT blocks near the input and wider and deeper DeLighT blocks near the output. Overall, DeLighT networks are 2.5 to 4 times deeper than standard transformer models and yet have fewer parameters and operations. Experiments on benchmark machine translation and language modeling tasks show that DeLighT matches or improves the performance of baseline Transformers with 2 to 3 times fewer parameters on average. Our source code is available at: \url{this https URL}

...read moreread less

1 citations

Posted Content•

Semantically Distributed Robust Optimization for Vision-and-Language Inference.

[...]

Tejas Gokhale, Abhishek Chaudhary, Pratyay Banerjee, Chitta Baral, Yezhou Yang¹ - Show less +1 more•Institutions (1)

Arizona State University¹

14 Oct 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: The authors, a model-agnostic method that utilizes a set linguistic transformations in a distributed robust optimization setting, along with an ensembling technique to leverage these transformations during inference, demonstrate performance improvements as well as robustness to adversarial attacks The authors.

...read moreread less

Abstract: Analysis of vision-and-language models has revealed their brittleness under linguistic phenomena such as paraphrasing, negation, textual entailment, and word substitutions with synonyms or antonyms. While data augmentation techniques have been designed to mitigate against these failure modes, methods that can integrate this knowledge into the training pipeline remain under-explored. In this paper, we present \textbf{SDRO}, a model-agnostic method that utilizes a set linguistic transformations in a distributed robust optimization setting, along with an ensembling technique to leverage these transformations during inference. Experiments on benchmark datasets with images (NLVR$^2$) and video (VIOLIN) demonstrate performance improvements as well as robustness to adversarial attacks. Experiments on binary VQA explore the generalizability of this method to other V\&L tasks.

...read moreread less

1 citations

Proceedings Article•DOI•

ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

[...]

Alexander R. Fabbri¹, Faiaz Rahman, Imad Rizvi, Borui Wang², Haoran Li³, Yashar Mehdad⁴, Dragomir R. Radev¹ - Show less +3 more•Institutions (4)

Yale University¹, Stanford University², University of California, Los Angeles³, Facebook⁴

01 Aug 2021

TL;DR: The authors design annotation protocols motivated by an issues-viewpoints-assertions framework to crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads.

...read moreread less

Abstract: While online conversations can cover a vast amount of information in many different formats, abstractive text summarization has primarily focused on modeling solely news articles. This research gap is due, in part, to the lack of standardized datasets for summarizing online discussions. To address this gap, we design annotation protocols motivated by an issues–viewpoints–assertions framework to crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads. We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data. To create a comprehensive benchmark, we also evaluate these models on widely-used conversation summarization datasets to establish strong baselines in this domain. Furthermore, we incorporate argument mining through graph construction to directly model the issues, viewpoints, and assertions present in a conversation and filter noisy input, showing comparable or improved results according to automatic and human evaluations.

...read moreread less

1 citations