Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Towards Emotion-Aware Agents For Negotiation Dialogues

[...]

Kushal Chawla¹, Rene Clever², Jaysa Ramirez³, Gale M. Lucas¹, Jonathan Gratch¹ - Show less +1 more•Institutions (3)

University of Southern California¹, Lehman College², Rollins College³

28 Sep 2021

TL;DR: This article explored the prediction of two important subjective goals in a negotiation, outcome satisfaction and partner perception, and analyzed the extent to which emotion attributes extracted from the negotiation help in the prediction, above and beyond the individual difference variables.

...read moreread less

Abstract: Negotiation is a complex social interaction that encapsulates emotional encounters in human decision-making. Virtual agents that can negotiate with humans are useful in pedagogy and conversational AI. To advance the development of such agents, we explore the prediction of two important subjective goals in a negotiation – outcome satisfaction and partner perception. Specifically, we analyze the extent to which emotion attributes extracted from the negotiation help in the prediction, above and beyond the individual difference variables. We focus on a recent dataset in chat-based negotiations, grounded in a realistic camping scenario. We study three degrees of emotion dimensions – emoticons, lexical, and contextual by leveraging affective lexicons and a state-of-the-art deep learning architecture. Our insights will be helpful in designing adaptive negotiation agents that interact through realistic communication interfaces.

...read moreread less

1 citations

Posted Content•

Bandits Don't Follow Rules: Balancing Multi-Facet Machine Translation with Multi-Armed Bandits

[...]

Julia Kreutzer¹, David Vilar², Artem Sokolov¹•Institutions (2)

Google¹, Amazon.com²

13 Oct 2021-arXiv: Computation and Language

TL;DR: This paper propose a multi-armed bandit to dynamically choose between different facets in a way that is most beneficial for the MT system and evaluate it on three different multi-facet applications: balancing translationese and natural training data.

...read moreread less

Abstract: Training data for machine translation (MT) is often sourced from a multitude of large corpora that are multi-faceted in nature, e.g. containing contents from multiple domains or different levels of quality or complexity. Naturally, these facets do not occur with equal frequency, nor are they equally important for the test scenario at hand. In this work, we propose to optimize this balance jointly with MT model parameters to relieve system developers from manual schedule design. A multi-armed bandit is trained to dynamically choose between facets in a way that is most beneficial for the MT system. We evaluate it on three different multi-facet applications: balancing translationese and natural training data, or data from multiple domains or multiple language pairs. We find that bandit learning leads to competitive MT systems across tasks, and our analysis provides insights into its learned strategies and the underlying data sets.

...read moreread less

1 citations

Proceedings Article•DOI•

Learning Algebraic Recombination for Compositional Generalization

[...]

Chenyao Liu¹, Shengnan An, Zeqi Lin², Qian Liu³, Bei Chen², Jian-Guang Lou², Lijie Wen¹, Nanning Zheng, Dongmei Zhang² - Show less +5 more•Institutions (3)

Tsinghua University¹, Microsoft², Beihang University³

01 Aug 2021

TL;DR: LeAR as mentioned in this paper models the semantic parsing task as a homomorphism between a latent syntactic algebra and a semantic algebra, thus encouraging algebraic recombination, and learns two modules jointly: a Composer for producing latent syntax and an Interpreter for assigning semantic operations.

...read moreread less

Abstract: Neural sequence models exhibit limited compositional generalization ability in semantic parsing tasks. Compositional generalization requires algebraic recombination, i.e., dynamically recombining structured expressions in a recursive manner. However, most previous studies mainly concentrate on recombining lexical units, which is an important but not sufficient part of algebraic recombination. In this paper, we propose LeAR, an end-to-end neural model to learn algebraic recombination for compositional generalization. The key insight is to model the semantic parsing task as a homomorphism between a latent syntactic algebra and a semantic algebra, thus encouraging algebraic recombination. Specifically, we learn two modules jointly: a Composer for producing latent syntax, and an Interpreter for assigning semantic operations. Experiments on two realistic and comprehensive compositional generalization benchmarks demonstrate the effectiveness of our model. The source code is publicly available at this https URL.

...read moreread less

1 citations

Posted Content•

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

[...]

Shengjie Luo¹, Shanda Li¹, Tianle Cai², Di He³, Dinglan Peng⁴, Shuxin Zheng³, Guolin Ke³, Liwei Wang¹, Tie-Yan Liu³ - Show less +5 more•Institutions (4)

Peking University¹, Princeton University², Microsoft³, University of Science and Technology of China⁴

23 Jun 2021-arXiv: Learning

TL;DR: In this article, relative positional encoding (RPE) is used to accelerate the attention calculation for Transformers with RPE on top of the kernelized attention, and the learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long sequence regime.

...read moreread less

Abstract: The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

...read moreread less

1 citations

Posted Content•

FastSeq: Make Sequence Generation Faster

[...]

Yu Yan¹, Fei Hu, Jiusheng Chen¹, Nikhil Bhendawade¹, Ting Ye, Yeyun Gong¹, Nan Duan¹, Desheng Cui, Bingyu Chi, Ruofei Zhang¹ - Show less +6 more•Institutions (1)

Microsoft¹

08 Jun 2021-arXiv: Computation and Language

TL;DR: FastSeq as mentioned in this paper proposes an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O to accelerate sequence generation without accuracy loss.

...read moreread less

Abstract: Transformer-based models have made tremendous impacts in natural language generation. However the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop FastSeq framework to accelerate sequence generation without accuracy loss. The proposed optimization techniques include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations are general enough to be applicable to Transformer-based models (e.g., T5, GPT2, and UniLM). Our benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. Additionally, FastSeq is easy to use with a simple one-line code change. The source code is available at this https URL.

...read moreread less

1 citations