Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

A Novel Estimator of Mutual Information for Learning to Disentangle Textual Representations

[...]

Pierre Colombo¹, Pablo Piantanida², Chloé Clavel³•Institutions (3)

IBM¹, École Centrale Paris², Télécom ParisTech³

01 Aug 2021

TL;DR: This paper introduced a variational upper bound to the mutual information between an attribute and the latent code of an encoder, which aims at controlling the approximation error via the Renyi divergence, leading to both better disentangled representations and in particular, a precise control of the desirable degree of disentanglement than state-of-the-art methods proposed for textual data.

...read moreread less

Abstract: Learning disentangled representations of textual data is essential for many natural language tasks such as fair classification, style transfer and sentence generation, among others. The existent dominant approaches in the context of text data either rely on training an adversary (discriminator) that aims at making attribute values difficult to be inferred from the latent code or rely on minimising variational bounds of the mutual information between latent code and the value attribute. However, the available methods suffer of the impossibility to provide a fine-grained control of the degree (or force) of disentanglement. In contrast to adversarial methods, which are remarkably simple, although the adversary seems to be performing perfectly well during the training phase, after it is completed a fair amount of information about the undesired attribute still remains. This paper introduces a novel variational upper bound to the mutual information between an attribute and the latent code of an encoder. Our bound aims at controlling the approximation error via the Renyi’s divergence, leading to both better disentangled representations and in particular, a precise control of the desirable degree of disentanglement than state-of-the-art methods proposed for textual data. Furthermore, it does not suffer from the degeneracy of other losses in multi-class scenarios. We show the superiority of this method on fair classification and on textual style transfer tasks. Additionally, we provide new insights illustrating various trade-offs in style transfer when attempting to learn disentangled representations and quality of the generated sentence.

...read moreread less

3 citations

Proceedings Article•DOI•

Attention Temperature Matters in Abstractive Summarization Distillation

[...]

01 Jan 2022

TL;DR: The authors distill large pre-trained sequence-to-sequence Transformer models into smaller ones for faster inference and with minimal performance loss, by manipulating attention temperatures in Transformers to make pseudo labels easier to learn for student models.

...read moreread less

Abstract: Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and with minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves vanilla pseudo-labeling based methods. Further empirical analysis shows that both pseudo labels and summaries produced by our students are shorter and more abstractive.

...read moreread less

3 citations

Posted Content•

Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

[...]

Tim Isbister, Fredrik Carlsson, Magnus Sahlgren

21 Apr 2021-arXiv: Computation and Language

TL;DR: The authors demonstrate empirically that a large English language model coupled with modern machine translation outperforms native language models in most Scandinavian languages, with the exception of Finnish, which they assume is due to inferior translation quality.

...read moreread less

Abstract: Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native language models even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling the use of pretrained, and large-scale, English language models. We demonstrate empirically that a large English language model coupled with modern machine translation outperforms native language models in most Scandinavian languages. The exception to this is Finnish, which we assume is due to inferior translation quality. Our results suggest that machine translation is a mature technology, which raises a serious counter-argument for training native language models for low-resource languages. This paper therefore strives to make a provocative but important point. As English language models are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English, than to build language models for such languages.

...read moreread less

3 citations

Proceedings Article•DOI•

Stage-wise Fine-tuning for Graph-to-Text Generation

[...]

Qingyun Wang¹, Semih Yavuz¹, Xi Victoria Lin¹, Heng Ji², Nazneen Fatema Rajani¹ - Show less +1 more•Institutions (2)

Salesforce.com¹, University of Illinois at Urbana–Champaign²

01 Aug 2021

TL;DR: This paper proposed a tree-level embedding method to capture the inter-dependency structures of the input graph, which improved the performance of all text generation metrics for the English WebNLG 2017 dataset.

...read moreread less

Abstract: Graph-to-text generation has benefited from pre-trained language models (PLMs) in achieving better performance than structured graph encoders. However, they fail to fully utilize the structure information of the input graph. In this paper, we aim to further improve the performance of the pre-trained language model by proposing a structured graph-to-text model with a two-step fine-tuning mechanism which first fine-tunes model on Wikipedia before adapting to the graph-to-text generation. In addition to using the traditional token and position embeddings to encode the knowledge graph (KG), we propose a novel tree-level embedding method to capture the inter-dependency structures of the input graph. This new approach has significantly improved the performance of all text generation metrics for the English WebNLG 2017 dataset.

...read moreread less

3 citations

Posted Content•

Deep Learning in Memristive Nanowire Networks

[...]

Jack D. Kendall, Ross D. Pantone, Juan C. Nino¹•Institutions (1)

University of Florida¹

03 Mar 2020-arXiv: Emerging Technologies

TL;DR: This work represents, to the authors' knowledge, the first randomized nanowire architecture capable of reproducing the backpropagation algorithm and shows that the MN3 is capable of performing composition, gradient propagation, and weight updates, which together allow it to function as a deep neural network.

...read moreread less

Abstract: Analog crossbar architectures for accelerating neural network training and inference have made tremendous progress over the past several years. These architectures are ideal for dense layers with fewer than roughly a thousand neurons. However, for large sparse layers, crossbar architectures are highly inefficient. A new hardware architecture, dubbed the MN3 (Memristive Nanowire Neural Network), was recently described as an efficient architecture for simulating very wide, sparse neural network layers, on the order of millions of neurons per layer. The MN3 utilizes a high-density memristive nanowire mesh to efficiently connect large numbers of silicon neurons with modifiable weights. Here, in order to explore the MN3's ability to function as a deep neural network, we describe one algorithm for training deep MN3 models and benchmark simulations of the architecture on two deep learning tasks. We utilize a simple piecewise linear memristor model, since we seek to demonstrate that training is, in principle, possible for randomized nanowire architectures. In future work, we intend on utilizing more realistic memristor models, and we will adapt the presented algorithm appropriately. We show that the MN3 is capable of performing composition, gradient propagation, and weight updates, which together allow it to function as a deep neural network. We show that a simulated multilayer perceptron (MLP), built from MN3 networks, can obtain a 1.61% error rate on the popular MNIST dataset, comparable to equivalently sized software-based network. This work represents, to the authors' knowledge, the first randomized nanowire architecture capable of reproducing the backpropagation algorithm.

...read moreread less

3 citations