Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Contextualizing Variation in Text Style Transfer Datasets

[...]

Stephanie Schoch¹, Wanyu Du¹, Yangfeng Ji¹•Institutions (1)

University of Virginia¹

17 Aug 2021-arXiv: Computation and Language

TL;DR: The authors conduct several empirical analyses of existing text style datasets and propose a categorization of stylistic and dataset properties to consider when utilizing or comparing text style data sets, based on which a style is realized in a particular dataset.

...read moreread less

Abstract: Text style transfer involves rewriting the content of a source sentence in a target style. Despite there being a number of style tasks with available data, there has been limited systematic discussion of how text style datasets relate to each other. This understanding, however, is likely to have implications for selecting multiple data sources for model training. While it is prudent to consider inherent stylistic properties when determining these relationships, we also must consider how a style is realized in a particular dataset. In this paper, we conduct several empirical analyses of existing text style datasets. Based on our results, we propose a categorization of stylistic and dataset properties to consider when utilizing or comparing text style datasets.

...read moreread less

Posted Content•

Exceeding the Limits of Visual-Linguistic Multi-Task Learning.

[...]

Cameron R. Wolfe, Keld T. Lundgaard

27 Jul 2021-arXiv: Artificial Intelligence

TL;DR: In this article, a large-scale multi-task learning (MTL) approach is proposed to solve 1000 unique classification tasks that share similarly-structured input data, comprised of both text and images.

...read moreread less

Abstract: By leveraging large amounts of product data collected across hundreds of live e-commerce websites, we construct 1000 unique classification tasks that share similarly-structured input data, comprised of both text and images. These classification tasks focus on learning the product hierarchy of different e-commerce websites, causing many of them to be correlated. Adopting a multi-modal transformer model, we solve these tasks in unison using multi-task learning (MTL). Extensive experiments are presented over an initial 100-task dataset to reveal best practices for "large-scale MTL" (i.e., MTL with more than 100 tasks). From these experiments, a final, unified methodology is derived, which is composed of both best practices and new proposals such as DyPa, a simple heuristic for automatically allocating task-specific parameters to tasks that could benefit from extra capacity. Using our large-scale MTL methodology, we successfully train a single model across all 1000 tasks in our dataset while using minimal task specific parameters, thereby showing that it is possible to extend several orders of magnitude beyond current efforts in MTL.

...read moreread less

Posted Content•

RoBERTuito: a pre-trained language model for social media text in Spanish.

[...]

Juan Manuel Pérez, Damián Ariel Furman, Laura Alonso Alemany, Franco M. Luque

18 Nov 2021-arXiv: Computation and Language

TL;DR: This article presented RoBERTuito, a pre-trained language model for user-generated content in Spanish, trained on 500 million tweets in Spanish and showed that it outperformed other pre-learned language models for Spanish.

...read moreread less

Abstract: Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for Natural Language Understanding tasks. Recently, some works geared towards pre-training, specially-crafted models for particular domains, such as scientific papers, medical documents, and others. In this work, we present RoBERTuito, a pre-trained language model for user-generated content in Spanish. We trained RoBERTuito on 500 million tweets in Spanish. Experiments on a benchmark of 4 tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models for Spanish. In order to help further research, we make RoBERTuito publicly available at the HuggingFace model hub.

...read moreread less

Posted Content•

DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution

[...]

Keshav Santhanam¹, Siddharth Krishna, Ryota Tomioka, Tim Harris, Matei Zaharia - Show less +1 more•Institutions (1)

Stanford University¹

09 Nov 2021-arXiv: Learning

TL;DR: DistIR as mentioned in this paper is an intermediate representation for distributed DNN computation that is tailored for efficient analyses, such as simulation, which enables automatically identifying the top-performing strategies without having to execute on physical hardware.

...read moreread less

Abstract: The rapidly growing size of deep neural network (DNN) models and datasets has given rise to a variety of distribution strategies such as data, tensor-model, pipeline parallelism, and hybrid combinations thereof. Each of these strategies offers its own trade-offs and exhibits optimal performance across different models and hardware topologies. Selecting the best set of strategies for a given setup is challenging because the search space grows combinatorially, and debugging and testing on clusters is expensive. In this work we propose DistIR, an expressive intermediate representation for distributed DNN computation that is tailored for efficient analyses, such as simulation. This enables automatically identifying the top-performing strategies without having to execute on physical hardware. Unlike prior work, DistIR can naturally express many distribution strategies including pipeline parallelism with arbitrary schedules. Our evaluation on MLP training and GPT-2 inference models demonstrates how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations, reducing optimization time by an order of magnitude for certain regimes.

...read moreread less

DOI•

Benchmarking down-scaled (not so large) pre-trained language models

[...]

Matthias Aßenmacher, Patrick Schulze, Christian Heumann

11 May 2021

TL;DR: The authors compare three pre-training objectives for different shape parameters and model sizes, while also varying the number of pretraining steps and the batch size, and find that additional compute should be mainly allocated to an increased model size, while training for more steps is inefficient.

...read moreread less

Abstract: Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes. At the same time, more fundamental components, such as the pre-training objective or architectural hyperparameters, are modified. In total, it is therefore difficult to ascribe changes in performance to specific factors. Since searching the hyperparameter space over the full systems is too costly, we pre-train down-scaled versions of several popular Transformer-based architectures on a common pre-training corpus and benchmark them on a subset of the GLUE tasks (Wang et al., 2018). Specifically, we systematically compare three pre-training objectives for different shape parameters and model sizes, while also varying the number of pre-training steps and the batch size. In our experiments MLM + NSP (BERT-style) consistently outperforms MLM (RoBERTa-style) as well as the standard LM objective. Furthermore, we find that additional compute should be mainly allocated to an increased model size, while training for more steps is inefficient. Based on these observations, as a final step we attempt to scale up several systems using compound scaling (Tan and Le, 2019) adapted to Transformer-based language models.

...read moreread less