scispace - formally typeset
Open AccessJournal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Reads0
Chats0
TLDR
This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering

TL;DR: It is shown that the proposed method allows to significantly outperform the baselines trained on English data only, and is reported a new state-of-the-art on four datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).
Posted Content

RoFormer: Enhanced Transformer with Rotary Position Embedding.

TL;DR: The authors proposed a rotary position embedding (RoPE) to encode absolute position information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation, which has valuable properties such as flexibility of being expand to any sequence length, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear selfattention with relative position encoding.
Proceedings ArticleDOI

Modeling Graph Structure via Relative Position for Text Generation from Knowledge Graphs

TL;DR: Graformer, a novel Transformer-based encoder-decoder architecture for graph-to-text generation that learns to weight node-node relations differently for different attention heads, thus virtually learning differently connected views of the input graph.
Proceedings ArticleDOI

On the importance of pre-training data volume for compact language models

TL;DR: This paper study the impact of pre-training data volume on compact language models and show that well-performing models can be obtained with as little as 100 MB of text, and that past critically low amounts of pretraining data, an intermediate pretraining step on the task-specific corpus does not yield substantial improvements.
Proceedings Article

Mixed-Lingual Pre-training for Cross-lingual Summarization

TL;DR: This work proposes a solution based on mixed-lingual pre-training that can leverage the massive monolingual data to enhance its modeling of language and has no task-specific components, which saves memory and increases optimization efficiency.
Related Papers (5)
Trending Questions (1)
What are the limitations of transfer learning with a unified text-to-text transformer?

The paper does not mention the limitations of transfer learning with a unified text-to-text transformer.