Open AccessJournal Article
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel,Noam Shazeer,Adam Roberts,Katherine Lee,Sharan Narang,Michael Matena,Yanqi Zhou,Wei Li,Peter J. Liu +8 more
Reads0
Chats0
TLDR
This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.Abstract:
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.read more
Citations
More filters
Posted Content
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering
TL;DR: It is shown that the proposed method allows to significantly outperform the baselines trained on English data only, and is reported a new state-of-the-art on four datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr).
Posted Content
RoFormer: Enhanced Transformer with Rotary Position Embedding.
TL;DR: The authors proposed a rotary position embedding (RoPE) to encode absolute position information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation, which has valuable properties such as flexibility of being expand to any sequence length, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear selfattention with relative position encoding.
Proceedings ArticleDOI
Modeling Graph Structure via Relative Position for Text Generation from Knowledge Graphs
TL;DR: Graformer, a novel Transformer-based encoder-decoder architecture for graph-to-text generation that learns to weight node-node relations differently for different attention heads, thus virtually learning differently connected views of the input graph.
Proceedings ArticleDOI
On the importance of pre-training data volume for compact language models
TL;DR: This paper study the impact of pre-training data volume on compact language models and show that well-performing models can be obtained with as little as 100 MB of text, and that past critically low amounts of pretraining data, an intermediate pretraining step on the task-specific corpus does not yield substantial improvements.
Proceedings Article
Mixed-Lingual Pre-training for Cross-lingual Summarization
TL;DR: This work proposes a solution based on mixed-lingual pre-training that can leverage the massive monolingual data to enhance its modeling of language and has no task-specific components, which saves memory and increases optimization efficiency.