Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation.

[...]

Zeng Wei, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, Chen Li, Ziyan Gong, Yifan Yao, Xinjing Huang, Jun Wang, Jianfeng Yu, Guo Qi, Yu Yue, Yan Zhang, Wang Jin, Hengtao Tao, Dasen Yan, Zexuan Yi, Fang Peng, Fangqing Jiang, Han Zhang, Lingfeng Deng, Yehong Zhang, Zhe Lin, Chao Zhang, Shaojie Zhang, Mingyue Guo, Gu Shanzhi, Gaojun Fan, Wang Yaowei, Xuefeng Jin, Qun Liu, Tian Yonghong - Show less +34 more

26 Apr 2021-arXiv: Computation and Language

TL;DR: The experimental results demonstrate the superior capabilities of PanGu-α in performing various tasks under few-shot or zero-shot settings and investigate the effect of model scales on the few- shot performances across a broad range of Chinese NLP tasks.

...read moreread less

Abstract: Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-$\alpha$, with up to 200 billion parameters. PanGu-$\alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$\alpha$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$\alpha$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$\alpha$ in performing various tasks under few-shot or zero-shot settings.

...read moreread less

105 citations

Proceedings Article•DOI•

A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining

[...]

Chenguang Zhu¹, Ruochen Xu¹, Michael Zeng¹, Xuedong Huang¹•Institutions (1)

Microsoft¹

01 Nov 2020

TL;DR: A novel abstractive summary network that adapts to the meeting scenario is proposed with a hierarchical structure to accommodate long meeting transcripts and a role vector to depict the difference among speakers.

...read moreread less

Abstract: With the abundance of automatic meeting transcripts, meeting summarization is of great interest to both participants and other parties. Traditional methods of summarizing meetings depend on complex multi-step pipelines that make joint optimization intractable. Meanwhile, there are a handful of deep neural models for text summarization and dialogue systems. However, the semantic structure and styles of meeting transcripts are quite different from articles and conversations. In this paper, we propose a novel abstractive summary network that adapts to the meeting scenario. We design a hierarchical structure to accommodate long meeting transcripts and a role vector to depict the difference among speakers. Furthermore, due to the inadequacy of meeting summary data, we pretrain the model on large-scale news summary data. Empirical results show that our model outperforms previous approaches in both automatic metrics and human evaluation. For example, on ICSI dataset, the ROUGE-1 score increases from 34.66% to 46.28%.

...read moreread less

104 citations

Proceedings Article•DOI•

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

[...]

Ronghang Hu¹, Amanpreet Singh², Trevor Darrell¹, Marcus Rohrbach²•Institutions (2)

University of California, Berkeley¹, Facebook²

14 Jun 2020

TL;DR: Li et al. as mentioned in this paper propose a multimodal transformer architecture accompanied by a rich representation for text in images, which naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter-and intra-modality context.

...read moreread less

Abstract: Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task. In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intra- modality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Our model outperforms existing approaches on three benchmark datasets for the TextVQA task by a large margin.

...read moreread less

104 citations

Posted Content•

Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both?

[...]

Peter Shaw¹, Ming-Wei Chang¹, Panupong Pasupat¹, Kristina Toutanova¹•Institutions (1)

Google¹

24 Oct 2020-arXiv: Computation and Language

TL;DR: NQG-T5 is proposed, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model that outperforms existing approaches across several compositional generalization challenges on non-synthetic data, while also being competitive with the state of theart on standard evaluations.

...read moreread less

Abstract: Sequence-to-sequence models excel at handling natural language variation, but have been shown to struggle with out-of-distribution compositional generalization. This has motivated new specialized architectures with stronger compositional biases, but most of these approaches have only been evaluated on synthetically-generated datasets, which are not representative of natural language variation. In this work we ask: can we develop a semantic parsing approach that handles both natural language variation and compositional generalization? To better assess this capability, we propose new train and test splits of non-synthetic datasets. We demonstrate that strong existing approaches do not perform well across a broad set of evaluations. We also propose NQG-T5, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model. It outperforms existing approaches across several compositional generalization challenges on non-synthetic data, while also being competitive with the state-of-the-art on standard evaluations. While still far from solving this problem, our study highlights the importance of diverse evaluations and the open challenge of handling both compositional generalization and natural language variation in semantic parsing.

...read moreread less

103 citations

Posted Content•

PARADE: Passage Representation Aggregation for Document Reranking

[...]

Canjia Li, Andrew Yates¹, Sean MacAvaney, Ben He, Yingfei Sun - Show less +1 more•Institutions (1)

Max Planck Society¹

20 Aug 2020-arXiv: Information Retrieval

TL;DR: An end-to-end Transformer-based model that considers document-level context for document reranking and leverages passage-level relevance representations to predict a document relevance score, overcoming the limitations of previous approaches.

...read moreread less

Abstract: We present PARADE, an end-to-end Transformer-based model that considers document-level context for document reranking. PARADE leverages passage-level relevance representations to predict a document relevance score, overcoming the limitations of previous approaches that perform inference on passages independently. Experiments on two ad-hoc retrieval benchmarks demonstrate PARADE's effectiveness over such methods. We conduct extensive analyses on PARADE's efficiency, highlighting several strategies for improving it. When combined with knowledge distillation, a PARADE model with 72\% fewer parameters achieves effectiveness competitive with previous approaches using BERT-Base. Our code is available at \url{this https URL}.

...read moreread less

103 citations