Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

CoTexT: Multi-task Learning with Code-Text Transformer

[...]

Long Phan, Hieu Tran, Daniel Le, Hieu T. Nguyen¹, James T. Anibal, Alec Peltekian, Yanfang Ye - Show less +3 more•Institutions (1)

Case Western Reserve University¹

18 May 2021-arXiv: Artificial Intelligence

TL;DR: CoTexT as discussed by the authors is a transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL) using self-supervision.

...read moreread less

Abstract: We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both "bimodal" and "unimodal" data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate CoTexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.

...read moreread less

2 citations

Posted Content•

M6-T: Exploring Sparse Expert Models and Beyond

[...]

An Yang, Junyang Lin, Rui Men, Chang Zhou, Le Jiang, Xianyan Jia, Ang Wang, Jie Zhang, Jiamang Wang, Yong Li, Di Zhang, Wei Lin, Lin Qu, Jingren Zhou, Hongxia Yang - Show less +11 more

31 May 2021-arXiv: Learning

TL;DR: In this paper, the authors investigate several key factors in sparse expert models and propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.

...read moreread less

Abstract: Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still it is a mystery how MoE layers bring quality gains by leveraging the parameters with sparse activation. In this work, we investigate several key factors in sparse expert models. We observe that load imbalance may not be a significant problem affecting model quality, contrary to the perspectives of recent studies, while the number of sparsely activated experts $k$ and expert capacity $C$ in top-$k$ routing can significantly make a difference in this context. Furthermore, we take a step forward to propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models. We push the model scale to over $1$ trillion parameters and implement it on solely $480$ NVIDIA V100-32GB GPUs, in comparison with the recent SOTAs on $2048$ TPU cores. The proposed giant model achieves substantial speedup in convergence over the same-size baseline.

...read moreread less

2 citations

Posted Content•

OneStop QAMaker: Extract Question-Answer Pairs from Text in a One-Stop Approach

[...]

Shaobo Cui¹, Xintong Bao¹, Xinxing Zu¹, Yangyang Guo, Zhongzhou Zhao¹, Ji Zhang¹, Haiqing Chen¹ - Show less +3 more•Institutions (1)

Alibaba Group¹

24 Feb 2021-arXiv: Computation and Language

TL;DR: This paper proposed a model named OneStop to generate QA pairs from documents in a one-stop approach, where questions and their corresponding answer span are extracted simultaneously and the process of question generation and answer extraction mutually affect each other.

...read moreread less

Abstract: Large-scale question-answer (QA) pairs are critical for advancing research areas like machine reading comprehension and question answering. To construct QA pairs from documents requires determining how to ask a question and what is the corresponding answer. Existing methods for QA pair generation usually follow a pipeline approach. Namely, they first choose the most likely candidate answer span and then generate the answer-specific question. This pipeline approach, however, is undesired in mining the most appropriate QA pairs from documents since it ignores the connection between question generation and answer extraction, which may lead to incompatible QA pair generation, i.e., the selected answer span is inappropriate for question generation. However, for human annotators, we take the whole QA pair into account and consider the compatibility between question and answer. Inspired by such motivation, instead of the conventional pipeline approach, we propose a model named OneStop generate QA pairs from documents in a one-stop approach. Specifically, questions and their corresponding answer span is extracted simultaneously and the process of question generation and answer extraction mutually affect each other. Additionally, OneStop is much more efficient to be trained and deployed in industrial scenarios since it involves only one model to solve the complex QA generation task. We conduct comprehensive experiments on three large-scale machine reading comprehension datasets: SQuAD, NewsQA, and DuReader. The experimental results demonstrate that our OneStop model outperforms the baselines significantly regarding the quality of generated questions, quality of generated question-answer pairs, and model efficiency.

...read moreread less

2 citations

Posted Content•

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning.

[...]

Yixuan Su¹, Fangyu Liu¹, Zaiqiao Meng¹, Lei Shu¹, Ehsan Shareghi², Nigel Collier³ - Show less +2 more•Institutions (3)

University of Cambridge¹, Amazon.com², Monash University³

07 Nov 2021-arXiv: Computation and Language

TL;DR: This article proposed a token-aware contrastive learning (TaCL) approach, which encourages BERT to learn an isotropic and discriminative distribution of token representations, which can be used for task-specific NER.

...read moreread less

Abstract: Masked language models (MLMs) such as BERT and RoBERTa have revolutionized the field of Natural Language Understanding in the past few years. However, existing pre-trained MLMs often output an anisotropic distribution of token representations that occupies a narrow subset of the entire representation space. Such token representations are not ideal, especially for tasks that demand discriminative semantic meanings of distinct tokens. In this work, we propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations. TaCL is fully unsupervised and requires no additional data. We extensively test our approach on a wide range of English and Chinese benchmarks. The results show that TaCL brings consistent and notable improvements over the original BERT model. Furthermore, we conduct detailed ablation study and careful analysis to reveal the merits and inner-workings of our approach.

...read moreread less

2 citations

Posted Content•

Improving Text Auto-Completion with Next Phrase Prediction

[...]

Dong-Ho Lee¹, Zhiqiang Hu², Roy Ka-Wei Lee²•Institutions (2)

University of Southern California¹, Singapore University of Technology and Design²

15 Sep 2021-arXiv: Computation and Language

TL;DR: The authors proposed an intermediate training strategy to enhance pre-trained language models' performance in the text auto-completion task and fastly adapt them to specific domains, such as email and academic writing domains.

...read moreread less

Abstract: Language models such as GPT-2 have performed well on constructing syntactically sound sentences for text auto-completion task. However, such models often require considerable training effort to adapt to specific writing domains (e.g., medical). In this paper, we propose an intermediate training strategy to enhance pre-trained language models' performance in the text auto-completion task and fastly adapt them to specific domains. Our strategy includes a novel self-supervised training objective called Next Phrase Prediction (NPP), which encourages a language model to complete the partial query with enriched phrases and eventually improve the model's text auto-completion performance. Preliminary experiments have shown that our approach is able to outperform the baselines in auto-completion for email and academic writing domains.

...read moreread less

2 citations