Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

[...]

Boseop Kim, HyoungSeok Kim, Sang Woo Lee¹, Gichang Lee, Dong-Hyun Kwak¹, Dong Hyeon Jeon, Sunghyun Park², Sungju Kim, Seonhoon Kim³, Dongpil Seo, Heungsub Lee, Minyoung Jeong, Sungjae Lee⁴, Minsub Kim, Suk Hyun Ko, Seokhun Kim, Taeyong Park³, Jinuk Kim, Soyoung Kang, Na-Hyeon Ryu, Kang Min Yoo¹, Minsuk Chang⁵, Soobin Suh, Sookyo In, Jin-Seong Park⁶, Kyungduk Kim⁷, Hiun Kim, Jisu Jeong¹, Yong Goo Yeo, Donghoon Ham, Dongju Park, Min Young Lee⁸, Jae-Wook Kang⁹, Inho Kang¹, Jung-Woo Ha¹, Woo-Myoung Park⁷, Nako Sung¹ - Show less +33 more•Institutions (9)

Naver Corporation¹, Amazon.com², Seoul National University³, Dong-eui University⁴, KAIST⁵, Hanyang University⁶, Samsung⁷, Yonsei University⁸, Chonbuk National University⁹

10 Sep 2021-arXiv: Computation and Language

TL;DR: HyperCLOVA as discussed by the authors is a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens, which shows state-of-the-art zero-shot and few-shot learning performances on various downstream tasks in Korean.

...read moreread less

Abstract: GPT-3 shows remarkable in-context learning ability of large-scale language models (LMs) trained on hundreds of billion scale data. Here we address some remaining issues less reported by the GPT-3 paper, such as a non-English LM, the performances of different sized models, and the effect of recently introduced prompt optimization on in-context learning. To achieve this, we introduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens. Enhanced by our Korean-specific tokenization, HyperCLOVA with our training configuration shows state-of-the-art in-context zero-shot and few-shot learning performances on various downstream tasks in Korean. Also, we show the performance benefits of prompt-based learning and demonstrate how it can be integrated into the prompt engineering pipeline. Then we discuss the possibility of materializing the No Code AI paradigm by providing AI prototyping capabilities to non-experts of ML by introducing HyperCLOVA studio, an interactive prompt engineering interface. Lastly, we demonstrate the potential of our methods with three successful in-house applications.

...read moreread less

6 citations

Proceedings Article•DOI•

Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation.

[...]

Minki Kang¹, Moonsu Han, Sung Ju Hwang¹•Institutions (1)

KAIST¹

01 Nov 2020

TL;DR: This article proposed a method to automatically generate a domain and task-adaptive masking of the given text for self-supervised pre-training, such that they can effectively adapt the language model to a particular target task (e.g. question answering).

...read moreread less

Abstract: We propose a method to automatically generate a domain- and task-adaptive maskings of the given text for self-supervised pre-training, such that we can effectively adapt the language model to a particular target task (e.g. question answering). Specifically, we present a novel reinforcement learning-based framework which learns the masking policy, such that using the generated masks for further pre-training of the target language model helps improve task performance on unseen texts. We use off-policy actor-critic with entropy regularization and experience replay for reinforcement learning, and propose a Transformer-based policy network that can consider the relative importance of words in a given text. We validate our Neural Mask Generator (NMG) on several question answering and text classification datasets using BERT and DistilBERT as the language models, on which it outperforms rule-based masking strategies, by automatically learning optimal adaptive maskings.

...read moreread less

6 citations

Proceedings Article•DOI•

ERNIE-NLI: Analyzing the Impact of Domain-Specific External Knowledge on Enhanced Representations for NLI

[...]

Lisa Bauer¹, Lingjia Deng, Mohit Bansal¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jun 2021

TL;DR: Using the ERNIE architecture, a detailed analysis is provided on the types of knowledge that result in a performance increase on the Natural Language Inference task, specifically on the Multi-Genre Natural language Inference Corpus (MNLI).

...read moreread less

Abstract: We examine the effect of domain-specific external knowledge variations on deep large scale language model performance. Recent work in enhancing BERT with external knowledge has been very popular, resulting in models such as ERNIE (Zhang et al., 2019a). Using the ERNIE architecture, we provide a detailed analysis on the types of knowledge that result in a performance increase on the Natural Language Inference (NLI) task, specifically on the Multi-Genre Natural Language Inference Corpus (MNLI). While ERNIE uses general TransE embeddings, we instead train domain-specific knowledge embeddings and insert this knowledge via an information fusion layer in the ERNIE architecture, allowing us to directly control and analyze knowledge input. Using several different knowledge training objectives, sources of knowledge, and knowledge ablations, we find a strong correlation between knowledge and classification labels within the same polarity, illustrating that knowledge polarity is an important feature in predicting entailment. We also perform classification change analysis across different knowledge variations to illustrate the importance of selecting appropriate knowledge input regarding content and polarity, and show representative examples of these changes.

...read moreread less

6 citations

Proceedings Article•DOI•

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

[...]

01 Jan 2022

TL;DR: This article proposed a unified plug-and-play model for task-oriented dialogue, called PPTOD, which uses a multi-task pre-training strategy that allows the model to learn the primary task completion skills from heterogeneous dialog corpora.

...read moreread less

Abstract: Pre-trained language models have been recently shown to benefit task-oriented dialogue (TOD) systems. Despite their success, existing methods often formulate this task as a cascaded generation problem which can lead to error accumulation across different sub-tasks and greater data annotation overhead. In this study, we present PPTOD, a unified plug-and-play model for task-oriented dialogue. In addition, we introduce a new dialogue multi-task pre-training strategy that allows the model to learn the primary TOD task completion skills from heterogeneous dialog corpora. We extensively test our model on three benchmark TOD tasks, including end-to-end dialogue modelling, dialogue state tracking, and intent classification. Experimental results show that PPTOD achieves new state of the art on all evaluated tasks in both high-resource and low-resource scenarios. Furthermore, comparisons against previous SOTA methods show that the responses generated by PPTOD are more factually correct and semantically coherent as judged by human annotators.

...read moreread less

6 citations

Posted Content•

GooAQ: Open Question Answering with Diverse Answer Types

[...]

Daniel Khashabi¹, Amos Ng, Tushar Khot¹, Ashish Sabharwal¹, Hannaneh Hajishirzi², Chris Callison-Burch³ - Show less +2 more•Institutions (3)

Allen Institute for Artificial Intelligence¹, Facebook², University of Pennsylvania³

18 Apr 2021-arXiv: Computation and Language

TL;DR: GooAQ as discussed by the authors is a large-scale dataset with a variety of answer types, containing over 5 million questions and 3 million answers collected from Google search engine using its autocomplete feature.

...read moreread less

Abstract: While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmarkT5 models on GooAQ and observe that: (a) in line with recent work, LM's strong performance on GooAQ's short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as 'how' and 'why' questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GooAQ to facilitate further research on improving QA with diverse response types.

...read moreread less

6 citations