Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

RnG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

[...]

Xi Ye¹, Semih Yavuz², Kazuma Hashimoto², Yingbo Zhou², Caiming Xiong² - Show less +1 more•Institutions (2)

University of Texas at Austin¹, Salesforce.com²

17 Sep 2021-arXiv: Computation and Language

TL;DR: RnG-KBQA as mentioned in this paper uses a contrastive ranker to rank a set of candidate logical forms obtained by searching over the knowledge graph, and then introduces a tailored generation model conditioned on the question and the top-ranked candidates to compose the final logical form.

...read moreread less

Abstract: Existing KBQA approaches, despite achieving strong performance on i.i.d. test data, often struggle in generalizing to questions involving unseen KB schema items. Prior ranking-based approaches have shown some success in generalization, but suffer from the coverage issue. We present RnG-KBQA, a Rank-and-Generate approach for KBQA, which remedies the coverage issue with a generation model while preserving a strong generalization capability. Our approach first uses a contrastive ranker to rank a set of candidate logical forms obtained by searching over the knowledge graph. It then introduces a tailored generation model conditioned on the question and the top-ranked candidates to compose the final logical form. We achieve new state-of-the-art results on GrailQA and WebQSP datasets. In particular, our method surpasses the prior state-of-the-art by a large margin on the GrailQA leaderboard. In addition, RnG-KBQA outperforms all prior approaches on the popular WebQSP benchmark, even including the ones that use the oracle entity linking. The experimental results demonstrate the effectiveness of the interplay between ranking and generation, which leads to the superior performance of our proposed approach across all settings with especially strong improvements in zero-shot generalization.

...read moreread less

5 citations

Posted Content•

Teaching Machine Comprehension with Compositional Explanations

[...]

Qinyuan Ye¹, Xiao Huang, Elizabeth Boschee², Xiang Ren³•Institutions (3)

University of Southern California¹, BBN Technologies², Zhejiang University³

02 May 2020-arXiv: Computation and Language

TL;DR: The authors use semi-structured explanations that explicitly inform machines why answer spans are correct to train a machine reading comprehension model using a small number of semi-Structured Explanations (SEAs).

...read moreread less

Abstract: Advances in machine reading comprehension (MRC) rely heavily on the collection of large scale human-annotated examples in the form of (question, paragraph, answer) triples. In contrast, humans are typically able to generalize with only a few examples, relying on deeper underlying world knowledge, linguistic sophistication, and/or simply superior deductive powers. In this paper, we focus on "teaching" machines reading comprehension, using a small number of semi-structured explanations that explicitly inform machines why answer spans are correct. We extract structured variables and rules from explanations and compose neural module teachers that annotate instances for training downstream MRC models. We use learnable neural modules and soft logic to handle linguistic variation and overcome sparse coverage; the modules are jointly optimized with the MRC model to improve final performance. On the SQuAD dataset, our proposed method achieves 70.14% F1 score with supervision from 26 explanations, comparable to plain supervised learning using 1,100 labeled instances, yielding a 12x speed up.

...read moreread less

4 citations

Proceedings Article•

PMI-Masking: Principled masking of correlated spans

[...]

Yoav Levine¹, Barak Lenz, Opher Lieber, Omri Abend¹, Kevin Leyton-Brown², Moshe Tennenholtz³, Yoav Shoham⁴ - Show less +3 more•Institutions (4)

Hebrew University of Jerusalem¹, University of British Columbia², Technion – Israel Institute of Technology³, Stanford University⁴

03 May 2021

TL;DR: This paper proposed Pointwise Mutual Information (PMI) masking, which jointly masks a token n-gram if it exhibits high collocation over the corpus, and showed that PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking.

...read moreread less

Abstract: Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of pretraining.

...read moreread less

4 citations

Proceedings Article•DOI•

Fall Detection in Clinical Notes using Language Models and Token Classifier

[...]

Joaquim Santos¹, Henrique D. P. dos Santos², Renata Vieira¹•Institutions (2)

University of Évora¹, Pontifícia Universidade Católica do Rio Grande do Sul²

28 Jul 2020

TL;DR: This initial work compares the performance of SentenceClassifier against the Token-Classifier (TkC) with state-ofthe-art recurrent neural networks (RNN) to detect fall incidents in progress notes and shows that the use of deeplearning algorithms as token- classifier outperforms text-classifier.

...read moreread less

Abstract: Electronic health records (EHR) are a key source of information to identify adverse events in patients. The largest category of adverse events in hospitals is fall incidents. The identification of such incidents guide to a better comprehension of the event and enhance the quality of patient health care. In this initial work, we compare the performance of SentenceClassifier (StC) against the Token-Classifier (TkC) with state-ofthe-art recurrent neural networks (RNN) to detect fall incidents in progress notes. Our experiments show that the use of deeplearning algorithms as token-classifier outperforms text-classifier. It improves fall identification using StC from 65% to 92% with TkC (F-Measure). Additionally, the token classifier is able to explain which words are most important in positive detection.

...read moreread less

4 citations

Proceedings Article•

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections

[...]

Ruiqi Zhong¹, Kristy Lee, Zheng Zhang, Dan Klein¹•Institutions (1)

University of California, Berkeley¹

10 Apr 2021

TL;DR: This paper proposed meta-tuning, which directly optimizes the zero-shot learning objective by finetuning pre-trained language models on a collection of datasets, and constructed the meta-dataset by aggregating 43 existing datasets and annotating 441 label descriptions in a question-answering (QA) format.

...read moreread less

Abstract: Large pre-trained language models (LMs) such as GPT-3 have acquired a surprising ability to perform zero-shot learning. For example, to classify sentiment without any training examples, we can “prompt” the LM with the review and the label description “Does the user like this movie?”, and ask whether the next word is “yes” or “no”. However, the next word prediction training objective is still misaligned with the target zero-shot learning objective. To address this weakness, we propose meta-tuning, which directly optimizes the zero-shot learning objective by fine-tuning pre-trained language models on a collection of datasets. We focus on classification tasks, and construct the meta-dataset by aggregating 43 existing datasets and annotating 441 label descriptions in a question-answering (QA) format. When evaluated on unseen tasks, meta-tuned models outperform a same-sized QA model and the previous SOTA zero-shot learning system based on natural language inference. Additionally, increasing parameter count from 220M to 770M improves AUC-ROC scores by 6.3%, and we forecast that even larger models would perform better. Therefore, measuring zero-shot learning performance on language models out-of-the-box might underestimate their true potential, and community-wide efforts on aggregating datasets and unifying their formats can help build models that answer prompts better.

...read moreread less

4 citations