Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

ESTER: A Machine Reading Comprehension Dataset for Reasoning about Event Semantic Relations

[...]

Rujun Han¹, I-Hung Hsu², Jiao Sun¹, Julia Baylon, Qiang Ning³, Dan Roth⁴, Nanyun Peng⁵ - Show less +3 more•Institutions (5)

University of Southern California¹, National Taiwan University², Amazon.com³, University of Pennsylvania⁴, Information Sciences Institute⁵

01 Nov 2021

TL;DR: ESTER as mentioned in this paper is a dataset for event semantic relation reasoning, which leverages natural language queries to reason about the five most common event semantic relations, provides more than 6K questions, and captures 10.1k event relation pairs.

...read moreread less

Abstract: Understanding how events are semantically related to each other is the essence of reading comprehension. Recent event-centric reading comprehension datasets focus mostly on event arguments or temporal relations. While these tasks partially evaluate machines’ ability of narrative understanding, human-like reading comprehension requires the capability to process event-based information beyond arguments and temporal reasoning. For example, to understand causality between events, we need to infer motivation or purpose; to establish event hierarchy, we need to understand the composition of events. To facilitate these tasks, we introduce **ESTER**, a comprehensive machine reading comprehension (MRC) dataset for Event Semantic Relation Reasoning. The dataset leverages natural language queries to reason about the five most common event semantic relations, provides more than 6K questions, and captures 10.1K event relation pairs. Experimental results show that the current SOTA systems achieve 22.1%, 63.3% and 83.5% for token-based exact-match (**EM**), **F1** and event-based **HIT@1** scores, which are all significantly below human performances (36.0%, 79.6%, 100% respectively), highlighting our dataset as a challenging benchmark.

...read moreread less

4 citations

Posted Content•

Lacking the embedding of a word? Look it up into a traditional dictionary.

[...]

Elena Sofia Ruzzetti, Leonardo Ranaldi, Michele Mastromattei, Francesca Fallucchi, Fabio Massimo Zanzotto - Show less +1 more

24 Sep 2021-arXiv: Computation and Language

TL;DR: This paper used definitions retrieved in traditional dictionaries to produce word embeddings for rare words and achieved state-of-the-art performance for word embedding of unknown words in English.

...read moreread less

Abstract: Word embeddings are powerful dictionaries, which may easily capture language variations. However, these dictionaries fail to give sense to rare words, which are surprisingly often covered by traditional dictionaries. In this paper, we propose to use definitions retrieved in traditional dictionaries to produce word embeddings for rare words. For this purpose, we introduce two methods: Definition Neural Network (DefiNNet) and Define BERT (DefBERT). In our experiments, DefiNNet and DefBERT significantly outperform state-of-the-art as well as baseline methods devised for producing embeddings of unknown words. In fact, DefiNNet significantly outperforms FastText, which implements a method for the same task-based on n-grams, and DefBERT significantly outperforms the BERT method for OOV words. Then, definitions in traditional dictionaries are useful to build word embeddings for rare words.

...read moreread less

4 citations

Posted Content•

Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression

[...]

Canwen Xu¹, Wangchunshu Zhou¹, Tao Ge², Ke Xu³, Julian McAuley³, Furu Wei⁴ - Show less +2 more•Institutions (4)

University of California, San Diego¹, Stanford University², Microsoft³, Beihang University⁴

07 Sep 2021-arXiv: Computation and Language

TL;DR: This article proposed two new metrics, label loyalty and probability loyalty, to measure how closely a compressed model mimics the original model (i.e., student) and explore the effect of compression with regard to robustness under adversarial attacks.

...read moreread less

Abstract: Recent studies on compression of pretrained language models (e.g., BERT) usually use preserved accuracy as the metric for evaluation. In this paper, we propose two new metrics, label loyalty and probability loyalty that measure how closely a compressed model (i.e., student) mimics the original model (i.e., teacher). We also explore the effect of compression with regard to robustness under adversarial attacks. We benchmark quantization, pruning, knowledge distillation and progressive module replacing with loyalty and robustness. By combining multiple compression techniques, we provide a practical strategy to achieve better accuracy, loyalty and robustness.

...read moreread less

4 citations

Proceedings Article•DOI•

Adversarial Self-Supervised Data-Free Distillation for Text Classification.

[...]

Xinyin Ma¹, Yongliang Shen¹, Gongfan Fang¹, Chen Chen¹, Chenghao Jia¹, Weiming Lu¹ - Show less +2 more•Institutions (1)

Zhejiang University¹

01 Nov 2020

TL;DR: Li et al. as discussed by the authors proposed a two-stage data-free distillation method, named Adversarial self-Supervised Data-Free Distillation (AS-DFD), which is designed for compressing large-scale transformer-based models (e.g., BERT).

...read moreread less

Abstract: Large pre-trained transformer-based language models have achieved impressive results on a wide range of NLP tasks. In the past few years, Knowledge Distillation(KD) has become a popular paradigm to compress a computationally expensive model to a resource-efficient lightweight model. However, most KD algorithms, especially in NLP, rely on the accessibility of the original training dataset, which may be unavailable due to privacy issues. To tackle this problem, we propose a novel two-stage data-free distillation method, named Adversarial self-Supervised Data-Free Distillation (AS-DFD), which is designed for compressing large-scale transformer-based models (e.g., BERT). To avoid text generation in discrete space, we introduce a Plug & Play Embedding Guessing method to craft pseudo embeddings from the teacher’s hidden knowledge. Meanwhile, with a self-supervised module to quantify the student’s ability, we adapt the difficulty of pseudo embeddings in an adversarial training manner. To the best of our knowledge, our framework is the first data-free distillation framework designed for NLP tasks. We verify the effectiveness of our method on several text classification datasets.

...read moreread less

4 citations

Proceedings Article•

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

[...]

Yanai Elazar¹, Hongming Zhang², Yoav Goldberg¹, Dan Roth³•Institutions (3)

Allen Institute for Artificial Intelligence¹, Hong Kong University of Science and Technology², University of Pennsylvania³

16 Apr 2021

TL;DR: This article showed that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning and proposed a method for evaluating WS-like sentences in a zero-shot setting to account for the commonsENSE reasoning abilities acquired during the pretraining and observe that popular language models perform randomly in this setting when using our more strict evaluation.

...read moreread less

Abstract: The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we first show that the current evaluation method of WS is sub-optimal and propose a modification that uses twin sentences for evaluation. We also propose two new baselines that indicate the existence of artifacts in WS benchmarks. We then develop a method for evaluating WS-like sentences in a zero-shot setting to account for the commonsense reasoning abilities acquired during the pretraining and observe that popular language models perform randomly in this setting when using our more strict evaluation. We conclude that the observed progress is mostly due to the use of supervision in training WS models, which is not likely to successfully support all the required commonsense reasoning skills and knowledge.

...read moreread less

4 citations