Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Controlled Neural Sentence-Level Reframing of News Articles

[...]

Wei-Fan Chen¹, Khalid Al-Khatib², Benno Stein², Henning Wachsmuth¹•Institutions (2)

University of Paderborn¹, Bauhaus University, Weimar²

10 Sep 2021-arXiv: Computation and Language

TL;DR: This article proposed three strategies: framed-language pretraining, named-entity preservation, and adversarial learning to train neural models on an existing media frame corpus to generate properly framed text.

...read moreread less

Abstract: Framing a news article means to portray the reported event from a specific perspective, e.g., from an economic or a health perspective. Reframing means to change this perspective. Depending on the audience or the submessage, reframing can become necessary to achieve the desired effect on the readers. Reframing is related to adapting style and sentiment, which can be tackled with neural text generation techniques. However, it is more challenging since changing a frame requires rewriting entire sentences rather than single phrases. In this paper, we study how to computationally reframe sentences in news articles while maintaining their coherence to the context. We treat reframing as a sentence-level fill-in-the-blank task for which we train neural models on an existing media frame corpus. To guide the training, we propose three strategies: framed-language pretraining, named-entity preservation, and adversarial learning. We evaluate respective models automatically and manually for topic consistency, coherence, and successful reframing. Our results indicate that generating properly-framed text works well but with tradeoffs.

...read moreread less

4 citations

Proceedings Article•DOI•

CoRT: Complementary Rankings from Transformers

[...]

Marco Wrzalik¹, Dirk Krechel¹•Institutions (1)

RheinMain University of Applied Sciences¹

01 Jun 2021

TL;DR: It is shown that CoRT significantly increases the candidate recall by complementing BM25 with missing candidates, and it is demonstrated that passage retrieval using CoRT can be realized with surprisingly low latencies.

...read moreread less

Abstract: Many recent approaches towards neural information retrieval mitigate their computational costs by using a multi-stage ranking pipeline. In the first stage, a number of potentially relevant candidates are retrieved using an efficient retrieval model such as BM25. Although BM25 has proven decent performance as a first-stage ranker, it tends to miss relevant passages. In this context we propose CoRT, a simple neural first-stage ranking model that leverages contextual representations from pretrained language models such as BERT to complement term-based ranking functions while causing no significant delay at query time. Using the MS MARCO dataset, we show that CoRT significantly increases the candidate recall by complementing BM25 with missing candidates. Consequently, we find subsequent re-rankers achieve superior results with less candidates. We further demonstrate that passage retrieval using CoRT can be realized with surprisingly low latencies.

...read moreread less

4 citations

Journal Article•

Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

[...]

Amrit Nagarajan¹, Sanchari Sen¹, Jacob R. Stevens¹, Anand Raghunathan¹•Institutions (1)

Purdue University¹

04 May 2021-arXiv: Computation and Language

TL;DR: Approximate Computing, specifically targeting the use of Transformers in NLP tasks, proposes a framework to create smaller, faster and in some cases more accurate models that are faster, smaller and/or more accurate, depending on the user's constraints.

...read moreread less

Abstract: Transformer models have garnered a lot of interest in recent years by delivering state-of-the-art performance in a range of Natural Language Processing (NLP) tasks. However, these models can have over a hundred billion parameters, presenting very high computational and memory requirements. We address this challenge through Approximate Computing, specifically targeting the use of Transformers in NLP tasks. Transformers are typically pre-trained and subsequently specialized for specific tasks through transfer learning. We observe that pre-trained Transformers are often over-parameterized for several downstream NLP tasks and propose a framework to create smaller and faster models with comparable accuracy. The key cornerstones of the framework are a Significance Analysis (SA) method to identify important components in a pre-trained Transformer for a given task, and techniques to approximate the less significant components. Our framework can be adapted to produce models that are faster, smaller and/or more accurate, depending on the user's constraints. We apply our framework to multiple Transformer models and different downstream tasks, including previously proposed optimized models like DistilBERT and Q8BERT. We demonstrate that our framework produces models that are up to 4× faster and up to 14× smaller (with less than 0.5% relative accuracy degradation), or up to 5.5% more accurate with simultaneous model size and speed improvements of up to 9.8× and 2.9×, respectively.

...read moreread less

4 citations

Posted Content•

A Comprehensive Assessment of Dialog Evaluation Metrics

[...]

Yi-Ting Yeh¹, Maxine Eskenazi¹, Shikib Mehri¹•Institutions (1)

Carnegie Mellon University¹

07 Jun 2021-arXiv: Computation and Language

TL;DR: The authors provided a comprehensive assessment of dialog evaluation metrics on a number of datasets and compared the performance of different metrics on turn-level and dialog-level data, and different types of response generation models (i.e., generative, retrieval, simple models and state-of-the-art models).

...read moreread less

Abstract: Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets. In this paper, 23 different automatic evaluation metrics are evaluated on 10 different datasets. Furthermore, the metrics are assessed in different settings, to better qualify their respective strengths and weaknesses. Metrics are assessed (1) on both the turn level and the dialog level, (2) for different dialog lengths, (3) for different dialog qualities (e.g., coherence, engaging), (4) for different types of response generation models (i.e., generative, retrieval, simple models and state-of-the-art models), (5) taking into account the similarity of different metrics and (6) exploring combinations of different metrics. This comprehensive assessment offers several takeaways pertaining to dialog evaluation metrics in general. It also suggests how to best assess evaluation metrics and indicates promising directions for future work.

...read moreread less

4 citations

Posted Content•

Pix2seq: A Language Modeling Framework for Object Detection

[...]

Ting Chen¹, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey E. Hinton - Show less +1 more•Institutions (1)

Google¹

22 Sep 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Pix2Seq as mentioned in this paper cast object detection as a language modeling task conditioned on the observed pixel inputs, where object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens and train a neural network to perceive the image and generate the desired sequence.

...read moreread less

Abstract: This paper presents Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we simply cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural net to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural net knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.

...read moreread less

4 citations