Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

[...]

Yichen Jiang, Asli Celikyilmaz¹, Paul Smolensky¹, Paul Soulos, Sudha Rao¹, Hamid Palangi¹, Roland Fernandez¹, Caitlin Smith², Mohit Bansal³, Jianfeng Gao¹ - Show less +6 more•Institutions (3)

Microsoft¹, University of Southern California², University of North Carolina at Chapel Hill³

02 Jun 2021

TL;DR: This paper adapts TP-Transformer, an architecture that enriches the original Transformer with the explicitly compositional Tensor Product Representation with the argument that the structured intermediate representations enable the model to take better control of the contents and structures when generating the summary.

...read moreread less

Abstract: ive summarization, the task of generating a concise summary of input documents, requires: (1) reasoning over the source document to determine the salient pieces of information scattered across the long document, and (2) composing a cohesive text by reconstructing these salient facts into a shorter summary that faithfully reflects the complex relations connecting these facts. In this paper, we adapt TP-Transformer (Schlag et al., 2019), an architecture that enriches the original Transformer (Vaswani et al., 2017) with the explicitly compositional Tensor Product Representation (TPR), for the task of abstractive summarization. The key feature of our model is a structural bias that we introduce by encoding two separate representations for each token to represent the syntactic structure (with role vectors) and semantic content (with filler vectors) separately. The model then binds the role and filler vectors into the TPR as the layer output. We argue that the structured intermediate representations enable the model to take better control of the contents (salient facts) and structures (the syntax that connects the facts) when generating the summary. Empirically, we show that our TP-Transformer outperforms the Transformer and the original TP-Transformer significantly on several abstractive summarization datasets based on both automatic and human evaluations. On several syntactic and semantic probing tasks, we demonstrate the emergent structural information in the role vectors and the performance gain by information specificity of the role vectors and improved syntactic interpretability in the TPR layer outputs.(Code and models are available at https://github.com/jiangycTarheel/TPT-Summ)

...read moreread less

7 citations

Proceedings Article•DOI•

Reordering Examples Helps during Priming-based Few-Shot Learning

[...]

Sawan Kumar¹, Partha Pratim Talukdar¹•Institutions (1)

Indian Institute of Science¹

01 Aug 2021

TL;DR: The authors propose to use examples as prompts for few-shot learning and demonstrate the effectiveness of the proposed method on the tasks of sentiment classification, natural language inference, and fact retrieval, including the idea that two training examples in the right order alone can provide competitive performance.

...read moreread less

Abstract: The ability to learn from limited data, or few-shot learning, is a desirable and often critical requirement for NLP systems. While many existing methods do poorly at learning from a handful of examples, large pretrained language models have recently been shown to be efficient few-shot learners. One approach to few-shot learning, which does not require finetuning of model parameters, is to augment the language model's input with priming text which is typically constructed using task specific descriptions and examples. In this work, we further explore priming-based few-shot learning, with focus on using examples as prompts. We show that presenting examples in the right order is key for generalization. We introduce PERO (Prompting with Examples in the Right Order), where we formulate few-shot learning as search over the set of permutations of the training examples. We show that PERO can learn to generalize efficiently using as few as 10 examples, in contrast to existing approaches. While the newline token is a natural choice for separating the examples in the prompt, we show that learning a new separator token can potentially provide further gains in performance. We demonstrate the effectiveness of the proposed method on the tasks of sentiment classification, natural language inference and fact retrieval. Finally, we analyze the learned prompts to reveal novel insights, including the idea that two training examples in the right order alone can provide competitive performance for sentiment classification and natural language inference.

...read moreread less

7 citations

Proceedings Article•

REBEL: Relation Extraction By End-to-end Language generation.

[...]

Pere-Lluís Huguet Cabot, Roberto Navigli¹•Institutions (1)

Sapienza University of Rome¹

01 Nov 2021

TL;DR: This article proposed an autoregressive seq2seq model for relation triplets extraction, which can be expressed as a sequence of text and performed end-to-end relation extraction for more than 200 different relation types.

...read moreread less

Abstract: Extracting relation triplets from raw text is a crucial task in Information Extraction, enabling multiple applications such as populating or validating knowledge bases, factchecking, and other downstream tasks. However, it usually involves multiple-step pipelines that propagate errors or are limited to a small number of relation types. To overcome these issues, we propose the use of autoregressive seq2seq models. Such models have previously been shown to perform well not only in language generation, but also in NLU tasks such as Entity Linking, thanks to their framing as seq2seq tasks. In this paper, we show how Relation Extraction can be simplified by expressing triplets as a sequence of text and we present REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types. We show our model’s flexibility by fine-tuning it on an array of Relation Extraction and Relation Classification benchmarks, with it attaining state-of-the-art performance in most of them.

...read moreread less

7 citations

Posted Content•

Improving Robotic Grasping on Monocular Images Via Multi-Task Learning and Positional Loss

[...]

William Prew¹, Toby P. Breckon¹, Magnus Bordewich¹, Ulrik Beierholm¹•Institutions (1)

Durham University¹

05 Nov 2020-arXiv: Robotics

TL;DR: Two methods of improving real-time object grasping performance from monocular colour images in an end-to-end CNN architecture are introduced including the addition of an auxiliary task during model training (multi-task learning) and introducing a positional loss function that emphasises loss per pixel for secondary parameters only on points of an object where a successful grasp can take place.

...read moreread less

Abstract: In this paper, we introduce two methods of improving real-time object grasping performance from monocular colour images in an end-to-end CNN architecture. The first is the addition of an auxiliary task during model training (multi-task learning). Our multi-task CNN model improves grasping performance from a baseline average of 72.04% to 78.14% on the large Jacquard grasping dataset when performing a supplementary depth reconstruction task. The second is introducing a positional loss function that emphasises loss per pixel for secondary parameters (gripper angle and width) only on points of an object where a successful grasp can take place. This increases performance from a baseline average of 72.04% to 78.92% as well as reducing the number of training epochs required. These methods can be also performed in tandem resulting in a further performance increase to 79.12% while maintaining sufficient inference speed to afford real-time grasp processing.

...read moreread less

7 citations

Proceedings Article•DOI•

Long-Span Summarization via Local Attention and Content Selection

[...]

Potsawee Manakul¹, Mark J. F. Gales¹•Institutions (1)

University of Cambridge¹

01 Aug 2021

TL;DR: The authors exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention and explicit content selection, which achieves state-of-the-art results on all three tasks in the ROUGE scores.

...read moreread less

Abstract: Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pre-trained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection. These approaches are compared on a range of network configurations. Experiments are carried out on standard long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed datasets. We demonstrate that by combining these methods, we can achieve state-of-the-art results on all three tasks in the ROUGE scores. Moreover, without a large-scale GPU card, our approach can achieve comparable or better results than existing approaches.

...read moreread less

7 citations