Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Deep Reinforced Query Reformulation for Information Retrieval.

[...]

Xiao Wang¹, Craig Macdonald¹, Iadh Ounis¹•Institutions (1)

University of Glasgow¹

15 Jul 2020-arXiv: Information Retrieval

TL;DR: This work proposes a deep reinforced query reformulation (DRQR) model to automatically generate new reformulations of the query to encourage the model to generate queries which can achieve high performance when performing the retrieval task, and incorporates query performance prediction into the reward function.

...read moreread less

Abstract: Query reformulations have long been a key mechanism to alleviate the vocabulary-mismatch problem in information retrieval, for example by expanding the queries with related query terms or by generating paraphrases of the queries. In this work, we propose a deep reinforced query reformulation (DRQR) model to automatically generate new reformulations of the query. To encourage the model to generate queries which can achieve high performance when performing the retrieval task, we incorporate query performance prediction into our reward function. In addition, to evaluate the quality of the reformulated query in the context of information retrieval, we first train our DRQR model, then apply the retrieval ranking model on the obtained reformulated query. Experiments are conducted on the TREC 2020 Deep Learning track MSMARCO document ranking dataset. Our results show that our proposed model outperforms several query reformulation model baselines when performing retrieval task. In addition, improvements are also observed when combining with various retrieval models, such as query expansion and BERT.

...read moreread less

12 citations

Proceedings Article•DOI•

Meta-Learning to Compositionally Generalize

[...]

Henry Conklin, Bailin Wang¹, Kenny Smith¹, Ivan Titov¹•Institutions (1)

University of Edinburgh¹

01 Aug 2021

TL;DR: This article propose a meta-learning augmented version of supervised learning whose objective directly optimizes for out-of-distribution generalization, by sub-sampling existing training data, in an effort to inhibit models from memorizing their input.

...read moreread less

Abstract: Natural language is compositional; the meaning of a sentence is a function of the meaning of its parts. This property allows humans to create and interpret novel sentences, generalizing robustly outside their prior experience. Neural networks have been shown to struggle with this kind of generalization, in particular performing poorly on tasks designed to assess compositional generalization (i.e. where training and testing distributions differ in ways that would be trivial for a compositional strategy to resolve). Their poor performance on these tasks may in part be due to the nature of supervised learning which assumes training and testing data to be drawn from the same distribution. We implement a meta-learning augmented version of supervised learning whose objective directly optimizes for out-of-distribution generalization. We construct pairs of tasks for meta-learning by sub-sampling existing training data. Each pair of tasks is constructed to contain relevant examples, as determined by a similarity metric, in an effort to inhibit models from memorizing their input. Experimental results on the COGS and SCAN datasets show that our similarity-driven meta-learning can improve generalization performance.

...read moreread less

12 citations

Posted Content•

Telling BERT's full story: from Local Attention to Global Aggregation

[...]

Damian Pascual¹, Gino Brunner¹, Roger Wattenhofer¹•Institutions (1)

ETH Zurich¹

10 Apr 2020-arXiv: Learning

TL;DR: This work uses gradient attribution to analyze how the output of an attention head depends on the input tokens, effectively extending the local attention-based analysis to account for the mixing of information throughout the transformer layers.

...read moreread less

Abstract: We take a deep look into the behavior of self-attention heads in the transformer architecture. In light of recent work discouraging the use of attention distributions for explaining a model's behavior, we show that attention distributions can nevertheless provide insights into the local behavior of attention heads. This way, we propose a distinction between local patterns revealed by attention and global patterns that refer back to the input, and analyze BERT from both angles. We use gradient attribution to analyze how the output of an attention attention head depends on the input tokens, effectively extending the local attention-based analysis to account for the mixing of information throughout the transformer layers. We find that there is a significant discrepancy between attention and attribution distributions, caused by the mixing of context inside the model. We quantify this discrepancy and observe that interestingly, there are some patterns that persist across all layers despite the mixing.

...read moreread less

12 citations

Journal Article•DOI•

Discrete and continuous representations and processing in deep learning: Looking forward

[...]

Ruben Cartuyvels¹, Graham Spinks¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2021

TL;DR: It is argued that combining discrete and continuous representations and their processing will be essential to build systems that exhibit a general form of intelligence.

...read moreread less

Abstract: Discrete and continuous representations of content (e.g., of language or images) have interesting properties to be explored for the understanding of or reasoning with this content by machines. This position paper puts forward our opinion on the role of discrete and continuous representations and their processing in the deep learning field. Current neural network models compute continuous-valued data. Information is compressed into dense, distributed embeddings. By stark contrast, humans use discrete symbols in their communication with language. Such symbols represent a compressed version of the world that derives its meaning from shared contextual information. Additionally, human reasoning involves symbol manipulation at a cognitive level, which facilitates abstract reasoning, the composition of knowledge and understanding, generalization and efficient learning. Motivated by these insights, in this paper we argue that combining discrete and continuous representations and their processing will be essential to build systems that exhibit a general form of intelligence. We suggest and discuss several avenues that could improve current neural networks with the inclusion of discrete elements to combine the advantages of both types of representations.

...read moreread less

12 citations

Posted Content•

Braid: Weaving Symbolic and Neural Knowledge into Coherent Logical Explanations.

[...]

Aditya Kalyanpur, Tom Breloff, David A. Ferrucci, Adam Lally, John Jantos - Show less +1 more

26 Nov 2020-arXiv: Computation and Language

TL;DR: This paper devise a novel FOL-based reasoner, called Braid, that supports probabilistic rules, and uses the notion of custom unification functions and dynamic rule generation to overcome the brittle matching and knowledge-gap problem prevalent in traditional reasoners.

...read moreread less

Abstract: Traditional symbolic reasoning engines, while attractive for their precision and explicability, have a few major drawbacks: the use of brittle inference procedures that rely on exact matching (unification) of logical terms, an inability to deal with uncertainty, and the need for a precompiled rule-base of knowledge (the "knowledge acquisition" problem). These issues are particularly severe for the Natural Language Understanding (NLU) task, where we often use implicit background knowledge to understand and reason about text, resort to fuzzy alignment of concepts and relations during reasoning, and constantly deal with ambiguity in representations. To address these issues, we devise a novel FOL-based reasoner, called Braid, that supports probabilistic rules, and uses the notion of custom unification functions and dynamic rule generation to overcome the brittle matching and knowledge-gap problem prevalent in traditional reasoners. In this paper, we describe the reasoning algorithms used in Braid-BC (the backchaining component of Braid), and their implementation in a distributed task-based framework that builds proof/explanation graphs for an input query in a scalable manner. We use a simple QA example from a children's story to motivate Braid-BC's design and explain how the various components work together to produce a coherent logical explanation.

...read moreread less

12 citations