scispace - formally typeset
Search or ask a question
Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: This work proposes a deep reinforced query reformulation (DRQR) model to automatically generate new reformulations of the query to encourage the model to generate queries which can achieve high performance when performing the retrieval task, and incorporates query performance prediction into the reward function.
Abstract: Query reformulations have long been a key mechanism to alleviate the vocabulary-mismatch problem in information retrieval, for example by expanding the queries with related query terms or by generating paraphrases of the queries. In this work, we propose a deep reinforced query reformulation (DRQR) model to automatically generate new reformulations of the query. To encourage the model to generate queries which can achieve high performance when performing the retrieval task, we incorporate query performance prediction into our reward function. In addition, to evaluate the quality of the reformulated query in the context of information retrieval, we first train our DRQR model, then apply the retrieval ranking model on the obtained reformulated query. Experiments are conducted on the TREC 2020 Deep Learning track MSMARCO document ranking dataset. Our results show that our proposed model outperforms several query reformulation model baselines when performing retrieval task. In addition, improvements are also observed when combining with various retrieval models, such as query expansion and BERT.

12 citations

Proceedings ArticleDOI
01 Aug 2021
TL;DR: This article propose a meta-learning augmented version of supervised learning whose objective directly optimizes for out-of-distribution generalization, by sub-sampling existing training data, in an effort to inhibit models from memorizing their input.
Abstract: Natural language is compositional; the meaning of a sentence is a function of the meaning of its parts. This property allows humans to create and interpret novel sentences, generalizing robustly outside their prior experience. Neural networks have been shown to struggle with this kind of generalization, in particular performing poorly on tasks designed to assess compositional generalization (i.e. where training and testing distributions differ in ways that would be trivial for a compositional strategy to resolve). Their poor performance on these tasks may in part be due to the nature of supervised learning which assumes training and testing data to be drawn from the same distribution. We implement a meta-learning augmented version of supervised learning whose objective directly optimizes for out-of-distribution generalization. We construct pairs of tasks for meta-learning by sub-sampling existing training data. Each pair of tasks is constructed to contain relevant examples, as determined by a similarity metric, in an effort to inhibit models from memorizing their input. Experimental results on the COGS and SCAN datasets show that our similarity-driven meta-learning can improve generalization performance.

12 citations

Posted Content
TL;DR: This work uses gradient attribution to analyze how the output of an attention head depends on the input tokens, effectively extending the local attention-based analysis to account for the mixing of information throughout the transformer layers.
Abstract: We take a deep look into the behavior of self-attention heads in the transformer architecture. In light of recent work discouraging the use of attention distributions for explaining a model's behavior, we show that attention distributions can nevertheless provide insights into the local behavior of attention heads. This way, we propose a distinction between local patterns revealed by attention and global patterns that refer back to the input, and analyze BERT from both angles. We use gradient attribution to analyze how the output of an attention attention head depends on the input tokens, effectively extending the local attention-based analysis to account for the mixing of information throughout the transformer layers. We find that there is a significant discrepancy between attention and attribution distributions, caused by the mixing of context inside the model. We quantify this discrepancy and observe that interestingly, there are some patterns that persist across all layers despite the mixing.

12 citations

Journal ArticleDOI
01 Jan 2021
TL;DR: It is argued that combining discrete and continuous representations and their processing will be essential to build systems that exhibit a general form of intelligence.
Abstract: Discrete and continuous representations of content (e.g., of language or images) have interesting properties to be explored for the understanding of or reasoning with this content by machines. This position paper puts forward our opinion on the role of discrete and continuous representations and their processing in the deep learning field. Current neural network models compute continuous-valued data. Information is compressed into dense, distributed embeddings. By stark contrast, humans use discrete symbols in their communication with language. Such symbols represent a compressed version of the world that derives its meaning from shared contextual information. Additionally, human reasoning involves symbol manipulation at a cognitive level, which facilitates abstract reasoning, the composition of knowledge and understanding, generalization and efficient learning. Motivated by these insights, in this paper we argue that combining discrete and continuous representations and their processing will be essential to build systems that exhibit a general form of intelligence. We suggest and discuss several avenues that could improve current neural networks with the inclusion of discrete elements to combine the advantages of both types of representations.

12 citations

Posted Content
TL;DR: This paper devise a novel FOL-based reasoner, called Braid, that supports probabilistic rules, and uses the notion of custom unification functions and dynamic rule generation to overcome the brittle matching and knowledge-gap problem prevalent in traditional reasoners.
Abstract: Traditional symbolic reasoning engines, while attractive for their precision and explicability, have a few major drawbacks: the use of brittle inference procedures that rely on exact matching (unification) of logical terms, an inability to deal with uncertainty, and the need for a precompiled rule-base of knowledge (the "knowledge acquisition" problem). These issues are particularly severe for the Natural Language Understanding (NLU) task, where we often use implicit background knowledge to understand and reason about text, resort to fuzzy alignment of concepts and relations during reasoning, and constantly deal with ambiguity in representations. To address these issues, we devise a novel FOL-based reasoner, called Braid, that supports probabilistic rules, and uses the notion of custom unification functions and dynamic rule generation to overcome the brittle matching and knowledge-gap problem prevalent in traditional reasoners. In this paper, we describe the reasoning algorithms used in Braid-BC (the backchaining component of Braid), and their implementation in a distributed task-based framework that builds proof/explanation graphs for an input query in a scalable manner. We use a simple QA example from a children's story to motivate Braid-BC's design and explain how the various components work together to produce a coherent logical explanation.

12 citations

Trending Questions (1)
What are the limitations of transfer learning with a unified text-to-text transformer?

The paper does not mention the limitations of transfer learning with a unified text-to-text transformer.