scispace - formally typeset
Search or ask a question
Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
26 Oct 2021
TL;DR: In this paper, Wang et al. presented a novel dataset, ECNum, for understanding the numerals in the transcript of earnings conference calls and proposed a simple but efficient method, Numeral-Aware Model (NAM), for enhancing the capacity of numeral understanding of neural network models.
Abstract: The volatility of stock price reflects the risk of stock and influences the risk of investor's portfolio. It is also a crucial part of pricing derivative securities. Researchers have paid their attention to predict the stock volatility with different kinds of textual data. However, most of them focus on using word information only. Few touch on capturing the numeral information in textual data, providing fine-grained clues for financial document understanding. In this paper, we present a novel dataset, ECNum, for understanding the numerals in the transcript of earnings conference calls. We propose a simple but efficient method, Numeral-Aware Model (NAM), for enhancing the capacity of numeral understanding of neural network models. We employ the distilled information in the stock volatility forecasting task and achieve the best performance compared to the previous works in short-term scenarios.

3 citations

Proceedings ArticleDOI
04 May 2021
TL;DR: In this article, a Hybrid Graph Network (HGN) is proposed, which jointly generates feature representations for new triples (as complement to the existing edges in the KG), determines relevance of the triples to the reasoning context, and learns graph model parameters for encoding the relational information.
Abstract: Recently, neural-symbolic architectures have achieved success on commonsense reasoning through effectively encoding relational structures retrieved from external knowledge graphs (KGs) and obtained state-of-the-art results in tasks such as (commonsense) question answering and natural language inference. However, current neural-symbolic reasoning methods rely on quality and contextualized knowledge structures (i.e., fact triples) that can be retrieved at the pre-processing stage and overlook challenges such as dealing with incompleteness of a KG (low coverage), limited expressiveness of its relations, and irrelevant retrieved facts in the reasoning context. In this paper, we present a novel neural-symbolic approach, named Hybrid Graph Network (HGN), which jointly generates feature representations for new triples (as complement to the existing edges in the KG), determines relevance of the triples to the reasoning context, and learns graph model parameters for encoding the relational information. Our method learns a compact graph structure (comprising both retrieved and generated edges) through filtering edges that are unhelpful to the reasoning process. We show marked improvements on three commonsense reasoning benchmarks and demonstrate the superiority of the learned graph structures with user studies.

3 citations

Posted Content
TL;DR: This paper proposes a more direct and complementary solution which consists in applying a generic change in the architecture of transformer-based models to delay the attention between subparts of the input and allow a more efficient management of computations.
Abstract: Open Domain Question Answering (ODQA) on a large-scale corpus of documents (e.g. Wikipedia) is a key challenge in computer science. Although transformer-based language models such as Bert have shown on SQuAD the ability to surpass humans for extracting answers in small passages of text, they suffer from their high complexity when faced to a much larger search space. The most common way to tackle this problem is to add a preliminary Information Retrieval step to heavily filter the corpus and only keep the relevant passages. In this paper, we propose a more direct and complementary solution which consists in applying a generic change in the architecture of transformer-based models to delay the attention between subparts of the input and allow a more efficient management of computations. The resulting variants are competitive with the original models on the extractive task and allow, on the ODQA setting, a significant speedup and even a performance improvement in many cases.

3 citations

Proceedings Article
03 May 2021
TL;DR: In this paper, a contrastive regularization is introduced to capture the global relationship among all the data samples and a momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss.
Abstract: Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation frame-work dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization is introduced to capture the global relationship among all the data samples. A momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss. To verify the effectiveness of the proposed framework, we apply CoDA to Transformer-based models on a wide range of natural language understanding tasks. On the GLUE benchmark, CoDA gives rise to an average improvement of 2.2%while applied to the Roberta-large model. More importantly, it consistently exhibits stronger results relative to several competitive data augmentation and adversarial training baselines (including the low-resource settings). Extensive experiments show that the proposed contrastive objective can be flexibly combined with various data augmentation approaches to further boost their performance, highlighting the wide applicability of the CoDA framework.

3 citations

Proceedings ArticleDOI
01 Dec 2020
TL;DR: This paper examines the ability of large-scale pre-trained language models to distinguish commonsense from non-commonsense statements, and the utility of external resources that aim to supplement the world knowledge inherent in such language models, including commonsense knowledge graph embedding models, word concreteness ratings, and text-to-image generation models.
Abstract: In this paper, we present our submission for subtask A of the Common Sense Validation and Explanation (ComVE) shared task. We examine the ability of large-scale pre-trained language models to distinguish commonsense from non-commonsense statements. We also explore the utility of external resources that aim to supplement the world knowledge inherent in such language models, including commonsense knowledge graph embedding models, word concreteness ratings, and text-to-image generation models. We find that such resources provide insignificant gains to the performance of fine-tuned language models. We also provide a qualitative analysis of the limitations of the language model fine-tuned to this task.

3 citations

Trending Questions (1)
What are the limitations of transfer learning with a unified text-to-text transformer?

The paper does not mention the limitations of transfer learning with a unified text-to-text transformer.