Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Distilling Numeral Information for Volatility Forecasting

[...]

Chung-Chi Chen¹, Hen-Hsen Huang², Yu-Lieh Huang³, Hsin-Hsi Chen¹•Institutions (3)

National Taiwan University¹, Academia Sinica², National Tsing Hua University³

26 Oct 2021

TL;DR: In this paper, Wang et al. presented a novel dataset, ECNum, for understanding the numerals in the transcript of earnings conference calls and proposed a simple but efficient method, Numeral-Aware Model (NAM), for enhancing the capacity of numeral understanding of neural network models.

...read moreread less

Abstract: The volatility of stock price reflects the risk of stock and influences the risk of investor's portfolio. It is also a crucial part of pricing derivative securities. Researchers have paid their attention to predict the stock volatility with different kinds of textual data. However, most of them focus on using word information only. Few touch on capturing the numeral information in textual data, providing fine-grained clues for financial document understanding. In this paper, we present a novel dataset, ECNum, for understanding the numerals in the transcript of earnings conference calls. We propose a simple but efficient method, Numeral-Aware Model (NAM), for enhancing the capacity of numeral understanding of neural network models. We employ the distilled information in the stock volatility forecasting task and achieve the best performance compared to the previous works in short-term scenarios.

...read moreread less

3 citations

Proceedings Article•DOI•

Learning Contextualized Knowledge Graph Structures for Commonsense Reasoning

[...]

Jun Yan¹, Mrigank Raman², Aaron Chan³, Tianyu Zhang³, Ryan A. Rossi⁴, Handong Zhao⁴, Sungchul Kim⁴, Nedim Lipka⁴, Xiang Ren¹ - Show less +5 more•Institutions (4)

University of Southern California¹, Indian Institute of Technology Delhi², Tsinghua University³, Adobe Systems⁴

04 May 2021

TL;DR: In this article, a Hybrid Graph Network (HGN) is proposed, which jointly generates feature representations for new triples (as complement to the existing edges in the KG), determines relevance of the triples to the reasoning context, and learns graph model parameters for encoding the relational information.

...read moreread less

Abstract: Recently, neural-symbolic architectures have achieved success on commonsense reasoning through effectively encoding relational structures retrieved from external knowledge graphs (KGs) and obtained state-of-the-art results in tasks such as (commonsense) question answering and natural language inference. However, current neural-symbolic reasoning methods rely on quality and contextualized knowledge structures (i.e., fact triples) that can be retrieved at the pre-processing stage and overlook challenges such as dealing with incompleteness of a KG (low coverage), limited expressiveness of its relations, and irrelevant retrieved facts in the reasoning context. In this paper, we present a novel neural-symbolic approach, named Hybrid Graph Network (HGN), which jointly generates feature representations for new triples (as complement to the existing edges in the KG), determines relevance of the triples to the reasoning context, and learns graph model parameters for encoding the relational information. Our method learns a compact graph structure (comprising both retrieved and generated edges) through filtering edges that are unhelpful to the reasoning process. We show marked improvements on three commonsense reasoning benchmarks and demonstrate the superiority of the learned graph structures with user studies.

...read moreread less

3 citations

Posted Content•

Delaying Interaction Layers in Transformer-based Encoders for Efficient Open Domain Question Answering.

[...]

Wissam Siblini, Mohamed Challal, Charlotte Pasqual

16 Oct 2020-arXiv: Computation and Language

TL;DR: This paper proposes a more direct and complementary solution which consists in applying a generic change in the architecture of transformer-based models to delay the attention between subparts of the input and allow a more efficient management of computations.

...read moreread less

Abstract: Open Domain Question Answering (ODQA) on a large-scale corpus of documents (e.g. Wikipedia) is a key challenge in computer science. Although transformer-based language models such as Bert have shown on SQuAD the ability to surpass humans for extracting answers in small passages of text, they suffer from their high complexity when faced to a much larger search space. The most common way to tackle this problem is to add a preliminary Information Retrieval step to heavily filter the corpus and only keep the relevant passages. In this paper, we propose a more direct and complementary solution which consists in applying a generic change in the architecture of transformer-based models to delay the attention between subparts of the input and allow a more efficient management of computations. The resulting variants are competitive with the original models on the extractive task and allow, on the ODQA setting, a significant speedup and even a performance improvement in many cases.

...read moreread less

3 citations

Proceedings Article•

CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding

[...]

Yanru Qu¹, Dinghan Shen², Yelong Shen³, Sandra Sajeev², Weizhu Chen², Jiawei Han¹ - Show less +2 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, Microsoft², Kent State University³

03 May 2021

TL;DR: In this paper, a contrastive regularization is introduced to capture the global relationship among all the data samples and a momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss.

...read moreread less

Abstract: Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation frame-work dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization is introduced to capture the global relationship among all the data samples. A momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss. To verify the effectiveness of the proposed framework, we apply CoDA to Transformer-based models on a wide range of natural language understanding tasks. On the GLUE benchmark, CoDA gives rise to an average improvement of 2.2%while applied to the Roberta-large model. More importantly, it consistently exhibits stronger results relative to several competitive data augmentation and adversarial training baselines (including the low-resource settings). Extensive experiments show that the proposed contrastive objective can be flexibly combined with various data augmentation approaches to further boost their performance, highlighting the wide applicability of the CoDA framework.

...read moreread less

3 citations

Proceedings Article•DOI•

TR at SemEval-2020 Task 4: Exploring the Limits of Language-model-based Common Sense Validation.

[...]

Don Teo

01 Dec 2020

TL;DR: This paper examines the ability of large-scale pre-trained language models to distinguish commonsense from non-commonsense statements, and the utility of external resources that aim to supplement the world knowledge inherent in such language models, including commonsense knowledge graph embedding models, word concreteness ratings, and text-to-image generation models.

...read moreread less

Abstract: In this paper, we present our submission for subtask A of the Common Sense Validation and Explanation (ComVE) shared task. We examine the ability of large-scale pre-trained language models to distinguish commonsense from non-commonsense statements. We also explore the utility of external resources that aim to supplement the world knowledge inherent in such language models, including commonsense knowledge graph embedding models, word concreteness ratings, and text-to-image generation models. We find that such resources provide insignificant gains to the performance of fine-tuned language models. We also provide a qualitative analysis of the limitations of the language model fine-tuned to this task.

...read moreread less

3 citations