Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Tracking Turbulence Through Financial News During COVID-19.

[...]

Philip Hossu, Natalie Parde¹•Institutions (1)

University of Illinois at Chicago¹

09 Sep 2021-arXiv: Computation and Language

TL;DR: In this article, a set of expert annotations of financial sentiment for articles from major American financial news publishers were used to predict financial sentiment during the 2020 pandemic-motivated U.S. financial crash.

...read moreread less

Abstract: Grave human toll notwithstanding, the COVID-19 pandemic created uniquely unstable conditions in financial markets. In this work we uncover and discuss relationships involving sentiment in financial publications during the 2020 pandemic-motivated U.S. financial crash. First, we introduce a set of expert annotations of financial sentiment for articles from major American financial news publishers. After an exploratory data analysis, we then describe a CNN-based architecture to address the task of predicting financial sentiment in this anomalous, tumultuous setting. Our best performing model achieves a maximum weighted F1 score of 0.746, establishing a strong performance benchmark. Using predictions from our top performing model, we close by conducting a statistical correlation study with real stock market data, finding interesting and strong relationships between financial news and the S\&P 500 index, trading volume, market volatility, and different single-factor ETFs.

...read moreread less

Posted Content•

EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks

[...]

Frederick Liu, Siamak Shakeri, Hongkun Yu, Jing Li¹•Institutions (1)

Google¹

16 Oct 2021-arXiv: Computation and Language

TL;DR: This paper proposed an encoder-decoder transformer architecture for fine-tuning pre-trained T5 models for classification and regression tasks by using the encoder layers, which was shown to be more efficient than BERT for pre-training on language model task.

...read moreread less

Abstract: Encoder-decoder transformer architectures have become popular recently with the advent of T5 models. It is also more favorable over architectures like BERT for pre-training on language model task when it comes to large scale models which could take months to train given it's generality. While being able to generalize to more tasks, it is not evident if the proposed encoder-decoder architecture is the most efficient for fine-tuning on classification and regression tasks given the pre-trained model. In this work, we study fine-tuning pre-trained encoder-decoder models such as T5. Particularly, we propose \textbf{EncT5} as a way to efficiently fine-tune pre-trained encoder-decoder T5 models for classification and regression tasks by using the encoder layers. Our experimental results show that \textbf{EncT5} with less than half of the parameters of T5 performs similarly to T5 models on GLUE benchmark. We believe our proposed approach can be easily applied to any pre-trained encoder-decoder model.

...read moreread less

Posted Content•

Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering

[...]

Gangwoo Kim¹, Hyunjae Kim¹, Jungsoo Park¹, Jaewoo Kang¹•Institutions (1)

Korea University¹

22 Jun 2021-arXiv: Computation and Language

TL;DR: This article propose a framework, ExCorD (Explicit guidance on how to resolve Conversational Dependency) to enhance the abilities of QA models in comprehending conversational context.

...read moreread less

Abstract: One of the main challenges in conversational question answering (CQA) is to resolve the conversational dependency, such as anaphora and ellipsis. However, existing approaches do not explicitly train QA models on how to resolve the dependency, and thus these models are limited in understanding human dialogues. In this paper, we propose a novel framework, ExCorD (Explicit guidance on how to resolve Conversational Dependency) to enhance the abilities of QA models in comprehending conversational context. ExCorD first generates self-contained questions that can be understood without the conversation history, then trains a QA model with the pairs of original and self-contained questions using a consistency-based regularizer. In our experiments, we demonstrate that ExCorD significantly improves the QA models' performance by up to 1.2 F1 on QuAC, and 5.2 F1 on CANARD, while addressing the limitations of the existing approaches.

...read moreread less

Posted Content•

Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision.

[...]

Si Sun¹, Yingzhuo Qian, Zhenghao Liu¹, Chenyan Xiong², Kaitao Zhang¹, Jie Bao¹, Zhiyuan Liu¹, Paul N. Bennett² - Show less +4 more•Institutions (2)

Tsinghua University¹, Microsoft²

29 Dec 2020-arXiv: Information Retrieval

TL;DR: MetaAdaptRank as mentioned in this paper is a domain adaptive learning method that generalizes Neu-IR models from label-rich source domains to few-shot target domains by contrastively synthesizing a large number of weak supervision signals for target domains.

...read moreread less

Abstract: The effectiveness of Neural Information Retrieval (Neu-IR) often depends on a large scale of in-domain relevance training signals, which are not always available in real-world ranking scenarios. To democratize the benefits of Neu-IR, this paper presents MetaAdaptRank, a domain adaptive learning method that generalizes Neu-IR models from label-rich source domains to few-shot target domains. Drawing on source-domain massive relevance supervision, MetaAdaptRank contrastively synthesizes a large number of weak supervision signals for target domains and meta-learns to reweight these synthetic "weak" data based on their benefits to the target-domain ranking accuracy of Neu-IR models. Experiments on three TREC benchmarks in the web, news, and biomedical domains show that MetaAdaptRank significantly improves the few-shot ranking accuracy of Neu-IR models. Further analyses indicate that MetaAdaptRank thrives from both its contrastive weak data synthesis and meta-reweighted data selection. The code and data of this paper can be obtained from this https URL.

...read moreread less

Corrected CBOW Performs as well as Skip-gram

[...]

Ozan Irsoy, Adrian Benton, Karl Stratos

01 Nov 2021

TL;DR: This paper showed that after correcting a bug in the CBOW gradient update, one can learn CBOW word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks, while being many times faster to train.

...read moreread less

Abstract: Mikolov et al. (2013a) observed that continuous bag-of-words (CBOW) word embeddings tend to underperform Skip-gram (SG) embeddings, and this finding has been reported in subsequent works. We find that these observations are driven not by fundamental differences in their training objectives, but more likely on faulty negative sampling CBOW implementations in popular libraries such as the official implementation, word2vec.c, and Gensim. We show that after correcting a bug in the CBOW gradient update, one can learn CBOW word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks, while being many times faster to train.

...read moreread less