Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

[...]

Xudong Lin¹, Gedas Bertasius², Jue Wang², Shih-Fu Chang¹, Devi Parikh², Lorenzo Torresani² - Show less +2 more•Institutions (2)

Columbia University¹, Facebook²

28 Jan 2021

TL;DR: VX2TEXT as mentioned in this paper is a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio, where each modality is first converted into a set of language embeddings by a learnable tokenizer.

...read moreread less

Abstract: We present VX2TEXT, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language space, thus eliminating the need for ad-hoc cross-modal fusion modules. To address the non-differentiability of tokenization on continuous inputs (e.g., video or audio), we utilize a relaxation scheme that enables end-to-end training. Furthermore, unlike prior encoder-only models, our network includes an autoregressive decoder to generate open-ended text from the multimodal embeddings fused by the language encoder. This renders our approach fully generative and makes it directly applicable to different "video+x to text" problems without the need to design specialized network heads for each task. The proposed framework is not only conceptually simple but also remarkably effective: experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three videobased text-generation tasks—captioning, question answering and audio-visual scene-aware dialog.

...read moreread less

46 citations

Journal Article•DOI•

The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures

[...]

Sushant Singh¹, Ausif Mahmood¹•Institutions (1)

University of Bridgeport¹

04 May 2021-IEEE Access

TL;DR: In this article, the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency are summarized and examined.

...read moreread less

Abstract: In recent years, Natural Language Processing (NLP) models have achieved phenomenal success in linguistic and semantic tasks like text classification, machine translation, cognitive dialogue systems, information retrieval via Natural Language Understanding (NLU), and Natural Language Generation (NLG). This feat is primarily attributed due to the seminal Transformer architecture, leading to designs such as BERT, GPT (I, II, III), etc. Although these large-size models have achieved unprecedented performances, they come at high computational costs. Consequently, some of the recent NLP architectures have utilized concepts of transfer learning, pruning, quantization, and knowledge distillation to achieve moderate model sizes while keeping nearly similar performances as achieved by their predecessors. Additionally, to mitigate the data size challenge raised by language models from a knowledge extraction perspective, Knowledge Retrievers have been built to extricate explicit data documents from a large corpus of databases with greater efficiency and accuracy. Recent research has also focused on superior inference by providing efficient attention to longer input sequences. In this paper, we summarize and examine the current state-of-the-art (SOTA) NLP models that have been employed for numerous NLP tasks for optimal performance and efficiency. We provide a detailed understanding and functioning of the different architectures, a taxonomy of NLP designs, comparative evaluations, and future directions in NLP.

...read moreread less

46 citations

Posted Content•

SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval.

[...]

Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, Qun Liu - Show less +5 more

02 Oct 2020-arXiv: Information Retrieval

TL;DR: This paper proposes a novel framework SparTerm to directly learn sparse text representations in the full vocabulary space, aiming to improve the representation capacity of bag-of-words(BoW) method for semantic-level matching, while still keeping its advantages.

...read moreread less

Abstract: Term-based sparse representations dominate the first-stage text retrieval in industrial applications, due to its advantage in efficiency, interpretability, and exact term matching. In this paper, we study the problem of transferring the deep knowledge of the pre-trained language model (PLM) to Term-based Sparse representations, aiming to improve the representation capacity of bag-of-words(BoW) method for semantic-level matching, while still keeping its advantages. Specifically, we propose a novel framework SparTerm to directly learn sparse text representations in the full vocabulary space. The proposed SparTerm comprises an importance predictor to predict the importance for each term in the vocabulary, and a gating controller to control the term activation. These two modules cooperatively ensure the sparsity and flexibility of the final text representation, which unifies the term-weighting and expansion in the same framework. Evaluated on MSMARCO dataset, SparTerm significantly outperforms traditional sparse methods and achieves state of the art ranking performance among all the PLM-based sparse models.

...read moreread less

46 citations

Proceedings Article•DOI•

“You are grounded!”: Latent Name Artifacts in Pre-trained Language Models

[...]

Vered Shwartz¹, Rachel Rudinger¹, Oyvind Tafjord¹•Institutions (1)

Allen Institute for Artificial Intelligence¹

06 Apr 2020

TL;DR: This work focuses on artifacts associated with the representation of given names, which, depending on the corpus, may be associated with specific entities, as indicated by next token prediction (e.g., Donald), and suggests additional pre-training on different corpora may mitigate this bias.

...read moreread less

Abstract: Pre-trained language models (LMs) may perpetuate biases originating in their training corpus to downstream models. We focus on artifacts associated with the representation of given names (e.g., Donald), which, depending on the corpus, may be associated with specific entities, as indicated by next token prediction (e.g., Trump). While helpful in some contexts, grounding happens also in under-specified or inappropriate contexts. For example, endings generated for `Donald is a' substantially differ from those of other names, and often have more-than-average negative sentiment. We demonstrate the potential effect on downstream tasks with reading comprehension probes where name perturbation changes the model answers. As a silver lining, our experiments suggest that additional pre-training on different corpora may mitigate this bias.

...read moreread less

45 citations

Posted Content•

Entities as Experts: Sparse Memory Access with Entity Supervision

[...]

Thibault Févry¹, Livio Soares², Nicholas FitzGerald², Eunsol Choi³, Tom Kwiatkowski² - Show less +1 more•Institutions (3)

New York University¹, Google², University of Texas at Austin³

15 Apr 2020-arXiv: Computation and Language

TL;DR: This article propose an entity knowledge model called Entities as Experts (EAE) that can access distinct memories of the entities mentioned in a piece of text. But, their model is not designed to capture declarative knowledge about entities in learned parameters of a language model.

...read moreread less

Abstract: We focus on the problem of capturing declarative knowledge about entities in the learned parameters of a language model. We introduce a new model - Entities as Experts (EAE) - that can access distinct memories of the entities mentioned in a piece of text. Unlike previous efforts to integrate entity knowledge into sequence models, EAE's entity representations are learned directly from text. We show that EAE's learned representations capture sufficient knowledge to answer TriviaQA questions such as "Which Dr. Who villain has been played by Roger Delgado, Anthony Ainley, Eric Roberts?", outperforming an encoder-generator Transformer model with 10x the parameters. According to the LAMA knowledge probes, EAE contains more factual knowledge than a similarly sized BERT, as well as previous approaches that integrate external sources of entity knowledge. Because EAE associates parameters with specific entities, it only needs to access a fraction of its parameters at inference time, and we show that the correct identification and representation of entities is essential to EAE's performance.

...read moreread less

45 citations