Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Pay Attention to MLPs

[...]

Hanxiao Liu¹, Zihang Dai¹, David R. So¹, Quoc V. Le¹•Institutions (1)

Google¹

17 May 2021-arXiv: Learning

TL;DR: The authors proposed a simple network architecture, gMLP, based on MLPs with gating, and showed that it can perform as well as Transformers in key language and vision applications.

...read moreread less

Abstract: Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

...read moreread less

35 citations

Proceedings Article•DOI•

Cascade Reasoning Network for Text-based Visual Question Answering

[...]

Fen Liu¹, Guanghui Xu¹, Qi Wu², Qing Du, Wei Jia, Mingkui Tan¹ - Show less +2 more•Institutions (2)

South China University of Technology¹, University of Adelaide²

12 Oct 2020

TL;DR: A novel Cascade Reasoning Network (CRN) is proposed that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module that aims to explicitly model the connections and interactions between texts and visual concepts.

...read moreread less

Abstract: We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvqa.

...read moreread less

35 citations

Journal Article•DOI•

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

[...]

Prakhar Ganesh¹, Yao Chen¹, Xin Lou¹, Mohammad Ali Khan¹, Yin Yang², Hassan Sajjad³, Preslav Nakov³, Deming Chen⁴, Marianne Winslett⁴ - Show less +5 more•Institutions (4)

Agency for Science, Technology and Research¹, Khalifa University², Qatar Computing Research Institute³, University of Illinois at Urbana–Champaign⁴

24 Sep 2021-Transactions of the Association for Computational Linguistics

TL;DR: This article summarized the state of the art in model compression for BERT and clarified the current best practices for compressing large-scale Transformer models, and provided insights into the workings of various methods.

...read moreread less

Abstract: Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and thus are too resource-hungry and computation-intensive to suit low-capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted a lot of research attention. Here, we summarize the research in compressing Transformers, focusing on the especially popular BERT model. In particular, we survey the state of the art in compression for BERT, we clarify the current best practices for compressing large-scale Transformer models, and we provide insights into the workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving lightweight, accurate, and generic NLP models.

...read moreread less

35 citations

Proceedings Article•DOI•

Logic2Text: High-Fidelity Natural Language Generation from Logical Forms

[...]

Zhiyu Chen¹, Wenhu Chen¹, Hanwen Zha¹, Xiyou Zhou¹, Yunkai Zhang, Sairam Sundaresan², William Yang Wang¹ - Show less +3 more•Institutions (2)

University of California, Santa Barbara¹, Intel²

01 Apr 2020

TL;DR: This work forms high-fidelity NLG as generation from logical forms in order to obtain controllable and faithful generations, and presents a new large-scale dataset, Logic2Text, with 10,753 descriptions involving common logic types paired with the underlying logical forms.

...read moreread less

Abstract: Previous studies on Natural Language Generation (NLG) from structured data have primarily focused on surface-level descriptions of record sequences. However, for complex structured data, e.g., multi-row tables, it is often desirable for an NLG system to describe interesting facts from logical inferences across records. If only provided with the table, it is hard for existing models to produce controllable and high-fidelity logical generations. In this work, we formulate high-fidelity NLG as generation from logical forms in order to obtain controllable and faithful generations. We present a new large-scale dataset, Logic2Text, with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which pose great challenges on the model’s ability to understand the semantics. We experiment on (1) Fully-supervised training with the full datasets, and (2) Few-shot setting, provided with hundreds of paired examples; We compare several popular generation models and analyze their performances. We hope our dataset can encourage research towards building an advanced NLG system capable of natural, faithful, and human-like generation. The dataset and code is available at https://github.com/czyssrs/Logic2Text.

...read moreread less

35 citations

Posted Content•

Data Augmentation Approaches in Natural Language Processing: A Survey.

[...]

Bohan Li¹, Yutai Hou¹, Wanxiang Che¹•Institutions (1)

Harbin Institute of Technology¹

05 Oct 2021-arXiv: Computation and Language

TL;DR: Data augmentation (DA) as discussed by the authors alleviates data scarcity scenarios where deep learning techniques may fail by improving the diversity of training data, thereby helping the model to better generalize to unseen testing data.

...read moreread less

Abstract: As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges.

...read moreread less

34 citations