Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Multilingual Translation via Grafting Pre-trained Language Models

[...]

Zewei Sun¹, Mingxuan Wang², Lei Li³•Institutions (3)

Nanjing University¹, Carnegie Mellon University², University of California, Santa Barbara³

11 Sep 2021

TL;DR: The authors propose Graformer to graft separately pre-trained (masked) language models for machine translation using monolingual data for pre-training and parallel data for grafting training, which maximally takes advantage of the usage of both types of data.

...read moreread less

Abstract: Can pre-trained BERT for one language and GPT for another be glued together to translate texts? Self-supervised training using only monolingual data has led to the success of pre-trained (masked) language models in many NLP tasks. However, directly connecting BERT as an encoder and GPT as a decoder can be challenging in machine translation, for GPT-like models lack a cross-attention component that is needed in seq2seq decoders. In this paper, we propose Graformer to graft separately pre-trained (masked) language models for machine translation. With monolingual data for pre-training and parallel data for grafting training, we maximally take advantage of the usage of both types of data. Experiments on 60 directions show that our method achieves average improvements of 5.8 BLEU in x2en and 2.9 BLEU in en2x directions comparing with the multilingual Transformer of the same size.

...read moreread less

Posted Content•

Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories.

[...]

Wenlin Yao¹, Xiaoman Pan¹, Lifeng Jin¹, Jianshu Chen¹, Dian Yu¹, Dong Yu¹ - Show less +2 more•Institutions (1)

Tencent¹

27 Oct 2021-arXiv: Computation and Language

TL;DR: The authors proposed a gloss alignment algorithm that can align definition sentences (glosses) with the same meaning from different sense inventories to collect rich lexical knowledge and then train a model to identify semantic equivalence between a target word in context and one of its glosses using these aligned inventories, which exhibits strong transfer capability to many WSD tasks.

...read moreread less

Abstract: Word Sense Disambiguation (WSD) aims to automatically identify the exact meaning of one word according to its context. Existing supervised models struggle to make correct predictions on rare word senses due to limited training data and can only select the best definition sentence from one predefined word sense inventory (e.g., WordNet). To address the data sparsity problem and generalize the model to be independent of one predefined inventory, we propose a gloss alignment algorithm that can align definition sentences (glosses) with the same meaning from different sense inventories to collect rich lexical knowledge. We then train a model to identify semantic equivalence between a target word in context and one of its glosses using these aligned inventories, which exhibits strong transfer capability to many WSD tasks. Experiments on benchmark datasets show that the proposed method improves predictions on both frequent and rare word senses, outperforming prior work by 1.2% on the All-Words WSD Task and 4.3% on the Low-Shot WSD Task. Evaluation on WiC Task also indicates that our method can better capture word meanings in context.

...read moreread less

Posted Content•

ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale Fusion of Locally Descriptors.

[...]

Huaishao Luo, Lei Ji¹, Yanyong Huang², Bin Wang³, Shenggong Ji⁴, Tianrui Li - Show less +2 more•Institutions (4)

Southwest Jiaotong University¹, Microsoft², Southwestern University of Finance and Economics³, Ocean University of China⁴

02 Dec 2021-arXiv: Computation and Language

TL;DR: ScaleVLAD as mentioned in this paper proposes a fusion model named ScaleVLAD to gather multi-scale representation from text, video, and audio with shared Vectors of Locally Aggregated Descriptors to improve unaligned multimodal sentiment analysis.

...read moreread less

Abstract: Fusion technique is a key research topic in multimodal sentiment analysis. The recent attention-based fusion demonstrates advances over simple operation-based fusion. However, these fusion works adopt single-scale, i.e., token-level or utterance-level, unimodal representation. Such single-scale fusion is suboptimal because that different modality should be aligned with different granularities. This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio with shared Vectors of Locally Aggregated Descriptors to improve unaligned multimodal sentiment analysis. These shared vectors can be regarded as shared topics to align different modalities. In addition, we propose a self-supervised shifted clustering loss to keep the fused feature differentiation among samples. The backbones are three Transformer encoders corresponding to three modalities, and the aggregated features generated from the fusion module are feed to a Transformer plus a full connection to finish task predictions. Experiments on three popular sentiment analysis benchmarks, IEMOCAP, MOSI, and MOSEI, demonstrate significant gains over baselines.

...read moreread less

Proceedings Article•

TAG: Gradient Attack on Transformer-based Language Models

[...]

Jieren Deng¹, Yijue Wang¹, Ji Li², Chenghong Wang³, Chao Shang¹, Hang Liu⁴, Sanguthevar Rajasekaran¹, Caiwen Ding¹ - Show less +4 more•Institutions (4)

University of Connecticut¹, Microsoft², Duke University³, Stevens Institute of Technology⁴

21 Sep 2021

TL;DR: This article proposed a gradient attack algorithm, TAG, to reconstruct the local training data of Transformer-based NLP models, which works well on more weight distributions in reconstructing training data and achieves 1.5x recover rate and 2.5 times ROUGE-2 over prior methods.

...read moreread less

Abstract: Although distributed learning has increasingly gained attention in terms of effectively utilizing local devices for data privacy enhancement, recent studies show that publicly shared gradients in the training process can reveal the private training data (gradient leakage) to a third-party. We have, however, no systematic understanding of the gradient leakage mechanism on the Transformer based language models. In this paper, as the first attempt, we formulate the gradient attack problem on the Transformer-based language models and propose a gradient attack algorithm, TAG, to reconstruct the local training data. Experimental results on Transformer, TinyBERT4, TinyBERT6 BERT_BASE, and BERT_LARGE using GLUE benchmark show that compared with DLG, TAG works well on more weight distributions in reconstructing training data and achieves 1.5x recover rate and 2.5x ROUGE-2 over prior methods without the need of ground truth label. TAG can obtain up to 90% data by attacking gradients in CoLA dataset. In addition, TAG is stronger than previous approaches on larger models, smaller dictionary size, and smaller input length. We hope the proposed TAG will shed some light on the privacy leakage problem in Transformer-based NLP models.

...read moreread less

Proceedings Article•

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation.

[...]

Chenguang Wang¹, Xiao Liu², Zui Chen³, Haoyun Hong³, Jie Tang³, Dawn Song¹ - Show less +2 more•Institutions (3)

University of California, Berkeley¹, Beijing Institute of Technology², Tsinghua University³

01 Nov 2021

TL;DR: This article cast a suite of information extraction tasks into a text-to-triple translation framework, which enables a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task.

...read moreread less

Abstract: We cast a suite of information extraction tasks into a text-to-triple translation framework. Instead of solving each task relying on task-specific datasets and models, we formalize the task as a translation between task-specific input text and output triples. By taking the task-specific input, we enable a task-agnostic translation by leveraging the latent knowledge that a pre-trained language model has about the task. We further demonstrate that a simple pre-training task of predicting which relational information corresponds to which input text is an effective way to produce task-specific outputs. This enables the zero-shot transfer of our framework to downstream tasks. We study the zero-shot performance of this framework on open information extraction (OIE2016, NYT, WEB, PENN), relation classification (FewRel and TACRED), and factual probe (Google-RE and T-REx). The model transfers non-trivially to most tasks and is often competitive with a fully supervised method without the need for any task-specific training. For instance, we significantly outperform the F1 score of the supervised open information extraction without needing to use its training set.

...read moreread less