Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Augmenting Decompiler Output with Learned Variable Names and Types.

[...]

Qibin Chen, Jeremy Lacomis, Edward J. Schwartz, Claire Le Goues, Graham Neubig, Bogdan Vasilescu - Show less +2 more

13 Aug 2021-arXiv: Software Engineering

TL;DR: In this article, the authors present a technique for improving the quality of decompiler output that automatically generates meaningful variable names and types, such as loops, typed variables, and comments.

...read moreread less

Abstract: A common tool used by security professionals for reverse-engineering binaries found in the wild is the decompiler. A decompiler attempts to reverse compilation, transforming a binary to a higher-level language such as C. High-level languages ease reasoning about programs by providing useful abstractions such as loops, typed variables, and comments, but these abstractions are lost during compilation. Decompilers are able to deterministically reconstruct structural properties of code, but comments, variable names, and custom variable types are technically impossible to recover. In this paper we present DIRTY (DecompIled variable ReTYper), a novel technique for improving the quality of decompiler output that automatically generates meaningful variable names and types. Empirical evaluation on a novel dataset of C code mined from GitHub shows that DIRTY outperforms prior work approaches by a sizable margin, recovering the original names written by developers 66.4% of the time and the original types 75.8% of the time.

...read moreread less

Journal Article•DOI•

[...]

Zohreh Madhoushi, Abdul Razak Hamdan, Suhaila Zainudin

01 Jan 2021-International Journal of Advanced Computer Science and Applications

TL;DR: This paper proposed a mechanism to address labelled data dependency by a one-step approach experimenting to decide the best combinatory architectures of recurrent-based LM and the best semantic similarity measures for fostering a new aspect category detection model.

...read moreread less

Abstract: Aspect-based Sentiment Analysis (ABSA) aims to extract significant aspects of an item or product from reviews and predict the sentiment of each aspect. Previous similarity methods tend to extract aspect categories at the word level by combining Language Models (LM) in their models. A drawback for the LM model is its dependence on a large amount of labelled data for a specific domain to function well. This work proposes a mechanism to address labelled data dependency by a one-step approach experimenting to decide the best combinatory architectures of recurrent-based LM and the best semantic similarity measures for fostering a new aspect category detection model. The proposed model addresses drawbacks of previous aspect category detection models in an implicit manner. The datasets of this study, S1 and S2, are from standard SemEval online competition. The proposed model outperforms the previous baseline models in terms of the F1-score of aspect category detection. This study finds more relevant aspect categories by creating a more stable and robust model. The F1 score of our best model for aspect category detection is 79.03% in the restaurant domain for the S1 dataset. In dataset S2, the F1-score is 72.65% in the laptop domain and 75.11% in the restaurant domain.

...read moreread less

Proceedings Article•

SideControl: Controlled Open-domain Dialogue Generation via Additive Side Networks.

[...]

Wanyu Du¹, Yangfeng Ji¹•Institutions (1)

University of Virginia¹

01 Nov 2021

TL;DR: This article proposed a side control framework to control the generation of Transformer-based pre-trained language models, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples.

...read moreread less

Abstract: Transformer-based pre-trained language models boost the performance of open-domain dialogue systems. Prior works leverage Transformer-based pre-trained language models to generate texts with desired attributes in two general approaches: (1) gradient-based methods: updating all latent representations of pre-trained models with gradients from attribute models; (2) weighted-decoding methods: re-ranking beam candidates from pre-trained models with attribute functions. However, gradient-based methods lead to high computation cost and can easily get overfitted on small training sets, while weighted-decoding methods are inherently constrained by the low-variance high-bias pre-trained model. In this work, we propose a novel approach to control the generation of Transformer-based pre-trained language models: the SideControl framework, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples. We evaluate our proposed method on two benchmark open-domain dialogue datasets, and results show that the SideControl framework has better controllability, higher generation quality and better sample-efficiency than existing gradient-based and weighted-decoding baselines.

...read moreread less

Posted Content•

CARLS: Cross-platform Asynchronous Representation Learning System.

[...]

Chun-Ta Lu, Yun Zeng, Da-Cheng Juan, Yicheng Fan, Zhe Li, Jan Dlabal, Yi-Ting Chen, Arjun Gopalan, Allan Heydon, Chun-Sung Ferng, Reah Miyara, Ariel Fuxman, Futang Peng, Zhen Li, Tom Duerig, Andrew Tomkins - Show less +12 more

26 May 2021-arXiv: Learning

TL;DR: CARLS as discussed by the authors is a framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components (model trainers, knowledge makers and knowledge banks) to concertedly work together in an asynchronous fashion across hardware platforms.

...read moreread less

Abstract: In this work, we propose CARLS, a novel framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components -- model trainers, knowledge makers and knowledge banks -- to concertedly work together in an asynchronous fashion across hardware platforms. The proposed CARLS is particularly suitable for learning paradigms where model training benefits from additional knowledge inferred or discovered during training, such as node embeddings for graph neural networks or reliable pseudo labels from model predictions. We also describe three learning paradigms -- semi-supervised learning, curriculum learning and multimodal learning -- as examples that can be scaled up efficiently by CARLS. One version of CARLS has been open-sourced and available for download at: this https URL

...read moreread less

Posted Content•

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

[...]

Ankur Bapna, Yu-An Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H. Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, Yu Zhang - Show less +6 more

20 Oct 2021-arXiv: Computation and Language

TL;DR: The authors proposed a single encoder with the BERT objective on unlabeled text and the w2v-BERT objective on unannotated speech, which achieved state-of-the-art performance on CoVoST~2 speech translation.

...read moreread less

Abstract: Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. To further align our model representations across modalities, we leverage alignment losses, specifically Translation Language Modeling (TLM) and Speech Text Matching (STM) that make use of supervised speech-text recognition data. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST~2 speech translation, by around 1 BLEU compared to single-modality pre-trained models, while retaining close to SotA performance on LibriSpeech and SpeechStew ASR tasks. On four GLUE tasks and text-normalization, we observe evidence of capacity limitations and interference between the two modalities, leading to degraded performance compared to an equivalent text-only model, while still being competitive with BERT. Through extensive empirical analysis we also demonstrate the importance of the choice of objective function for speech pre-training, and the beneficial effect of adding additional supervised signals on the quality of the learned representations.

...read moreread less