scispace - formally typeset
Search or ask a question
Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: In this article, the authors present a technique for improving the quality of decompiler output that automatically generates meaningful variable names and types, such as loops, typed variables, and comments.
Abstract: A common tool used by security professionals for reverse-engineering binaries found in the wild is the decompiler. A decompiler attempts to reverse compilation, transforming a binary to a higher-level language such as C. High-level languages ease reasoning about programs by providing useful abstractions such as loops, typed variables, and comments, but these abstractions are lost during compilation. Decompilers are able to deterministically reconstruct structural properties of code, but comments, variable names, and custom variable types are technically impossible to recover. In this paper we present DIRTY (DecompIled variable ReTYper), a novel technique for improving the quality of decompiler output that automatically generates meaningful variable names and types. Empirical evaluation on a novel dataset of C code mined from GitHub shows that DIRTY outperforms prior work approaches by a sizable margin, recovering the original names written by developers 66.4% of the time and the original types 75.8% of the time.
Journal ArticleDOI
TL;DR: This paper proposed a mechanism to address labelled data dependency by a one-step approach experimenting to decide the best combinatory architectures of recurrent-based LM and the best semantic similarity measures for fostering a new aspect category detection model.
Abstract: Aspect-based Sentiment Analysis (ABSA) aims to extract significant aspects of an item or product from reviews and predict the sentiment of each aspect. Previous similarity methods tend to extract aspect categories at the word level by combining Language Models (LM) in their models. A drawback for the LM model is its dependence on a large amount of labelled data for a specific domain to function well. This work proposes a mechanism to address labelled data dependency by a one-step approach experimenting to decide the best combinatory architectures of recurrent-based LM and the best semantic similarity measures for fostering a new aspect category detection model. The proposed model addresses drawbacks of previous aspect category detection models in an implicit manner. The datasets of this study, S1 and S2, are from standard SemEval online competition. The proposed model outperforms the previous baseline models in terms of the F1-score of aspect category detection. This study finds more relevant aspect categories by creating a more stable and robust model. The F1 score of our best model for aspect category detection is 79.03% in the restaurant domain for the S1 dataset. In dataset S2, the F1-score is 72.65% in the laptop domain and 75.11% in the restaurant domain.
Proceedings Article
01 Nov 2021
TL;DR: This article proposed a side control framework to control the generation of Transformer-based pre-trained language models, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples.
Abstract: Transformer-based pre-trained language models boost the performance of open-domain dialogue systems. Prior works leverage Transformer-based pre-trained language models to generate texts with desired attributes in two general approaches: (1) gradient-based methods: updating all latent representations of pre-trained models with gradients from attribute models; (2) weighted-decoding methods: re-ranking beam candidates from pre-trained models with attribute functions. However, gradient-based methods lead to high computation cost and can easily get overfitted on small training sets, while weighted-decoding methods are inherently constrained by the low-variance high-bias pre-trained model. In this work, we propose a novel approach to control the generation of Transformer-based pre-trained language models: the SideControl framework, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples. We evaluate our proposed method on two benchmark open-domain dialogue datasets, and results show that the SideControl framework has better controllability, higher generation quality and better sample-efficiency than existing gradient-based and weighted-decoding baselines.
Posted Content
TL;DR: CARLS as discussed by the authors is a framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components (model trainers, knowledge makers and knowledge banks) to concertedly work together in an asynchronous fashion across hardware platforms.
Abstract: In this work, we propose CARLS, a novel framework for augmenting the capacity of existing deep learning frameworks by enabling multiple components -- model trainers, knowledge makers and knowledge banks -- to concertedly work together in an asynchronous fashion across hardware platforms. The proposed CARLS is particularly suitable for learning paradigms where model training benefits from additional knowledge inferred or discovered during training, such as node embeddings for graph neural networks or reliable pseudo labels from model predictions. We also describe three learning paradigms -- semi-supervised learning, curriculum learning and multimodal learning -- as examples that can be scaled up efficiently by CARLS. One version of CARLS has been open-sourced and available for download at: this https URL
Posted Content
TL;DR: The authors proposed a single encoder with the BERT objective on unlabeled text and the w2v-BERT objective on unannotated speech, which achieved state-of-the-art performance on CoVoST~2 speech translation.
Abstract: Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech. To further align our model representations across modalities, we leverage alignment losses, specifically Translation Language Modeling (TLM) and Speech Text Matching (STM) that make use of supervised speech-text recognition data. We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST~2 speech translation, by around 1 BLEU compared to single-modality pre-trained models, while retaining close to SotA performance on LibriSpeech and SpeechStew ASR tasks. On four GLUE tasks and text-normalization, we observe evidence of capacity limitations and interference between the two modalities, leading to degraded performance compared to an equivalent text-only model, while still being competitive with BERT. Through extensive empirical analysis we also demonstrate the importance of the choice of objective function for speech pre-training, and the beneficial effect of adding additional supervised signals on the quality of the learned representations.
Trending Questions (1)
What are the limitations of transfer learning with a unified text-to-text transformer?

The paper does not mention the limitations of transfer learning with a unified text-to-text transformer.