scispace - formally typeset
Search or ask a question
Journal Article

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
01 Nov 2021
TL;DR: In this paper, a time-efficient sampling method is proposed to select the data that is most relevant to the primary task, which can only train on the most beneficial sub-datasets from the auxiliary tasks, achieving efficient multi-task auxiliary learning.
Abstract: Multi-task auxiliary learning utilizes a set of relevant auxiliary tasks to improve the performance of a primary task. A common usage is to manually select multiple auxiliary tasks for multi-task learning on all data, which raises two issues: (1) selecting beneficial auxiliary tasks for a primary task is nontrivial; (2) when the auxiliary datasets are large, training on all data becomes time-expensive and impractical. Therefore, this paper focuses on addressing these problems and proposes a time-efficient sampling method to select the data that is most relevant to the primary task. The proposed method allows us to only train on the most beneficial sub-datasets from the auxiliary tasks, achieving efficient multi-task auxiliary learning. The experiments on three benchmark datasets (RTE, MRPC, STS-B) show that our method significantly outperforms random sampling and ST-DNN. Also, by applying our method, the model can surpass fully-trained MT-DNN on RTE, MRPC, STS-B, using only 50%, 66%, and 1% of data, respectively.

2 citations

Posted Content
TL;DR: This article proposed BANG, a new pretraining model to bridge the gap between AR and NAR generation by designing a novel model structure for large-scale pretraining, which can simultaneously support AR, NAR and semi-NAR generation to meet different requirements.
Abstract: In this paper, we propose BANG, a new pretraining model to Bridge the gap between Autoregressive (AR) and Non-autoregressive (NAR) Generation AR and NAR generation can be uniformly regarded as to what extent previous tokens can be attended, and BANG bridges AR and NAR generation by designing a novel model structure for large-scale pretraining The pretrained BANG model can simultaneously support AR, NAR and semi-NAR generation to meet different requirements Experiments on question generation (SQuAD 11), summarization (XSum) and dialogue generation (PersonaChat) show that BANG improves NAR and semi-NAR performance significantly as well as attaining comparable performance with strong AR pretrained models Compared with the semi-NAR strong baselines, BANG achieves absolute improvements of 1401 and 524 in the overall scores of SQuAD 11 and XSum, respectively In addition, BANG achieves absolute improvements of 1073, 639 and 590 in the overall scores of SQuAD, XSUM and PersonaChat respectively compared with the strong NAR baselines

2 citations

Posted Content
Tong Niu1, Semih Yavuz1, Yingbo Zhou1, Nitish Shirish Keskar1, Huan Wang1, Caiming Xiong1 
TL;DR: The authors adopt a transfer learning approach and propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting, which achieves state-of-the-art performance on both the Quora Question Pair (QQP) and ParaNMT datasets and is robust to domain shift between the two datasets of distinct distributions.
Abstract: Paraphrase generation has benefited extensively from recent progress in the designing of training objectives and model architectures. However, previous explorations have largely focused on supervised methods, which require a large amount of labeled data that is costly to collect. To address this drawback, we adopt a transfer learning approach and propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking (DB). To enforce a surface form dissimilar from the input, whenever the language model emits a token contained in the source sequence, DB prevents the model from outputting the subsequent source token for the next generation step. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair (QQP) and the ParaNMT datasets and is robust to domain shift between the two datasets of distinct distributions. We also demonstrate that our model transfers to paraphrasing in other languages without any additional finetuning.

2 citations

Journal ArticleDOI
TL;DR: SSL-Reg as mentioned in this paper is a data-dependent regularization approach based on self-supervised learning (SSL), which defines auxiliary tasks on input data without using any human-provided labels and learns data representations by solving these auxiliary tasks.
Abstract: Text classification is a widely studied problem and has broad applications. In many real-world problems, the number of texts for training classification models is limited, which renders these models prone to overfitting. To address this problem, we propose SSL-Reg, a data-dependent regularization approach based on self-supervised learning (SSL). SSL is an unsupervised learning approach which defines auxiliary tasks on input data without using any human-provided labels and learns data representations by solving these auxiliary tasks. In SSL-Reg, a supervised classification task and an unsupervised SSL task are performed simultaneously. The SSL task is unsupervised, which is defined purely on input texts without using any human-provided labels. Training a model using an SSL task can prevent the model from being overfitted to a limited number of class labels in the classification task. Experiments on 17 text classification datasets demonstrate the effectiveness of our proposed method. Code is available at https://github.com/UCSD-AI4H/SSReg

2 citations

Proceedings ArticleDOI
03 Sep 2021
TL;DR: In this article, the authors exploited the power of recent advances in pre-trained and transformer-based NLP models to perform text summarization over the COVID-19 Public Media Dataset.
Abstract: Amidst the grueling SARS-CoV-2 pandemic which has affected the lives of people across the world, the accelerating growth in COVID-19 related news articles is making it difficult for the general public to stay up-to-date with all the information. News articles are a crucial medium to convey coronavirus-related information across the world to the public. Short summaries of news articles can assist the public in grasping a gist of an entire article without having to read it fully. With the evolution of Deep Learning in Natural Language Processing (NLP), we exploited the power of recent advances in pre-trained and transformer-based NLP models to perform text summarization over the COVID-19 Public Media Dataset. For this, we analyzed and compared the results of BERT, GPT-2, XLNet, BART, and T5. The first three models are among the most popular extractive summarization models and the last two are abstractive summarization models. We evaluated the results of our experiments using ROUGE scores (ROUGE-2 and ROUGE-L) and found that BERT, a transformer autoencoder, outperforms the other models under consideration in SARS-CoV-2 news summarization. Thus, we leveraged BERT in our web application “CoVShorts” to summarize COVID-19 articles. Further, we visually analyzed the dataset to depict the most used words in COVID-19 news articles using Word Cloud to validate the accuracy of the summarization task. CoVShorts will serve the public by helping them in gaining brief, concise, and to-the-point summaries quickly.

2 citations

Trending Questions (1)
What are the limitations of transfer learning with a unified text-to-text transformer?

The paper does not mention the limitations of transfer learning with a unified text-to-text transformer.