Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity

[...]

Po-Nien Kung, Sheng-Siang Yin, Yi-Cheng Chen, Tse-Hsuan Yang, Yun-Nung Chen¹ - Show less +1 more•Institutions (1)

National Taiwan University¹

01 Nov 2021

TL;DR: In this paper, a time-efficient sampling method is proposed to select the data that is most relevant to the primary task, which can only train on the most beneficial sub-datasets from the auxiliary tasks, achieving efficient multi-task auxiliary learning.

...read moreread less

Abstract: Multi-task auxiliary learning utilizes a set of relevant auxiliary tasks to improve the performance of a primary task. A common usage is to manually select multiple auxiliary tasks for multi-task learning on all data, which raises two issues: (1) selecting beneficial auxiliary tasks for a primary task is nontrivial; (2) when the auxiliary datasets are large, training on all data becomes time-expensive and impractical. Therefore, this paper focuses on addressing these problems and proposes a time-efficient sampling method to select the data that is most relevant to the primary task. The proposed method allows us to only train on the most beneficial sub-datasets from the auxiliary tasks, achieving efficient multi-task auxiliary learning. The experiments on three benchmark datasets (RTE, MRPC, STS-B) show that our method significantly outperforms random sampling and ST-DNN. Also, by applying our method, the model can surpass fully-trained MT-DNN on RTE, MRPC, STS-B, using only 50%, 66%, and 1% of data, respectively.

...read moreread less

2 citations

Posted Content•

BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining

[...]

Weizhen Qi, Yeyun Gong¹, Jian Jiao, Yu Yan, Dayiheng Liu, Weizhu Chen, Kewen Tang, Houqiang Li, Jiusheng Chen, Ruofei Zhang, Ming Zhou, Nan Duan - Show less +8 more•Institutions (1)

Microsoft¹

31 Dec 2020-arXiv: Computation and Language

TL;DR: This article proposed BANG, a new pretraining model to bridge the gap between AR and NAR generation by designing a novel model structure for large-scale pretraining, which can simultaneously support AR, NAR and semi-NAR generation to meet different requirements.

...read moreread less

Abstract: In this paper, we propose BANG, a new pretraining model to Bridge the gap between Autoregressive (AR) and Non-autoregressive (NAR) Generation AR and NAR generation can be uniformly regarded as to what extent previous tokens can be attended, and BANG bridges AR and NAR generation by designing a novel model structure for large-scale pretraining The pretrained BANG model can simultaneously support AR, NAR and semi-NAR generation to meet different requirements Experiments on question generation (SQuAD 11), summarization (XSum) and dialogue generation (PersonaChat) show that BANG improves NAR and semi-NAR performance significantly as well as attaining comparable performance with strong AR pretrained models Compared with the semi-NAR strong baselines, BANG achieves absolute improvements of 1401 and 524 in the overall scores of SQuAD 11 and XSum, respectively In addition, BANG achieves absolute improvements of 1073, 639 and 590 in the overall scores of SQuAD, XSUM and PersonaChat respectively compared with the strong NAR baselines

...read moreread less

2 citations

Posted Content•

Unsupervised Paraphrasing with Pretrained Language Models

[...]

Tong Niu¹, Semih Yavuz¹, Yingbo Zhou¹, Nitish Shirish Keskar¹, Huan Wang¹, Caiming Xiong¹ - Show less +2 more•Institutions (1)

Salesforce.com¹

24 Oct 2020-arXiv: Computation and Language

TL;DR: The authors adopt a transfer learning approach and propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting, which achieves state-of-the-art performance on both the Quora Question Pair (QQP) and ParaNMT datasets and is robust to domain shift between the two datasets of distinct distributions.

...read moreread less

Abstract: Paraphrase generation has benefited extensively from recent progress in the designing of training objectives and model architectures. However, previous explorations have largely focused on supervised methods, which require a large amount of labeled data that is costly to collect. To address this drawback, we adopt a transfer learning approach and propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking (DB). To enforce a surface form dissimilar from the input, whenever the language model emits a token contained in the source sequence, DB prevents the model from outputting the subsequent source token for the next generation step. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair (QQP) and the ParaNMT datasets and is robust to domain shift between the two datasets of distinct distributions. We also demonstrate that our model transfers to paraphrasing in other languages without any additional finetuning.

...read moreread less

2 citations

Journal Article•DOI•

Self-supervised Regularization for Text Classification

[...]

Meng Zhou¹, Zechen Li², Pengtao Xie³•Institutions (3)

Shanghai Jiao Tong University¹, Northeastern University², University of California, San Diego³

14 Jul 2021-Transactions of the Association for Computational Linguistics

TL;DR: SSL-Reg as mentioned in this paper is a data-dependent regularization approach based on self-supervised learning (SSL), which defines auxiliary tasks on input data without using any human-provided labels and learns data representations by solving these auxiliary tasks.

...read moreread less

Abstract: Text classification is a widely studied problem and has broad applications. In many real-world problems, the number of texts for training classification models is limited, which renders these models prone to overfitting. To address this problem, we propose SSL-Reg, a data-dependent regularization approach based on self-supervised learning (SSL). SSL is an unsupervised learning approach which defines auxiliary tasks on input data without using any human-provided labels and learns data representations by solving these auxiliary tasks. In SSL-Reg, a supervised classification task and an unsupervised SSL task are performed simultaneously. The SSL task is unsupervised, which is defined purely on input texts without using any human-provided labels. Training a model using an SSL task can prevent the model from being overfitted to a limited number of class labels in the classification task. Experiments on 17 text classification datasets demonstrate the effectiveness of our proposed method. Code is available at https://github.com/UCSD-AI4H/SSReg

...read moreread less

2 citations

Proceedings Article•DOI•

CoVShorts: News Summarization Application Based on Deep NLP Transformers for SARS-CoV-2

[...]

Hunar Batra¹, Akansha Jain¹, Gargi Bisht¹, Khushi Srivastava¹, Meenakshi Bharadwaj¹, Deepali Bajaj¹, Urmil Bharti¹ - Show less +3 more•Institutions (1)

University of Delhi¹

03 Sep 2021

TL;DR: In this article, the authors exploited the power of recent advances in pre-trained and transformer-based NLP models to perform text summarization over the COVID-19 Public Media Dataset.

...read moreread less

Abstract: Amidst the grueling SARS-CoV-2 pandemic which has affected the lives of people across the world, the accelerating growth in COVID-19 related news articles is making it difficult for the general public to stay up-to-date with all the information. News articles are a crucial medium to convey coronavirus-related information across the world to the public. Short summaries of news articles can assist the public in grasping a gist of an entire article without having to read it fully. With the evolution of Deep Learning in Natural Language Processing (NLP), we exploited the power of recent advances in pre-trained and transformer-based NLP models to perform text summarization over the COVID-19 Public Media Dataset. For this, we analyzed and compared the results of BERT, GPT-2, XLNet, BART, and T5. The first three models are among the most popular extractive summarization models and the last two are abstractive summarization models. We evaluated the results of our experiments using ROUGE scores (ROUGE-2 and ROUGE-L) and found that BERT, a transformer autoencoder, outperforms the other models under consideration in SARS-CoV-2 news summarization. Thus, we leveraged BERT in our web application “CoVShorts” to summarize COVID-19 articles. Further, we visually analyzed the dataset to depict the most used words in COVID-19 news articles using Word Cloud to validate the accuracy of the summarization task. CoVShorts will serve the public by helping them in gaining brief, concise, and to-the-point summaries quickly.

...read moreread less

2 citations