Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Make Lead Bias in Your Favor: Zero-shot Abstractive News Summarization

[...]

Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, Xuedong Huang - Show less +1 more

25 Dec 2019-arXiv: Computation and Language

TL;DR: This work proposes a self-supervised pre-training method to pre-train abstractive news summarization models on large-scale unlabeled news corpora, and shows that this approach can dramatically improve the summarization quality and achieve state-of-the-art results for zero-shot news summarizing without any fine-tuning.

...read moreread less

Abstract: Lead bias is a common phenomenon in news summarization, where early parts of an article often contain the most salient information. While many algorithms exploit this fact in summary generation, it has a detrimental effect on teaching the model to discriminate and extract important information in general. We propose that the lead bias can be leveraged in our favor in a simple and effective way to pre-train abstractive news summarization models on large-scale unlabeled news corpora: predicting the leading sentences using the rest of an article. We collect a massive news corpus and conduct data cleaning and filtering via statistical analysis. We then apply the proposed self-supervised pre-training to existing generation models BART and T5 for domain adaptation. Via extensive experiments on six benchmark datasets, we show that this approach can dramatically improve the summarization quality and achieve state-of-the-art results for zero-shot news summarization without any fine-tuning. For example, in the DUC2003 dataset, the ROUGE-1 score of BART increases 13.7% after the lead-bias pre-training. We deploy the model in Microsoft News and provide public APIs as well as a demo website for multi-lingual news summarization.

...read moreread less

8 citations

Posted Content•

TeaForN: Teacher-Forcing with N-grams

[...]

Sebastian Goodman¹, Nan Ding¹, Radu Soricut¹•Institutions (1)

Google¹

07 Oct 2020-arXiv: Computation and Language

TL;DR: The proposed method, Teacher-Forcing with N-grams (TeaForN), addresses both problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model parameter updates based on N prediction steps.

...read moreread less

Abstract: Sequence generation models trained with teacher-forcing suffer from issues related to exposure bias and lack of differentiability across timesteps. Our proposed method, Teacher-Forcing with N-grams (TeaForN), addresses both these problems directly, through the use of a stack of N decoders trained to decode along a secondary time axis that allows model parameter updates based on N prediction steps. TeaForN can be used with a wide class of decoder architectures and requires minimal modifications from a standard teacher-forcing setup. Empirically, we show that TeaForN boosts generation quality on one Machine Translation benchmark, WMT 2014 English-French, and two News Summarization benchmarks, CNN/Dailymail and Gigaword.

...read moreread less

7 citations

Posted Content•

SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising

[...]

Kuan Xuan, Yongbo Wang, Yongliang Wang, Zujie Wen, Yang Dong - Show less +1 more

17 May 2021-arXiv: Computation and Language

TL;DR: In this paper, a schema aware denoising (SeaD) is proposed to improve the performance of seq-to-seq model in both schema linking and grammar correctness and establishes new state-of-the-art on WikiSQL benchmark.

...read moreread less

Abstract: In text-to-SQL task, seq-to-seq models often lead to sub-optimal performance due to limitations in their architecture. In this paper, we present a simple yet effective approach that adapts transformer-based seq-to-seq model to robust text-to-SQL generation. Instead of inducing constraint to decoder or reformat the task as slot-filling, we propose to train seq-to-seq model with Schema aware Denoising (SeaD), which consists of two denoising objectives that train model to either recover input or predict output from two novel erosion and shuffle noises. These denoising objectives acts as the auxiliary tasks for better modeling the structural data in S2S generation. In addition, we improve and propose a clause-sensitive execution guided (EG) decoding strategy to overcome the limitation of EG decoding for generative model. The experiments show that the proposed method improves the performance of seq-to-seq model in both schema linking and grammar correctness and establishes new state-of-the-art on WikiSQL benchmark. The results indicate that the capacity of vanilla seq-to-seq architecture for text-to-SQL may have been under-estimated.

...read moreread less

7 citations

Journal Article•DOI•

SKR-QA: Semantic ranking and knowledge revise for multi-choice question answering

[...]

Mucheng Ren¹, Heyan Huang¹, Yang Gao¹•Institutions (1)

Beijing Institute of Technology¹

12 Oct 2021-Neurocomputing

TL;DR: It is argued that complete knowledge already exists in natural language and that the knowledge obtained by the Semantic-rank-and-Knowledge-Revise-based Question Answering approach is more conducive to machine understanding, thus providing certain interpretability.

...read moreread less

7 citations

Posted Content•

With Little Power Comes Great Responsibility

[...]

Dallas Card¹, Peter Henderson¹, Urvashi Khandelwal¹, Robin Jia¹, Kyle Mahowald², Dan Jurafsky¹ - Show less +2 more•Institutions (2)

Stanford University¹, University of California, Santa Barbara²

13 Oct 2020-arXiv: Computation and Language

TL;DR: This paper found that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point, and that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied.

...read moreread less

Abstract: Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.

...read moreread less

7 citations