Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Long-Span Summarization via Local Attention and Content Selection

[...]

Potsawee Manakul¹, Mark J. F. Gales¹•Institutions (1)

University of Cambridge¹

08 May 2021-arXiv: Computation and Language

TL;DR: This work exploits large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection, which can achieve comparable or better results than existing approaches.

...read moreread less

Abstract: Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pre-trained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection. These approaches are compared on a range of network configurations. Experiments are carried out on standard long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed datasets. We demonstrate that by combining these methods, we can achieve state-of-the-art results on all three tasks in the ROUGE scores. Moreover, without a large-scale GPU card, our approach can achieve comparable or better results than existing approaches.

...read moreread less

21 citations

Proceedings Article•DOI•

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

[...]

Yao Qian¹, Ximo Bianv², Yu Shi¹, Naoyuki Kanda¹, Leo Shen¹, Zhen Xiao¹, Michael Zeng¹ - Show less +3 more•Institutions (2)

Microsoft¹, Beijing Institute of Technology²

06 Jun 2021

TL;DR: In this article, a well-optimized E2E ASR encoder and a pre-trained language model encoder are combined into a transformer decoder for end-to-end spoken language understanding.

...read moreread less

Abstract: End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder. The unified speech-language pre-trained model (SLP) is continually enhanced on limited labeled data from a target domain by using a conditional masked language model (MLM) objective, and thus can effectively generate a sequence of intent, slot type, and slot value for given input speech in the inference. The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method. It also outperforms the present state-of-the-art approaches to E2E SLU with much less paired data.

...read moreread less

21 citations

Journal Article•DOI•

Damage detection using in-domain and cross-domain transfer learning

[...]

Zaharah Allah Bukhsh¹, Nils Jansen², Aaqib Saeed¹•Institutions (2)

Eindhoven University of Technology¹, Radboud University Nijmegen²

13 Jul 2021-Neural Computing and Applications

TL;DR: A combination of in-domain and cross-domain transfer learning strategies for damage detection in bridges and visual explanations of predictive models to enable algorithmic transparency and provide insights to experts about the intrinsic decision-logic of typically black-box deep models are provided.

...read moreread less

Abstract: We investigate the capabilities of transfer learning in the area of structural health monitoring. In particular, we are interested in damage detection for concrete structures. Typical image datasets for such problems are relatively small, calling for the transfer of learned representation from a related large-scale dataset. Past efforts of damage detection using images have mainly considered cross-domain transfer learning approaches using pre-trained ImageNet models that are subsequently fine-tuned for the target task. However, there are rising concerns about the generalizability of ImageNet representations for specific target domains, such as for visual inspection and medical imaging. We, therefore, evaluate a combination of in-domain and cross-domain transfer learning strategies for damage detection in bridges. We perform comprehensive comparisons to study the impact of cross-domain and in-domain transfer, with various initialization strategies, using six publicly available visual inspection datasets. The pre-trained models are also evaluated for their ability to cope with the extremely low-data regime. We show that the combination of cross-domain and in-domain transfer persistently shows superior performance specially with tiny datasets. Likewise, we also provide visual explanations of predictive models to enable algorithmic transparency and provide insights to experts about the intrinsic decision logic of typically black-box deep models.

...read moreread less

21 citations

Posted Content•

Measuring and Improving BERT's Mathematical Abilities by Predicting the Order of Reasoning

[...]

Piotr Piękos, Henryk Michalewski, Mateusz Malinowski

07 Jun 2021-arXiv: Computation and Language

TL;DR: This work fine-tunes BERT on a popular dataset for word math problems, AQuA-RAT, and proposes new pretext tasks for learning mathematical rules that achieve significantly better outcomes than data-driven baselines and even on-par with more tailored models.

...read moreread less

Abstract: Imagine you are in a supermarket. You have two bananas in your basket and want to buy four apples. How many fruits do you have in total? This seemingly straightforward question can be challenging for data-driven language models, even if trained at scale. However, we would expect such generic language models to possess some mathematical abilities in addition to typical linguistic competence. Towards this goal, we investigate if a commonly used language model, BERT, possesses such mathematical abilities and, if so, to what degree. For that, we fine-tune BERT on a popular dataset for word math problems, AQuA-RAT, and conduct several tests to understand learned representations better. Since we teach models trained on natural language to do formal mathematics, we hypothesize that such models would benefit from training on semi-formal steps that explain how math results are derived. To better accommodate such training, we also propose new pretext tasks for learning mathematical rules. We call them (Neighbor) Reasoning Order Prediction (ROP or NROP). With this new model, we achieve significantly better outcomes than data-driven baselines and even on-par with more tailored models. We also show how to reduce positional bias in such models.

...read moreread less

21 citations

Proceedings Article•DOI•

Look at the First Sentence: Position Bias in Question Answering

[...]

Miyoung Ko¹, Jinhyuk Lee¹, Hyunjae Kim¹, Gangwoo Kim¹, Jaewoo Kang¹ - Show less +1 more•Institutions (1)

Korea University¹

01 Nov 2020

TL;DR: This article showed that using the prior distribution of answer positions as a bias model is very effective at reducing position bias, recovering the performance of BERT from 37.48% to 81.64% when trained on a biased SQuAD dataset.

...read moreread less

Abstract: Many extractive question answering models are trained to predict start and end positions of answers. The choice of predicting answers as positions is mainly due to its simplicity and effectiveness. In this study, we hypothesize that when the distribution of the answer positions is highly skewed in the training set (e.g., answers lie only in the k-th sentence of each passage), QA models predicting answers as positions can learn spurious positional cues and fail to give answers in different positions. We first illustrate this position bias in popular extractive QA models such as BiDAF and BERT and thoroughly examine how position bias propagates through each layer of BERT. To safely deliver position information without position bias, we train models with various de-biasing methods including entropy regularization and bias ensembling. Among them, we found that using the prior distribution of answer positions as a bias model is very effective at reducing position bias, recovering the performance of BERT from 37.48% to 81.64% when trained on a biased SQuAD dataset.

...read moreread less

21 citations