Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

RewardsOfSum: Exploring Reinforcement Learning Rewards for Summarisation

[...]

Jacob Parnell, Inigo Jauregi Unanue, Massimo Piccardi

01 Aug 2021

TL;DR: This article propose two reward functions for abstractive summarization: the first function, referred to as RwB-Hinge, dynamically selects the samples for the gradient update, and the second function, nicknamed RISK, leverages a small pool of strong candidates to inform the reward.

...read moreread less

Abstract: To date, most abstractive summarisation models have relied on variants of the negative log-likelihood (NLL) as their training objective. In some cases, reinforcement learning has been added to train the models with an objective that is closer to their evaluation measures (e.g. ROUGE). However, the reward function to be used within the reinforcement learning approach can play a key role for performance and is still partially unexplored. For this reason, in this paper, we propose two reward functions for the task of abstractive summarisation: the first function, referred to as RwB-Hinge, dynamically selects the samples for the gradient update. The second function, nicknamed RISK, leverages a small pool of strong candidates to inform the reward. In the experiments, we probe the proposed approach by fine-tuning an NLL pre-trained model over nine summarisation datasets of diverse size and nature. The experimental results show a consistent improvement over the negative log-likelihood baselines.

...read moreread less

Proceedings Article•

Generating Datasets with Pretrained Language Models

[...]

Timo Schick¹, Hinrich Schütze¹•Institutions (1)

Ludwig Maximilian University of Munich¹

15 Apr 2021

TL;DR: The authors leverage the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which they then use for finetuning much smaller and more efficient models.

...read moreread less

Abstract: To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.

...read moreread less

Proceedings Article•

StreamHover: Livestream Transcript Summarization and Annotation

[...]

Sangwoo Cho¹, Franck Dernoncourt², Tim Ganter¹, Trung Bui², Nedim Lipka², Walter Chang², Hailin Jin², Jonathan Brandt², Hassan Foroosh², Fei Liu¹ - Show less +6 more•Institutions (2)

University of Central Florida¹, Adobe Systems²

01 Nov 2021

TL;DR: StreamHover as discussed by the authors leverages vector-quantized variational autoencoder to learn latent vector representations of spoken utterances and identify salient utterances from the transcripts to form summaries.

...read moreread less

Abstract: With the explosive growth of livestream broadcasting, there is an urgent need for new summarization technology that enables us to create a preview of streamed content and tap into this wealth of knowledge. However, the problem is nontrivial due to the informal nature of spoken language. Further, there has been a shortage of annotated datasets that are necessary for transcript summarization. In this paper, we present StreamHover, a framework for annotating and summarizing livestream transcripts. With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora. We explore a neural extractive summarization model that leverages vector-quantized variational autoencoder to learn latent vector representations of spoken utterances and identify salient utterances from the transcripts to form summaries. We show that our model generalizes better and improves performance over strong baselines. The results of this study provide an avenue for future research to improve summarization solutions for efficient browsing of livestreams.

...read moreread less

Posted Content•

Attention-guided Generative Models for Extractive Question Answering

[...]

Peng Xu, Davis Liang, Zhiheng Huang, Bing Xiang¹•Institutions (1)

Amazon.com¹

12 Oct 2021-arXiv: Computation and Language

TL;DR: This article proposed a cross-attention-based approach to obtain an extractive answer span from the generative model by leveraging the decoder's crossattention patterns, and applied joint training to further improve QA performance.

...read moreread less

Abstract: We propose a novel method for applying Transformer models to extractive question answering (QA) tasks. Recently, pretrained generative sequence-to-sequence (seq2seq) models have achieved great success in question answering. Contributing to the success of these models are internal attention mechanisms such as cross-attention. We propose a simple strategy to obtain an extractive answer span from the generative model by leveraging the decoder cross-attention patterns. Viewing cross-attention as an architectural prior, we apply joint training to further improve QA performance. Empirical results show that on open-domain question answering datasets like NaturalQuestions and TriviaQA, our method approaches state-of-the-art performance on both generative and extractive inference, all while using much fewer parameters. Furthermore, this strategy allows us to perform hallucination-free inference while conferring significant improvements to the model's ability to rerank relevant passages.

...read moreread less

Posted Content•

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation.

[...]

Haiyue Song¹, Raj Dabre², Atsushi Fujita², Sadao Kurohashi³•Institutions (3)

Shanghai Jiao Tong University¹, National Institute of Information and Communications Technology², Kyoto University³

26 Dec 2019-arXiv: Computation and Language

TL;DR: This paper mined a parallel corpus from publicly available lectures at Coursera and used the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation.

...read moreread less

Abstract: Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.

...read moreread less