Journal Article•
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
Citations
More filters
••
01 Aug 2021
TL;DR: This article propose two reward functions for abstractive summarization: the first function, referred to as RwB-Hinge, dynamically selects the samples for the gradient update, and the second function, nicknamed RISK, leverages a small pool of strong candidates to inform the reward.
Abstract: To date, most abstractive summarisation models have relied on variants of the negative log-likelihood (NLL) as their training objective. In some cases, reinforcement learning has been added to train the models with an objective that is closer to their evaluation measures (e.g. ROUGE). However, the reward function to be used within the reinforcement learning approach can play a key role for performance and is still partially unexplored. For this reason, in this paper, we propose two reward functions for the task of abstractive summarisation: the first function, referred to as RwB-Hinge, dynamically selects the samples for the gradient update. The second function, nicknamed RISK, leverages a small pool of strong candidates to inform the reward. In the experiments, we probe the proposed approach by fine-tuning an NLL pre-trained model over nine summarisation datasets of diverse size and nature. The experimental results show a consistent improvement over the negative log-likelihood baselines.
•
15 Apr 2021TL;DR: The authors leverage the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which they then use for finetuning much smaller and more efficient models.
Abstract: To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.
•
01 Nov 2021TL;DR: StreamHover as discussed by the authors leverages vector-quantized variational autoencoder to learn latent vector representations of spoken utterances and identify salient utterances from the transcripts to form summaries.
Abstract: With the explosive growth of livestream broadcasting, there is an urgent need for new summarization technology that enables us to create a preview of streamed content and tap into this wealth of knowledge. However, the problem is nontrivial due to the informal nature of spoken language. Further, there has been a shortage of annotated datasets that are necessary for transcript summarization. In this paper, we present StreamHover, a framework for annotating and summarizing livestream transcripts. With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora. We explore a neural extractive summarization model that leverages vector-quantized variational autoencoder to learn latent vector representations of spoken utterances and identify salient utterances from the transcripts to form summaries. We show that our model generalizes better and improves performance over strong baselines. The results of this study provide an avenue for future research to improve summarization solutions for efficient browsing of livestreams.
•
TL;DR: This article proposed a cross-attention-based approach to obtain an extractive answer span from the generative model by leveraging the decoder's crossattention patterns, and applied joint training to further improve QA performance.
Abstract: We propose a novel method for applying Transformer models to extractive
question answering (QA) tasks. Recently, pretrained generative
sequence-to-sequence (seq2seq) models have achieved great success in question
answering. Contributing to the success of these models are internal attention
mechanisms such as cross-attention. We propose a simple strategy to obtain an
extractive answer span from the generative model by leveraging the decoder
cross-attention patterns. Viewing cross-attention as an architectural prior, we
apply joint training to further improve QA performance. Empirical results show
that on open-domain question answering datasets like NaturalQuestions and
TriviaQA, our method approaches state-of-the-art performance on both generative
and extractive inference, all while using much fewer parameters. Furthermore,
this strategy allows us to perform hallucination-free inference while
conferring significant improvements to the model's ability to rerank relevant
passages.
•
TL;DR: This paper mined a parallel corpus from publicly available lectures at Coursera and used the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation.
Abstract: Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.