Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Few-Shot Learning for Opinion Summarization

[...]

Arthur Bražinskas¹, Mirella Lapata¹, Ivan Titov¹•Institutions (1)

University of Edinburgh¹

01 Nov 2020

TL;DR: This work shows that even a handful of summaries is sufficient to bootstrap generation of the summary text with all expected properties, such as writing style, informativeness, fluency, and sentiment preservation.

...read moreread less

Abstract: Opinion summarization is the automatic creation of text reflecting subjective information expressed in multiple documents, such as user reviews of a product. The task is practically important and has attracted a lot of attention. However, due to the high cost of summary production, datasets large enough for training supervised models are lacking. Instead, the task has been traditionally approached with extractive methods that learn to select text fragments in an unsupervised or weakly-supervised way. Recently, it has been shown that abstractive summaries, potentially more fluent and better at reflecting conflicting information, can also be produced in an unsupervised fashion. However, these models, not being exposed to actual summaries, fail to capture their essential properties. In this work, we show that even a handful of summaries is sufficient to bootstrap generation of the summary text with all expected properties, such as writing style, informativeness, fluency, and sentiment preservation. We start by training a conditional Transformer language model to generate a new product review given other available reviews of the product. The model is also conditioned on review properties that are directly related to summaries; the properties are derived from reviews with no manual effort. In the second stage, we fine-tune a plug-in module that learns to predict property values on a handful of summaries. This lets us switch the generator to the summarization mode. We show on Amazon and Yelp datasets that our approach substantially outperforms previous extractive and abstractive methods in automatic and human evaluation.

...read moreread less

59 citations

Proceedings Article•DOI•

Leading Conversational Search by Suggesting Useful Questions

[...]

Rosset Corbin Louis¹, Chenyan Xiong¹, Xia Song¹, Daniel Campos¹, Nick Craswell¹, Saurabh Tiwary¹, Paul N. Bennett¹ - Show less +3 more•Institutions (1)

Microsoft¹

20 Apr 2020

TL;DR: A novel evaluation metric, usefulness, is established, which goes beyond relevance and measures whether the suggestions provide valuable information for the next step of a user’s journey, and a public benchmark for useful question suggestion is constructed.

...read moreread less

Abstract: This paper studies a new scenario in conversational search, conversational question suggestion, which leads search engine users to more engaging experiences by suggesting interesting, informative, and useful follow-up questions. We first establish a novel evaluation metric, usefulness, which goes beyond relevance and measures whether the suggestions provide valuable information for the next step of a user’s journey, and construct a public benchmark for useful question suggestion. Then we develop two suggestion systems, a BERT based ranker and a GPT-2 based generator, both trained with novel weak supervision signals that convey past users’ search behaviors in search sessions. The weak supervision signals help ground the suggestions to users’ information-seeking trajectories: we identify more coherent and informative sessions using encodings, and then weakly supervise our models to imitate how users transition to the next state of search. Our offline experiments demonstrate the crucial role our “next-turn” inductive training plays in improving usefulness over a strong online system. Our online A/B test in Bing shows that our more useful question suggestions receive 8% more user clicks than the previous system.

...read moreread less

59 citations

Posted Content•

Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

[...]

Nicholas Lourie¹, Ronan Le Bras¹, Yejin Choi¹•Institutions (1)

Allen Institute for Artificial Intelligence¹

20 Aug 2020-arXiv: Computation and Language

TL;DR: This work introduces Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes, and presents a new method to estimate the best possible performance on such tasks with inherently diverse label distributions, and explores likelihood functions that separate intrinsic from model uncertainty.

...read moreread less

Abstract: As AI systems become an increasing part of people's everyday lives, it becomes ever more important that they understand people's ethical norms. Motivated by descriptive ethics, a field of study that focuses on people's descriptive judgments rather than theoretical prescriptions on morality, we investigate a novel, data-driven approach to machine ethics. We introduce Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes. Each anecdote recounts a complex ethical situation, often posing moral dilemmas, paired with a distribution of judgments contributed by the community members. Our dataset presents a major challenge to state-of-the-art neural language models, leaving significant room for improvement. However, when presented with simplified moral situations, the results are considerably more promising, suggesting that neural models can effectively learn simpler ethical building blocks. A key take-away of our empirical analysis is that norms are not always clean-cut; many situations are naturally divisive. We present a new method to estimate the best possible performance on such tasks with inherently diverse label distributions, and explore likelihood functions that separate intrinsic from model uncertainty.

...read moreread less

58 citations

Book Chapter•DOI•

LAMBERT: Layout-Aware Language Modeling for Information Extraction.

[...]

Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek¹, Bartosz Topolski, Piotr Halama, Michał Turski², Filip Graliński² - Show less +3 more•Institutions (2)

Warsaw University of Technology¹, Adam Mickiewicz University in Poznań²

05 Sep 2021

TL;DR: The authors modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch.

...read moreread less

Abstract: We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token bounding boxes, avoiding, in this way, the use of raw images. This leads to a layout-aware language model which can then be fine-tuned on downstream tasks.

...read moreread less

58 citations

Posted Content•

Go Figure! A Meta Evaluation of Factuality in Summarization

[...]

Saadia Gabriel¹, Asli Celikyilmaz¹, Rahul Jha², Yejin Choi³, Jianfeng Gao³ - Show less +1 more•Institutions (3)

University of Washington¹, Facebook², Microsoft³

24 Oct 2020-arXiv: Computation and Language

TL;DR: This paper introduces a meta-evaluation framework for evaluating factual consistency metrics and experiments with nine recent factuality metrics using synthetic and human-labeled factuality data from short news, long news and dialogue summarization domains.

...read moreread less

Abstract: Text generation models can generate factually inconsistent text containing distorted or fabricated facts about the source text. Recent work has focused on building evaluation models to verify the factual correctness of semantically constrained text generation tasks such as document summarization. While the field of factuality evaluation is growing fast, we don't have well-defined criteria for measuring the effectiveness, generalizability, reliability, or sensitivity of the factuality metrics. Focusing on these aspects, in this paper, we introduce a meta-evaluation framework for evaluating factual consistency metrics. We introduce five necessary, common-sense conditions for effective factuality metrics and experiment with nine recent factuality metrics using synthetic and human-labeled factuality data from short news, long news and dialogue summarization domains. Our framework enables assessing the efficiency of any new factual consistency metric on a variety of dimensions over multiple summarization domains and can be easily extended with new meta-evaluation criteria. We also present our conclusions towards standardizing the factuality evaluation metrics.

...read moreread less

57 citations