Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Learning Compact Metrics for MT.

[...]

Amy Pu, Hyung Won Chung¹, Ankur P. Parikh¹, Sebastian Gehrmann¹, Thibault Sellam¹ - Show less +1 more•Institutions (1)

Google¹

01 Nov 2021

TL;DR: This paper investigated the tradeoff between multilinguality and model capacity with RemBERT, a state-of-the-art multilingual language model, using data from the WMT Metrics Shared Task.

...read moreread less

Abstract: Recent developments in machine translation and multilingual text generation have led researchers to adopt trained metrics such as COMET or BLEURT, which treat evaluation as a regression problem and use representations from multilingual pre-trained models such as XLM-RoBERTa or mBERT. Yet studies on related tasks suggest that these models are most efficient when they are large, which is costly and impractical for evaluation. We investigate the trade-off between multilinguality and model capacity with RemBERT, a state-of-the-art multilingual language model, using data from the WMT Metrics Shared Task. We present a series of experiments which show that model size is indeed a bottleneck for cross-lingual transfer, then demonstrate how distillation can help addressing this bottleneck, by leveraging synthetic data generation and transferring knowledge from one teacher to multiple students trained on related languages. Our method yields up to 10.5% improvement over vanilla fine-tuning and reaches 92.6% of RemBERT’s performance using only a third of its parameters.

...read moreread less

Posted Content•

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

[...]

Jesse Dodge¹, Maarten Sap², Ana Marasović¹, William Agnew³, Gabriel Ilharco³, Dirk Groeneveld¹, Margaret Mitchell⁴, Matt Gardner⁵ - Show less +4 more•Institutions (5)

Allen Institute for Artificial Intelligence¹, Georgia Institute of Technology², University of Washington³, Google⁴, University of California, Irvine⁵

18 Apr 2021-arXiv: Computation and Language

TL;DR: The Colossal Clean Crawled Corpus (C4) as discussed by the authors is a dataset created by applying a set of filters to a single snapshot of Common Crawl, and it is used to evaluate the impact of the filters applied to create this dataset.

...read moreread less

Abstract: Large language models have led to remarkable progress on many NLP tasks, and researchers are turning to ever-larger text corpora to train them. Some of the largest corpora available are made by scraping significant portions of the internet, and are frequently introduced with only minimal documentation. In this work we provide some of the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl. We begin by investigating where the data came from, and find a significant amount of text from unexpected sources like patents and US military websites. Then we explore the content of the text itself, and find machine-generated text (e.g., from machine translation systems) and evaluation examples from other benchmark NLP datasets. To understand the impact of the filters applied to create this dataset, we evaluate the text that was removed, and show that blocklist filtering disproportionately removes text from and about minority individuals. Finally, we conclude with some recommendations for how to created and document web-scale datasets from a scrape of the internet.

...read moreread less

Posted Content•

Teaching a Massive Open Online Course on Natural Language Processing

[...]

Ekaterina Artemova¹, Murat Apishev, Veronika Sarkisyan, Sergey Aksenov, Denis Kirjanov, Oleg Serikov - Show less +2 more•Institutions (1)

National Research University – Higher School of Economics¹

26 Apr 2021-arXiv: Computation and Language

TL;DR: The authors presented a Massive Open Online Course on Natural Language Processing (MOOPP) for non-English speaking students, which consists of lectures, practical sessions, and quiz assignments, with three weeks out of 12 followed by Kaggle-style coding assignments.

...read moreread less

Abstract: This paper presents a new Massive Open Online Course on Natural Language Processing, targeted at non-English speaking students. The course lasts 12 weeks; every week consists of lectures, practical sessions, and quiz assignments. Three weeks out of 12 are followed by Kaggle-style coding assignments. Our course intends to serve multiple purposes: (i) familiarize students with the core concepts and methods in NLP, such as language modeling or word or sentence representations, (ii) show that recent advances, including pre-trained Transformer-based models, are built upon these concepts; (iii) introduce architectures for most demanded real-life applications, (iv) develop practical skills to process texts in multiple languages. The course was prepared and recorded during 2020, launched by the end of the year, and in early 2021 has received positive feedback.

...read moreread less

Posted Content•

Improving Neural Model Performance through Natural Language Feedback on Their Explanations.

[...]

Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Yiming Yang, Peter Clark, Keisuke Sakaguchi, Eduard Hovy - Show less +3 more

18 Apr 2021-arXiv: Computation and Language

TL;DR: This paper proposed MERCURIE, an interactive system that refines its explanations for a given reasoning task by getting human feedback in natural language and showed that adding the corrected explanation structures to the output leads to a gain of 1.2 points on accuracy on defeasible reasoning across all three domains.

...read moreread less

Abstract: A class of explainable NLP models for reasoning tasks support their decisions by generating free-form or structured explanations, but what happens when these supporting structures contain errors? Our goal is to allow users to interactively correct explanation structures through natural language feedback. We introduce MERCURIE - an interactive system that refines its explanations for a given reasoning task by getting human feedback in natural language. Our approach generates graphs that have 40% fewer inconsistencies as compared with the off-the-shelf system. Further, simply appending the corrected explanation structures to the output leads to a gain of 1.2 points on accuracy on defeasible reasoning across all three domains. We release a dataset of over 450k graphs for defeasible reasoning generated by our system at this https URL .

...read moreread less

Posted Content•

BERT Embeddings Can Track Context in Conversational Search.

[...]

Rafael Ferreira, David Semedo, João Magalhães

13 Apr 2021-arXiv: Information Retrieval

TL;DR: In this article, a Transformer-based re-ranking method was used to re-rank the answers given the question and the conversational history to provide the most relevant answers.

...read moreread less

Abstract: The use of conversational assistants to search for information is becoming increasingly more popular among the general public, pushing the research towards more advanced and sophisticated techniques. In the last few years, in particular, the interest in conversational search is increasing, not only because of the generalization of conversational assistants but also because conversational search is a step forward in allowing a more natural interaction with the system. In this work, the focus is on exploring the context present of the conversation via the historical utterances and respective embeddings with the aim of developing a conversational search system that helps people search for information in a natural way. In particular, this system must be able to understand the context where the question is posed, tracking the current state of the conversation and detecting mentions to previous questions and answers. We achieve this by using a context-tracking component based on neural query-rewriting models. Another crucial aspect of the system is to provide the most relevant answers given the question and the conversational history. To achieve this objective, we used a Transformer-based re-ranking method and expanded this architecture to use the conversational context. The results obtained with the system developed showed the advantages of using the context present in the natural language utterances and in the neural embeddings generated throughout the conversation.

...read moreread less