Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Home
/
Papers
/
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Journal Article•

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu - Show less +5 more

01 Jan 2020-Journal of Machine Learning Research-Vol. 21, Iss: 140, pp 1-67

TL;DR: This article introduced a unified framework that converts all text-based language problems into a text-to-text format and compared pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks.

read less

Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•

Bandits Don’t Follow Rules: Balancing Multi-Facet Machine Translation with Multi-Armed Bandits

[...]

Julia Kreutzer¹, David Vilar², Artem Sokolov¹•Institutions (2)

Google¹, Amazon.com²

01 Nov 2021

TL;DR: The authors propose a multi-armed bandit to dynamically choose between different facets in a way that is most beneficial for the MT system and evaluate it on three different multi-facet applications: balancing translationese and natural training data.

...read moreread less

Abstract: Training data for machine translation (MT) is often sourced from a multitude of large corpora that are multi-faceted in nature, e.g. containing contents from multiple domains or different levels of quality or complexity. Naturally, these facets do not occur with equal frequency, nor are they equally important for the test scenario at hand. In this work, we propose to optimize this balance jointly with MT model parameters to relieve system developers from manual schedule design. A multi-armed bandit is trained to dynamically choose between facets in a way that is most beneficial for the MT system. We evaluate it on three different multi-facet applications: balancing translationese and natural training data, or data from multiple domains or multiple language pairs. We find that bandit learning leads to competitive MT systems across tasks, and our analysis provides insights into its learned strategies and the underlying data sets.

...read moreread less

Proceedings Article•

ForumSum: A Multi-Speaker Conversation Summarization Dataset.

[...]

Misha Khalman, Yao Zhao¹, Mohammad Saleh¹•Institutions (1)

Google¹

01 Nov 2021

TL;DR: The authors collected ForumSum, a diverse and high-quality conversation summarization dataset with human written summaries, and used a conversational corpus for pre-training to improve the quality of the chat summarization model.

...read moreread less

Abstract: ive summarization quality had large improvements since recent language pretraining techniques. However, currently there is a lack of datasets for the growing needs of conversation summarization applications. Thus we collected ForumSum, a diverse and high-quality conversation summarization dataset with human written summaries. The conversations in ForumSum dataset are collected from a wide variety of internet forums. To make the dataset easily expandable, we also release the process of dataset creation. Our experiments show that models trained on ForumSum have better zero-shot and few-shot transferability to other datasets than the existing large chat summarization dataset SAMSum. We also show that using a conversational corpus for pre-training improves the quality of the chat summarization model.

...read moreread less

Proceedings Article•

ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

[...]

Xuan Ouyang¹, Wang Shuohuan¹, Pang Chao¹, Yu Sun¹, Hao Tian¹, Hua Wu¹, Haifeng Wang¹ - Show less +3 more•Institutions (1)

Baidu¹

31 Dec 2020

TL;DR: This article proposed Ernie-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance.

...read moreread less

Abstract: Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for low-resource languages. In this paper, we propose Ernie-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that Ernie-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.

...read moreread less

Posted Content•

Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

[...]

Ganesh Jawahar, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan

18 May 2021-arXiv: Computation and Language

TL;DR: This article proposed a curriculum learning approach where they first finetune the language models on synthetic data and then on gold code-mixed data, and showed that their synthetic code-mixing method is competitive with (and in some cases, even superior to) several standard methods under a diverse set of conditions.

...read moreread less

Abstract: We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed Hindi and English). Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models (i.e., mT5 and mBART) on the task finding both to work well. Given the paucity of training data for code-mixing, we also propose a dependency-free method for generating code-mixed texts from bilingual distributed representations that we exploit for improving language model performance. In particular, armed with this additional data, we adopt a curriculum learning approach where we first finetune the language models on synthetic data then on gold code-mixed data. We find that, although simple, our synthetic code-mixing method is competitive with (and in some cases is even superior to) several standard methods (backtranslation, method based on equivalence constraint theory) under a diverse set of conditions. Our work shows that the mT5 model, finetuned following the curriculum learning procedure, achieves best translation performance (12.67 BLEU). Our models place first in the overall ranking of the English-Hinglish official shared task.

...read moreread less

Posted Content•

Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

[...]

Steven Y. Feng¹, Kevin Lu¹, Zhuofu Tao¹, Malihe Alikhani¹, Teruko Mitamura², Eduard Hovy³, Varun Gangal⁴ - Show less +3 more•Institutions (4)

Carnegie Mellon University¹, University of Waterloo², University of California, Los Angeles³, University of Pittsburgh⁴

08 Sep 2021-arXiv: Computation and Language

TL;DR: VisCTG as mentioned in this paper uses multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation, which significantly improves model performance while successfully addressing several issues of the baseline generations.

...read moreread less

Abstract: We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity.

...read moreread less