mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

doi:10.18653/V1/2021.NAACL-MAIN.41

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Overview of JOKER@CLEF 2022: Automatic Wordplay and Humour Translation Workshop

[...]

L. Ermakova, Tristan Miller, Fabio Regattin, A-G. Bosser, Claudine Borg, Elise Mathurin, Gaelle Le Corre, Sílvia Araújo, Radia Hannachi, Julien Boccou, Albin Digue, Aurianne Damoy, Benoît Jeanjean - Show less +9 more

01 Jan 2022

13 citations

Journal Article•DOI•

Exploring the Benefits of Training Expert Language Models over Instruction Tuning

[...]

Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung Kim, Lajanugen Logeswaran, Moontae Lee, Kyungjae Lee, Min Joon Seo - Show less +4 more

07 Feb 2023-arXiv.org

TL;DR: The authors showed that fine-tuning a language model on a single task can outperform a multi-task-prompted language model trained on 300+ different tasks by a mean accuracy of 3.20% and 1.29%.

...read moreread less

Abstract: Recently, Language Models (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT), have shown the capability to generalize to unseen tasks. Previous work has shown that scaling the number of training tasks is the key component in making stronger MT LMs. In this work, we report an unexpected finding that an expert LM fine-tuned on just a single task can outperform an MT LM trained with 300+ different tasks on 11 different unseen datasets and on 13 datasets of the BIG-bench benchmark by a mean accuracy of 3.20% and 1.29%, respectively. This finding casts doubt on the previously held belief that simply scaling the number of tasks makes stronger MT LMs. Leveraging this finding, we further show that this distributed approach of training a separate expert LM per training task instead of a single MT LM for zero-shot inference possesses many benefits including (1) avoiding negative task transfer that often occurs during instruction tuning, (2) being able to continually learn new tasks without having to re-train on previous tasks to avoid catastrophic forgetting, and (3) showing compositional capabilities when merging individual experts together. The code is available at https://github.com/joeljang/ELM.

...read moreread less

12 citations

Journal Article•DOI•

Uncontrolled Lexical Exposure Leads to Overestimation of Compositional Generalization in Pretrained Models

[...]

Najoung Kim, Tal Linzen, Paul Smolensky

21 Dec 2022-arXiv.org

TL;DR: This article showed that exposure to pretraining data may break the distributional control of pre-trained models and lead to a lower generalization performance in the COGS benchmark. But the performance degradation is more extreme with novel embeddings, and the degradation increases with the amount of pretraining training data, highlighting an interesting case of inverse scaling.

...read moreread less

Abstract: Human linguistic capacity is often characterized by compositionality and the generalization it enables -- human learners can produce and comprehend novel complex expressions by composing known parts. Several benchmarks exploit distributional control across training and test to gauge compositional generalization, where certain lexical items only occur in limited contexts during training. While recent work using these benchmarks suggests that pretrained models achieve impressive generalization performance, we argue that exposure to pretraining data may break the aforementioned distributional control. Using the COGS benchmark of Kim and Linzen (2020), we test two modified evaluation setups that control for this issue: (1) substituting context-controlled lexical items with novel character sequences, and (2) substituting them with special tokens represented by novel embeddings. We find that both of these setups lead to lower generalization performance in T5 (Raffel et al., 2020), suggesting that previously reported results have been overestimated due to uncontrolled lexical exposure during pretraining. The performance degradation is more extreme with novel embeddings, and the degradation increases with the amount of pretraining data, highlighting an interesting case of inverse scaling.

...read moreread less

12 citations

Proceedings Article•DOI•

Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning

[...]

Bill Yuchen Lin¹, Seyeon Lee¹, Xiaoyang Qiao, Xiang Ren¹•Institutions (1)

University of Southern California¹

01 Aug 2021

TL;DR: This paper proposed a multilingual contrastive pretraining (MCP) method to improve the performance of ML-LMs for cross-lingual commonsense reasoning, which significantly enhances sentence representations.

...read moreread less

Abstract: Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-general probing task for fairly evaluating the common sense of popular ML-LMs across different languages. In addition, we also create two new datasets, X-CSQA and X-CODAH, by translating their English versions to 14 other languages, so that we can evaluate popular ML-LMs for cross-lingual commonsense reasoning. To improve the performance beyond English, we propose a simple yet effective method — multilingual contrastive pretraining (MCP). It significantly enhances sentence representations, yielding a large performance gain on both benchmarks (e.g., +2.7% accuracy for X-CSQA over XLM-R_L).

...read moreread less

12 citations

Book Chapter•DOI•

Automatic Simplification of Scientific Texts: SimpleText Lab at CLEF-2022

[...]

L. Ermakova, Patrice Bellot, Jaap Kamps, Diana Nurbakova, Irina Ovchinnikova, Eric SanJuan, Elise Mathurin, Sílvia Araújo, Radia Hannachi, Stéphane Huet, Nicolas Poinsu - Show less +7 more

01 Jan 2022

TL;DR: The CLEF 2022 SimpleText track as discussed by the authors addresses the challenges of text simplification approaches in the context of promoting scientific information access, by providing appropriate data and benchmarks, and creating a community of NLP and IR researchers working together to resolve one of the greatest challenges of today.

...read moreread less

Abstract: The Web and social media have become the main source of information for citizens, with the risk that users rely on shallow information in sources prioritizing commercial or political incentives rather than the correctness and informational value. Non-experts tend to avoid scientific literature due to its complex language or their lack of prior background knowledge. Text simplification promises to remove some of these barriers. The CLEF 2022 SimpleText track addresses the challenges of text simplification approaches in the context of promoting scientific information access, by providing appropriate data and benchmarks, and creating a community of NLP and IR researchers working together to resolve one of the greatest challenges of today. The track will use a corpus of scientific literature abstracts and popular science requests. It features three tasks. First, content selection (what is in, or out?) challenges systems to select passages to include in a simplified summary in response to a query. Second, complexity spotting (what is unclear?) given a passage and a query, aims to rank terms/concepts that are required to be explained for understanding this passage (definitions, context, applications). Third, text simplification (rewrite this!) given a query, asks to simplify passages from scientific abstracts while preserving the main content.

...read moreread less

12 citations

Collapse

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Citations

References

"mT5: A Massively Multilingual Pre-t..." refers methods in this paper

"mT5: A Massively Multilingual Pre-t..." refers background or methods in this paper

Related Papers (5)

Trending Questions (3)