mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

doi:10.18653/V1/2021.NAACL-MAIN.41

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

What Language Model to Train if You Have One Million GPU Hours?

[...]

Teven Le Scao, Thomas Wang, Daniel Hesslow, Lucile Saulnier, Stas Bekman, Saiful Bari, Stella Biderman, Hady Elsahar, Niklas Muennighoff, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang A. Sutawika, Jae-Oong Tae, Zheng-Xin Yong, Julien Launay, Iz Beltagy - Show less +15 more

27 Oct 2022

TL;DR: An ablation study at the billion-parameter scale compar-ing different modeling practices and their impact on zero-shot generalization is performed and the performance of a multilingual model and how it compares to the English-only one is studied.

...read moreread less

Abstract: The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .

...read moreread less

35 citations

Posted Content•

ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic.

[...]

Muhammad Abdul-Mageed¹, AbdelRahim A. Elmadany¹, El Moatez Billah Nagoudi¹•Institutions (1)

University of British Columbia¹

27 Dec 2020-arXiv: Computation and Language

TL;DR: The authors introduced two powerful deep bidirectional transformer-based models, ARBERT and MARBERT, for multi-dialectal Arabic language understanding evaluation, which achieved state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets).

...read moreread less

Abstract: Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large (~ 3.4 x larger size). Our models are publicly available at this https URL and ARLUE will be released through the same repository.

...read moreread less

35 citations

Journal Article•DOI•

AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

[...]

Saleh Soltan, Shankar Ananthakrishnan, John G. Fitzgerald, Rahul Gupta, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andrew Rosenbaum, Anna Rumshisky, Chandan Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv Verma, Gokhan Tur, Prem Natarajan - Show less +12 more

02 Aug 2022-arXiv.org

TL;DR: It is demonstrated that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more eﬃcient few-shot learners than decoder-only models on various tasks.

...read moreread less

Abstract: In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training.

...read moreread less

33 citations

Proceedings Article•DOI•

FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech

[...]

A-C. Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara E. Rivera, Ankur Bapna - Show less +5 more

25 May 2022

TL;DR: This paper provides baselines for the tasks based on multilingual pre-trained models like mSLAM, and introduces FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark.

...read moreread less

Abstract: We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Speech-Text Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like speech-only w2v-BERT [1] and speech-text multimodal mSLAM [2]. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.1.

...read moreread less

33 citations

Proceedings Article•

UL2: Unifying Language Learning Paradigms

[...]

Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Loh Seong Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, T. Schuster, Huaixiu Zheng, Denny Zhou, Neil Houlsby, Donald Metzler - Show less +9 more

10 May 2022

TL;DR: By scaling the model up to 20B parameters, this paper achieves SOTA performance on 50 well-established supervised NLP tasks ranging from language generation, language understanding, text classiﬁcation, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval.

...read moreread less

Abstract: Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized&unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5&GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B&Flan-UL2 20B.

...read moreread less

33 citations

Collapse

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Citations

References

"mT5: A Massively Multilingual Pre-t..." refers methods in this paper

"mT5: A Massively Multilingual Pre-t..." refers background or methods in this paper

Related Papers (5)

Trending Questions (3)