mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

doi:10.18653/V1/2021.NAACL-MAIN.41

Open AccessProceedings ArticleDOI

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Linting Xue, +7 more

- pp 483-498

Chats0

TLDR

This paper proposed a multilingual variant of T5, mT5, which was pre-trained on a new Common Crawl-based dataset covering 101 languages and achieved state-of-the-art performance on many multilingual benchmarks.

Abstract:

The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

Citations

PDF

Open Access

More filters

Journal Article

mSLAM: Massively multilingual joint pre-training for speech and text

Ankur Bapna, +8 more

- 03 Feb 2022 -

arXiv.org

TL;DR: mSLAM is evaluated on several downstream speech understanding tasks and finds that joint pre-training with text improves quality on speech translation, speech intent classification and speech languageID while being competitive on multilingual ASR, when compared against speech-only pre- training.

...read moreread less

Proceedings ArticleDOI

Language Models are Multilingual Chain-of-Thought Reasoners

Haoyue Shi, +11 more

TL;DR: It is shown that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment, and that models have strikingly strong mult bilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili.

...read moreread less

Proceedings ArticleDOI

ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

Muhammad Abdul-Mageed, +2 more

TL;DR: The authors introduced two powerful deep bidirectional transformer-based models, ARBERT and MARBERT, for multi-dialectal Arabic language understanding evaluation, which achieved state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets).

...read moreread less

Proceedings ArticleDOI

Designing Effective Sparse Expert Models

Barret Zoph, +7 more

TL;DR: This paper proposed a stable and transferable Mixture-of-Experts (MoE-32B) model with 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer.

...read moreread less

Posted Content

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics.

Sebastian Gehrmann, +55 more

- 02 Feb 2021 -

arXiv: Computation and Language

TL;DR: GEM as discussed by the authors is a living benchmark for natural language generation (NLG), its Evaluation and Metrics, which provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Proceedings ArticleDOI

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, +3 more

TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

...read moreread less

Proceedings ArticleDOI

Unsupervised Cross-lingual Representation Learning at Scale

Alexis Conneau, +9 more

TL;DR: It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

...read moreread less

Proceedings ArticleDOI

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard, +1 more

TL;DR: Universal Language Model Fine-tuning (ULMFiT) as mentioned in this paper is an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for finetuning a language model.

...read moreread less

Collapse

arXiv: Computation and Language

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Citations

mSLAM: Massively multilingual joint pre-training for speech and text

Language Models are Multilingual Chain-of-Thought Reasoners

ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

Designing Effective Sparse Expert Models

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics.

References

Attention is All you Need

RoBERTa: A Robustly Optimized BERT Pretraining Approach

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Unsupervised Cross-lingual Representation Learning at Scale

Universal Language Model Fine-tuning for Text Classification

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Unsupervised Cross-lingual Representation Learning at Scale

Attention is All you Need

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Bleu: a Method for Automatic Evaluation of Machine Translation

Trending Questions (3)