mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
Linting Xue,Noah Constant,Adam Roberts,Mihir Kale,Rami Al-Rfou,Aditya Siddhant,Aditya Barua,Colin Raffel +7 more
- pp 483-498
Reads0
Chats0
TLDR
This paper proposed a multilingual variant of T5, mT5, which was pre-trained on a new Common Crawl-based dataset covering 101 languages and achieved state-of-the-art performance on many multilingual benchmarks.Abstract:
The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.read more
Citations
More filters
Proceedings ArticleDOI
The Geometry of Multilingual Language Model Representations
TL;DR: The results suggest that multilingual language models encode features by projecting representations onto orthogonal axes in the representation space, enabling the simultaneous encoding of a wide variety of signals for downstream tasks and multilingual learning.
Journal ArticleDOI
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
Bonaventure F. P. Dossou,A. Tonja,Oreen Yousuf,Salomey Osei,Abigail Oppong,Iyanuoluwa Shode,Oluwabusayo Olufunke Awoyomi,Chris Chinenye Emezue +7 more
TL;DR: This paper proposed a self-active learning framework for pre-training multilingual pre-trained language models on 23 African languages and achieved good performance on NLP downstream tasks (NER, text classification, and sentiment analysis).
Proceedings ArticleDOI
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
John G. Fitzgerald,Shankar Ananthakrishnan,Konstantine Arkoudas,Davide Bernardi,Abhishek Bhagia,Claudio Delli Bovi,Jin Cao,Rakesh Chada,Amit Chauhan,Luoxin Chen,Anurag Dwarakanath,Satya Vart Dwivedi,Turan Gojayev,Karthik Gopalakrishnan,Thomas Gueudré,Dilek Hakkani-Tur,Wael Hamza,Jonathan Hueser,Kevin Martin Jose,Haidar Khan,Bei Liu,Jianhua Lu,A. Manzotti,Pradeep Natarajan,Karolina Owczarzak,Gokmen Oz,Enrico Palumbo,Charith Peris,Chandan Prakash,Stephen Rawls,Andrew Rosenbaum,Anjali Shenoy,Saleh Soltan,Mukund Sridhar,Lizhen Tan,Fabian Triefenbach,Pan Wei,Haiyang Yu,Shuai Zheng,Gokhan Tur,Prem Natarajan +40 more
TL;DR: Results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system are presented.
Proceedings ArticleDOI
Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment
TL;DR: XLM-Align as discussed by the authors introduces denoising word alignment as a new cross-lingual pre-training task, where the model first self-label word alignments for parallel sentences and then randomly mask tokens in a bite-xt pair.
Proceedings ArticleDOI
Towards Making the Most of Cross-Lingual Transfer for Zero-Shot Neural Machine Translation
TL;DR: SixT+ is presented, a strong many-to-English NMT model that supports 100 source languages but is trained with a parallel dataset in only six source languages, and offers a set of model parameters that can be further fine-tuned to other unsupervised tasks.
References
More filters
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Proceedings ArticleDOI
SQuAD: 100,000+ Questions for Machine Comprehension of Text
TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
Proceedings ArticleDOI
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau,Kartikay Khandelwal,Naman Goyal,Vishrav Chaudhary,Guillaume Wenzek,Francisco Guzmán,Edouard Grave,Myle Ott,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Proceedings ArticleDOI
Universal Language Model Fine-tuning for Text Classification
Jeremy Howard,Sebastian Ruder +1 more
TL;DR: Universal Language Model Fine-tuning (ULMFiT) as mentioned in this paper is an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for finetuning a language model.