mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

doi:10.18653/V1/2021.NAACL-MAIN.41

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

A New Generation of Perspective API: Efficient Multilingual Character-level Transformers

[...]

Alyssa Whitlock Lees, Vinh Q. Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, Lucy Vasserman - Show less +3 more

22 Feb 2022

TL;DR: This paper presents the fundamentals behind the next version of the Perspective API from Google Jigsaw, and presents a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks.

...read moreread less

Abstract: On the world wide web, toxic content detectors are a crucial line of defense against potentially hateful and offensive messages. As such, building highly effective classifiers that enable a safer internet is an important research area. Moreover, the web is a highly multilingual, cross-cultural community that develops its own lingo over time. As such, it is crucial to develop models that are effective across a diverse range of languages, usages, and styles. In this paper, we present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings. We additionally outline the techniques employed to make such a byte-level model efficient and feasible for productionization. Through extensive experiments on multilingual toxic comment classification benchmarks derived from real API traffic and evaluation on an array of code-switching, covert toxicity, emoji-based hate, human-readable obfuscation, distribution shift, and bias evaluation settings, we show that our proposed approach outperforms strong baselines. Finally, we present our findings from deploying this system in production.

...read moreread less

26 citations

Posted Content•

What Makes Good In-Context Examples for GPT-3?

[...]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin¹, Weizhu Chen - Show less +2 more•Institutions (1)

Duke University¹

17 Jan 2021-arXiv: Computation and Language

TL;DR: The authors propose to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt, and evaluate the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random baseline.

...read moreread less

Abstract: GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks, especially with its powerful and versatile in-context few-shot learning ability. Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples. In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples (relative to random sampling) that better leverage GPT-$3$'s few-shot capabilities. Inspired by the recent success of leveraging a retrieval module to augment large-scale neural network models, we propose to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt. Intuitively, the in-context examples selected with such a strategy may serve as more informative inputs to unleash GPT-$3$'s extensive knowledge. We evaluate the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random baseline. Moreover, it is observed that the sentence encoders fine-tuned on task-related datasets yield even more helpful retrieval results. Notably, significant gains are observed on tasks such as table-to-text generation (41.9% on the ToTTo dataset) and open-domain question answering (45.5% on the NQ dataset). We hope our investigation could help understand the behaviors of GPT-$3$ and large-scale pre-trained LMs in general and enhance their few-shot capabilities.

...read moreread less

25 citations

Journal Article•DOI•

<i>Samanantar</i>: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

[...]

01 Jan 2022-Transactions of the Association for Computational Linguistics

TL;DR: Samanantar as mentioned in this paper is the largest publicly available parallel corpora collection for Indic languages, which contains 49.7 million sentence pairs between English and 11 languages (from two language families).

...read moreread less

Abstract: Abstract We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly available parallel corpora, and additionally mine 37.4 million sentence pairs from the Web, resulting in a 4× increase. We mine the parallel sentences from the Web by combining many corpora, tools, and methods: (a) Web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at Samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.

...read moreread less

25 citations

Posted Content•

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

[...]

Isaac Caswell¹, Julia Kreutzer¹, Lisa Wang¹, Ahsan Wahab, Daan van Esch¹, Nasanbayar Ulzii-Orshikh, Allahsera Auguste Tapo, Nishant Subramani², Artem Sokolov¹, Claytone Sikasote³, Monang Setyawan¹, Supheakmungkol Sarin, Sokhar Samb⁴, Benoît Sagot, Clara E. Rivera¹, Annette Rios⁵, Isabel Papadimitriou⁶, Salomey Osei⁷, Pedro Javier Ortiz Suárez⁸, Iroro Orife, Kelechi Ogueji⁹, Rubungo Andre Niyongabo¹⁰, Toan Q. Nguyen¹¹, Mathias Müller⁵, André Müller⁵, Shamsuddeen Hassan Muhammad¹², Nanda Muhammad¹, Ayanda Mnyakeni¹, Jamshidbek Mirzakhalov¹³, Tapiwanashe Matangira¹, Colin Leong, Nze Lawson¹, Sneha Kudugunta¹, Yacine Jernite, Mathias Jenny⁵, Orhan Firat¹, Bonaventure F. P. Dossou¹⁴, Sakhile Dlamini¹, Nisansa de Silva¹⁵, Sakine Çabuk Ballı¹, Stella Biderman, Alessia Battisti⁵, Ahmed Baruwa¹⁶, Ankur Bapna¹, Pallavi Baljekar¹, Israel Abebe Azime⁴, Ayodele Awokoya¹⁷, Duygu Ataman⁵, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal¹⁸, Mofetoluwa Adeyemi - Show less +48 more•Institutions (18)

Google¹, Intel², University of Zambia³, African Institute for Mathematical Sciences⁴, University of Zurich⁵, Stanford University⁶, Kwame Nkrumah University of Science and Technology⁷, University of Paris⁸, University of Waterloo⁹, University of Electronic Science and Technology of China¹⁰, University of Notre Dame¹¹, Bayero University Kano¹², University of South Florida¹³, Jacobs University Bremen¹⁴, University of Moratuwa¹⁵, Obafemi Awolowo University¹⁶, University of Ibadan¹⁷, University of Maryland, Baltimore¹⁸

23 Mar 2021-arXiv: Computation and Language

TL;DR: In this paper, the authors manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4) and audit the correctness of language codes in a sixth (JW300).

...read moreread less

Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

...read moreread less

24 citations

Proceedings Article•

Few-shot Learning with Multilingual Generative Language Models

[...]

TL;DR: The authors train multilingual generative language models on a corpus covering a diverse set of languages, and study their few-and zero-shot learning capabilities in a wide range of tasks, including commonsense reasoning and natural language inference.

...read moreread less

Abstract: Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We conduct an in-depth analysis of different multilingual prompting approaches, showing in particular that strong few-shot learning performance across languages can be achieved via cross-lingual transfer through both templates and demonstration examples.

...read moreread less

24 citations

Collapse

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Citations

References

"mT5: A Massively Multilingual Pre-t..." refers methods in this paper

"mT5: A Massively Multilingual Pre-t..." refers background or methods in this paper

Related Papers (5)

Trending Questions (3)