scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

01 Jun 2021-pp 483-498
TL;DR: This paper proposed a multilingual variant of T5, mT5, which was pre-trained on a new Common Crawl-based dataset covering 101 languages and achieved state-of-the-art performance on many multilingual benchmarks.
Abstract: The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

Content maybe subject to copyright    Report

Citations
More filters
Journal Article
TL;DR: A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
Abstract: Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning , which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM). We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

1,429 citations

Journal ArticleDOI
TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).
Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

407 citations

Journal ArticleDOI
14 Apr 2022
TL;DR: GPT-NeoX-20B is introduced, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license.
Abstract: We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe GPT-NeoX-20B’s architecture and training, and evaluate its performance. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.

213 citations

Journal ArticleDOI
TL;DR: Recently, a large language model (LLM) as mentioned in this paper has been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks.
Abstract: Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since researchers have found that model scaling can lead to performance improvement, they further study the scaling effect by increasing the model size to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. In this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions.

149 citations

Proceedings ArticleDOI
14 Sep 2022
TL;DR: PaLI achieves state-of-the-art in multiple vision and language tasks, while retaining a simple, modular, and scalable design.
Abstract: Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

133 citations

References
More filters
Proceedings Article
01 Nov 2019
TL;DR: An automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages by following the data processing introduced in fastText, that deduplicates documents and identifies their language.
Abstract: Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.

313 citations


"mT5: A Massively Multilingual Pre-t..." refers methods in this paper

  • ...To improve the pre-training data quality, pages from Common Crawl were filtered by an n-gram language model trained on Wikipedia (Wenzek et al., 2020). mBART (Liu et al., 2020a) is a multilingual encoder-decoder model that is based on BART (Lewis et al., 2020b). mBART is trained with a combination of span masking and sentence shuffling objectives on a subset of 25 languages from the same data as XLM-R....

    [...]

  • ...The line 9We use the 2020 Wikipedia data from TensorFlow Datasets, selecting the same languages as mBERT. https://www.tensorflow.org/datasets/ catalog/wikipedia length filter provides a +2 point boost, corroborating the findings of Conneau et al. (2020) and Raffel et al. (2020) that filtering low-quality pages from Common Crawl is valuable....

    [...]

  • ...To improve the pre-training data quality, pages from Common Crawl were filtered by an n-gram language model trained on Wikipedia (Wenzek et al., 2020). mBART (Liu et al., 2020a) is a multilingual encoder-decoder model that is based on BART (Lewis et al., 2020b). mBART is trained with a combination…...

    [...]

  • ...We run six ablations, modifying various settings, using our Large model as a baseline: (i) increase dropout to 0.1 in hopes of mitigating overfitting on low-resource languages, (ii) decrease sequence length to 512 (as was used in T5), (iii) increase the average noise span length in the pre-training objective to 10 since we observe fewer characters per token than T5, (iv) adjust the language sampling exponent α to {0.2, 0.7} as used in MMNMT (Arivazhagan et al., 2019) and mBERT (Devlin, 2018), respectively, (v) turn off the “line length filter” in the mC4 data pipeline, and (vi) supplement mC4 with Wikipedia data9 from 103 languages....

    [...]

  • ...Many pre-trained versions of XLM have been released; the most massively-multilingual variant was trained on 100 languages from Wikipedia....

    [...]

Posted Content
TL;DR: This work sets a milestone by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples, and demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines.
Abstract: We introduce our efforts towards building a universal neural machine translation (NMT) system capable of translating between any language pair. We set a milestone towards this goal by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples. Our system demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. We provide in-depth analysis of various aspects of model building that are crucial to achieving quality and practicality in universal NMT. While we prototype a high-quality universal translation system, our extensive empirical analysis exposes issues that need to be further addressed, and we suggest directions for future research.

299 citations


"mT5: A Massively Multilingual Pre-t..." refers background or methods in this paper

  • ...…objective to 10 since we observe fewer characters per token than T5, (iv) adjust the language sampling exponent α to {0.2, 0.7} as used in MMNMT (Arivazhagan et al., 2019) and mBERT (Devlin, 2018), respectively, (v) turn off the “line length filter” in the mC4 data pipeline, and (vi)…...

    [...]

  • ...Values used by prior work include α = 0.7 for mBERT (Devlin, 2018), α = 0.3 for XLM-R (Conneau et al., 2020), and α = 0.2 for MMNMT (Arivazhagan et al., 2019)....

    [...]

  • ...Massively multilingual models have been observed to underperform on a given language when compared to a similarly-sized “dedicated” model trained specifically for that language (Arivazhagan et al., 2019)....

    [...]

  • ...We therefore take the approach used in (Devlin, 2018; Conneau et al., 2019; Arivazhagan et al., 2019) and boost lower-resource languages by sampling examples according to the probability p(L) ∝ |L|, where p(L) is the probability of sampling text from a given language during pre-training and |L| is the number of examples in the language....

    [...]

  • ...We run six ablations, modifying various settings, using our Large model as a baseline: (i) increase dropout to 0.1 in hopes of mitigating overfitting on low-resource languages, (ii) decrease sequence length to 512 (as was used in T5), (iii) increase the average noise span length in the pre-training objective to 10 since we observe fewer characters per token than T5, (iv) adjust the language sampling exponent α to {0.2, 0.7} as used in MMNMT (Arivazhagan et al., 2019) and mBERT (Devlin, 2018), respectively, (v) turn off the “line length filter” in the mC4 data pipeline, and (vi) supplement mC4 with Wikipedia data9 from 103 languages....

    [...]

Proceedings ArticleDOI
01 Jun 2019
TL;DR: Transfer learning as discussed by the authors is a set of methods that extend the classical supervised machine learning paradigm by leveraging data from additional domains or tasks to train a model with better generalization properties, which can be used for NLP tasks.
Abstract: The classic supervised machine learning paradigm is based on learning in isolation, a single predictive model for a task using a single dataset. This approach requires a large number of training examples and performs best for well-defined and narrow tasks. Transfer learning refers to a set of methods that extend this approach by leveraging data from additional domains or tasks to train a model with better generalization properties. Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of several transfer learning methods and architectures which significantly improved upon the state-of-the-art on a wide range of NLP tasks. These improvements together with the wide availability and ease of integration of these methods are reminiscent of the factors that led to the success of pretrained word embeddings and ImageNet pretraining in computer vision, and indicate that these methods will likely become a common tool in the NLP landscape as well as an important research direction. We will present an overview of modern transfer learning methods in NLP, how models are pre-trained, what information the representations they learn capture, and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks.

267 citations

Proceedings ArticleDOI
10 Feb 2020
TL;DR: The authors fine-tuned pre-trained models to answer questions without access to any external context or knowledge, which scales with model size and performs competitively with open-domain systems that explicitly retrieve answers from an external knowledge source when answering questions.
Abstract: It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the practical utility of this approach by fine-tuning pre-trained models to answer questions without access to any external context or knowledge. We show that this approach scales with model size and performs competitively with open-domain systems that explicitly retrieve answers from an external knowledge source when answering questions. To facilitate reproducibility and future work, we release our code and trained models.

247 citations

Proceedings ArticleDOI
30 Aug 2019
TL;DR: PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages, shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information.
Abstract: Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. We remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. We provide baseline numbers for three models with different capacity to capture non-local context and sentence structure, and using different multilingual training and evaluation regimes. Multilingual BERT fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23% over the next best model. PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information.

237 citations


"mT5: A Massively Multilingual Pre-t..." refers background in this paper

  • ...We cast all tasks into the text-to-text format, i.e. generating the label text (XNLI and PAWS-X), entity tags and labels (WikiAnn NER), or answer (XQuAD, MLQA, and TyDi QA) directly in a generative fashion....

    [...]

  • ...On the PAWS-X task, FILTER used translation data from the original task instead....

    [...]

  • ...7, and 11 languages respectively; the Named Entity Recognition (NER) dataset of WikiAnn (Pan et al., 2017) restricted to the 40 languages from XTREME (Hu et al., 2020), and the PAWS-X (Yang et al., 2019) paraphrase identification dataset with 7 languages....

    [...]

  • ..., 2020), and the PAWS-X (Yang et al., 2019) paraphrase identification dataset with 7 languages....

    [...]

Trending Questions (3)
ISINDEBELE text generation under NLP using MT5 tool

The paper does not specifically mention ISINDEBELE text generation using the MT5 tool. The paper introduces mT5, a multilingual variant of T5, and demonstrates its performance on multilingual benchmarks.

Isindebele text generation under NLP using MT5 tool

The paper does not mention specifically about Isindebele text generation using the MT5 tool.

A Massively Multilingual Pre-trained Text-to-Text Transformer?

The paper introduces mT5, a multilingual variant of T5, which is a massively multilingual pre-trained text-to-text transformer.