mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
Citations
1,429 citations
407 citations
213 citations
149 citations
133 citations
References
313 citations
"mT5: A Massively Multilingual Pre-t..." refers methods in this paper
...To improve the pre-training data quality, pages from Common Crawl were filtered by an n-gram language model trained on Wikipedia (Wenzek et al., 2020). mBART (Liu et al., 2020a) is a multilingual encoder-decoder model that is based on BART (Lewis et al., 2020b). mBART is trained with a combination of span masking and sentence shuffling objectives on a subset of 25 languages from the same data as XLM-R....
[...]
...The line 9We use the 2020 Wikipedia data from TensorFlow Datasets, selecting the same languages as mBERT. https://www.tensorflow.org/datasets/ catalog/wikipedia length filter provides a +2 point boost, corroborating the findings of Conneau et al. (2020) and Raffel et al. (2020) that filtering low-quality pages from Common Crawl is valuable....
[...]
...To improve the pre-training data quality, pages from Common Crawl were filtered by an n-gram language model trained on Wikipedia (Wenzek et al., 2020). mBART (Liu et al., 2020a) is a multilingual encoder-decoder model that is based on BART (Lewis et al., 2020b). mBART is trained with a combination…...
[...]
...We run six ablations, modifying various settings, using our Large model as a baseline: (i) increase dropout to 0.1 in hopes of mitigating overfitting on low-resource languages, (ii) decrease sequence length to 512 (as was used in T5), (iii) increase the average noise span length in the pre-training objective to 10 since we observe fewer characters per token than T5, (iv) adjust the language sampling exponent α to {0.2, 0.7} as used in MMNMT (Arivazhagan et al., 2019) and mBERT (Devlin, 2018), respectively, (v) turn off the “line length filter” in the mC4 data pipeline, and (vi) supplement mC4 with Wikipedia data9 from 103 languages....
[...]
...Many pre-trained versions of XLM have been released; the most massively-multilingual variant was trained on 100 languages from Wikipedia....
[...]
299 citations
"mT5: A Massively Multilingual Pre-t..." refers background or methods in this paper
...…objective to 10 since we observe fewer characters per token than T5, (iv) adjust the language sampling exponent α to {0.2, 0.7} as used in MMNMT (Arivazhagan et al., 2019) and mBERT (Devlin, 2018), respectively, (v) turn off the “line length filter” in the mC4 data pipeline, and (vi)…...
[...]
...Values used by prior work include α = 0.7 for mBERT (Devlin, 2018), α = 0.3 for XLM-R (Conneau et al., 2020), and α = 0.2 for MMNMT (Arivazhagan et al., 2019)....
[...]
...Massively multilingual models have been observed to underperform on a given language when compared to a similarly-sized “dedicated” model trained specifically for that language (Arivazhagan et al., 2019)....
[...]
...We therefore take the approach used in (Devlin, 2018; Conneau et al., 2019; Arivazhagan et al., 2019) and boost lower-resource languages by sampling examples according to the probability p(L) ∝ |L|, where p(L) is the probability of sampling text from a given language during pre-training and |L| is the number of examples in the language....
[...]
...We run six ablations, modifying various settings, using our Large model as a baseline: (i) increase dropout to 0.1 in hopes of mitigating overfitting on low-resource languages, (ii) decrease sequence length to 512 (as was used in T5), (iii) increase the average noise span length in the pre-training objective to 10 since we observe fewer characters per token than T5, (iv) adjust the language sampling exponent α to {0.2, 0.7} as used in MMNMT (Arivazhagan et al., 2019) and mBERT (Devlin, 2018), respectively, (v) turn off the “line length filter” in the mC4 data pipeline, and (vi) supplement mC4 with Wikipedia data9 from 103 languages....
[...]
267 citations
247 citations
237 citations
"mT5: A Massively Multilingual Pre-t..." refers background in this paper
...We cast all tasks into the text-to-text format, i.e. generating the label text (XNLI and PAWS-X), entity tags and labels (WikiAnn NER), or answer (XQuAD, MLQA, and TyDi QA) directly in a generative fashion....
[...]
...On the PAWS-X task, FILTER used translation data from the original task instead....
[...]
...7, and 11 languages respectively; the Named Entity Recognition (NER) dataset of WikiAnn (Pan et al., 2017) restricted to the 40 languages from XTREME (Hu et al., 2020), and the PAWS-X (Yang et al., 2019) paraphrase identification dataset with 7 languages....
[...]
..., 2020), and the PAWS-X (Yang et al., 2019) paraphrase identification dataset with 7 languages....
[...]