Unsupervised Cross-lingual Representation Learning at Scale

doi:10.18653/V1/2020.ACL-MAIN.747

Open AccessProceedings ArticleDOI

Unsupervised Cross-lingual Representation Learning at Scale

Alexis Conneau, +9 more

- pp 8440-8451

Chats0

TLDR

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

Abstract:

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Linting Xue, +7 more

TL;DR: This paper proposed a multilingual variant of T5, mT5, which was pre-trained on a new Common Crawl-based dataset covering 101 languages and achieved state-of-the-art performance on many multilingual benchmarks.

...read moreread less

Posted Content

Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference

Timo Schick, +1 more

- 21 Jan 2020 -

arXiv: Computation and Language

TL;DR: This work introduces Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task.

...read moreread less

Posted Content

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Junjie Hu, +5 more

- 24 Mar 2020 -

arXiv: Computation and Language

TL;DR: The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.

...read moreread less

Proceedings ArticleDOI

CamemBERT: a Tasty French Language Model

Louis Martin, +7 more

- 10 Nov 2019 -

arXiv: Computation and Language

TL;DR: This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.

...read moreread less

Journal ArticleDOI

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, +386 more

- 09 Nov 2022 -

arXiv.org

TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Posted Content

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

...read moreread less

Collapse

arXiv: Computation and Language

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

Deep contextualized word representations

Matthew E. Peters, +6 more

Unsupervised Cross-lingual Representation Learning at Scale

Citations

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

CamemBERT: a Tasty French Language Model

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

References

Attention is All you Need

Glove: Global Vectors for Word Representation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Distributed Representations of Words and Phrases and their Compositionality

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Attention is All you Need

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Adam: A Method for Stochastic Optimization

Deep contextualized word representations