UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

Open AccessProceedings Article

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

Jonas Pfeiffer, +3 more

- pp 10186-10203

Chats0

TLDR

This paper propose a series of data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. But their methods rely on matrix factorization, which is not suitable for low resource languages.

Abstract:

Massively multilingual language models such as multilingual BERT offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to limited capacity and large differences in pretraining data sizes, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models and written in scripts unseen during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model’s embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called lexically overlapping tokens) shared between mBERT’s and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can yield improvements for low-resource languages written in scripts covered by the pretrained model.

Citations

PDF

Open Access

More filters

Posted Content

AdapterFusion: Non-Destructive Task Composition for Transfer Learning

Jonas Pfeiffer, +4 more

- 01 May 2020 -

arXiv: Computation and Language

TL;DR: This work proposes AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks by separating the two stages, i.e., knowledge extraction and knowledge composition, so that the classifier can effectively exploit the representations learned frommultiple tasks in a non-destructive manner.

...read moreread less

Proceedings ArticleDOI

AdapterFusion: Non-destructive task composition for transfer learning

Jonas Pfeiffer, +4 more

TL;DR: In this paper, the authors propose a two-stage learning algorithm that leverages knowledge from multiple tasks to solve the problem of catastrophic forgetting and difficulties in dataset balancing, by separating the two stages, i.e., knowledge extraction and knowledge composition.

...read moreread less

Proceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Phillip Rust, +4 more

TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

...read moreread less

Posted Content

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.

Phillip Rust, +4 more

- 31 Dec 2020 -

arXiv: Computation and Language

TL;DR: This paper provided a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and found that while the pretraining data size is an important factor in the downstream performance of the multilingual model, a designated mon-olingual tokenizer plays an equally important role in downstream performance.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings Article

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, +2 more

TL;DR: Gumbel-Softmax as mentioned in this paper replaces the non-differentiable samples from a categorical distribution with a differentiable sample from a novel Gumbel softmax distribution, which has the essential property that it can be smoothly annealed into the categorical distributions.

...read moreread less

Proceedings ArticleDOI

Unsupervised Cross-lingual Representation Learning at Scale

Alexis Conneau, +9 more

TL;DR: It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

...read moreread less

Posted Content

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, +1 more

- 27 Jun 2016 -

arXiv: Learning

TL;DR: An empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations is performed and performance improvements are found across all considered computer vision, natural language processing, and speech tasks.

...read moreread less

Collapse

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, +9 more

- 26 Jul 2019 -

arXiv: Computation and Language

UNKs Everywhere: Adapting Multilingual Language Models to New Scripts.

Citations

AdapterFusion: Non-Destructive Task Composition for Transfer Learning

AdapterFusion: Non-destructive task composition for transfer learning

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages.

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.

References

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Categorical Reparameterization with Gumbel-Softmax

Unsupervised Cross-lingual Representation Learning at Scale

Gaussian Error Linear Units (GELUs)

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Unsupervised Cross-lingual Representation Learning at Scale

Attention is All you Need

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition